<a href="https://colab.research.google.com/github/Ayikanying-ux/multi_class_classification_on_stack_overflow_questions/blob/main/multi_class_classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Import libraries

In [1]:
import numpy as np
import tensorflow as tf
import os
import shutil
import string
import re
import matplotlib.pyplot as plt

from tensorflow.keras import layers
from tensorflow.keras import losses

## Download dataset.
Here we will be using the dataset from tensorflow to predict the correct label os each question posted on stack overflow

In [2]:
url="https://storage.googleapis.com/download.tensorflow.org/data/stack_overflow_16k.tar.gz"
dataset = tf.keras.utils.get_file("stack_overflow_16k_v1", url,
                                 untar=True, cache_dir="stack_overflow",
                                 cache_subdir=""
                                 )

In [3]:
dataset_dir = os.path.join(os.path.dirname(dataset))
os.listdir(dataset_dir)

['README.md', 'test', 'stack_overflow_16k_v1.tar.gz', 'train']

As we can see after downloading the datset we have a test directory for testing and train directory for training the model.

Lest take a look at the train folder to see how the data is organised.

In [4]:
train_dir = os.path.join(dataset_dir, "train")
os.listdir(train_dir)

['java', 'javascript', 'python', 'csharp']

In [5]:
test_ds=os.path.join(dataset_dir, "test")

We can see that we have 4 folders which represents the class label and they contain the questions which we need to trian.

Let's take a look at on of the questions.

In [6]:
java = os.path.join(train_dir, 'java')
file = os.path.join(java, '117.txt')
with open(file) as f:
  print(f.read())

"how to sort list<map<string, object>> in blank8 with .stream()? i have a list like this ..list&lt;map&lt;string, object&gt;&gt; list = new arraylist&lt;&gt;();..    for(int i = 0; i &lt; 20; i++) {.        map&lt;string, object&gt; map = new hashmap&lt;&gt;();.        map.put(""quantity"", math.random());.        map.put(""price"", math.random());.        list.add(map);.    }...how can i sort by price?..i hope it is use blank8 stream"



## Load the dataset
Next is to load the dataset and prepare it for training in a format suitable for training

In [7]:
batch_size=32
seed=42
train_ds = tf.keras.utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset="training",
    seed=seed
)

Found 8000 files belonging to 4 classes.
Using 6400 files for training.


In [8]:
train_val = tf.keras.utils.text_dataset_from_directory(
    train_dir,
    batch_size=batch_size,
    validation_split=0.2,
    subset="validation",
    seed=seed
)

Found 8000 files belonging to 4 classes.
Using 1600 files for validation.


In [9]:
test_ds = tf.keras.utils.text_dataset_from_directory(
    'test',
    batch_size=batch_size)

Found 8000 files belonging to 4 classes.


Let's take a look at some of the sample text and class they belong to.

In [10]:
for text, label in train_ds.take(1):
  for i in range(5):
    print("Question: ", text.numpy()[i])
    print("Label: ", label.numpy()[i])

Question:  b'"my tester is going to the wrong constructor i am new to programming so if i ask a question that can be easily fixed, please forgive me. my program has a tester class with a main. when i send that to my regularpolygon class, it sends it to the wrong constructor. i have two constructors. 1 without perameters..public regularpolygon().    {.       mynumsides = 5;.       mysidelength = 30;.    }//end default constructor...and my second, with perameters. ..public regularpolygon(int numsides, double sidelength).    {.        mynumsides = numsides;.        mysidelength = sidelength;.    }// end constructor...in my tester class i have these two lines:..regularpolygon shape = new regularpolygon(numsides, sidelength);.        shape.menu();...numsides and sidelength were declared and initialized earlier in the testing class...so what i want to happen, is the tester class sends numsides and sidelength to the second constructor and use it in that class. but it only uses the default con

In [11]:
for i, label in enumerate(train_ds.class_names):
  print(f'Label {i}, corresponds to {label}')

Label 0, corresponds to csharp
Label 1, corresponds to java
Label 2, corresponds to javascript
Label 3, corresponds to python


## Prepare the dataset for training

In [12]:
def standardize_text(text):
  lower=tf.strings.lower(text)
  return tf.strings.regex_replace(lower,
                          '[%s]' % re.escape(string.punctuation),
                          '')

Let's create a TextVectorization layer to standardize, vectorize and tokenize our data

In [13]:
max_feature=10000
sequence_length=250

vectorize_layer=layers.TextVectorization(
    standardize=standardize_text,
    max_tokens=max_feature,
    output_mode="int",
    output_sequence_length=sequence_length
)

In [14]:
train_text = train_ds.map(lambda x, y: x)
vectorize_layer.adapt(train_text)

In [15]:
def vectorize_text(text, label):
  text = tf.expand_dims(text, -1)
  return vectorize_layer(text), label

# retrieve a batch (of 32 reviews and labels) from the dataset
text_batch, label_batch = next(iter(train_ds))
first_review, first_label = text_batch[0], label_batch[0]
print("Review", first_review)
print("Label", train_ds.class_names[first_label])
print("Vectorized review", vectorize_text(first_review, first_label))

Review tf.Tensor(b'"set blank to quit on exception? i\'m using blank 3..i\'ve been looking around for an answer to this, but i haven\'t found it yet. basically, i\'m running several blank scripts into a game engine, and each script has its own entry point...i\'d rather not add try: except blocks through all of my code, so i was wondering if it\'s at all possible to tell blank to quit (or perhaps assign a custom function to that ""callback"") on finding its first error, regardless of where or what it found? ..currently, the game engine will continue after finding and hitting an error, making it more difficult than necessary to diagnose issues since running into one error may make a subsequent script not work (as it relies on variables that the error-ing script set, for example). any ideas? ..i know that i could redirect the console to a file to allow for easier scrolling, but just capturing the first error and stopping the game prematurely would be really useful...okay, a couple of extr

In [16]:
train_ds = train_ds.map(vectorize_text)
val_ds = train_val.map(vectorize_text)
test_ds = test_ds.map(vectorize_text)

In [17]:
AUTOTUNE = tf.data.AUTOTUNE

train_ds = train_ds.cache().prefetch(buffer_size=AUTOTUNE)
val_ds = train_val.cache().prefetch(buffer_size=AUTOTUNE)
test_ds = test_ds.cache().prefetch(buffer_size=AUTOTUNE)

In [18]:
embedding_dim = 16
model = tf.keras.Sequential([
  layers.Embedding(max_feature, embedding_dim),
  layers.Dropout(0.2),
  layers.GlobalAveragePooling1D(),
  layers.Dropout(0.2),
  layers.Dense(4, activation='sigmoid')])

model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, None, 16)          160000    
                                                                 
 dropout (Dropout)           (None, None, 16)          0         
                                                                 
 global_average_pooling1d (  (None, 16)                0         
 GlobalAveragePooling1D)                                         
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense (Dense)               (None, 4)                 68        
                                                                 
Total params: 160068 (625.27 KB)
Trainable params: 160068 (625.27 KB)
Non-trainable params: 0 (0.00 Byte)
________________