# Model Creation

In this jupyter notebook we shall look at taking the preprocessed data  generated by preprocessing_part_2.ipynb and creating machine learning model from it 
that reads each review and tries to predict what its average score is. Thus we are building a text classifier

In [98]:
#start with the relevant imports

#use to visualise the data 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#used to build the model
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import Sequential
from keras.layers import Dense, Dropout, Embedding
from keras.optimizers import RMSprop
from keras.losses import SparseCategoricalCrossentropy

Step 1: load and inspect the csv with pandas

In [99]:
#first load the data with pandas
df=pd.read_csv("./data/data_ready_for_model.csv")


In [100]:
df.head()

Unnamed: 0.1,Unnamed: 0,Comments,Average Score
0,0,moved uk end august got virgin media broadband...,1.0
1,1,truly attrocious service terms broadband custo...,1.0
2,2,hard cancel contract. phone 2 hours t o spend ...,2.0
3,3,pay 350mbps package managed 250mbps upload 34 ...,2.0
4,4,worst customer service: -the bots ask irreleva...,2.0


In [101]:
df.drop("Unnamed: 0", axis=1, inplace=True) #unneeded column, resulted when csv was created from dataframe

The last step before splitting our data into train test split sets is to tokenize the words.

In [102]:
#max words to be used.
max_words=5000 
#max no of words per complaint:
max_sequence=250
#fixed
embedding_dim=250

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df["Comments"].values)
word_index=tokenizer.word_index

Truncate and pad the input sequences so that they are all in the same length for modeling.

In [103]:
print(f"found {len(word_index)} unique tokens")

found 13038 unique tokens


In [104]:
X = tokenizer.texts_to_sequences(df['Comments'].values)
X = tf.keras.utils.pad_sequences(X, maxlen=max_sequence)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (4342, 250)


Step 2 prepare the data into train val test sets (code is borrowed from my Wine reviews classification Neural Network). We want our target ot be our "average score" and our features to be the "comments". We have quite the imbalanced dataset,  because we have more average scores with a score of 1 and two than any other score. Because we are implementing a classification model, this could be especially problematic.

To overcome this data we will _stratify_ the data. This is to ensure that relative class frequencies is approximately preserved in each train and validation fold.

In [105]:
y=df["Average Score"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.340, random_state=0, stratify=y)
#60 training, 20 validation, 20 testing
X_val, X_test, y_val, y_test =train_test_split(X_temp, y_temp, test_size = 0.5, random_state=0, stratify=y_temp)

# Training an LSTM model

the time for creating a neural network has finally arrived! First, let's encode our comments using a text vectorizor model:

let's check our vocabulary. These are just some of the words that have been encoded into vectors: (UNK) represents any unknown tokens

In [106]:
model = Sequential([
        Embedding(max_words, embedding_dim, input_length=X.shape[1]),#mask=0 so we can handle inputs of variable lengths
        #now we have a vector of numbers a nn can comprehend
        tf.keras.layers.SpatialDropout1D(0.2),
        tf.keras.layers.LSTM(32),
        Dense(32, activation="relu"),
        Dropout(0.4),
        Dense(5, activation="softmax")
])

In [107]:
callback = [EarlyStopping(monitor='val_loss', patience=5),
             ModelCheckpoint(filepath='saved_model', monitor='val_loss', save_best_only=True)]



In [108]:
model.compile(RMSprop(learning_rate=0.001), 
             loss = SparseCategoricalCrossentropy(), #categorical cross entropy as multi classification problem
                metrics=["sparse_categorical_accuracy"])

In [109]:
model.evaluate(X_train, y_train) #evaluate performance of model without training it first
#accuracy is around 0.36.7



[1.612160325050354, 0.09668412059545517]

In [110]:
history = model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), callbacks=callback)

Epoch 1/50



INFO:tensorflow:Assets written to: saved_model\assets


INFO:tensorflow:Assets written to: saved_model\assets


Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50


In [111]:
model.save("saved_model/model1")



INFO:tensorflow:Assets written to: saved_model/model1\assets


INFO:tensorflow:Assets written to: saved_model/model1\assets


Model has trained but has a val accuracy of only 0.6491. We can see that it is clearly overfitting. So we will have to try and improve it. As our dataset is quite unbalanced, one way to do this is by using Jaccard's similarity metric.

Jaccard Similarity is a measure of how similar two sets are based on the items present in both the sets. It is defined as the fraction of number of common elements in two sets to the total number of elements in the union of the two sets. 

We can use it in the "columns" content to find all the comments that are very similar to each other and remove them from the dataframe, essentially removing any "quasi duplicate" data.