# Model Creation

In this jupyter notebook we shall look at taking the preprocessed data  generated by preprocessing_part_2.ipynb and creating machine learning model from it 
that reads each review and tries to predict what its average score is. Thus we are building a text classifier

In [494]:
#start with the relevant imports

#use to visualise the data 
import pandas as pd

import numpy as np
import matplotlib.pyplot as plt

#used to build the model
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import Sequential
from keras.layers import Dense, Dropout, Embedding
from keras.optimizers import RMSprop
from keras.losses import SparseCategoricalCrossentropy

Step 1: load and inspect the csv with pandas

In [495]:
#first load the data with pandas
df=pd.read_csv("./data/data_ready_for_model.csv")


In [496]:
df.head(10)

Unnamed: 0.1,Unnamed: 0,Comments,Average Score
0,0,moved uk end august got virgin media broadband...,1.0
1,1,truly attrocious service terms broadband custo...,1.0
2,2,hard cancel contract. phone 2 hours t o spend ...,2.0
3,3,pay 350mbps package managed 250mbps upload 34 ...,2.0
4,4,worst customer service: -the bots ask irreleva...,2.0
5,5,informed given upgrade difference speeds reboo...,3.0
6,6,wish zero star virgin media unfortunately lowe...,1.0
7,7,sold package 1gb speed 2 tivo box 6 ask act li...,2.0
8,8,simply don't sold. example switching bt 67mb p...,3.0
9,9,virgin worse broadband company. ive trying boo...,1.0


In [497]:
df.drop("Unnamed: 0", axis=1, inplace=True) #unneeded column, resulted when csv was created from dataframe

The last step before splitting our data into train test split sets is to tokenize the words.

In [498]:
#max words to be used.
max_words=5000 
#max no of words per complaint:
max_sequence=250
#fixed
embedding_dim=250

tokenizer = tf.keras.preprocessing.text.Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(df["Comments"].values)
word_index=tokenizer.word_index

Truncate and pad the input sequences so that they are all in the same length for modeling.

In [461]:
print(f"found {len(word_index)} unique tokens")

found 12592 unique tokens


In [462]:
X = tokenizer.texts_to_sequences(df['Comments'].values)
X = tf.keras.utils.pad_sequences(X, maxlen=max_sequence)
print('Shape of data tensor:', X.shape)

Shape of data tensor: (4342, 250)


Step 2 prepare the data into train val test sets (code is borrowed from my Wine reviews classification Neural Network). We want our target ot be our "average score" and our features to be the "comments". We have quite the imbalanced dataset,  because we have more average scores with a score of 1 and two than any other score. Because we are implementing a classification model, this could be especially problematic.

To overcome this data we will _stratify_ the data. This is to ensure that relative class frequencies is approximately preserved in each train and validation fold.

In [463]:
y=df["Average Score"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.340, random_state=0, stratify=y)
#60 training, 20 validation, 20 testing
X_val, X_test, y_val, y_test =train_test_split(X_temp, y_temp, test_size = 0.5, random_state=0, stratify=y_temp)

# Training an LSTM model

the time for creating a neural network has finally arrived! First, let's encode our comments using a text vectorizor model:

let's check our vocabulary. These are just some of the words that have been encoded into vectors: (UNK) represents any unknown tokens

In [464]:
model = Sequential([
        Embedding(max_words, embedding_dim, input_length=X.shape[1]),#mask=0 so we can handle inputs of variable lengths
        #now we have a vector of numbers a nn can comprehend
        tf.keras.layers.SpatialDropout1D(0.2),
        tf.keras.layers.LSTM(32),
        Dense(32, activation="relu"),
        Dropout(0.4),
        Dense(5, activation="softmax")
])

In [465]:
callback = [EarlyStopping(monitor='val_loss', patience=5),
             ModelCheckpoint(filepath='saved_model', monitor='val_loss', save_best_only=True)]



In [466]:
model.compile(RMSprop(learning_rate=0.001), 
             loss = SparseCategoricalCrossentropy(), #categorical cross entropy as multi classification problem
                metrics=["sparse_categorical_accuracy"])

In [467]:
model.evaluate(X_train, y_train) #evaluate performance of model without training it first
#accuracy is around 0.36.7



[1.612270474433899, 0.10610820353031158]

In [468]:
history = model.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), callbacks=callback)

Epoch 1/50



INFO:tensorflow:Assets written to: saved_model\assets


INFO:tensorflow:Assets written to: saved_model\assets


Epoch 2/50



INFO:tensorflow:Assets written to: saved_model\assets


INFO:tensorflow:Assets written to: saved_model\assets


Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50


In [469]:
model.save("saved_model/model1")



INFO:tensorflow:Assets written to: saved_model/model1\assets


INFO:tensorflow:Assets written to: saved_model/model1\assets


In [470]:
df["Average Score"].value_counts()

2.0    2819
1.0    1227
3.0     275
4.0      21
Name: Average Score, dtype: int64

Model has trained but has a val accuracy of only 0.6504. We can see that it is clearly overfitting. We have quite an imbalanced dataset, that we need to account for. Lots of 2.0 star reviews and very few  3 and 4 start reviews.


To increase the accuracy, it might be worth merging the two smallest classes together with a combined rating of 3.5:

In addition, to tackle the large number of 2.0 and 1.0 start reviews, we will be using Jaccard's similarity to look for reviews that are similar to each other enough to be counted as duplicates and then remove them. jaccard's similarity is a mathematical function that just does that: measures how similar two sets are to each other.

First, let's address the two smaller classes and merge them together.


In [500]:
#merge smallest classes in the df  #softmax will now be 3 classes
condition = df['Average Score']== 4
df.loc[condition, 'Average Score'] = 3.5
condition = df['Average Score']== 3
df.loc[condition, 'Average Score'] = 3.5

In [501]:
X = tokenizer.texts_to_sequences(df['Comments'].values)
X = tf.keras.utils.pad_sequences(X, maxlen=max_sequence)
y=df["Average Score"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.340, random_state=0, stratify=y)
#60 training, 20 validation, 20 testing
X_val, X_test, y_val, y_test =train_test_split(X_temp, y_temp, test_size = 0.5, random_state=0, stratify=y_temp)

training again

In [506]:
model3 = Sequential([
        Embedding(max_words, embedding_dim, input_length=X.shape[1]),#mask=0 so we can handle inputs of variable lengths
        #now we have a vector of numbers a nn can comprehend
        tf.keras.layers.SpatialDropout1D(0.2),
        tf.keras.layers.LSTM(32),
        Dense(32, activation="relu"),
        Dropout(0.4),
        Dense(4, activation="softmax")
])

model3.compile(RMSprop(learning_rate=0.001), 
             loss = SparseCategoricalCrossentropy(), #categorical cross entropy as multi classification problem
                metrics=["sparse_categorical_accuracy"])

In [507]:
history = model3.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), callbacks=callback)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50


In [508]:
df["Average Score"].value_counts()

2.0    2819
1.0    1227
3.5     296
Name: Average Score, dtype: int64

Now let's deal with the comments. First tokenize the comments column and define the jaccard function. 

In [509]:
tokenize = lambda doc: doc.lower().split(" ")
tokenized_documents = [tokenize(d) for d in df["Comments"]] # tokenized docs

In [510]:
print(tokenized_documents[6])

['wish', 'zero', 'star', 'virgin', 'media', 'unfortunately', 'lowest', 'star', '1.', 'intend', 'joining', 'network', "don't", 'want', 'bills', 'increasing', 'informed', "don't", 'mistake', 'joing', 'virgin', 'media.', 'set', 'liars', 'lied', '2019', 'join', 'network', 'pay', '33', 'pounds', '18', 'month', 'contract.', '4', 'month', 'virgin', 'media', 'increase', 'price', '59', 'pounds', 'month', "wasn't", 'initially', 'told', 'me.', 'realised', "they've", 'increased', 'monthly', 'end', 'contract', 'ended', 'contract', '102.27', 'insisted', 'pay', 'began', 'send', 'debt', 'collector', 'disturb', 'payment', 'spoilt', 'credit', 'rating', 'reported', 'case', 'credit', 'authority', 'credit', 'affected', 'them.', 'love', 'credit', 'report', "don't", 'want', 'paying', 'extra', 'bills', 'comparison', 'initially', 'agree', "don't", 'join', 'virgin', 'media.', 'lastly', 'saw', "they've", 'gone', 'bit', 'spoil', 'credit', 'report', 'contacted', 'voice', 'complain', 'customer', 'service', 'listene

In [511]:
def jaccard_similarity(query: set, document: set) -> float:
    
    """"Returns the Jaccard similarity between a query and a specified document"""

    intersection = set(query).intersection(set(document))
    union = set(query).union(set(document))

Let's apply this measure to the comments column! As the most common score is the one with an average of two, we will try and eliminate reviews that have comments that have two star ratings, or at least look like they could have two star ratings.

To do this, I will (arbitrarily) pick the first comment that has a 2 star review and compare the rest of the comments in the dataframe to this one using the Jaccardian metric.

In [512]:
values=[]
def jaccard_similarity(query: set) -> float:
    
    """"Returns the Jaccard similarity between a query and a specified document"""

    intersection = set(query).intersection(set(df["Comments"][0]))
    union = set(query).union(set(df["Comments"][0]))
    values.append(len(intersection)/len(union))

In [513]:
df["Comments"].apply(jaccard_similarity)

0       None
1       None
2       None
3       None
4       None
        ... 
4337    None
4338    None
4339    None
4340    None
4341    None
Name: Comments, Length: 4342, dtype: object

Crea

In [514]:
df["Jaccard Similarity"]=values

In [515]:
df.head()

Unnamed: 0,Comments,Average Score,Jaccard Similarity
0,moved uk end august got virgin media broadband...,1.0,1.0
1,truly attrocious service terms broadband custo...,1.0,0.702703
2,hard cancel contract. phone 2 hours t o spend ...,2.0,0.666667
3,pay 350mbps package managed 250mbps upload 34 ...,2.0,0.692308
4,worst customer service: -the bots ask irreleva...,2.0,0.692308


Drop all rows with a Jaccard score greater than 0.75

In [516]:
df=df[df["Jaccard Similarity"]<0.70]


In [517]:
df.head()

Unnamed: 0,Comments,Average Score,Jaccard Similarity
2,hard cancel contract. phone 2 hours t o spend ...,2.0,0.666667
3,pay 350mbps package managed 250mbps upload 34 ...,2.0,0.692308
4,worst customer service: -the bots ask irreleva...,2.0,0.692308
5,informed given upgrade difference speeds reboo...,3.5,0.527778
7,sold package 1gb speed 2 tivo box 6 ask act li...,2.0,0.648649


In [518]:

X = tokenizer.texts_to_sequences(df['Comments'].values)
X = tf.keras.utils.pad_sequences(X, maxlen=max_sequence)
y=df["Average Score"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.340, random_state=0, stratify=y)
#60 training, 20 validation, 20 testing
X_val, X_test, y_val, y_test =train_test_split(X_temp, y_temp, test_size = 0.5, random_state=0, stratify=y_temp)

In [519]:
model2 = Sequential([
        Embedding(max_words, embedding_dim, input_length=X.shape[1]),#mask=0 so we can handle inputs of variable lengths
        #now we have a vector of numbers a nn can comprehend
        tf.keras.layers.SpatialDropout1D(0.2),
        tf.keras.layers.LSTM(32),
        Dense(32, activation="relu"),
        Dropout(0.4),
        Dense(4, activation="softmax")
])

In [520]:
model2.compile(RMSprop(learning_rate=0.001), 
             loss = SparseCategoricalCrossentropy(), #categorical cross entropy as multi classification problem
                metrics=["sparse_categorical_accuracy"])

In [521]:
model2.evaluate(X_train, y_train) #evaluate performance of model without training it firs



[1.381453514099121, 0.3383311629295349]

In [522]:
history = model2.fit(X_train, y_train, epochs=50, validation_data=(X_val, y_val), callbacks=callback)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50


In [523]:
model2.save("saved_model/model2")



INFO:tensorflow:Assets written to: saved_model/model2\assets


INFO:tensorflow:Assets written to: saved_model/model2\assets


In [524]:
new_model = tf.keras.models.load_model('saved_model/model2')


# After saving the models let's evaluate them:

In [525]:
first_model=tf.keras.models.load_model("saved_model/model1/")

In [526]:
#results from the first model
first_model.evaluate(X_val, y_val)



[0.6878027319908142, 0.7493671178817749]

In [527]:
#results from the first model
first_model.evaluate(X_test, y_test)



[0.6655564904212952, 0.7954545617103577]

In [528]:
#results from the second model
new_model.evaluate(X_val, y_val)



[1.4099856615066528, 0.6177214980125427]

In [493]:
#results from the second model
new_model.evaluate(X_test, y_test)



[1.4404915571212769, 0.5345699787139893]