# Model Creation

In this jupyter notebook we shall look at taking the preprocessed data  generated by preprocessing_part_2.ipynb and creating machine learning model from it 
that reads each review and tries to predict what its average score is. Thus we are building a text classifier

In [113]:
#start with the relevant imports

#use to visualise the data 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#used to build the model
from sklearn.model_selection import train_test_split
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import Sequential
from keras.layers import Dense, TextVectorization, Dropout, Embedding, LSTM
from keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy

Step 1: load and inspect the csv with pandas

In [114]:
#first load the data with pandas
df=pd.read_csv("./data/data_ready_for_model.csv")


In [115]:
df.head()

Unnamed: 0.1,Unnamed: 0,Comments,Average Score
0,0,Moved back to the UK end of August and got Vir...,1.0
1,1,"A truly attrocious service, both in terms of b...",1.0
2,2,They make it as hard as they can for you to ca...,2.0
3,3,Pay for the 350Mbps package but only ever mana...,2.0
4,4,The worst customer service:\r-The bots ask irr...,2.0


In [116]:
df.drop("Unnamed: 0", axis=1, inplace=True) #unneeded column, resulted when csv was created from dataframe

Step 2 prepare the data into train val test sets (code is borrowed from my Wine reviews classification Neural Network). We want our target ot be our "average score" and our features to be the "comments". We have quite the imbalanced dataset,  because we have more average scores with a score of 1 and two than any other score. Because we are implementing a classification model, this could be especially problematic.

To overcome this data we will _stratify_ the data. This is to ensure that relative class frequencies is approximately preserved in each train and validation fold.

In [117]:
X=df.drop("Average Score", axis=1)
y=df["Average Score"]

X_train, X_temp, y_train, y_temp = train_test_split(X, y, test_size=0.340, random_state=0, stratify=y)
#60 training, 20 validation, 20 testing
X_val, X_test, y_val, y_test =train_test_split(X_temp, y_temp, test_size = 0.5, random_state=0, stratify=y_temp)

In [118]:

def df_to_dataset(features, target, shuffle=True, batch_size=1024):
  ds = tf.data.Dataset.from_tensor_slices((features, target))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(features))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.AUTOTUNE)
  return ds


In [119]:
train_data= df_to_dataset(X_train, y_train)
valid_data= df_to_dataset(X_val, y_val)
test_data= df_to_dataset(X_test, y_test)

In [120]:
print(train_data)

<PrefetchDataset element_spec=(TensorSpec(shape=(None, 1), dtype=tf.string, name=None), TensorSpec(shape=(None,), dtype=tf.float64, name=None))>


# Training an LSTM model

the time for creating a neural network has finally arrived! First, let's encode our comments using a text vectorizor model:

In [130]:
#max tokens=max no of words we will remember.
encoder=TextVectorization(max_tokens=2000)

#train data is composed of comment and the score but we don't really 
# #need the score for this encoder. So just use a lambda function to pass in the text.
encoder.adapt(train_data.map(lambda comment, score: comment))



let's check our vocabulary. These are just some of the words that have been encoded into vectors: (UNK) represents any unknown tokens

In [132]:
vocab=np.array(encoder.get_vocabulary())
vocab[:50]

array(['', '[UNK]', 'to', 'the', 'i', 'and', 'a', 'they', 'for', 'is',
       'of', 'have', 'my', 'virgin', 'it', 'service', 'with', 'was',
       'that', 'in', 'you', 'not', 'on', 'me', 'customer', 'this', 'but',
       'be', 'as', 'had', 'them', 'no', 'are', 'we', 'so', 'been', 'get',
       'when', 'up', 'broadband', 'at', 'media', 'will', 'from', 'their',
       'would', 'an', 'all', 'contract', 'if'], dtype='<U15')

In [123]:
model = Sequential([
        encoder,
        Embedding(input_dim=len(encoder.get_vocabulary()), output_dim=32, mask_zero=True),#mask=0 so we can handle inputs of variable lengths
        #now we have a vector of numbers a nn can comprehend
        LSTM(32),
        Dense(32, activation="relu"),
        Dropout(0.4),
        Dense(5, activation="softmax")
])

ValueError: Exception encountered when calling layer "text_vectorization_7" (type TextVectorization).

When using `TextVectorization` to tokenize strings, the input rank must be 1 or the last shape dimension must be 1. Received: inputs.shape=(None, None) with rank=2

Call arguments received by layer "text_vectorization_7" (type TextVectorization):
  • inputs=tf.Tensor(shape=(None, None), dtype=string)

In [None]:
callback = [EarlyStopping(monitor='val_loss', patience=5),
             ModelCheckpoint(filepath='saved_model', monitor='val_loss', save_best_only=True)]



In [None]:
model.compile(Adam(learning_rate=0.001), 
             loss = SparseCategoricalCrossentropy(), #categorical cross entropy as multi classification problem
                metrics=["accuracy"])

In [None]:
model.evaluate(train_data) #evaluate performance of model without training it first
#accuracy is around 0.36.7



[1.6107749938964844, 0.11949323117733002]

In [None]:
history = model.fit(train_data, epochs=50, validation_data=valid_data)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
