# Model Creation

In this jupyter notebook we shall look at taking the preprocessed data  generated by preprocessing_part_2.ipynb and creating machine learning model from it 
that reads each review and tries to predict what its average score is. Thus we are building a text classifier

In [21]:
#start with the relevant imports

#use to visualise the data 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#used to build the model
import tensorflow as tf
from keras.callbacks import EarlyStopping, ModelCheckpoint
from keras import Sequential
from keras.layers import Dense, TextVectorization, Dropout, Embedding, LSTM
from keras.optimizers import Adam
from keras.losses import SparseCategoricalCrossentropy

Step 1: load and inspect the csv with pandas

In [22]:
#first load the data with pandas
df=pd.read_csv("./data/data_ready_for_model.csv")


In [23]:
df.head()

Unnamed: 0.1,Unnamed: 0,Comments,Average Score
0,0,Moved back to the UK end of August and got Vir...,1.0
1,1,"A truly attrocious service, both in terms of b...",1.0
2,2,They make it as hard as they can for you to ca...,2.0
3,3,Pay for the 350Mbps package but only ever mana...,2.0
4,4,The worst customer service:\r-The bots ask irr...,2.0


In [24]:
df.drop("Unnamed: 0", axis=1, inplace=True) #unneeded column, resulted when csv was created from dataframe

Step 2 prepare the data into train val test sets (code is borrowed from my Wine reviews classification Neural Network). We want our target ot be our "average score" and our features to be the "comments". We have quite the imbalanced dataset,  because we have more average scores with a score of 1 and two than any other score. Because we are implementing a classification model, this could be especially problematic.

To overcome this data we will _stratify_ the data. This is to ensure that relative class frequencies is approximately preserved in each train and validation fold.

In [25]:
train, val, test = np.split(df.sample(frac=1), [int(0.8*len(df)), int(0.9 * len(df))])
print(len(train), len(val), len(test))

3473 434 435


In [26]:
#creating a function that feeds data to the model

def df_to_dataset(dataframe, shuffle=True, batch_size=1024):
  df = dataframe.copy()
  labels = df.pop('Average Score')
  df = df["Comments"]
  ds = tf.data.Dataset.from_tensor_slices((df, labels))
  if shuffle:
    ds = ds.shuffle(buffer_size=len(dataframe))
  ds = ds.batch(batch_size)
  ds = ds.prefetch(tf.data.AUTOTUNE)
  return ds

In [27]:
#creating instances of our train, val test data
train_data = df_to_dataset(train)
valid_data = df_to_dataset(val)
test_data = df_to_dataset(test)

# Training an LSTM model

the time for creating a neural network has finally arrived! First, let's encode our comments using a text vectorizor model:

In [28]:
#max tokens=max no of words we will remember.
encoder=TextVectorization(max_tokens=2000)

#train data is composed of comment and the score but we don't really 
# #need the score for this encoder. So just use a lambda function to pass in the text.
encoder.adapt(train_data.map(lambda comment, score: comment))

let's check our vocabulary. These are just some of the words that have been encoded into vectors: (UNK) represents any unknown tokens

In [29]:
vocab=np.array(encoder.get_vocabulary())
vocab[:50]

array(['', '[UNK]', 'to', 'the', 'i', 'and', 'a', 'they', 'for', 'of',
       'is', 'my', 'have', 'virgin', 'it', 'service', 'with', 'was',
       'that', 'in', 'you', 'not', 'on', 'me', 'customer', 'this', 'be',
       'but', 'as', 'them', 'no', 'had', 'we', 'are', 'broadband', 'been',
       'so', 'get', 'when', 'up', 'at', 'will', 'media', 'from', 'their',
       'would', 'contract', 'all', 'phone', 'an'], dtype='<U15')

Now we can create our model:

In [30]:
model = Sequential([
        encoder,
        Embedding(input_dim=len(encoder.get_vocabulary()), output_dim=32, mask_zero=True),#mask=0 so we can handle inputs of variable lengths
        #now we have a vector of numbers a nn can comprehend
        LSTM(32),
        Dense(32, activation="relu"),
        Dropout(0.4),
        Dense(5, activation="softmax")
])

In [31]:
callback = [EarlyStopping(monitor='val_loss', patience=5),
             ModelCheckpoint(filepath='best_model.h5', monitor='val_loss', save_best_only=True)]



In [32]:
model.compile(Adam(learning_rate=0.001), 
             loss = SparseCategoricalCrossentropy(), #binarycross entropy as binary classification problem
                metrics=["accuracy"])

In [34]:
model.evaluate(train_data) #evaluate performance of model without training it first
#accuracy is around 0.8



[1.611752986907959, 0.0889720693230629]

In [None]:
history = model.fit(train_data, epochs=50, validation_data=valid_data)