## Final Project Day 3 Solution: Neural Networks and Transformers  for a Classification Task

We continue to work with the final project dataset to see how Recurrent Neural Networks (RNNs), Long Short-term Memory Networks (LSTMs) and Transformers, perform to predict the __isPositive__ field of the dataset.

* We are giving you two pieces of code to read your training and test datasets.
* Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

*Note: Incorporate all that you have learned over Day 1 and Day 2. Feel free to use your processed data from Day 1 to save on redundant work.*

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review

Importing libraries:

In [1]:
import re
from collections import Counter
import numpy as np
from sklearn.model_selection import train_test_split
import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon import nn, rnn

### 1. Reading the dataset

Let's read the datasets below and fill-in the reviewText field. We will use this field as input to our ML model.

In [2]:
import pandas as pd

df_train = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')
df_test = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

Let's look at the first five rows in the datasets. As you can see the __log_votes__ field is numeric. That's why we will build a regression model.

In [3]:
df_train.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


In [4]:
df_test.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,Kaspersky offers the best security for your co...,State of the art protection,True,1465516800,0.0,1.0
1,This Value was extremely discounted which I ap...,Quickbooks,True,1393632000,0.0,1.0
2,Some dufus probably got stock options by the t...,Sad,False,1228176000,2.639057,0.0
3,I have reviewed the software and it is beyond ...,Excellent product,True,1402531200,0.0,1.0
4,"Plain old simple you need Anti-Virus,I have tr...",A must have,True,1367539200,0.0,1.0


### 2. Exploratory Data Analysis and Missing Value Imputation

Let's look at the target distribution for our datasets.

In [5]:
df_train["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [6]:
df_test["isPositive"].value_counts()

1.0    4980
0.0    3020
Name: isPositive, dtype: int64

Checking the number of missing values:

In [7]:
print(df_train.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


In [8]:
print(df_test.isna().sum())

reviewText    2
summary       1
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


We will only consider the reviewText field. Let's fill-in the missing values for that below. We will just use the placeholder "Missing" here.

In [9]:
df_train["reviewText"].fillna("Missing", inplace=True)
df_test["reviewText"].fillna("Missing", inplace=True)

### 3. Text processing-cleaning
Next, we will clean the text. We will remove leading/train white space, extra space and html tags. Recurrent neural networks usually __DON'T__ need text processing work further than simple text cleaning. Stemming and lemmatization can introduce some errors that will cause our model to skip those words completely. 

In [10]:
# Some string preprocessing
def clean_str(text):
    text = text.lower().strip() # Remove leading/trailing whitespace
    text = re.sub('\s+', ' ', text) # Remove extra space and tabs
    text = re.compile('<.*?>').sub('', text) # Remove HTML tags/markups:
    return text

Next, we are going to process all of the words in the reviews, count the number of occurences of each word, and then index the words in descending order with respect to how many times this occur. This is a necessary input to help us encode the words in the reviews so that they can be understood by a machine.

In [11]:
#This creates a dictionary of the words and their counts in entire 

word_counter = Counter()
def create_count(sentiments):
    for line in sentiments:
        for word in (clean_str(line)).split():
            if word not in word_counter.keys():               
                word_counter[word] = 1
            else:
                word_counter[word] += 1

#This assigns a unique a number for each word (sorted by descending order 
#based on the frequency of occurrence)and returns a word_dict

def create_word_index():
    idx = 1
    word_dict = {}
    for word in word_counter.most_common():
        word_dict[word[0]] = idx
        idx+=1
    return word_dict
    
#Here we combine all of the reviews into one dataset and create a word
#dictionary using this entire dataset
create_count(df_train["reviewText"].tolist())
word_dict = create_word_index()

#This creates a reverse index from a number to the word 
idx2word = {v: k for k, v in word_dict.items()}

Next we create a set of helper functions that (1) encode words into a sequence of numbers, (2) decode a sequence of numbers back into words, and (3) truncate and pad the input data to ensure they are of equal length and thereby enable easier processing.  

In [12]:
#This helper function creates a encoded sentences by assigning the unique 
#id from word_dict to the words in the input text
def encoded_sentences(input_file,word_dict):
    output_string = []
    for line in input_file:
        output_line = []
        for word in (clean_str(line)).split():
            if word in word_dict:
                output_line.append(word_dict[word])
        output_string.append(output_line)
    return output_string

#This helper function decodes encoded sentences
def decode_sentences(input_file,word_dict):
    output_string = []
    for line in input_file:
        output_line = ''
        for idx in line:
            output_line += idx2word[idx] + ' '
        output_string.append(output_line)
    return output_string

#This helper function pads the sequences to maxlen.
#If the sentence is greater than maxlen, it truncates the sentence.
#If the sentence is less than 50, it pads with value 0.
def pad_sequences(sentences,maxlen=50,value=0):
    padded_sentences = []
    for sen in sentences:
        new_sentence = []
        if(len(sen) > maxlen):
            new_sentence = sen[:maxlen]
            padded_sentences.append(new_sentence)
        else:
            num_padding = maxlen - len(sen)
            new_sentence = np.append(sen,[value] * num_padding)
            padded_sentences.append(new_sentence)
    return padded_sentences

Next we are going to encode all of the reviewText using the word dictionary created. In addition, we are going to cap the size of the tracked vocabulary size - meaning any word that is outside of the tracked range will be encoded with the last position. This is performance versus accuracy consideration - a larger tracked vocabulary will lead to more accurary but will have performance considerations because it requires a longer training process.

In [13]:
# Let's encode sentences
encoded_texts = encoded_sentences(df_train["reviewText"].tolist(), word_dict)

#Here we set the total num of words to be tracked
vocab_size = 5000 

#Any word outside of the tracked range will be encoded with last position.
t_data = [np.array([i if i<(vocab_size-1) else (vocab_size-1) for i in s]) for s in encoded_texts]
all_labels = df_train["isPositive"].tolist()

In [14]:
# Let's print the first sentence
# We have 4999 for words outside range of 5000 words

t_data[0]

array([ 165,    9, 4999,  133, 4999,   13, 4999,  353,    9, 4999,  960,
       3758,    9,  257,    5, 1141,   71, 4999,  457,  926,    7,  668,
          8,    5,  545,   21,   19,   10, 4999])

### 4. Using pre-trained GloVe Word Embeddings:

In this example, we will use GloVe word vectors. The following code shows how to get the word vectors and create an embedding dictionary with them. The dictionary maps the words to their word vectors. The file is downloaded from here: https://nlp.stanford.edu/projects/glove/

In [15]:
# Download the zip file - WARNING: THIS TAKES A WHILE!
! wget http://nlp.stanford.edu/data/glove.6B.zip

--2020-02-12 04:08:39--  http://nlp.stanford.edu/data/glove.6B.zip
Resolving nlp.stanford.edu (nlp.stanford.edu)... 171.64.67.140
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:80... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://nlp.stanford.edu/data/glove.6B.zip [following]
--2020-02-12 04:08:39--  https://nlp.stanford.edu/data/glove.6B.zip
Connecting to nlp.stanford.edu (nlp.stanford.edu)|171.64.67.140|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip [following]
--2020-02-12 04:08:39--  http://downloads.cs.stanford.edu/nlp/data/glove.6B.zip
Resolving downloads.cs.stanford.edu (downloads.cs.stanford.edu)... 171.64.64.22
Connecting to downloads.cs.stanford.edu (downloads.cs.stanford.edu)|171.64.64.22|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 862182613 (822M) [application/zip]
Saving to: ‘glove.6B.zip’


2020-0

In [16]:
# Unziping
! unzip glove.6B.zip

# Deleting the zip file
! rm glove.6B.zip

Archive:  glove.6B.zip
  inflating: glove.6B.50d.txt        
  inflating: glove.6B.100d.txt       
  inflating: glove.6B.200d.txt       
  inflating: glove.6B.300d.txt       


Below, we first create a mapper for the word vectors (word->vector) with the __load_glove_index()__ function. Later, __create_emb()__ creates an embedding matrix. Each row corresponds to a word. For our vocabulary size of 5000 and 300 word vector dimension, this gives a matrix of 5000 rows and  300 columns.

In [17]:
# We downloaded the 300 dimension word vectors
num_embed = 300 

def load_glove_index(loc):
    f = open(loc, encoding="utf8")
    embeddings_index = {}
    
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype = 'float32')
        embeddings_index[word] = coefs
    f.close()
    return embeddings_index

def create_emb():
    embedding_matrix = np.zeros((vocab_size, num_embed))
    for word, i in word_dict.items():
        if i >= vocab_size:
            continue
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector
    embedding_matrix = nd.array(embedding_matrix)
    return embedding_matrix

embeddings_index = load_glove_index('glove.6B.300d.txt')
embedding_matrix = create_emb()

In [18]:
embedding_matrix.shape

(5000, 300)

Next we prepare the review texts to be fed into the deep learning model by (1) Reserving 15% of the dataset as a validation dataset, (2) padding and truncating the data to the length of 50 words, and (3) converting the encoded text into into MXNet's NDArray format.

In [19]:
# This separates 15% of the entire dataset into test dataset.
X_train, X_val, y_train, y_val = train_test_split(t_data, all_labels, test_size=0.15, random_state=42)

In [20]:
#This set the max word length of each review text
seq_len = 50

#Below we pad the reviews and convert them to MXNet's NDArray format
X_train = nd.array(pad_sequences(X_train, maxlen=seq_len, value=0))
y_train = nd.array(y_train)
X_val = nd.array(pad_sequences(X_val, maxlen=seq_len, value=0))
y_val = nd.array(y_val)

We will set our parameters below

In [21]:
num_hidden = 64
learning_rate = .001
epochs = 10
batch_size = 16

In [22]:
train_arraydataset = mx.gluon.data.ArrayDataset(X_train, y_train)
train_loader = mx.gluon.data.DataLoader(train_arraydataset, batch_size=batch_size, shuffle=False, last_batch='keep')

val_arraydataset = mx.gluon.data.ArrayDataset(X_val, y_val)
val_loader = mx.gluon.data.DataLoader(val_arraydataset, batch_size=batch_size, shuffle=False, last_batch='keep')

We will be using an RNN model with 64 hidden units. Let's use the Sequential mode

In [23]:
context = mx.cpu()

model = mx.gluon.nn.Sequential()

model = nn.Sequential()
model.add(mx.gluon.nn.Embedding(vocab_size, num_embed),      # Embedding layer
          mx.gluon.rnn.RNN(num_hidden, layout = 'NTC'),      # Recurrent layer
          mx.gluon.nn.Dense(2)                               # Output layer
         )

model.collect_params().initialize(mx.init.Xavier(), ctx=context)
model[0].weight.set_data(embedding_matrix.as_in_context(context))
model[0].collect_params().setattr('grad_req', 'null')

Before we execute the training loop, we need to define a function that will calculate the accurary metrics for the model.

In [24]:
def evaluate_accuracy(model,loader):
    correct = 0
    total = 0
    
    for _, (data, target) in enumerate(loader):
        
        data = data.as_in_context(context)
        target = target.as_in_context(context)
        
        output = model(data)
        predictions = nd.argmax(output, axis=1)
        correct+=np.sum(predictions.asnumpy()==target.asnumpy())
        total+=data.shape[0]
    
    return float(correct/total)

Let's start the training process below. We will print accuracy score after each epoch.

In [25]:

trainer = gluon.Trainer(model.collect_params(), 'sgd',
                        {'learning_rate': learning_rate})

softmax_cross_entropy = gluon.loss.SoftmaxCrossEntropyLoss() 

for epoch in range(epochs):
    training_loss = 0
    train_predictions = nd.array([])
    # Training loop, train the network
    for _, (data, target) in enumerate(train_loader):

        data = data.as_in_context(context)
        target = target.as_in_context(context)
        
        with autograd.record():
            L = softmax_cross_entropy(model(data), target)
            L.backward()
        trainer.step(data.shape[0])
    
    # Calculating training and validation accuracy
    val_accuracy = evaluate_accuracy(model, val_loader)
    train_accuracy = evaluate_accuracy(model, train_loader)
    
    print("Epoch %s. Train_acc %f Validation_acc %f" % (epoch, train_accuracy, val_accuracy))

Epoch 0. Train_acc 0.681815 Validation_acc 0.670762
Epoch 1. Train_acc 0.730840 Validation_acc 0.720667
Epoch 2. Train_acc 0.748739 Validation_acc 0.736571
Epoch 3. Train_acc 0.762874 Validation_acc 0.749714
Epoch 4. Train_acc 0.773092 Validation_acc 0.762571
Epoch 5. Train_acc 0.780639 Validation_acc 0.768857
Epoch 6. Train_acc 0.786000 Validation_acc 0.775238
Epoch 7. Train_acc 0.791412 Validation_acc 0.777143
Epoch 8. Train_acc 0.795899 Validation_acc 0.782762
Epoch 9. Train_acc 0.800723 Validation_acc 0.785333


In [26]:
# Deleting the upzipped file
! rm glove.6B.50d.txt
! rm glove.6B.100d.txt
! rm glove.6B.200d.txt
! rm glove.6B.300d.txt
