## Final Project Day 3 Solution: Use Recurrent Neural Networks (RNNs), LSTMs or Transformers for a Classification Task

We continue to work with the final project dataset to see how Recurrent Neural Networks (RNNs), Long Short-term Memory Networks (LSTMs) and Transformers, perform to predict the __isPositive__ field of the dataset.

* We are giving you two pieces of code to read your training and test datasets.
* Use the notebooks from the class and implement the model, train and test with the corresponding datasets.

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Rating of the review

In [1]:
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved
# SPDX-License-Identifier: MIT-0

! pip install -q gluonnlp mxnet

[33mYou are using pip version 10.0.1, however version 20.0.2 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


Importing libraries:

In [2]:
import re
import numpy as np
import mxnet as mx
from mxnet import gluon, nd, autograd
from mxnet.gluon import nn, rnn, Trainer
from mxnet.gluon.loss import SigmoidBinaryCrossEntropyLoss
from sklearn.model_selection import train_test_split

### 1. Reading the dataset

Let's read the datasets below and fill-in the reviewText field. We will use this field as input to our ML model.

In [3]:
import pandas as pd

df_train = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TRAIN-CSV.csv')
df_test = pd.read_csv('../../DATA/NLP/EMBK-NLP-FINAL-TEST-CSV.csv')

Let's look at the first five rows in the datasets. As you can see the __log_votes__ field is numeric. That's why we will build a regression model.

In [4]:
df_train.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


In [5]:
df_test.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,Kaspersky offers the best security for your co...,State of the art protection,True,1465516800,0.0,1.0
1,This Value was extremely discounted which I ap...,Quickbooks,True,1393632000,0.0,1.0
2,Some dufus probably got stock options by the t...,Sad,False,1228176000,2.639057,0.0
3,I have reviewed the software and it is beyond ...,Excellent product,True,1402531200,0.0,1.0
4,"Plain old simple you need Anti-Virus,I have tr...",A must have,True,1367539200,0.0,1.0


### 2. Exploratory Data Analysis and Missing Value Imputation

Let's look at the target distribution for our datasets.

In [6]:
df_train["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

Checking the number of missing values:

In [7]:
print(df_train.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


In [8]:
print(df_test.isna().sum())

reviewText    2
summary       1
verified      0
time          0
log_votes     0
isPositive    0
dtype: int64


We will only consider the reviewText field. Let's fill-in the missing values for that below. We will just use the placeholder "Missing" here.

In [9]:
df_train["reviewText"].fillna("Missing", inplace=True)
df_test["reviewText"].fillna("Missing", inplace=True)

### 3. Train-validation split

Let's split the dataset into training and validation

In [10]:
# This separates 15% of the entire dataset into validation dataset.
train_text, val_text, train_label, val_label = \
    train_test_split(df_train["reviewText"].tolist(),
                     df_train["isPositive"].tolist(),
                     test_size=0.15,
                     random_state=42)

### 4. Text processing and Transformation
We will apply the following processes here:
* __Text cleaning:__ Simple text cleaning operations. We won't do stemming or lemmatization as our word vectors already cover different forms of words. We are using GloVe word embeddings for 6 billion words, phrases or punctuations in this example.
* __Tokenization:__ Tokenizing all sentences
* __Creating vocabulary:__ We will create a vocabulary of the tokens. In this vocabulary, tokens will map to unique ids, such as "car"->32, "house"->651, etc.
* __Transforming text:__ Tokenized sentences will be mapped to unique ids. For example: ["this", "is", "sentence"] -> [13, 54, 412].

In [11]:
import nltk, gluonnlp
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def cleanStr(text):
    # Remove leading/trailing whitespace
    text = text.lower().strip()
    # Remove extra space and tabs
    text = re.sub('\s+', ' ', text)
    # Remove HTML tags/markups
    text = re.compile('<.*?>').sub('', text)
    return text

def tokenize(text):
    tokens = []
    text = cleanStr(text)
    words = word_tokenize(text)
    for word in words:
        tokens.append(word)
    return tokens

def createVocabulary(text_list, min_freq):
    all_tokens = []
    for sentence in text_list:
        all_tokens += tokenize(sentence)
    # Calculate token frequencies
    counter = gluonnlp.data.count_tokens(all_tokens)
    # Create the vocabulary
    vocab = gluonnlp.Vocab(counter,
                           min_freq = min_freq,
                           unknown_token = '<unk>',
                           padding_token = None,
                           bos_token = None,
                           eos_token = None)
    
    return vocab

def transformText(text, vocab, max_length):
    token_arr = np.zeros((max_length,))
    tokens = tokenize(text)[0:max_length]
    for idx, token in enumerate(tokens):
        try:
            # Use the vocabulary index of the token
            token_arr[idx] = vocab.token_to_idx[token]
        except:
            token_arr[idx] = 0 # Unknown word
    return token_arr

[nltk_data] Downloading package punkt to /home/ec2-user/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


  Optimizer.opt_registry[name].__name__))


In order to keep the training time low, we only consider the first 125 words (max_length) in sentences. We also only use words that occur more than 5 times in the all sentences (min_freq).

In [12]:
min_freq = 5
max_length = 125

print("Creating the vocabulary")
vocab = createVocabulary(train_text, min_freq)
print("Transforming training texts")
train_text_transformed = nd.array([transformText(text, vocab, max_length) for text in train_text])
print("Transforming validation texts")
val_text_transformed = nd.array([transformText(text, vocab, max_length) for text in val_text])

Creating the vocabulary
Transforming training texts
Transforming validation texts


Let's see some unique ids for some words.

In [13]:
print("Vocabulary index for computer:", vocab['computer'])
print("Vocabulary index for beautiful:", vocab['beautiful'])
print("Vocabulary index for code:", vocab['code'])

Vocabulary index for computer: 67
Vocabulary index for beautiful: 1923
Vocabulary index for code: 395


### 5. Using pre-trained GloVe Word Embeddings:

In this example, we will use GloVe word vectors. `'glove.6B.50d.txt'` file gives us 6 billion words/phrases vectors. Each word vector has 50 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the `get_vecs_by_tokens()` function.

In [14]:
from mxnet.contrib import text
glove = text.embedding.create('glove',
                              pretrained_file_name = 'glove.6B.50d.txt')
embedding_matrix = glove.get_vecs_by_tokens(vocab.idx_to_token)

### 6. Training and validation

We have processed our text data and also created our embedding matrixes from GloVe. Now, it is time to start the training process.

We will set our parameters below

In [15]:
# Size of the state vectors
hidden_size = 12

# General NN training parameters
learning_rate = 0.01
epochs = 15
batch_size = 32

# Embedding vector and vocabulary sizes
num_embed = 50
vocab_size = len(vocab.token_to_idx.keys())

We need to put our data into correct format before the process.

In [16]:
from mxnet.gluon.data import ArrayDataset, DataLoader

train_label = nd.array(train_label)
val_label = nd.array(val_label)

train_dataset = ArrayDataset(train_text_transformed, train_label)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

Our sequential model is made of these layers:
* Embedding layer: This is where our words/tokens are mapped to word vectors.
* RNN layer: We will be using a simple RNN model. We won't stack RNN units in this example. It uses a sinle RNN unit with its hidden state size of 12. More details about the RNN is available [here](https://mxnet.incubator.apache.org/api/python/docs/api/gluon/rnn/index.html#mxnet.gluon.rnn.RNN). 
* Dense layer: A dense layer with a single neuron is used for output

In [17]:
context = mx.cpu()

model = nn.Sequential()
model.add(nn.Embedding(vocab_size, num_embed), # Embedding layer
          rnn.RNN(hidden_size),                # Recurrent layer
          nn.Dense(1, activation="sigmoid"))   # Output layer

Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors.

In [18]:
# Initialize networks parameters
model.collect_params().initialize(mx.init.Xavier(), ctx=context)

# We set the embedding layer's parameters with our embedding matrix (from GloVe)
model[0].weight.set_data(embedding_matrix.as_in_context(context))
# We won't change/train the embedding layer
model[0].collect_params().setattr('grad_req', 'null')

We will define the trainer and loss function below. __Binary cross-entropy loss__ is used as this is a binary classification problem.
$$
\mathrm{BinaryCrossEntropyLoss} = -\frac{1}{n}\sum_{examples}{(y\log(p) + (1 - y)\log(1 - p))}
$$

In [19]:
# Setting our trainer
trainer = Trainer(model.collect_params(),
                  'sgd',
                  {'learning_rate': learning_rate})

# We will use Sigmoid Binary Cross-entropy loss
binary_cross_entropy_loss = SigmoidBinaryCrossEntropyLoss(from_sigmoid=True) 

Now, it is time to start the training process. We will print the Binary cross-entropy loss after each epoch.

In [20]:
import time
for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    # Training loop, train the network
    for idx, (data, target) in enumerate(train_loader):

        data = data.as_in_context(context)
        target = target.as_in_context(context)
        
        with autograd.record():
            output = model(data)
            
            L = binary_cross_entropy_loss(output, target)
            training_loss += nd.sum(L).asscalar()
            L.backward()
        trainer.step(data.shape[0])
    
    # Calculate validation loss
    val_predictions = model(val_text_transformed.as_in_context(context))
    val_loss = nd.sum(binary_cross_entropy_loss(val_predictions, val_label)).asscalar()
    
    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)
    
    end = time.time()
    print("Epoch %s. Train_loss %f Validation_loss %f Seconds %f" % \
          (epoch, training_loss, val_loss, end-start))

Epoch 0. Train_loss 0.621807 Validation_loss 0.586678 Seconds 9.037092
Epoch 1. Train_loss 0.553893 Validation_loss 0.532601 Seconds 8.970854
Epoch 2. Train_loss 0.519342 Validation_loss 0.512030 Seconds 9.035579
Epoch 3. Train_loss 0.501472 Validation_loss 0.498159 Seconds 8.994958
Epoch 4. Train_loss 0.487072 Validation_loss 0.488197 Seconds 9.114947
Epoch 5. Train_loss 0.475217 Validation_loss 0.479580 Seconds 9.111258
Epoch 6. Train_loss 0.465758 Validation_loss 0.473038 Seconds 9.038848
Epoch 7. Train_loss 0.458406 Validation_loss 0.468344 Seconds 8.940323
Epoch 8. Train_loss 0.452745 Validation_loss 0.464891 Seconds 8.859067
Epoch 9. Train_loss 0.448174 Validation_loss 0.462168 Seconds 9.030634
Epoch 10. Train_loss 0.444210 Validation_loss 0.459715 Seconds 8.995466
Epoch 11. Train_loss 0.440575 Validation_loss 0.456971 Seconds 9.041262
Epoch 12. Train_loss 0.437049 Validation_loss 0.454920 Seconds 9.100632
Epoch 13. Train_loss 0.433966 Validation_loss 0.452838 Seconds 8.930927
Ep

We trained it for 15 epochs. As you can see, the validation loss goes down with each epoch, feel free to increase the number of epochs.

## 7. Test performance
Let's see how the model performs for the test data. We will use some of the sklearn's metric functions here.

In [21]:
from sklearn.metrics import classification_report, accuracy_score

test_text = df_test["reviewText"].tolist()
test_label = df_test["isPositive"].tolist()

# Transform test text
test_text_transformed = nd.array([transformText(text, vocab, max_length) for text in test_text])
# Get test predictions
test_predictions = model(test_text_transformed.as_in_context(context))

# Map the predictions to 0 and 1
mapped_predictions = []
for pred in test_predictions.asnumpy():
    if pred[0]>0.5:
        mapped_predictions.append(1)
    else:
        mapped_predictions.append(0)

print("Classification Report")
print(classification_report(mapped_predictions, test_label))
print("Accuracy")
print(accuracy_score(mapped_predictions, test_label))

Classification Report
              precision    recall  f1-score   support

           0       0.69      0.74      0.72      2807
           1       0.86      0.82      0.84      5193

   micro avg       0.79      0.79      0.79      8000
   macro avg       0.77      0.78      0.78      8000
weighted avg       0.80      0.79      0.79      8000

Accuracy
0.793125
