# <a name="0">Machine Learning Accelerator - Natural Language Processing - Lecture 2</a>

## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not

In this exercise, we will learn how to use Recurrent Neural Networks. 

We will follow these steps:
1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Train-validation dataset split</a>
4. <a href="#4">Text Transformation</a>
5. <a href="#5">Generating data batch and iterator</a>
6. <a href="#6">Using pre-trained GloVe Word Embeddings</a>
7. <a href="#7">Setting Hyperparameters and Bulding the Network</a>
8. <a href="#8">Training the Network</a>
9. <a href="#9">Improvement ideas</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Whether the review is positive or negative (1 or 0)

__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc., we need to use the regular neural networks with the RNN network.

In [1]:
# Upgrade dependencies
!pip install -r ../../requirements.txt

[31mERROR: Could not install packages due to an OSError: [Errno 2] No such file or directory: '/home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages/six-1.16.0.dist-info/METADATA'
[0m
You should consider upgrading via the '/home/ec2-user/anaconda3/envs/pytorch_p36/bin/python -m pip install --upgrade pip' command.[0m


In [2]:
import re, time
import numpy as np
import torch, torchtext
import boto3
import os
import pandas as pd

from os import path
from collections import Counter
from torch import nn, optim
from torch.nn import BCEWithLogitsLoss
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import Vocab
from torch.utils.data import TensorDataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from torchtext.vocab import GloVe

## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset below and look at the first five rows in the dataset. 

In [3]:
df = pd.read_csv("../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION-TRAINING.csv")
df.head()

Unnamed: 0,ID,reviewText,summary,verified,time,log_votes,isPositive
0,65886,Purchased as a quick fix for a needed Server 2...,"Easy install, seamless migration",True,1458864000,0.0,1.0
1,19822,So far so good. Installation was simple. And r...,Five Stars,True,1417478400,0.0,1.0
2,14558,Microsoft keeps making Visual Studio better. I...,This is the best development tool I've ever used.,False,1252886400,0.0,1.0
3,39708,Very good product.,Very good product.,True,1458604800,0.0,1.0
4,8015,So very different from my last version and I a...,... from my last version and I am having a gre...,True,1454716800,2.197225,0.0


## 2. <a name="2">Exploratory Data Analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the range and distribution of the target column `isPositive`.

In [4]:
df["isPositive"].value_counts()

1.0    34954
0.0    21046
Name: isPositive, dtype: int64

We can check the number of missing values for each columm below.

In [5]:
print(df.isna().sum())

ID             0
reviewText    10
summary       12
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields. We will use the __reviewText__ field, so we fill-in the missing values in it iwth the empty string.

In [6]:
df["reviewText"] = df["reviewText"].fillna("")

## 3. <a name="3">Train-validation split</a>
(<a href="#0">Go to top</a>)

Let's split the dataset into training and validation.

In [7]:
# This separates 10% of the entire dataset into validation dataset.
train_text, val_text, train_label, val_label = train_test_split(
    df["reviewText"].tolist(),
    df["isPositive"].tolist(),
    test_size=0.10,
    shuffle=True,
    random_state=324,
)

## 4. <a name="4">Text Transformation</a>
(<a href="#0">Go to top</a>)

We will apply the following processes here:
1. Creating a vocabulary
2. Text transformation

__1. Creating a vocabulary:__ 

We will create a vocabulary with the tokens from the text data. We use a simple english tokenizer and use these tokens to create our vocabulary. In this vocabulary, tokens will map to unique ids, such as "car"->32, "house"->651, etc. 

In [8]:
tokenizer = get_tokenizer("basic_english")
counter = Counter()
for line in train_text:
    counter.update(tokenizer(line))
vocab = Vocab(counter, min_freq=1)

Here are some examples.

In [9]:
print(f"'home' -> {vocab['home']}")
print(f"'wash' -> {vocab['wash']}")
# unknown word (assume from test set)
print(f"'fhshbasdhb' -> {vocab['fhshbasdhb']}")

'home' -> 211
'wash' -> 10241
'fhshbasdhb' -> 0


__2. Text transformation:__ 

We will use the vocabulary and map tokens in the text to unique ids of the tokens. For example: `["this", "is", "a", "sentence"] -> [14, 12, 9, 2066]`.

In [10]:
# Let's create a mapper to transform our text data
text_transform_pipeline = lambda x: [vocab[token] for token in tokenizer(x)]

Let's see some text before and after transformation.

In [11]:
print(f"Before transform:\t{train_text[37]}")
print(f"After transform:\t{text_transform_pipeline(train_text[37])}")

Before transform:	Happy to own it.
After transform:	[321, 6, 237, 8, 2]


Let's create a function for this. In this function, we transform and pad (if necessary) our text data. We cut the series of words at the point where it reaches a certain lenght (we used `max_len=50` here). If the text is shorter than max_len, we `pad zeros` to the end.

In [12]:
def transformText(text_list, max_len):
    # Transform the text
    transformed_data = [text_transform_pipeline(text)[:max_len] for text in text_list]

    # Pad zeros if the text is shoter than max_len
    for data in transformed_data:
        data[len(data) : max_len] = np.zeros(max_len - len(data))

    return torch.tensor(transformed_data, dtype=torch.int64)

In [13]:
text = train_text[8:10]
print(f"Text: {text}\n")
print(f"Num sentences: {len(text)}\n")
tt = transformText(text, max_len=50)
print(f"Transformed text: \n{tt}\n")
print(f"Shape of transformed text: {tt.shape}")

Text: ['Horrible. One day of frustration and I am back to the PC version.', "Didnt give a 5 because I don't know what I need. I like it great"]

Num sentences: 2

Transformed text: 
tensor([[1073,    2,   54,  307,   11, 1254,    7,    4,   86,  114,    6,    3,
          139,   50,    2,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0],
        [4934,  240,    9,  189,  100,    4,   85,   10,   25,  155,   67,    4,
          109,    2,    4,   60,    8,   68,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
            0,    0]])

Shape of transformed text: torch.Size([2, 50])


## 5. <a name="5">Generating data batch and iterator</a>
(<a href="#0">Go to top</a>)

Let's use the transformText() function and create the data loaders. Here, we use __max_len=100__ to consider the first 100 words in the text.

In [14]:
max_len = 100
batch_size = 16

# Pass transformed and padded data to dataset
# Create data loaders
train_dataset = TensorDataset(
    transformText(train_text, max_len), torch.tensor(train_label)
)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

val_dataset = TensorDataset(transformText(val_text, max_len), torch.tensor(val_label))
val_loader = DataLoader(val_dataset, batch_size=batch_size)

## 6. <a name="6">Using pre-trained GloVe Word Embeddings</a>
(<a href="#0">Go to top</a>)

In this example, we will use GloVe word vectors. `name='6B'` `dim=300` gives us 6 billion words/phrases vectors. Each word vector has 300 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the `get_vecs_by_tokens()` function.

In [15]:
glove = GloVe(name="6B", dim=300)
embedding_matrix = glove.get_vecs_by_tokens(vocab.itos)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.34MB/s]                               
100%|█████████▉| 399999/400000 [00:39<00:00, 10171.69it/s]


## 7. <a name="7">Setting Hyperparameters and Bulding the Network</a>
(<a href="#0">Go to top</a>)

We will set our parameters like below.

In [16]:
# Size of the state vectors
hidden_size = 8

# General NN training parameters
learning_rate = 0.001
epochs = 25

# Embedding vector and vocabulary sizes
embed_size = 300  # glove.6B.300d.txt
vocab_size = len(vocab.itos)

We need to put our data into correct format before the process.

Our model is made of these layers:
* Embedding layer: This is where our words/tokens are mapped to word vectors.
* RNN layer: We are using a simple RNN model. We stack 2 RNN layers in this example. More details about the RNN are available [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).
* Linear layer: A linear layer with a single neuron is used to output the `isPositive` prediction.

In [17]:
class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, hidden_size, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(
            embed_size, hidden_size, num_layers=num_layers
        )

        self.linear = nn.Linear(hidden_size*max_len, 1)
        self.act = nn.Sigmoid()

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        # Call RNN layer
        outputs, _ = self.rnn(embeddings)
        # Use the output of each time step
        # Send it all together to the linear layer
        outs = self.linear(outputs.reshape(outputs.shape[0], -1))
        return self.act(outs)
    
model = Net(vocab_size, embed_size, hidden_size, num_layers=2)

# Initialize the weights
def init_weights(m):
    if type(m) == nn.Linear:
        nn.init.xavier_uniform_(m.weight)
    if type(m) == nn.RNN:
        for param in m._flat_weights_names:
            if "weight" in param:
                nn.init.xavier_uniform_(m._parameters[param])

Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors.

In [18]:
# We set the embedding layer's parameters from GloVe
model.embedding.weight.data.copy_(embedding_matrix)
# We won't change/train the embedding layer
model.embedding.weight.requires_grad = False

## 8. <a name="8">Training the Network</a>
(<a href="#0">Go to top</a>)

Now, it is time to start our training. We define the loss function and training algorithm first. Then, training starts!

We will define the trainer and loss function below. 

__Binary cross-entropy loss__ is used as this is a binary classification problem.

$$
\mathrm{BinaryCrossEntropyLoss} = -\sum_{examples}{(y\log(p) + (1 - y)\log(1 - p))}
$$

In [19]:
# Setting our trainer
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# We will use Binary Cross-entropy loss
# reduction="sum" sums the losses for given output and target
cross_ent_loss = nn.BCELoss(reduction="sum")

Now, it is time to start the training process. We will print the Binary cross-entropy loss loss after each epoch.

In [20]:
# Get the compute device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

model.apply(init_weights)
model.to(device)

for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    val_loss = 0
    # Training loop, train the network
    for data, target in train_loader:
        trainer.zero_grad()
        data = data.to(device)
        target = target.to(device)
        output = model(data)
        L = cross_ent_loss(output, target.unsqueeze(1))
        training_loss += L.item()
        L.backward()
        trainer.step()

    # Validate the network, no training (no weight update)
    for data, target in val_loader:
        val_predictions = model(data.to(device))
        L = cross_ent_loss(val_predictions, target.to(device).unsqueeze(1))
        val_loss += L.item()

    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)

    end = time.time()
    print(
        f"Epoch {epoch}. Train_loss {training_loss}. Val_loss {val_loss}. Seconds {end-start}"
    )

Epoch 0. Train_loss 0.5973615130545601. Val_loss 0.5182847450886454. Seconds 15.003392457962036
Epoch 1. Train_loss 0.49063601183985905. Val_loss 0.47741369175059456. Seconds 14.428903102874756
Epoch 2. Train_loss 0.46167316157666466. Val_loss 0.46435574889183046. Seconds 15.341912031173706
Epoch 3. Train_loss 0.4447657239389798. Val_loss 0.45652317472866605. Seconds 14.96990418434143
Epoch 4. Train_loss 0.43217158005824163. Val_loss 0.45046656757593156. Seconds 14.371657848358154
Epoch 5. Train_loss 0.42222203342214465. Val_loss 0.4456157413550786. Seconds 15.766808986663818
Epoch 6. Train_loss 0.4141492530679892. Val_loss 0.44175665663821356. Seconds 15.575850009918213
Epoch 7. Train_loss 0.4074958306243495. Val_loss 0.438498275067125. Seconds 14.719719409942627
Epoch 8. Train_loss 0.4019319280828275. Val_loss 0.43549691004412516. Seconds 16.524062156677246
Epoch 9. Train_loss 0.39719844862700454. Val_loss 0.4326266840313162. Seconds 22.080443859100342
Epoch 10. Train_loss 0.39311184

## 9. <a name="9">Test the classifier on the validation data</a>
(<a href="#0">Go to top</a>)

Let's get the validation predictions. Earlier we made predictions on the validation set with this line: ```model(data.to(device))```.

In [21]:
val_predictions = []
for data, target in val_loader:
    val_preds = model(data.to(device))
    val_predictions.extend(
        [np.rint(val_pred)[0] for val_pred in val_preds.detach().cpu().numpy()]
    )
print(val_predictions[:10])

[1.0, 1.0, 0.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]


Confusion matrix, classification report and accuracy score are printed below.

In [22]:
# Use the fitted pipeline to make predictions on the validation dataset
print(confusion_matrix(val_label, val_predictions))
print(classification_report(val_label, val_predictions))
print("Accuracy (validation):", accuracy_score(val_label, val_predictions))

[[1332  720]
 [ 337 3211]]
              precision    recall  f1-score   support

         0.0       0.80      0.65      0.72      2052
         1.0       0.82      0.91      0.86      3548

    accuracy                           0.81      5600
   macro avg       0.81      0.78      0.79      5600
weighted avg       0.81      0.81      0.81      5600

Accuracy (validation): 0.81125


This score isn't a great improvement over the single layer network from yesterday (it can be even slightly worse than that result). RNNs usually require more data than regular neural networks in training. They also have additional hyperparameters to work with: __max_len, hidden_size, embed_size__ 

We will see some improved versions of RNNs namely Gated Recurrent Units and Long Sort-term Memory Networks tomorrow.

## 10. <a name="10">Test the classifier on the unseen test data</a>
(<a href="#0">Go to top</a>)

Let's get the test predictions. 

In [23]:
df_test = pd.read_csv("../../data/examples/NLP-REVIEW-DATA-CLASSIFICATION-TEST.csv")

In [24]:
test_text = df_test["reviewText"].fillna(value='').tolist()

In [25]:
test_dataset = TensorDataset(transformText(test_text, max_len)) #, torch.tensor(val_label))
test_loader = DataLoader(test_dataset, batch_size=batch_size)

In [26]:
test_predictions = []
for data, in test_loader:
    test_preds = model(data.to(device))
    test_predictions.extend(
        [np.rint(test_pred)[0] for test_pred in test_preds.detach().cpu().numpy()]
    )
print(test_predictions[:10])

[0.0, 1.0, 1.0, 1.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0]


In [27]:
import pandas as pd

result_df = pd.DataFrame()
result_df["ID"] = df_test["ID"]
result_df["isPositive"] = test_predictions

result_df.to_csv("result_day2_rnn.csv", encoding='utf-8', index=False)

## 11. <a name="11">Improvement ideas</a>
(<a href="#0">Go to top</a>)

We can improve our model by
* Changing hyper-parameters: Learning rate, batch size and hidden size
* Increase the number of layers: num_layers