## Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not

In this exercise, we will learn how to use Recurrent Neural Networks. 

We will follow these steps:
1. <a href="#1">Reading the dataset</a>
2. <a href="#2">Exploratory data analysis</a>
3. <a href="#3">Train-validation dataset split</a>
4. <a href="#4">Text processing and transformation</a>
5. <a href="#5">Using GloVe Word Embeddings</a>
6. <a href="#6">Training and validating model</a>
7. <a href="#7">Improvement ideas</a>

Overall dataset schema:
* __reviewText:__ Text of the review
* __summary:__ Summary of the review
* __verified:__ Whether the purchase was verified (True or False)
* __time:__ UNIX timestamp for the review
* __log_votes:__ Logarithm-adjusted votes log(1+votes)
* __isPositive:__ Whether the review is positive or negative (1 or 0)

__Important note:__ One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network.

In [1]:
!pip install -q torch==1.8.1 torchtext nltk

In [2]:
import re
import numpy as np
import torch
from torch import nn, optim
from torch.nn import BCEWithLogitsLoss 
from sklearn.model_selection import train_test_split

## 1. <a name="1">Reading the dataset</a>
(<a href="#0">Go to top</a>)

Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model.

In [3]:
import pandas as pd

df = pd.read_csv('../../DATA/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

Let's look at the first five rows in the dataset. As you can see the __log_votes__ field is numeric. That's why we will build a regression model.

In [4]:
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


## 2. <a name="2">Exploratory Data Analysis</a>
(<a href="#0">Go to top</a>)

Let's look at the range and distribution of log_votes

In [5]:
df["isPositive"].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

We can check the number of missing values for each columm below.

In [6]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


We have missing values in our text fields.

## 3. <a name="3">Train-validation split</a>
(<a href="#0">Go to top</a>)

Let's split the dataset into training and validation

In [7]:
# This separates 15% of the entire dataset into validation dataset.
train_text, val_text, train_label, val_label = \
    train_test_split(df["reviewText"].tolist(),
                     df["isPositive"].tolist(),
                     test_size=0.10,
                     shuffle=True,
                     random_state=324)

## 4. <a name="4">Text processing and Transformation</a>
(<a href="#0">Go to top</a>)

We will apply the following processes here:
* __Text cleaning:__ Simple text cleaning operations. We won't do stemming or lemmatization as our word vectors already cover different forms of words. We are using GloVe word embeddings for 6 billion words, phrases or punctuations in this example.
* __Tokenization:__ Tokenizing all sentences
* __Creating vocabulary:__ We will create a vocabulary of the tokens. In this vocabulary, tokens will map to unique ids, such as "car"->32, "house"->651, etc.
* __Transforming text:__ Tokenized sentences will be mapped to unique ids. For example: ["this", "is", "sentence"] -> [13, 54, 412].

In [8]:
from collections import Counter
import nltk, torchtext
from nltk.tokenize import word_tokenize

nltk.download('punkt')

def cleanStr(text):
    
    # Check if the sentence is a missing value
    if isinstance(text, str) == False:
        text = ""
            
    # Remove leading/trailing whitespace
    text = text.lower().strip()
    # Remove extra space and tabs
    text = re.sub('\s+', ' ', text)
    # Remove HTML tags/markups
    text = re.compile('<.*?>').sub('', text)
    return text

def tokenize(text):
    tokens = []
    text = cleanStr(text)
    words = word_tokenize(text)
    for word in words:
        tokens.append(word)
    return tokens

def createVocabulary(text_list, min_freq):
    all_tokens = []
    for sentence in text_list:
        all_tokens += tokenize(sentence)
    # Calculate token frequencies
    counter = Counter()
    for token in all_tokens:
        counter[token] += 1
    # Create the vocabulary
    vocab = torchtext.vocab.Vocab(counter,
                           min_freq = min_freq,
                           specials = ('<unk>'))
    
    return vocab

def transformText(text, vocab, max_length):
    token_arr = torch.zeros((max_length,))
    tokens = tokenize(text)[0:max_length]
    for idx, token in enumerate(tokens):
        try:
            # Use the vocabulary index of the token
            token_arr[idx] = vocab.stoi[token]
        except:
            token_arr[idx] = 0 # Unknown word
    return token_arr

[nltk_data] Downloading package punkt to /Users/mimayer/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


In order to keep the training time low, we only consider the first 250 words (max_length) in sentences. We also only use words that occur more than 5 times in the all sentences (min_freq).

In [9]:
min_freq = 5
max_length = 250

print("Creating the vocabulary")
vocab = createVocabulary(train_text, min_freq)

Creating the vocabulary


In [10]:
print("Transforming training texts")
train_text_transformed = torch.stack([transformText(text, vocab, max_length) for text in train_text])
print("Transforming validation texts")
val_text_transformed = torch.stack([transformText(text, vocab, max_length) for text in val_text])

Transforming training texts
Transforming validation texts


Let's see some unique ids for some words.

In [11]:
print("Vocabulary index for computer:", vocab['computer'])
print("Vocabulary index for beautiful:", vocab['beautiful'])
print("Vocabulary index for code:", vocab['code'])

Vocabulary index for computer: 71
Vocabulary index for beautiful: 1935
Vocabulary index for code: 407


## 5. <a name="5">Using pre-trained GloVe Word Embeddings</a>
(<a href="#0">Go to top</a>)

In this example, we will use GloVe word vectors. `name='6B'` `dim=50` gives us 6 billion words/phrases vectors. Each word vector has 50 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the `get_vecs_by_tokens()` function.

In [12]:
from torchtext.vocab import GloVe
glove = GloVe(name='6B', dim=50)
embedding_matrix = glove.get_vecs_by_tokens(vocab.itos)

.vector_cache/glove.6B.zip: 862MB [02:51, 5.04MB/s]                               
100%|█████████▉| 399999/400000 [00:07<00:00, 53606.58it/s]


## 6. <a name="6">Training and validation</a>
(<a href="#0">Go to top</a>)

We have processed our text data and also created our embedding matrixes from GloVe. Now, it is time to start the training process.

We will set our parameters below

In [13]:
# Size of the state vectors
hidden_size = 12

# General NN training parameters
learning_rate = 0.01
epochs = 15
batch_size = 32

# Embedding vector and vocabulary sizes
num_embed = 50 # glove.6B.50d.txt
vocab_size = len(vocab.itos)

We need to put our data into correct format before the process.

In [14]:
from torch.utils.data import TensorDataset, DataLoader

train_label = torch.tensor(train_label)
val_label = torch.tensor(val_label)
train_dataset = TensorDataset(train_text_transformed, train_label)
train_loader = DataLoader(train_dataset, batch_size=batch_size)

Our model is made of these layers:
* Embedding layer: This is where our words/tokens are mapped to word vectors.
* RNN layer: We will be using a simple RNN model. We won't stack RNN units in this example. It uses a sinle RNN unit with its hidden state size of 12. More details about the RNN is available [here](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html).
* Linear layer: A linear layer with a single neuron is used to output our log_votes prediction.

In [15]:
device = torch.device("cpu") # use "cuda:0" if you are using GPU

class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, num_hiddens, num_layers=num_layers)
        self.linear = nn.Linear(3000, 1)
        self.act = nn.Sigmoid()

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        outputs, _ = self.rnn(embeddings)
        outs = self.linear(outputs.reshape(outputs.shape[0], -1))
        return self.act(outs)

model = Net(vocab_size, num_embed, hidden_size)

Let's initialize this network. Then, we will need to make the embedding layer use our GloVe word vectors.

In [16]:
# We set the embedding layer's parameters from GloVe
model.embedding.weight.data.copy_(embedding_matrix)
# We won't change/train the embedding layer
model.embedding.weight.requires_grad = False

We will define the trainer and loss function below. __Binary cross-entropy loss__ is used as this is a binary classification problem.
$$
\mathrm{BinaryCrossEntropyLoss} = -\sum_{examples}{(y\log(p) + (1 - y)\log(1 - p))}
$$

In [17]:
# Setting our trainer
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)

# We will use Binary Cross-entropy loss
cross_ent_loss = nn.BCEWithLogitsLoss(reduction='none') 

Now, it is time to start the training process. We will print the Binary cross-entropy loss loss after each epoch.

In [18]:
import time
for epoch in range(epochs):
    start = time.time()
    training_loss = 0
    # Training loop, train the network
    for idx, (data, target) in enumerate(train_loader):
        trainer.zero_grad()
        data = data.long().to(device)
        target = target.to(device)
        output = model(data)
        L = cross_ent_loss(output.squeeze(1), target).sum()
        training_loss += L.item()
        L.backward()
        trainer.step()
    
    # Calculate validation loss
    val_predictions = model(val_text_transformed.long().to(device)).squeeze(1)
    val_loss = cross_ent_loss(val_predictions, val_label).sum().item()
    
    # Let's take the average losses
    training_loss = training_loss / len(train_label)
    val_loss = val_loss / len(val_label)
    
    end = time.time()
    print("Epoch %s. Train_loss %f Validation_loss %f Seconds %f" % \
          (epoch, training_loss, val_loss, end-start))

Epoch 0. Train_loss 0.629892 Validation_loss 0.602221 Seconds 8.312113
Epoch 1. Train_loss 0.598350 Validation_loss 0.582967 Seconds 7.471832
Epoch 2. Train_loss 0.587359 Validation_loss 0.576832 Seconds 7.734207
Epoch 3. Train_loss 0.579904 Validation_loss 0.574100 Seconds 7.834989
Epoch 4. Train_loss 0.573832 Validation_loss 0.569988 Seconds 7.897461
Epoch 5. Train_loss 0.569642 Validation_loss 0.570019 Seconds 7.486577
Epoch 6. Train_loss 0.566562 Validation_loss 0.569189 Seconds 10.890178
Epoch 7. Train_loss 0.563948 Validation_loss 0.573817 Seconds 11.268969
Epoch 8. Train_loss 0.561591 Validation_loss 0.568954 Seconds 11.232005
Epoch 9. Train_loss 0.559987 Validation_loss 0.568272 Seconds 10.575020
Epoch 10. Train_loss 0.557994 Validation_loss 0.567034 Seconds 11.526398
Epoch 11. Train_loss 0.556572 Validation_loss 0.568186 Seconds 11.837136
Epoch 12. Train_loss 0.555396 Validation_loss 0.566539 Seconds 10.796005
Epoch 13. Train_loss 0.554863 Validation_loss 0.563169 Seconds 10.5

Let's see some validation results below

In [19]:
from sklearn.metrics import classification_report, accuracy_score

# Get validation predictions
val_predictions = model(val_text_transformed.to(device).long())

# Round predictions: 1 if pred>0.5, 0 otherwise
val_predictions = np.round(val_predictions.detach().numpy())

print("Classification Report")
print(classification_report(val_label.numpy(), val_predictions))
print("Accuracy")
print(accuracy_score(val_label.numpy(), val_predictions))

Classification Report
              precision    recall  f1-score   support

         0.0       0.67      0.74      0.70      2605
         1.0       0.83      0.79      0.81      4395

    accuracy                           0.77      7000
   macro avg       0.75      0.76      0.76      7000
weighted avg       0.77      0.77      0.77      7000

Accuracy
0.7672857142857142
