<a href="https://colab.research.google.com/github/LxYuan0420/aws-machine-learning-university-accelerated-nlp/blob/master/colab_notebooks/MLA_NLP_Lecture3_Recurrent_Neural_Networks_PyTorch.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/gdrive')

Mounted at /gdrive


In [2]:
%cd "/gdrive/MyDrive/Colab Notebooks/git/aws-machine-learning-university-accelerated-nlp/colab_notebooks"

/gdrive/MyDrive/Colab Notebooks/git/aws-machine-learning-university-accelerated-nlp/colab_notebooks


**Machine Learning Accelerator - Natural Language Processing - Lecture 3**

**Recurrent Neural Networks (RNNs) for the Product Review Problem - Classify Product Reviews as Positive or Not**

In this exercise, we will learn how to use Recurrent Neural Networks.

We will follow these steps:

1. Reading the dataset
1. Exploratory data analysis
1. Train-validation dataset split
1. Text processing and transformation
1. Using GloVe Word Embeddings
1. Training and validating model
1. Improvement ideas

Overall dataset schema:

1. reviewText: Text of the review
1. summary: Summary of the review
1. verified: Whether the purchase was verified (True or False)
1. time: UNIX timestamp for the review
1. log_votes: Logarithm-adjusted votes log(1+votes)
1. isPositive: Whether the review is positive or negative (1 or 0)
**Important note: One big distinction betweeen the regular neural networks and RNNs is that RNNs work with sequential data. In our case, RNNs will help us with the text field. If we also want to consider other fields such as time, log_votes, verified, etc. , we need to use the regular neural networks with the RNN network.**


In [4]:
import re
import numpy as np
import torch
from torch import nn, optim
from torch.nn import BCEWithLogitsLoss 
from sklearn.model_selection import train_test_split

**1. Reading the dataset**

(Go to top)

Let's read the dataset below and fill-in the reviewText field. We will use this field as input to our ML model.

In [5]:
import pandas as pd

df = pd.read_csv('../data/examples/AMAZON-REVIEW-DATA-CLASSIFICATION.csv')

In [6]:
df.head()

Unnamed: 0,reviewText,summary,verified,time,log_votes,isPositive
0,"PURCHASED FOR YOUNGSTER WHO\nINHERITED MY ""TOO...",IDEAL FOR BEGINNER!,True,1361836800,0.0,1.0
1,unable to open or use,Two Stars,True,1452643200,0.0,0.0
2,Waste of money!!! It wouldn't load to my system.,Dont buy it!,True,1433289600,0.0,0.0
3,I attempted to install this OS on two differen...,I attempted to install this OS on two differen...,True,1518912000,0.0,0.0
4,I've spent 14 fruitless hours over the past tw...,Do NOT Download.,True,1441929600,1.098612,0.0


**2. Exploratory Data Analysis**

(Go to top)

Let's look at the range and distribution of log_votes

In [7]:
df['isPositive'].value_counts()

1.0    43692
0.0    26308
Name: isPositive, dtype: int64

In [8]:
print(df.isna().sum())

reviewText    11
summary       14
verified       0
time           0
log_votes      0
isPositive     0
dtype: int64


**3. Train-validation split**

(Go to top)

Let's split the dataset into training and validation

In [9]:
train_text, val_text, train_label, val_label = train_test_split(df["reviewText"].tolist(), df["isPositive"].tolist(), test_size=0.1, shuffle=True, random_state=324)

**4. Text processing and Transformation**

We will apply the following processes here:

1. Text cleaning: Simple text cleaning operations. We won't do stemming or lemmatization as our word vectors already cover different forms of words. We are using GloVe word embeddings for 6 billion words, phrases or punctuations in this example.
1. Tokenization: Tokenizing all sentences
1. Creating vocabulary: We will create a vocabulary of the tokens. In this vocabulary, tokens will map to unique ids, such as "car"->32, "house"->651, etc.
1. Transforming text: Tokenized sentences will be mapped to unique ids. For example: ["this", "is", "sentence"] -> [13, 54, 412].


In [10]:
from collections import Counter
import nltk, torchtext
from nltk.tokenize import word_tokenize

nltk.download("punkt")


def cleanStr(text):
    if isinstance(text, str) == False:
        text = ""

    text = text.lower().strip()
    text = re.sub("\s+", " ", text)
    text = re.compile("<.*?>").sub("", text)
    return text

def tokenize(text):
    tokens = []
    text = cleanStr(text)
    words = word_tokenize(text)
    for word in words:
        tokens.append(word)
    return tokens

def createVocabulary(text_list, min_freq):
    all_tokens = []
    for sentence in text_list:
        all_tokens += tokenize(sentence)
    
    counter = Counter()
    for token in all_tokens:
        counter[token] += 1
    
    vocab = torchtext.vocab.Vocab(counter, min_freq=min_freq, specials=("<unk>"))

    return vocab

def transformText(text, vocab, max_length):
    token_arr = torch.zeros((max_length))
    tokens = tokenize(text)[0:max_length]
    for idx, token in enumerate(tokens):
        try:
            token_arr[idx] = vocab.stoi[token]
        except:
            token_arr[idx] = 0 
    return token_arr

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


In [11]:
min_freq = 5
max_length = 250

print("Creating the vocabulary")
vocab = createVocabulary(train_text, min_freq)

Creating the vocabulary


In [14]:
print("Transforming training texts")
train_text_transformed = torch.stack([transformText(text, vocab, max_length) for text in train_text])
print("Transforming validation texts")
val_text_transformed = torch.stack([transformText(text, vocab, max_length) for text in val_text])

Transforming training texts
Transforming validation texts


In [21]:
# print some words and ids
print(f"Vocab index for beatiful: {vocab.stoi['beautiful']}")
print(f"Vocab index for xyzzzxc: {vocab.stoi['xyzzzxc']}")

Vocab index for beatiful: 1931
Vocab index for xyzzzxc: 0


**5. Using pre-trained GloVe Word Embeddings**

(Go to top)

In this example, we will use GloVe word vectors. name='6B' dim=50 gives us 6 billion words/phrases vectors. Each word vector has 50 numbers in it. The following code shows how to get the word vectors and create an embedding matrix from them. We will connect our vocabulary indexes to the GloVe embedding with the get_vecs_by_tokens() function.

In [26]:
a = vocab.itos
print(type(a))

<class 'list'>


In [25]:
from torchtext.vocab import GloVe

glove = GloVe(name="6B", dim=50)
embedding_matrix = glove.get_vecs_by_tokens(vocab.itos)

.vector_cache/glove.6B.zip: 862MB [02:41, 5.35MB/s]                           
100%|█████████▉| 398025/400000 [00:13<00:00, 29083.17it/s]

AttributeError: ignored

**6. Training and validation**

(Go to top)

We have processed our text data and also created our embedding matrixes from GloVe. Now, it is time to start the training process.

We will set our parameters below

In [30]:
hidden_size = 12
learning_rate = 0.01
epochs = 15
batch_size = 32

num_embed = 50
vocab_size = len(vocab.itos)

In [29]:
from torch.utils.data import TensorDataset, DataLoader

train_label = torch.Tensor(train_label)
val_label = torch.Tensor(val_label)
train_dataset = TensorDataset(train_text_transformed, train_label) 
train_loader = DataLoader(train_dataset, batch_size=batch_size)

In [None]:
device = torch.device("cpu") # use "cuda:0" if you are using GPU

class Net(nn.Module):
    def __init__(self, vocab_size, embed_size, num_hiddens, num_layers=1):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.rnn = nn.RNN(embed_size, num_hiddens, num_layers=num_layers)
        self.linear = nn.Linear(3000, 1)
        self.act = nn.Sigmoid()

    def forward(self, inputs):
        embeddings = self.embedding(inputs)
        outputs, _ = self.rnn(embeddings)
        outs = self.linear(outputs.reshape(outputs.shape[0], -1))
        return self.act(outs)
    
model = Net(vocab_size, num_embed, hidden_size)

In [None]:
model.embedding.weight.data.copy_(embedding_matrix)
model.embedding.weight.requires_grad=False

In [None]:
trainer = torch.optim.SGD(model.parameters(), lr=learning_rate)
cross_ent_loss = nn.BCEWithLogitsLoss(reduction='none')

In [None]:
import time

for epoch in range(epochs):
    start = time.time()
    training_loss=0
    for idx, (data, target) in enumerate(train_loader):
        trainer.zero_grad()
        
        data = data.long().to(device)
        target = target.to(device)

        output = model(data)
        L = cross_ent_loss(output.squeeze(1), target).sum()
        training_loss += L

        L.backward()
        trainer.step()

    # one epoch finish 
    val_predictions = model(val_text_transformed.long().to(device)).squeeze(1)
    val_loss = cross_ent_loss(val_predictions, val_label).sum().item()

    training_loss /= len(train_label)
    val_loss /= len(val_label)

    end = time.time()
    print("Epoch %s. Train_loss %f Validation_loss %f Seconds %f" % \
          (epoch, training_loss, val_loss, end-start))

In [None]:
from sklearn.metrics import classification_report, accuracy_score

# Get validation predictions
val_predictions = model(val_text_transformed.to(device).long())

# Round predictions: 1 if pred>0.5, 0 otherwise
val_predictions = np.round(val_predictions.detach().numpy())

print("Classification Report")
print(classification_report(val_label.numpy(), val_predictions))
print("Accuracy")
print(accuracy_score(val_label.numpy(), val_predictions))