## Assignment 2.3: Text classification via RNN (30 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language.

In [0]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

SEED = 42
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

### Preparing Data

In [0]:
TEXT = Field(sequential=True, lower=True)
LABEL = LabelField(dtype=torch.float)

In [3]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()
print(len(trn))

17500


In [4]:
%%time
MAX_VOCAB_SIZE = 25000
TEXT.build_vocab(trn, max_size = MAX_VOCAB_SIZE)

CPU times: user 1.1 s, sys: 22 ms, total: 1.12 s
Wall time: 1.13 s


In [0]:
LABEL.build_vocab(trn)

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [6]:
TEXT.vocab.freqs.most_common(10)

[('the', 225207),
 ('a', 111530),
 ('and', 110502),
 ('of', 101129),
 ('to', 93533),
 ('is', 72659),
 ('in', 63316),
 ('i', 49352),
 ('this', 48818),
 ('that', 46311)]

### Creating the Iterator (2 points)

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [0]:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_size= 64,
        sort=False,
        sort_key=lambda x: len(x.comment_text), # write your code here
        sort_within_batch=False,
        device=device,
        repeat=False
)

Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [8]:
batch = next(train_iter.__iter__()); batch.text

tensor([[    0, 11878,  8688,  ...,  7889,  7306,    23],
        [    0,   347, 15642,  ...,    30,  9080,     6],
        [    3,   102,     7,  ...,     5,    14,    28],
        ...,
        [    1,     1,     1,  ...,     1,     1,     1],
        [    1,     1,     1,  ...,     1,     1,     1],
        [    1,     1,     1,  ...,     1,     1,     1]], device='cuda:0')

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [9]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

### Define the RNN-based text classification model (10 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [0]:
class RNNBaseline(nn.Module):
    def __init__(self, hidden_dim, emb_dim):
        super().__init__()

        self.embedding = nn.Embedding(
            num_embeddings=len(TEXT.vocab), 
            embedding_dim=emb_dim,
            )
        
        self.gru = nn.GRU(
            input_size=emb_dim, 
            hidden_size=hidden_dim
            )
        
        self.fc = nn.Linear(
            in_features=hidden_dim, 
            out_features = 1
            )
            
    def forward(self, seq):
        # seq = [text_len, batch_size]

        embedded = self.embedding(seq)
        # embedded = [text_len, batch_size, embed_dim]

        output, hidden = self.gru(embedded)
        # output = [text_len, batch_size, hid_dim]
        # hidden = [1, batch_size, hid_dim]

        assert torch.equal(output[-1,:,:], hidden.squeeze(0))

        preds = self.fc(hidden.squeeze(0))
        return preds

In [11]:
em_sz = 200
nh = 300
model = RNNBaseline(nh, emb_dim=em_sz); model

RNNBaseline(
  (embedding): Embedding(25002, 200)
  (gru): GRU(200, 300)
  (fc): Linear(in_features=300, out_features=1, bias=True)
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [0]:
model = model.to(device)

### The training loop (3 points)

Define the optimization and the loss functions.

In [0]:
opt = optim.Adam(model.parameters(), lr=1e-3) 
loss_func = nn.BCEWithLogitsLoss().to(device) 

Define the stopping criteria.

In [0]:
epochs = 5 # your code goes here

In [15]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 

        opt.zero_grad()
        preds = model(batch.text).squeeze(1) 
        loss = loss_func(preds, batch.label)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        preds = model(batch.text).squeeze(1) 
        loss = loss_func(preds, batch.label)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010980496900422233, Validation Loss: 0.010902446150779723
Epoch: 2, Training Loss: 0.010882041849408832, Validation Loss: 0.010913486154874166
Epoch: 3, Training Loss: 0.010845106482505798, Validation Loss: 0.010769179232915243
Epoch: 4, Training Loss: 0.008927319971152715, Validation Loss: 0.006545571482181549
Epoch: 5, Training Loss: 0.004724217485530036, Validation Loss: 0.005722816616296768
CPU times: user 1min 55s, sys: 28.9 s, total: 2min 24s
Wall time: 2min 25s


### Calculate performance of the trained model (5 points)

In [0]:
predictions = np.array([])
y_true = np.array([])

model.eval()
with torch.no_grad():
    for batch in test_iter:
        preds = model(batch.text).squeeze(1)
        preds = torch.round(torch.sigmoid(preds))
        predictions = np.append(predictions, preds.cpu().data.numpy())
        y_true = np.append(y_true, batch.label.cpu().data.numpy())

In [0]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
def metrics_report(true_y, preds_y):
    accuracy = accuracy_score(true_y, preds_y)
    precision = precision_score(true_y, preds_y)
    recall = recall_score(true_y, preds_y)
    f1 = f1_score(true_y, preds_y)
    return accuracy, precision, recall, f1

In [18]:
accuracy, precision, recall, f1 = metrics_report(y_true, predictions)
print('Accuracy  ', accuracy)
print('Precision ', precision)
print('Recall    ', recall)
print('F1        ', f1)

Accuracy   0.83908
Precision  0.921783262016121
Recall     0.74104
F1         0.8215885405117743


Write down the calculated performance

### Accuracy   0.828
### Precision  0.894990366088632
### Recall     0.7432
### F1         0.8120629370629371

In [0]:
del model
torch.cuda.empty_cache()

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

In [0]:
class RNN(nn.Module):
    def __init__(self, hidden_dim, emb_dim):
        super().__init__()

        self.embedding = nn.Embedding(
            num_embeddings=len(TEXT.vocab), 
            embedding_dim=emb_dim,
            )
        
        self.lstm = nn.LSTM(
            input_size=emb_dim, 
            hidden_size=hidden_dim,
            num_layers=2,
            bidirectional=True,
            dropout=0.2
            )
        self.lstm_2 = nn.LSTM(
            input_size=hidden_dim*2,
            hidden_size=hidden_dim,
            num_layers=2,
            bidirectional=True,
            dropout=0.2
            )
        self.fc = nn.Linear(
            in_features=hidden_dim * 2, 
            out_features = 1
            )
        self.dropout = nn.Dropout(0.2)
            
    def forward(self, seq):
        embedded = self.embedding(seq)
        output, (hidden, cell) = self.lstm(embedded)
        output, (hidden, cell) = self.lstm_2(output)
        hidden = self.dropout(torch.cat((hidden[-2, :, :], hidden[-1, :, :]), dim=1))
        preds = self.fc(hidden)
        return preds

In [35]:
em_sz = 100
nh = 256
model = RNN(nh, emb_dim=em_sz); model

RNN(
  (embedding): Embedding(25002, 100)
  (lstm): LSTM(100, 256, num_layers=2, dropout=0.2, bidirectional=True)
  (lstm_2): LSTM(512, 256, num_layers=2, dropout=0.2, bidirectional=True)
  (fc): Linear(in_features=512, out_features=1, bias=True)
  (dropout): Dropout(p=0.2, inplace=False)
)

In [0]:
model = model.to(device)

opt = optim.Adam(model.parameters(), lr=1e-3) 
loss_func = nn.BCEWithLogitsLoss().to(device)

epochs = 7

In [37]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for batch in train_iter: 

        opt.zero_grad()
        preds = model(batch.text).squeeze(1) 
        loss = loss_func(preds, batch.label)
        loss.backward()
        opt.step()
        running_loss += loss.item()

    epoch_loss = running_loss / len(trn)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        preds = model(batch.text).squeeze(1) 
        loss = loss_func(preds, batch.label)
        val_loss += loss.item()
        
    val_loss /= len(vld)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.010666684733118329, Validation Loss: 0.010889897274971009
Epoch: 2, Training Loss: 0.009524216517380306, Validation Loss: 0.009167809947331747
Epoch: 3, Training Loss: 0.008454268496377128, Validation Loss: 0.007150847907861074
Epoch: 4, Training Loss: 0.006180550186974661, Validation Loss: 0.00610453633069992
Epoch: 5, Training Loss: 0.0045721262037754055, Validation Loss: 0.005529360460241635
Epoch: 6, Training Loss: 0.004056513360994203, Validation Loss: 0.005276496680577596
Epoch: 7, Training Loss: 0.0031857222161122732, Validation Loss: 0.005711840677261352
CPU times: user 22min 22s, sys: 7min 50s, total: 30min 13s
Wall time: 30min 26s


In [0]:
predictions = np.array([])
y_true = np.array([])

model.eval()
with torch.no_grad():
    for batch in test_iter:
        preds = model(batch.text).squeeze(1)
        preds = torch.round(torch.sigmoid(preds))
        predictions = np.append(predictions, preds.cpu().data.numpy())
        y_true = np.append(y_true, batch.label.cpu().data.numpy())

In [39]:
accuracy, precision, recall, f1 = metrics_report(y_true, predictions)
print('Accuracy  ', accuracy)
print('Precision ', precision)
print('Recall    ', recall)
print('F1        ', f1)

Accuracy   0.8546
Precision  0.8733260338583341
Recall     0.82952
F1         0.8508595577072992


### 1. Tried bidirectional LSTM and trained for 5 epochs with Dropout of 0.2 Obtained: 
Accuracy   0.85144

Precision  0.8706547418157273

Recall     0.82552

F1         0.8474868593955323

### 2. Tried bidirectional Two layer LSTM and trained for 7 epochs with Dropout 0.2 Obtained:

Accuracy   0.8546

Precision  0.8733260338583341

Recall     0.82952

F1         0.8508595577072992
