## Assignment 2.3: Text classification via RNN (30 points)

In this assignment you will perform sentiment analysis of the IMDBs reviews by using RNN. An additional goal is to learn high abstactions of the **torchtext** module that consists of data processing utilities and popular datasets for natural language.

In [None]:
import pandas as pd
import numpy as np
import torch

from torchtext import datasets

from torchtext.data import Field, LabelField
from torchtext.data import BucketIterator

import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.tensorboard import SummaryWriter

### Preparing Data

In [None]:
device = 'cuda'

In [None]:
TEXT = Field(sequential=True, lower=True)
LABEL = LabelField()

In [4]:
train, tst = datasets.IMDB.splits(TEXT, LABEL)
trn, vld = train.split()

downloading aclImdb_v1.tar.gz


aclImdb_v1.tar.gz: 100%|██████████| 84.1M/84.1M [00:09<00:00, 9.01MB/s]


In [5]:
%%time
TEXT.build_vocab(trn)

CPU times: user 1.27 s, sys: 35.5 ms, total: 1.31 s
Wall time: 1.31 s


In [None]:
LABEL.build_vocab(trn)

The vocab.freqs is a collections.Counter object, so we can take a look at the most frequent words.

In [7]:
TEXT.vocab.freqs.most_common(10)

[('the', 225513),
 ('a', 111704),
 ('and', 110729),
 ('of', 101179),
 ('to', 93530),
 ('is', 72445),
 ('in', 63261),
 ('i', 49429),
 ('this', 48961),
 ('that', 46429)]

### Creating the Iterator (2 points)

During training, we'll be using a special kind of Iterator, called the **BucketIterator**. When we pass data into a neural network, we want the data to be padded to be the same length so that we can process them in batch:

e.g.
\[ 
\[3, 15, 2, 7\],
\[4, 1\], 
\[5, 5, 6, 8, 1\] 
\] -> \[ 
\[3, 15, 2, 7, **0**\],
\[4, 1, **0**, **0**, **0**\], 
\[5, 5, 6, 8, 1\] 
\] 

If the sequences differ greatly in length, the padding will consume a lot of wasteful memory and time. The BucketIterator groups sequences of similar lengths together for each batch to minimize padding.

Complete the definition of the **BucketIterator** object

In [None]:
batch_size = 64
train_iter, val_iter, test_iter = BucketIterator.splits(
        (trn, vld, tst),
        batch_sizes=(batch_size, batch_size, batch_size),
        sort=True,
        sort_key=lambda x: len(x.text), # write your code here
        sort_within_batch=False,
        device=device,
        repeat=False
)

Let's take a look at what the output of the BucketIterator looks like. Do not be suprised **batch_first=True**

In [9]:
batch = next(train_iter.__iter__()); batch.text

tensor([[    9,    10,  1280,  ...,    10,     9, 11804],
        [  522,    20,   137,  ...,     7,   371, 47404],
        [  853,     7,  2148,  ...,     3,     2,   277],
        ...,
        [    1,     1,     1,  ...,    24,   220,    52],
        [    1,     1,     1,  ...,    40,   531,     5],
        [    1,     1,     1,  ...,     9,   112,   743]], device='cuda:0')

The batch has all the fields we passed to the Dataset as attributes. The batch data can be accessed through the attribute with the same name.

In [10]:
batch.__dict__.keys()

dict_keys(['batch_size', 'dataset', 'fields', 'input_fields', 'target_fields', 'text', 'label'])

### Define the RNN-based text classification model (10 points)

Start simple first. Implement the model according to the shema below.  
![alt text](https://miro.medium.com/max/1396/1*v-tLYQCsni550A-hznS0mw.jpeg)


In [None]:
class RNNBaseline(nn.Module):
    def __init__(self, vocab_size, hidden_dim, emb_dim, num_classes):
        super().__init__()
        # =============================
        #      Write code here
        # =============================
        self.embedding = nn.Embedding(vocab_size, emb_dim)
        self.rnn = nn.GRU(emb_dim, hidden_dim)
        self.classification = nn.Linear(hidden_dim, num_classes)
            
    def forward(self, seq):
        # =============================
        #      Write code here
        # =============================
        x = self.embedding(seq)
        _, x = self.rnn(x)
        x = self.classification(x)
        return F.softmax(x, -1).squeeze(0)

In [12]:
em_sz = 200
nh = 300
model = RNNBaseline(len(TEXT.vocab), nh, em_sz, len(LABEL.vocab)); model

RNNBaseline(
  (embedding): Embedding(201550, 200)
  (rnn): GRU(200, 300)
  (classification): Linear(in_features=300, out_features=2, bias=True)
)

If you're using a GPU, remember to call model.cuda() to move your model to the GPU.

In [13]:
model.to(device)

RNNBaseline(
  (embedding): Embedding(201550, 200)
  (rnn): GRU(200, 300)
  (classification): Linear(in_features=300, out_features=2, bias=True)
)

### The training loop (3 points)

Define the optimization and the loss functions.

In [None]:
opt = optim.Adam(model.parameters()) # your code goes here
loss_func = nn.CrossEntropyLoss() # your code goes here

Define the stopping criteria.

In [None]:
epochs = 20 # your code goes here

In [None]:
writer = SummaryWriter()
log_every = 10

In [22]:
%%time
for epoch in range(1, epochs + 1):
    running_loss = 0.0
    running_corrects = 0
    model.train() 
    for i, batch in enumerate(train_iter): 
        
        x = batch.text
        y = batch.label

        opt.zero_grad()
        preds = model(x)   
        loss = loss_func(preds, y)
        loss.backward()
        opt.step()
        running_loss += loss.item()
        
        global_step = epoch * len(trn) + (i + 1) * batch_size
        if i % log_every == 0:
            writer.add_scalar('training_loss', loss.item(), global_step)

    epoch_loss = running_loss / len(train_iter)
    writer.add_scalar('epoch_loss', epoch_loss, global_step)
    
    val_loss = 0.0
    model.eval()
    for batch in val_iter:
        
        x = batch.text
        y = batch.label
        
        preds = model(x) 
        loss = loss_func(preds, y)
        val_loss += loss.item()
        
    val_loss /= len(val_iter)
    writer.add_scalar('val_loss', val_loss, global_step)
    print('Epoch: {}, Training Loss: {}, Validation Loss: {}'.format(epoch, epoch_loss, val_loss))

Epoch: 1, Training Loss: 0.6855767456284405, Validation Loss: 0.6718965900146355
Epoch: 2, Training Loss: 0.5819907411389107, Validation Loss: 0.5329511206028825
Epoch: 3, Training Loss: 0.4506173847365553, Validation Loss: 0.5136477596173852
Epoch: 4, Training Loss: 0.391089590796589, Validation Loss: 0.4557222862870006
Epoch: 5, Training Loss: 0.36087735236561214, Validation Loss: 0.5070311331142814
Epoch: 6, Training Loss: 0.3453608222686461, Validation Loss: 0.45424947900287177
Epoch: 7, Training Loss: 0.33872245774216897, Validation Loss: 0.45070389115204246
Epoch: 8, Training Loss: 0.3335897464604273, Validation Loss: 0.44812590035341554
Epoch: 9, Training Loss: 0.3304089866850498, Validation Loss: 0.4524917466155553
Epoch: 10, Training Loss: 0.32783904377996487, Validation Loss: 0.463598412729926
Epoch: 11, Training Loss: 0.32736551641982836, Validation Loss: 0.49507346370462646
Epoch: 12, Training Loss: 0.3249196514596034, Validation Loss: 0.4505924875453367
Epoch: 13, Training

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

![](imgs/rnn_epoch_loss.png)

![](imgs/rnn_train_loss.png)

![](imgs/rnn_val_loss.png)

### Calculate performance of the trained model (5 points)

In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, precision_recall_fscore_support

In [44]:
model.eval()
y_pred, y_true = [], []

for batch in test_iter:
    x = batch.text
    y = batch.label

    preds = model(x).argmax(axis=1).cpu().numpy()

    y_pred.append(preds)
    y_true.append(y.cpu().numpy())

y_true = np.concatenate(y_true)
y_pred = np.concatenate(y_pred)

accuracy = accuracy_score(y_true, y_pred)
precision, recall, f1, support = precision_recall_fscore_support(y_true, y_pred)
print('accuracy =', accuracy, '\nprecision =', precision[1], '\nrecall =', recall[1], '\nf1 =', f1[1])

accuracy = 0.84396 
precision = 0.8192144925384216 
recall = 0.88272 
f1 = 0.8497824329007663


Write down the calculated performance

### Accuracy: 0.8440
### Precision: 0.8192
### Recall: 0.8827
### F1: 0.8498

### Experiments (10 points)

Experiment with the model and achieve better results. You can find advices [here](https://arxiv.org/abs/1801.06146). Implement and describe your experiments in details, mention what was helpful.

### 1. ?
### 2. ?
### 3. ?