## Part 2. Model Training & Evaluation - RNN   
Now with the pretrained word embeddings acquired from Part 1 and the dataset acquired from
Part 0, you need to train a deep learning model for sentiment classification using the training set,
conforming to these requirements:


• Use the pretrained word embeddings from Part 1 as inputs; do not update them during training
(they are “frozen”).   

• Design a simple recurrent neural network (RNN), taking the input word embeddings, and
predicting a sentiment label for each sentence. To do that, you need to consider how to
aggregate the word representations to represent a sentence.   

• Use the validation set to gauge the performance of the model for each epoch during training.
You are required to use accuracy as the performance metric during validation and evaluation. 
   
• Use the mini-batch strategy during training. You may choose any preferred optimizer (e.g.,
SGD, Adagrad, Adam, RMSprop). Be careful when you choose your initial learning rate and
mini-batch size. (You should use the validation set to determine the optimal configuration.)
Train the model until the accuracy score on the validation set is not increasing for a few
epochs.
   
• Evaluate your trained model on the test dataset, observing the accuracy score.

In [None]:
import os
import json
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import time
from datasets import load_dataset
from common_utils import load_glove_embeddings, set_seed, EmbeddingMatrix
import nltk

In [None]:
# set seed 
set_seed()

In [None]:
# initialize parameters
BATCH_SIZE = 32
INPUT_SIZE = 100 # word embedding size 
HIDDEN_SIZE = 128 # just as a starter to see 
NUM_EPOCHS = 100 
EMBEDDING_DIM=100
LEARNING_RATE = 0.01
GRADIENT_CLIP=5 # gradient clipping

In [None]:
# initialize word embeddings
word_embeddings = EmbeddingMatrix.load()
word_embeddings.add_padding()

print("The index of <PAD> is: ", word_embeddings.pad_idx)

In [None]:
word_embeddings.to_tensor[word_embeddings.pad_idx]

In [None]:
class RNN(nn.Module):

    def __init__(
        self,
        hidden_dim: int,
        embedding_dim: int,
        word_embeddings: torch.Tensor,
        pad_idx,
        num_layers=1,
        output_size=1,
        dropout_rate=0
    ):
        super().__init__()
        self.hidden_dim = hidden_dim
        self.num_layers = num_layers
        self.dropout_rate = dropout_rate
        self.embedding = nn.Embedding.from_pretrained(word_embeddings, freeze=True, padding_idx=pad_idx)
        self.rnn = nn.RNN(
            embedding_dim, hidden_dim, num_layers, batch_first=True
        )  # this is the num rows of the input matrix
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sigmoid = nn.Sigmoid()
        
        self.dropout = nn.Dropout(dropout_rate)

    def forward(self, x):
        h0 = torch.zeros(self.num_layers, x.size(0), self.hidden_dim).to(x.device)
        
        if self.dropout_rate > 0:
            embedded = self.dropout(self.embedding(x)).float()
        else:
            embedded = self.embedding(x).float()
        
        out, _ = self.rnn(embedded, h0)
        
        if self.dropout_rate > 0:
            out = self.dropout(out)
        # if num_layers > 1, we need to do max pooling
        if self.num_layers > 1:
            out, _ = torch.max(out, 1)
        else:
            out = out[:, -1, :]
        
        out = self.fc(out)  # Use the last output of the RNN for classification
        sig_out = self.sigmoid(out)
        return sig_out

In [None]:
# load dataset from huggingface first 
dataset = load_dataset("rotten_tomatoes")
train_dataset = dataset['train']
validation_dataset = dataset['validation']
test_dataset = dataset['test']

In [None]:
# create train, validate and test datasets and dataloaders
from torch.utils.data import Dataset, DataLoader
from torch.nn.utils.rnn import pad_sequence

class EmbeddingsDataset(Dataset):
    def __init__(self, X, y, word_embeddings:EmbeddingMatrix =word_embeddings):
        self.word_embeddings = word_embeddings
        self.X = X # train_dataset['text']
        self.y = y # train_dataset['label']
        self.len = len(self.X)

    def __getitem__(self, index):
        # tokenize the sentence
        tokens = self.tokenize_sentence(self.X[index])
        return tokens, self.y[index] 

    def __len__(self):
        return self.len 

    def tokenize_sentence(self, x): 
        '''
    returns a list containing the embeddings of each token 
    '''
        tokens = nltk.word_tokenize(x)
        # word tokens to index, skip if token is not in the word embeddings
        tokens = [self.word_embeddings.get_idx(token) for token in tokens if self.word_embeddings.get_idx(token) is not None]
        return tokens


def pad_collate(batch, pad_value):
    (xx, yy) = zip(*batch)
    # convert xx to a tensor
    xx = [torch.tensor(x, dtype=torch.int64) for x in xx]
    xx_pad = pad_sequence(xx, batch_first=True, padding_value=pad_value)
    return xx_pad, torch.tensor(yy, dtype=torch.long)

In [None]:
train_dataset_ed = EmbeddingsDataset(
    train_dataset["text"], train_dataset["label"]
)
validation_dataset_ed = EmbeddingsDataset(
    validation_dataset["text"], validation_dataset["label"]
)
test_dataset_ed = EmbeddingsDataset(test_dataset["text"], test_dataset["label"])

pad_value = word_embeddings.pad_idx
# implement minibatch training
train_dataloader = DataLoader(
    train_dataset_ed,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=lambda x: pad_collate(x, pad_value),
)
validation_dataloader = DataLoader(
    validation_dataset_ed,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=lambda x: pad_collate(x, pad_value),
)
test_dataloader = DataLoader(
    test_dataset_ed,
    batch_size=BATCH_SIZE,
    shuffle=True,
    collate_fn=lambda x: pad_collate(x, pad_value),
)

In [None]:
# obtain one batch of training data
dataiter = iter(train_dataloader)
sample_x, sample_y = next(dataiter)

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

In [None]:
# obtain one batch of training data
dataiter = iter(validation_dataloader)
sample_x, sample_y = next(dataiter)

print('Sample input size: ', sample_x.size()) # batch_size, seq_length
print('Sample input: \n', sample_x)
print()
print('Sample label size: ', sample_y.size()) # batch_size
print('Sample label: \n', sample_y)

In [None]:
def train_loop(train_dataloader, model, loss_fn, optimizer):
    model.train()
    num_batches = len(train_dataloader)
    size = len(train_dataloader.dataset)
    train_loss, train_correct = 0, 0
    for batch_no, (X_batch, y_batch) in enumerate(train_dataloader):
        # Forward pass
        pred = model(X_batch)
        pred = pred.squeeze(1)

        # TODO
        loss = loss_fn(pred, y_batch.float())
        mask = (y_batch != word_embeddings.pad_idx)
        loss = torch.sum(loss * mask) / torch.sum(mask)
        
        train_loss += loss.item() 
        train_correct += ((pred >= 0.5).long()==y_batch).sum().item() 
        
        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        # TODO add main branch
        torch.nn.utils.clip_grad_norm_(model.parameters(), GRADIENT_CLIP)
        optimizer.step()
        
    train_loss /= num_batches 
    train_correct /= size 

    return train_loss, train_correct 
   

def test_loop(validate_dataloader, model, loss_fn):
    model.eval()
    num_batches = len(validate_dataloader)
    size = len(validate_dataloader.dataset)
    test_loss, test_correct = 0, 0

    with torch.no_grad():
        for X_batch, y_batch in validate_dataloader:

            pred = model(X_batch)
            pred = pred.squeeze(1)
            pred_binary = (pred >= 0.5).long()
            
            # TODO
            loss = loss_fn(pred, y_batch.float())
            mask = (y_batch != word_embeddings.pad_idx)
            loss = torch.sum(loss * mask) / torch.sum(mask)
        
        
            test_loss += loss.item()
            test_correct += (pred_binary == y_batch).sum().item()

    test_loss /= num_batches
    test_correct /= size
    return test_loss, test_correct

In [None]:
basic_RNN = RNN(
    hidden_dim=HIDDEN_SIZE,
    embedding_dim=EMBEDDING_DIM,
    word_embeddings=word_embeddings.to_tensor,
    pad_idx=word_embeddings.pad_idx,
    num_layers=1,
)
optim = torch.optim.Adam(basic_RNN.parameters(), lr=LEARNING_RATE)
# scheduler = torch.optim.lr_scheduler.StepLR(optim, step_size=5, gamma=0.9)
scheduler = torch.optim.lr_scheduler.LinearLR(optim, start_factor=1.0, end_factor=0.01, total_iters=100)
criterion = nn.BCELoss()

# stacked RNN
stacked_RNN = RNN(
    hidden_dim=HIDDEN_SIZE,
    embedding_dim=EMBEDDING_DIM,
    word_embeddings=word_embeddings.to_tensor,
    pad_idx=word_embeddings.pad_idx,
    num_layers=3,
)
optim_stacked_rnn = torch.optim.Adam(stacked_RNN.parameters(), lr=LEARNING_RATE)
criterion_stacked_rnn = nn.BCELoss()
scheduler_stacked_rnn = torch.optim.lr_scheduler.StepLR(optim_stacked_rnn, step_size=5, gamma=0.9)

# RNN with dropout
RNN_dropout = RNN(
    hidden_dim=HIDDEN_SIZE,
    embedding_dim=EMBEDDING_DIM,
    word_embeddings=word_embeddings.to_tensor,
    pad_idx=word_embeddings.pad_idx,
    num_layers=1,
    dropout_rate=0.5,
)

optim_rnn_dropout = torch.optim.Adam(RNN_dropout.parameters(), lr=LEARNING_RATE)
criterion_rnn_dropout = nn.BCELoss()
scheduler_rnn_dropout = torch.optim.lr_scheduler.StepLR(optim_rnn_dropout, step_size=5, gamma=0.9)

# Testing the model, just using epoch = 100 

In [None]:
import matplotlib.pyplot as plt
    
def evaluate_model(train_dataloader, validation_dataloader, model, criterion, optim, scheduler):
    validation_acc = []
    train_acc = []
    for i in range(NUM_EPOCHS):
        train_loss, train_correct = train_loop(
            train_dataloader, model, criterion, optim
        )
        validate_loss, validate_correct = test_loop(
            validation_dataloader, model, criterion
        )
        validation_acc.append(validate_correct)
        train_acc.append(train_correct)
        scheduler.step()
        if i % 10 == 0:
            print(
                f"Epoch:{i+1} \tValidation Acc:{validate_correct} \tTrain Acc:{train_correct} \tLearning rate:{scheduler.get_last_lr()}"
            )

    plt.plot(train_acc, label="train acc")
    plt.plot(validation_acc, label="validation acc")

    plt.xlabel("epoch")
    plt.ylabel("accuracies")
    plt.title("train vs validation accs")
    plt.legend()
    plt.show()

In [None]:
evaluate_model(train_dataloader, validation_dataloader, basic_RNN, criterion, optim, scheduler)

In [None]:
evaluate_model(train_dataloader, validation_dataloader, stacked_RNN, criterion_stacked_rnn, optim_stacked_rnn, scheduler_stacked_rnn)

In [None]:
evaluate_model(train_dataloader, validation_dataloader, RNN_dropout, criterion_rnn_dropout, optim_rnn_dropout, scheduler_rnn_dropout)

In [None]:
import matplotlib.pyplot as plt

plt.plot(train_acc, label="train acc")
plt.plot(validation_acc, label="validation acc")

plt.xlabel("epoch")
plt.ylabel("accuracies")
plt.title("train vs validation accs")
plt.legend()
plt.show()

# Hyperparameter Tuning
We will perform grid search on the no. of training epochs, lr, optimizer and batch sizes

Question 2. RNN
(a) Report the final configuration of your best model, namely the number of training epochs,
learning rate, optimizer, batch size.
(b) Report the accuracy score on the test set, as well as the accuracy score on the validation
set for each epoch during training.
(c) RNNs produce a hidden vector for each word, instead of the entire sentence. Which methods
have you tried in deriving the final sentence representation to perform sentiment classification?
Describe all the strategies you have implemented, together with their accuracy scores on the
test set

In [None]:
test_acc = [] 