# KAIST AI605 Assignment 1: Text Classification with RNNs
Authors: Hyeong-Gwon Hong (honggudrnjs@kaist.ac.kr) and Minjoon Seo (minjoon@kaist.ac.kr)

**Due Date:** March 31 (Wed) 11:00pm, 2021

## Assignment Objectives
- Verify theoretically and empirically why gating mechanism (LSTM, GRU) helps in Recurrent Neural Networks (RNNs)
- Design an LSTM-based text classification model from scratch using PyTorch.
- Apply the classification model to a popular classification task, Stanford Sentiment Treebank v2 (SST-2).
- Achieve higher accuracy by applying common machine learning strategies, including Dropout.
- Utilize pretrained word embedding, GloVe, to leverage self-supervision over a large text corpus.
- (Bonus) Use Hugging Face library (`transformers`) to leverage self-supervision via large language models.

## Your Submission
Your submission will be a link to a CoLab notebook that has all written answers and is fully executable. Use in-line LaTeX (see below) for mathematical expressions. Collaboration among students is allowed but it is not a group assignment so make sure your answer and code are your own.

## Grading
The entire assignment is out of 100 points. There are two bonus questions with 10 points each.


## Environment
You will only use Python 3.7 and PyTorch 1.8, which is already available:

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
from platform import python_version
import torch
import os
import pandas as pd
import re
from torch import nn
import numpy as np
import math

print("python", python_version())
print("torch", torch.__version__)

python 3.7.10
torch 1.8.1+cu101


## 1. Limitations of Vanilla RNNs
In Lecture 04 and 05, we saw how RNNs suffer from exploding or vanishing gradients. We mathematically showed that, if the recurrent relation is
$$ \textbf{h}_t = \sigma (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}) $$
then
$$ \frac{\partial \textbf{h}_t}{\partial \textbf{h}_{t-1}} = \text{diag}(\sigma' (\textbf{V}\textbf{h}_{t-1} + \textbf{U}\textbf{x}_t + \textbf{b}))\textbf{V}$$
so
$$\frac{\partial \textbf{h}_T}{\partial \textbf{h}_1} \propto \textbf{V}^{T-1}$$
which means this term will be very close to zero if the norm of $\bf{V}$ is smaller than 1 and really big otherwise.




**Problem 1.1** *(10 points)* Explain how exploding gradient can be mitigated if we use gradient clipping.

<font color='blue'> **Solution:** Gradient clipping is based on a simple idea. If the gradient exceeds some threshold $c$, then gradient clipping makes it equal to the threshold $c$.
Gradient clipping normalizes the gradient vector and rescales the gradient values when they exceed a threshold in the following way:
$$g \leftarrow c * \frac{g}{||g||}$$

<font color='blue'>The above expression ensures that the gradients have the norm atmost c. If we encounter the problem of exploding gradients, then with gradient clipping we can reduce the large descent step making training stable. Specifically, the values of the error gradient are checked against a threshold value and clipped or set to that threshold value if the error gradient exceeds the threshold.<font>

**Problem 1.2** *(10 points)* Explain how vanishing gradient can be mitigated if we use LSTM. See the Lecture 04 and 05 slides for the definition of LSTM.

<font color='blue'>The cell state gradient is an additive function made up from four elements denoted as shown in the equation below. 
 \begin{aligned} \frac{\partial c_{t}}{\partial c_{t-1}}=& \sigma^{\prime}\left(W_{f} \cdot\left[h_{t-1}, x_{t}\right]\right) \cdot W_{f} \cdot o_{t-1} \otimes \tanh ^{\prime}\left(c_{t-1}\right) \cdot c_{t-1} \\ &+f_{t} \\ &+\sigma^{\prime}\left(W_{i} \cdot\left[h_{t-1}, x_{t}\right]\right) \cdot W_{i} \cdot o_{t-1} \otimes \tanh ^{\prime}\left(c_{t-1}\right) \cdot \tilde{c}_{t} \\ &+\sigma^{\prime}\left(W_{c} \cdot\left[h_{t-1}, x_{t}\right]\right) \cdot W_{c} \cdot o_{t-1} \otimes \tanh ^{\prime}\left(c_{t-1}\right) \cdot i_{t} \end{aligned}
<font color='blue'>
This additive property enables better balancing of gradient values during backpropagation. The LSTM updates and balances the values of the four components making it more likely the additive expression does not vanish.<font>

<font color='blue'> There are primarily two factors that affect the magnitude of gradients. These are the weights and the activation functions.
If either of these factors is smaller than 1, then the gradients may vanish in time (or otherwise known as the gradient descent problem). If either of these factors is larger than 1, then the gradients may explode (or otherwise known as the exploding gradient problem). 
In the recurrency of the LSTM the activation function is the identity function with a derivative of 1.0. So, the backpropagated gradient neither vanishes or explodes.
The effective weight of the recurrency is equal to the forget gate activation. So, if the forget gate is on (activation close to 1.0), then the gradient does not vanish. Since the forget gate activation is never  >1.0 , the gradient can't explode either.



## 2. Creating Vocabulary from Training Data
Creating the vocabulary is the first step for every natural language processing model. In this section, you will use Stanford Sentiment Treebank v2, a popular dataset for sentiment classification, to create your vocabulary.

### Obtaining SST-2 via GLUE
General Language Understanding Evaluation (GLUE) benchmark is a collection of tools for evaluating the performance of models across a diverse set of existing natural language understanding (NLU) tasks. From the GLUE website (https://gluebenchmark.com/), you can access to the GLUE paper (https://openreview.net/pdf?id=rJ4km2R5t7) and the GitHub repository for GLUE baselines (Reference : https://github.com/nyu-mll/GLUE-baselines) .

You can download SST-2 dataset by following the steps below:

1. Clone GitHub repository:

In [None]:
!git clone https://github.com/nyu-mll/GLUE-baselines.git

Cloning into 'GLUE-baselines'...
remote: Enumerating objects: 891, done.[K
remote: Counting objects: 100% (5/5), done.[K
remote: Compressing objects: 100% (5/5), done.[K
remote: Total 891 (delta 1), reused 2 (delta 0), pack-reused 886[K
Receiving objects: 100% (891/891), 1.48 MiB | 5.33 MiB/s, done.
Resolving deltas: 100% (610/610), done.


2. Download dataset:

In [None]:
%cd GLUE-baselines/
!python download_glue_data.py --data_dir glue_data --tasks SST

/content/GLUE-baselines/GLUE-baselines
Downloading and extracting SST...
	Completed!


Your training, dev, and test data can be found at `glue_data/SST-2`. Note that each file is in a tsv format, where the first column is the sentence and the second column is the label (either 0 or 1, where 1 means positive review). 

In [None]:
# Splitting the training dataset to train, val set and using validation data for testing
tsv_read = pd.read_csv('glue_data/SST-2/train.tsv', sep='\t')[:65000]
val_tsv_read = pd.read_csv('glue_data/SST-2/train.tsv', sep='\t')[65000:]
test_tsv_read = pd.read_csv('glue_data/SST-2/dev.tsv', sep='\t')
print(len(tsv_read),len(val_tsv_read), len(test_tsv_read))

65000 2349 872


In [None]:
val_tsv_read.head()

Unnamed: 0,sentence,label
65000,content merely to lionize its title character ...,0
65001,"a subtle , humorous , illuminating study",1
65002,a solid and refined piece of moviemaking,1
65003,as padded as allen 's jelly belly,0
65004,having so much fun,1


**Problem 2.1** *(10 points)* Using space tokenizer, create the vocabulary for the training data and report the vocabulary size here. Make sure that you add an `UNK` token to the vocabulary to account for words (during inference time) that you haven't seen. See below for an example with a short text.

#### $\color{blue}{\text{Solution 2.1}}$


In [None]:
# Space tokenization
# tsv_read['sentence'] = tsv_read['sentence'].str.lower()
tsv_read['sentence'] = tsv_read['sentence']
text = list(tsv_read['sentence'])
tokens = []
for i in range(len(text)):
  temp_tokens = text[i].split(" ")
  # temp_tokens = re.findall(r"[\w']+|[.,!?;-_#%()<>()#:'@%#$]",text[i])
  temp_tokens1 = list(list(filter(('').__ne__, temp_tokens))) # removing "''" from the tokens
  tokens.extend(temp_tokens1)

print("Number of reviews in the training dataset:", len(text))
print("Total number of tokens in the reviews from training dataset:", len(tokens))

Number of reviews in the training dataset: 65000
Total number of tokens in the reviews from training dataset: 612161


In [None]:
# Constructing vocabulary from train data using space tokenizer and adding 'UNK' token
vocab = ['UNK'] + ['PAD'] + list(set(tokens))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print("Vocabulary size:", len(vocab))
print(word2id["UNK"])

Vocabulary size: 14799
0


<font color='blue'> In above cells, I created the vocabulary for the training data using space tokenizer. The vocabulary size is **14799** words in this case.</font>

**Problem 2.2** *(10 points)* Using all words in the training data will make the vocabulary very big. Reduce its size by only including words that occur at least 2 times. How does the size of the vocabulary change?

#### $\color{blue}{\text{Solution 2.2}}$
<font color='blue'> </font>

In [None]:
# Using the tokens generated in Problem 2.1 
print("Number of reviews in the training dataset:", len(text))
print("Number of tokens in the reviews from training dataset:", len(tokens))

# Finding the frequency of occurence of each token in the reviews
token_counts = {}
tokens_final = []

for token in tokens:
    if token in token_counts:
        token_counts[token] += 1
    else:
        token_counts[token] = 1

# Including only those tokens that occured atleast 2 times
for token in token_counts.keys():
    if token_counts[token] >= 2:
        tokens_final.append(token)

print("\nNumber of tokens that occured atleast 2 times in the training data :", len(tokens_final))      

Number of reviews in the training dataset: 65000
Number of tokens in the reviews from training dataset: 612161

Number of tokens that occured atleast 2 times in the training data : 14259


In [None]:
# Constructing vocabulary using space tokenizer and adding 'UNK' token
vocab = ['UNK'] + ['PAD'] + list(set(tokens_final))
word2id = {word: id_ for id_, word in enumerate(vocab)}
print("Vocabulary size after including tokens with frequency of atleast 2", len(vocab))

Vocabulary size after including tokens with frequency of atleast 2 14261


<font color='blue'> In above cells, I created the vocabulary for the training data using space tokenizer. Vocabulary size after including tokens that occur atleast two times in the dataset is **14261** words. This vocabulary has **538 words less** then the vocabulary with all words included. Although, there is not a significant reduction in the size of the vocabulary, using words which occur atleast two times in training dataset is helpful as there are fewmany words which are useless/typos and including them in the vocabulary is not important. </font>

## 3. Text Classification Baselines

You can now use the vocabulary constructed from the training data to create an embedding matrix. You will use the embedding matrix to map each input sequence of tokens to a list of embedding vectors. One of the simplest baseline is to go through one layer of neural network and then average the outputs, and finally classify the average embedding: 

In [None]:
def tokenization(text): 
  tokens_per_sent = []
  input_ids = []
  for i in range(len(text)):
    temp_tokens = text[i].split(" ")
    temp_tokens = list(list(filter(('').__ne__, temp_tokens))) # removing "''" from the tokens
    tokens_per_sent.append(temp_tokens)
    tokens_id = [word2id[word] if word in word2id else 0 for word in temp_tokens]
    input_ids.append(tokens_id)
  return tokens_per_sent, input_ids

text = tsv_read['sentence']
labels = list(tsv_read['label'])
text = list(text.str.lower())
tokens, token_ids = tokenization(text)
print("Number of reviews in the training dataset:", len(text))

val_text = val_tsv_read['sentence']
val_labels = list(val_tsv_read['label'])
val_text = list(val_text.str.lower())
val_tokens, val_token_ids = tokenization(val_text)
print("Number of reviews in the validation dataset:", len(val_text))

Number of reviews in the training dataset: 65000
Number of reviews in the validation dataset: 2349


In [None]:
# One layer, average pooling and classification
class Baseline(nn.Module):
  def __init__(self, d):
    super(Baseline, self).__init__()
    self.embedding = nn.Embedding(len(vocab), d)
    self.layer = nn.Linear(d, d, bias=True)
    self.relu = nn.ReLU()
    self.class_layer = nn.Linear(d, 2, bias=True)

  def forward(self, input_tensor):
    emb = self.embedding(input_tensor)
    out = self.relu(self.layer(emb))
    avg = out.mean(1)
    logits = self.class_layer(avg)
    return logits

Now we will compute the loss, which is the negative log probability of the input text's label being the target label (`1`), which in fact turns out to be equivalent to the cross entropy (https://en.wikipedia.org/wiki/Cross_entropy) between the probability distribution and a one-hot distribution of the target label (note that we use `logits` instead of `softmax(logits)` as the input to the cross entropy, which allow us to avoid numerical instability). 

In [None]:
# cel = nn.CrossEntropyLoss()
# label = torch.LongTensor([1]) # The ground truth label for "hi world!" is positive.
# loss = cel(logits, label) # Loss, a.k.a L
# print(loss)
# print(logits, label)

Once we have the loss defined, only one step remains! We compute the gradients of parameters with respective to the loss and update. Fortunately, PyTorch does this for us in a very convenient way. Note that we used only example to update the model, which is basically a Stochastic Gradient Descent (SGD) with minibatch size of 1. A recommended minibatch size in this exercise is at least 16. It is also recommended that you reuse your training data at least 10 times (i.e. 10 *epochs*).

In [None]:
# optimizer = torch.optim.SGD(baseline.parameters(), lr=0.1)
# optimizer.zero_grad() # reset process
# loss.backward() # compute gradients
# optimizer.step() # update parameters

Once you have done this, all weight parameters will have `grad` attributes that contain their gradients with respect to the loss.

In [None]:
# print(baseline.layer.weight.grad) # dL/dw of weights in the linear layer

**Problem 3.1** *(10 points)* Properly train this average-pooling baseline model on SST-2 and report the model's accuracy on the dev data.

#### $\color{blue}{\text{Solution 3.1}}$
<font color='blue'> Training the model on 65000 samples of train set; validating using remaining samples of train set; testing it on the dev set. </font>

In [None]:
# result_path = '/home/radhika/radhika_77/data/nlp/models/'
result_path = '/content/drive/MyDrive/nlp/'
criterion = nn.CrossEntropyLoss()
softmax = nn.Softmax(1) 

def train(model, data, labels, val_data, val_labels, num_epochs = 12, file = None):    
    if (file != None):
      best_file = os.path.join(result_path, file)

    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    best_acc = 0
    best_acc_epoch = 0
    
    train_loss=[]
    train_acc=[]
    val_loss = []
    val_acc=[]

    for epoch in range(num_epochs):
      model.train()
      epoch_loss = 0
      correct = 0
      total = 0
      epoch_acc = 0
      for i in range(len(labels)):
          input_tensor = torch.LongTensor([data[i]])
          label = torch.LongTensor([labels[i]])

          optimizer.zero_grad()
          logits = model(input_tensor)
          loss = criterion(logits, label)
          loss.backward()
          optimizer.step()
          _, predicted = torch.max(logits.data, 1)
          total += label.size(0)
          correct += (predicted == label).sum().item()
          epoch_loss += loss.item()

      epoch_acc = (100 * correct / total)
      train_loss.append(round((epoch_loss / len(labels)), 2))
      train_acc.append(round((epoch_acc),2))

      with torch.no_grad():
            model.eval()
            val_epoch_loss = 0
            val_total = 0
            val_correct = 0
            epoch_val_acc = 0
            for i in range(len(val_labels)):
                input_tensor, label = val_data[i], val_labels[i]
                input_tensor = torch.LongTensor([input_tensor])
                label = torch.LongTensor([label])
                logits = model(input_tensor)
                loss = criterion(logits, label)
                _, predicted = torch.max(logits.data, 1)
                val_total += label.size(0)
                val_correct += (predicted == label).sum().item()
                val_epoch_loss += loss.item()

            val_loss.append(round((val_epoch_loss / len(val_labels)),2))
            epoch_val_acc = (100 * val_correct / val_total)
            val_acc.append(round((epoch_val_acc),2))
      print("epoch {}: Training Loss- {:.2f} , Val Loss- {:.2f} , Training Acc- {:.2f}, Val Acc- {:.2f}".format(epoch, epoch_loss, val_epoch_loss, epoch_acc, epoch_val_acc))
      if (epoch_val_acc >= best_acc):
            best_acc = epoch_val_acc
            best_acc_epoch = epoch
            if (file!=None):
              torch.save(model.state_dict(), best_file)         
      if (epoch == num_epochs - 1):
        print("Best accuracy at epoch: {}".format(best_acc_epoch))
    return train_loss, train_acc, val_loss, val_acc

In [None]:
def test(model, data, labels, file = None):
    if (file!=None):
      best_file = os.path.join(result_path, file)
      model.load_state_dict(torch.load(best_file))
    model.eval()
    
    with torch.no_grad():
        test_total = 0
        test_correct = 0
        test_loss = 0
        test_accuracy = 0
        for i in range(len(labels)):
            input_tensor, label = data[i],  labels[i]
            input_tensor = torch.LongTensor([input_tensor])
            label = torch.LongTensor([label])
            logits = model(input_tensor)
            loss = criterion(logits, label)
            _, predicted = torch.max(logits.data, 1)
            test_total += label.size(0)
            test_correct += (predicted == label).sum().item()
            test_loss += loss.item()
        test_accuracy = round((100 * test_correct / test_total), 2)
        print("Test loss: {}, test accuracy: {}". format(round((test_loss / len(labels)),2), test_accuracy))

In [None]:
d = 128 # size of word-embedding
num_epochs = 30
model = Baseline(d=128)
train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_baseline.pt')

epoch 0: Training Loss- 65647.65 , Val Loss- 1418.86 , Training Acc- 64.27, Val Acc- 74.37
epoch 1: Training Loss- 33663.40 , Val Loss- 1347.99 , Training Acc- 78.30, Val Acc- 77.82
epoch 2: Training Loss- 28251.81 , Val Loss- 1158.29 , Training Acc- 82.68, Val Acc- 83.01
epoch 3: Training Loss- 24638.87 , Val Loss- 1199.34 , Training Acc- 85.32, Val Acc- 83.14
epoch 4: Training Loss- 21758.73 , Val Loss- 1244.83 , Training Acc- 87.06, Val Acc- 85.01
epoch 5: Training Loss- 19489.57 , Val Loss- 1239.51 , Training Acc- 88.39, Val Acc- 85.99
epoch 6: Training Loss- 19327.20 , Val Loss- 1296.61 , Training Acc- 88.58, Val Acc- 85.99
epoch 7: Training Loss- 17280.85 , Val Loss- 1316.56 , Training Acc- 89.71, Val Acc- 86.80
epoch 8: Training Loss- 16001.65 , Val Loss- 1494.07 , Training Acc- 90.39, Val Acc- 86.89
epoch 9: Training Loss- 15093.05 , Val Loss- 1589.43 , Training Acc- 91.14, Val Acc- 87.31
epoch 10: Training Loss- 14603.18 , Val Loss- 1632.44 , Training Acc- 91.39, Val Acc- 86.0

In [None]:
test_text = test_tsv_read['sentence']
test_labels = list(test_tsv_read['label'])
test_text = list(test_text.str.lower())
test_tokens, test_token_ids = tokenization(test_text)
test(model, test_token_ids, test_labels, 'best_baseline.pt')

1
Test loss: 0.72, test accuracy: 78.21


<font color='blue'>The model is trained for 30 epoch and the best model based on performance on the validation split is obtained at 12th epoch. The accuracy of the model is **78.21%** on the dev set.<font>

**Problem 3.2** *(10 points)* Implement a recurrent neural network (without using PyTorch's RNN module) where the output of the linear layer not only depends on the current input but also the previous output. Report the model's accuracy on the dev data. Is it better or worse than the baseline? Why?


#### $\color{blue}{\text{Solution 3.2}}$
<font color='blue'> Training the model on 65000 samples of train set; validating using remaining samples of train set; testing it on the dev set.
Here, the **`train function`** and **`test function`** defined in **solution 3.1** is used. </font>

In [None]:
class RNN(nn.Module):
    """
    Implementation of recurrent neural network using
    `nn.Linear` class
    3 types of layer connections:
    - input to hidden layer
    - hidden to hidden
    - hidden to output
    - hiddens to all 
    weights are shared across time 
    """
    def __init__(self, d):
        super(RNN, self).__init__()
        # Set the sizes of layers and more.
        self.input_size = d # size of word_embeddings
        self.hidden_size = d # size of hidden layers
        self.output_size = 2 # size of output

        self.embedding = nn.Embedding(len(vocab), self.input_size)
        self.x2h = nn.Linear(self.input_size, self.hidden_size)    # input to hidden 
        self.h2h = nn.Linear(self.hidden_size, self.hidden_size)    # hidden to  hidden
        self.h2y = nn.Linear(self.hidden_size, self.output_size, bias=True)  # hidden to output


    def forward(self, input_tensor):
        emb = self.embedding(input_tensor)
        h = torch.zeros(1, self.hidden_size)
        for i in range(emb.shape[1]):
            h = torch.tanh(self.h2h(h) + self.x2h(emb[:,i,:]))   
        out = self.h2y(h)
        return out
        

In [None]:
# d = 128 # size of word-embedding
num_epochs = 20
model = RNN(d=128)
train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_rnn.pt')

epoch 0: Training Loss- 392037.07 , Val Loss- 10098.62 , Training Acc- 51.32, Val Acc- 54.45
epoch 1: Training Loss- 388190.52 , Val Loss- 12706.09 , Training Acc- 52.00, Val Acc- 51.85
epoch 2: Training Loss- 385291.59 , Val Loss- 11382.55 , Training Acc- 52.33, Val Acc- 51.81
epoch 3: Training Loss- 385804.41 , Val Loss- 13494.90 , Training Acc- 52.23, Val Acc- 50.79
epoch 4: Training Loss- 384812.21 , Val Loss- 11289.45 , Training Acc- 52.45, Val Acc- 52.66
epoch 5: Training Loss- 381607.05 , Val Loss- 10052.96 , Training Acc- 52.79, Val Acc- 51.17
epoch 6: Training Loss- 381293.32 , Val Loss- 12750.19 , Training Acc- 52.83, Val Acc- 52.70
epoch 7: Training Loss- 379108.14 , Val Loss- 11628.76 , Training Acc- 53.17, Val Acc- 51.81
epoch 8: Training Loss- 379794.24 , Val Loss- 11344.03 , Training Acc- 53.11, Val Acc- 52.75
epoch 9: Training Loss- 378869.61 , Val Loss- 9530.38 , Training Acc- 53.15, Val Acc- 50.96
epoch 10: Training Loss- 379121.06 , Val Loss- 10143.09 , Training Acc-

In [None]:
test(model, test_token_ids, test_labels, 'best_rnn.pt')

Test loss: 4.95, test accuracy: 51.72


<font color='blue'>The model is trained for 20 epoch and the best model based on performance on the validation split is obtained at 20th epoch. The accuracy of the model is **51.72%** on the dev set.<font>

<font color='blue'> Based on the accuracy of RNN and baseline model on the dev set, we can clearly say that the **RNN model is perfoming worse than the baseline model**. As the training and validation accuracy is also around 55%, it implies that model is not learning properly. **This maybe be primarily because the gradients might be vanishing or exploding (also known as vanishing/exploding gradient problem) and hence the model is not able to learn properly.**<font>


**Problem 3.3 (bonus)** *(10 points)* Show that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution.

#### $\color{blue}{\text{Solution 3.3}}$
**Mathematically**

<font color='blue'>**Cross-entropy is a measure of the difference between two probability distributions**. Cross entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1. It increases as the predicted probability diverges from the actual label. **Cross entropy loss** is mathematically defined as:
$$H(p,q) = - \sum_{i} P_{i} \log q_{i}$$,
where $q_{i}$ is the estimated probability of outcome $i$ and $p_{i}$ is the empirical probability of outcome $i$ in the training data. <font>

<font color='blue'> The **likelihood of training set** is given by:
$$\text{likelihood} = \prod_{i} q_{i}^{Np_{i}} $$
where $q_{i}$ is the estimated probability of outcome $i$, $p_{i}$ is the empirical probability of outcome $i$ in the training data and N is the number of independent samples in the training set.

**On taking logarithm of likelihood followed by dividing it by N**, we get
$$\frac{1}{N} \log \prod_{i} q_{i}^{Np_{i}}  = \sum_{i} P_{i} \log q_{i} = - H(p,q) $$
</font>
<font color='blue'> 
Hence, proved that mathematically the cross-entropy is equivalent to the negative log likelihood of the probability distribution.<font>

**Based on experiments**

<font color='blue'>Additionally, I performed a small experiment in the cell below to verify this. I used the baseline model and during inference time checked the value of cross entropy loss and nll loss. Based on the loss values obtained in the cell below, it is proved that the cross entropy computed above is equivalent to the negative log likelihood of the probability distribution. <font>


In [None]:
# Test sample
input_tensor = torch.LongTensor([token_ids[10]])
label = torch.LongTensor([labels[10]])

#Loading baseline model
model = Baseline(128)
best_file = os.path.join(result_path, "best_baseline.pt")
model.load_state_dict(torch.load(best_file))
model.eval()
logits = model(input_tensor)

# Cross entropy loss
ce = nn.CrossEntropyLoss()
loss1 = ce(logits, label)
m = nn.LogSoftmax(dim = 1)

# NLL loss
nll_loss = nn.NLLLoss()
loss2 = nll_loss(m(logits), label)
print("Cross entropy loss is ", loss1.item())
print("NLL loss is ", loss2.item())

Cross entropy loss is  4.339123915997334e-05
NLL loss is  4.339123915997334e-05



**Problem 3.4 (bonus)** *(10 points)* Why is it numerically unstable if you compute log on top of softmax?

<font color='blue'> Computing log on top of softmax is numerically unstable as it may lead to **underflow** in case of, **$log(0)$**, which is undefined. The logarithm function is not defined for zero, so log probabilities can only represent non-zero probabilities. 

<font color='blue'> ***In the cell below, we do a simple experiment to prove that computing log on top of softmax is numerically unstable, particularly numerical underflow.***
In this example, we have an input vector containing one value significantly larger than the rest of values. <font>
<font color='blue'>
$$x = [10, 2, 10000, 4]$$

<font color='blue'>
Computing softmax on this vector generates: 

<font color='blue'>
$$ softmax(x) = [0., 0., 1., 0.]$$

<font color='blue'>
Since $x$ contains a significantly larger number at index $2$ and all the other values are very small, softmax of this vector generates probability $1$ for value at index $1$ and probability $0$ for rest of the values.

<font color='blue'>
Computing log of softmax(x) generates: 

<font color='blue'>
$$\log(softmax(x)) = [-inf, -inf,   0., -inf]$$

<font color='blue'>
Since log(0) is not defined, we get the error "*RuntimeWarning: divide by zero encountered in log*" and this is a classic example of numerical instabilty, particularly numerical underflow.

<font color='blue'>Hence, **computing log on top of softmax is numerically unstable.**<font>

In [None]:
import numpy as np
def softmax(x):
    max_x = np.max(x)
    exp_x = np.exp(x - max_x)
    sum_exp_x = np.sum(exp_x)
    sm_x = exp_x/sum_exp_x
    return sm_x
    
x = np.array([10, 2, 10000, 4])
softmax(x)

array([0., 0., 1., 0.])

In [None]:
np.log(softmax(x))

  """Entry point for launching an IPython kernel.


array([-inf, -inf,   0., -inf])

<font color='blue'>This is an example where computing log on top of softmax leads to numerical instability, in particular, numerical underflow.

## 4. Text Classification with LSTM and Dropout

Now it is time to drastically improve your baselines! Replace your RNN module with an LSTM module. See Lecture slides 04 and 05 for the formal definition of LSTMs. 

You will also use Dropout, which randomly makes each dimension zero with the probability of `p` and scale it by `1/(1-p)` if it is not zero during training. Put it either at the input or the output of the LSTM to prevent it from overfitting.

In [None]:
a = torch.FloatTensor([0.1, 0.3, 0.5, 0.7, 0.9])
dropout = nn.Dropout(0.5) # p=0.5
print(dropout(a))

tensor([0.2000, 0.6000, 0.0000, 0.0000, 0.0000])


**Problem 4.1** *(20 points)* Use LSTM instead of vanilla RNN to improve your model. Report the accuracy on the dev data.

**Problem 4.2** *(10 points)* Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data and briefly describe how it differs from 4.1.

**Problem 4.3 (bonus)** *(10 points)* Consider implementing bidirectional LSTM and two layers of LSTM to further improve your model. Report your accuracy on dev data.

#### $\color{blue}{\text{Solution 4.1}}$
<font color='blue'> Training the model on 65000 samples of train set; validating using remaining samples of train set; testing it on the dev set.
Here, the **`train function`** and **`test function`** defined in **solution 3.1** is used. </font>

In [None]:
class LSTM(nn.Module):
    def __init__(self, d, dropout = None):
        """
        Implementation of recurrent neural network using
        `nn.Linear` and `nn.Parameter`class
        """
        super(LSTM, self).__init__()
        self.input_size = d
        self.hidden_size = d
        self.output_size = 2
        self.drop = dropout
        self.embedding = nn.Embedding(len(vocab), self.input_size)
        # i_t, c_t, f_t, o_t
        self.W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.dropout = nn.Dropout(0.25)
        self.linear = nn.Linear(self.hidden_size, self.output_size, bias=True) 

        self.init_weights()
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, input_tensor):
        """Assumes x is of shape (batch, sequence, feature)"""
        # print(input_tensor.shape)
        emb = self.embedding(input_tensor)
        batch_size = emb.shape[0]

        h_t, c_t = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
          
        ds = self.hidden_size
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.W + h_t @ self.U + self.b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * torch.tanh(c_t)

        if (self.drop != None):
          h_t = self.dropout(h_t)
          out = self.linear(h_t)
        else:
          out = self.linear(h_t)
        return out

In [None]:
# num_epochs = 14
# model = LSTM(d=128)
# train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_lstm.pt')

In [None]:
test('LSTM', test_token_ids, test_labels, 'best_lstm.pt')

Test loss: 1.02, test accuracy: 83.14


In [None]:
d = 128 # size of word-embedding
num_epochs = 14
model = LSTM(d=128)
train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_lstm_v2.pt')

epoch 0: Training Loss- 29851.19 , Val Loss- 662.57 , Training Acc- 76.46, Val Acc- 88.72
epoch 1: Training Loss- 13292.34 , Val Loss- 550.47 , Training Acc- 92.31, Val Acc- 91.36
epoch 2: Training Loss- 8900.31 , Val Loss- 612.81 , Training Acc- 94.94, Val Acc- 91.40
epoch 3: Training Loss- 6605.37 , Val Loss- 720.49 , Training Acc- 96.12, Val Acc- 91.49
epoch 4: Training Loss- 5250.48 , Val Loss- 750.51 , Training Acc- 96.91, Val Acc- 91.78
epoch 5: Training Loss- 4233.30 , Val Loss- 799.96 , Training Acc- 97.44, Val Acc- 91.06
epoch 6: Training Loss- 3518.11 , Val Loss- 978.44 , Training Acc- 97.92, Val Acc- 91.27
epoch 7: Training Loss- 3123.30 , Val Loss- 1093.42 , Training Acc- 98.19, Val Acc- 91.02
epoch 8: Training Loss- 2797.79 , Val Loss- 1197.00 , Training Acc- 98.38, Val Acc- 91.53
epoch 9: Training Loss- 2601.87 , Val Loss- 1046.80 , Training Acc- 98.52, Val Acc- 91.70
epoch 10: Training Loss- 2259.63 , Val Loss- 1211.96 , Training Acc- 98.72, Val Acc- 91.83
epoch 11: Trai

In [None]:
model = LSTM(d=128)
test(model, test_token_ids, test_labels, 'best_lstm_v2.pt')

Test loss: 1.01, test accuracy: 82.34


<font color='blue'>Now, I used the LSTM model instead of vanilla RNN. The model is trained two times for 20 and 14 epochs respectively. The best model based on performance on the validation split is obtained at 11th and 14th epoch respectively. The accuracy of the models are **83.14%** and **82.34%** respectively on the dev set.<font>

Problem 4.2 (10 points) Use Dropout on LSTM (either at input or output). Report the accuracy on the dev data and briefly describe how it differs from 4.1.


#### $\color{blue}{\text{Solution 4.2}}$
<font color='blue'> Training the model on 65000 samples of train set; validating using remaining samples of train set; testing it on the dev set.
Here, the **`train function`** and **`test function`** defined in **solution 3.1** is used. I used dropout(p=0.5) and dropout(p=0.25) at the output layer. </font>

In [None]:
d = 128 # size of word-embedding
num_epochs = 12
model = LSTM(d, 'dropout')
train_loss, train_acc, val_loss, val_acc = train('LSTM_with_dropout', token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_lstm_with_dropout.pt')

epoch 0: Training Loss- 31048.76 , Val Loss- 665.26 , Training Acc- 75.44, Val Acc- 88.34
epoch 1: Training Loss- 14554.30 , Val Loss- 552.70 , Training Acc- 91.52, Val Acc- 90.51
epoch 2: Training Loss- 9957.78 , Val Loss- 577.75 , Training Acc- 94.41, Val Acc- 91.49
epoch 3: Training Loss- 7421.37 , Val Loss- 609.56 , Training Acc- 95.65, Val Acc- 92.08
epoch 4: Training Loss- 5983.45 , Val Loss- 702.58 , Training Acc- 96.38, Val Acc- 91.74
epoch 5: Training Loss- 5030.75 , Val Loss- 841.36 , Training Acc- 96.88, Val Acc- 91.70
epoch 6: Training Loss- 4340.32 , Val Loss- 861.91 , Training Acc- 97.27, Val Acc- 91.49
epoch 7: Training Loss- 3877.79 , Val Loss- 975.23 , Training Acc- 97.60, Val Acc- 92.08
epoch 8: Training Loss- 3506.09 , Val Loss- 946.87 , Training Acc- 97.81, Val Acc- 92.04
epoch 9: Training Loss- 3325.59 , Val Loss- 893.63 , Training Acc- 97.95, Val Acc- 92.04
epoch 10: Training Loss- 3219.58 , Val Loss- 1005.38 , Training Acc- 98.03, Val Acc- 92.21
epoch 11: Trainin

In [None]:
test_text = test_tsv_read['sentence']
test_labels = list(test_tsv_read['label'])
test_text = list(test_text.str.lower())
test_tokens, test_token_ids = tokenization(test_text)
d = 128
model = LSTM(d, 'dropout')
test(model, test_token_ids, test_labels, 'best_lstm_with_dropout.pt')

Test loss: 0.99, test accuracy: 82.11


<font color='blue'> I added **dropout(p=0.5) at the output of the LSTM** to prevent it from overfitting and trained the model for 20 epochs. The best model based on performance on the validation split is obtained at 12th epoch. The accuracy of the model is **82.11%** on the dev set. 
It can be observed that **LSTM model(solution 4.1) outperforms LSTM model with dropout (at the output layer)**. Hence, adding dropout(p=0.5) at the output layer does not improve the classification performance. This may be because the dimension of the output of the LSTM is 128 and dropping half of them may result in loosing some important information. Dropout may help in preventing overfitting in case the output dimension is very large (1024 or 2048). <font>

In [None]:
d = 128 # size of word-embedding
num_epochs = 12
model = LSTM(d, 'dropout')
train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_lstm_with_dropout_v2.pt')

epoch 0: Training Loss- 30602.65 , Val Loss- 691.71 , Training Acc- 75.65, Val Acc- 88.04
epoch 1: Training Loss- 14054.04 , Val Loss- 586.01 , Training Acc- 91.77, Val Acc- 91.19
epoch 2: Training Loss- 9542.08 , Val Loss- 557.85 , Training Acc- 94.61, Val Acc- 91.87
epoch 3: Training Loss- 7170.22 , Val Loss- 628.99 , Training Acc- 95.79, Val Acc- 91.66
epoch 4: Training Loss- 5456.86 , Val Loss- 758.17 , Training Acc- 96.67, Val Acc- 92.00
epoch 5: Training Loss- 4532.95 , Val Loss- 793.72 , Training Acc- 97.24, Val Acc- 92.00
epoch 6: Training Loss- 3864.70 , Val Loss- 932.18 , Training Acc- 97.67, Val Acc- 92.08
epoch 7: Training Loss- 3557.85 , Val Loss- 932.38 , Training Acc- 97.85, Val Acc- 92.51
epoch 8: Training Loss- 3345.23 , Val Loss- 988.03 , Training Acc- 98.03, Val Acc- 92.25
epoch 9: Training Loss- 2848.40 , Val Loss- 1103.57 , Training Acc- 98.29, Val Acc- 91.53
epoch 10: Training Loss- 2436.67 , Val Loss- 1237.35 , Training Acc- 98.54, Val Acc- 92.46
epoch 11: Traini

In [None]:
d = 128
model = LSTM(d, 'dropout')
test(model, test_token_ids, test_labels, 'best_lstm_with_dropout_v2.pt')

Test loss: 1.01, test accuracy: 83.78


<font color='blue'> I added **dropout(p=0.25) at the output of the LSTM** to prevent it from overfitting and trained the model for 20 epochs. The best model based on performance on the validation split is obtained at 12th epoch. The accuracy of the model is **83.78%** on the dev set. 
It can be observed that **LSTM model with dropout (at the output layer) outperforms the LSTM model(solution 4.1)**. Hence, adding dropout(p=0.25) at the output layer improves the classification performance.

<font color='blue'> In summary, adding dropout(p=0.25) improves the classification performance whereas adding dropout(p=0.5) degrades the classification performance. This may be because the dimension of the output of the LSTM is 128 and dropping half of them may result in loosing some important information. On the contrary, dropout(p=0.25) drops less nodes and most of the important information is not dropped and helps in improving the test performance. Dropout(p=0.5) may help in preventing overfitting in case the output dimension is very large (1024 or 2048). <font>

#### $\color{blue}{\text{Solution 4.3.1}}$
<font color='blue'> Implementing **bidirectional LSTM** to further improve the model. 
 Training the model on 65000 samples of train set; validating using remaining samples of train set; testing it on the dev set.
Here, the **`train function`** and **`test function`** defined in **solution 3.1** is used. </font>

In [None]:
class BiLSTM(nn.Module):
    def __init__(self, d, dropout = None):
        """
        Implementation of recurrent neural network using
        `nn.Linear` and `nn.Parameter`class
        """
        super(BiLSTM, self).__init__()
        self.input_size = d
        self.hidden_size = d
        self.output_size = 2

        self.embedding = nn.Embedding(len(vocab), self.input_size)
        # i_t, c_t, f_t, o_t
        self.W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.Wb = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.Ub = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.bb = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.linear = nn.Linear(self.hidden_size*2, self.output_size, bias=True) 

        self.init_weights()
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, input_tensor):
        """Assumes x is of shape (batch, sequence, feature)"""
        # print(input_tensor.shape)
        emb = self.embedding(input_tensor)
        batch_size = emb.shape[0]

        forward = []
        backward = []
        h_t_for, c_t_for = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
        h_t_back, c_t_back = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
          
        ds = self.hidden_size
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.W + h_t_for @ self.U + self.b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t_for = f_t * c_t_for + i_t * g_t
            h_t_for = o_t * torch.tanh(c_t_for)
            forward.append(h_t_for)

        for t in reversed(range(emb.shape[1])):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.Wb + h_t_back @ self.Ub + self.bb
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t_back = f_t * c_t_back + i_t * g_t
            h_t_back = o_t * torch.tanh(c_t_back)
            backward.append(h_t_back)

        h_final= torch.cat((h_t_for, h_t_back), 1)
        out = self.linear(h_final)
        return out

In [None]:
d = 128 # size of word-embedding
num_epochs = 12
model = BiLSTM(d)
train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_bilstm.pt')

epoch 0: Training Loss- 28008.84 , Val Loss- 630.29 , Training Acc- 78.60, Val Acc- 89.14
epoch 1: Training Loss- 12509.46 , Val Loss- 578.43 , Training Acc- 92.79, Val Acc- 91.61
epoch 2: Training Loss- 8051.79 , Val Loss- 597.64 , Training Acc- 95.50, Val Acc- 92.00
epoch 3: Training Loss- 5674.09 , Val Loss- 676.43 , Training Acc- 96.79, Val Acc- 92.29
epoch 4: Training Loss- 4230.75 , Val Loss- 750.23 , Training Acc- 97.57, Val Acc- 92.29
epoch 5: Training Loss- 3436.98 , Val Loss- 899.44 , Training Acc- 98.05, Val Acc- 92.08
epoch 6: Training Loss- 2777.25 , Val Loss- 965.44 , Training Acc- 98.44, Val Acc- 92.17
epoch 7: Training Loss- 2392.08 , Val Loss- 965.91 , Training Acc- 98.67, Val Acc- 92.42
epoch 8: Training Loss- 2051.31 , Val Loss- 1072.55 , Training Acc- 98.95, Val Acc- 92.34
epoch 9: Training Loss- 1918.34 , Val Loss- 1054.57 , Training Acc- 98.98, Val Acc- 92.38
epoch 10: Training Loss- 1572.50 , Val Loss- 1117.17 , Training Acc- 99.15, Val Acc- 92.64
epoch 11: Train

In [None]:
test_text = test_tsv_read['sentence']
test_labels = list(test_tsv_read['label'])
test_text = list(test_text.str.lower())
test_tokens, test_token_ids = tokenization(test_text)
d = 128
model = BiLSTM(d)
test(model, test_token_ids, test_labels, 'best_bilstm.pt')

Test loss: 0.96, test accuracy: 86.32


<font color='blue'>The model is trained for 12 epochs and the best model based on performance on the validation split is obtained at 12th epoch. The accuracy of the model is **86.32%** on the dev set. This model outperforms the LSTM model(solution 4.1). This model outperforms all the models trained in solution 3 and solution 4(namely baseline model, vanilla RNN, LSTM, LSTM with dropout).<font>

#### $\color{blue}{\text{Solution 4.3.2}}$
<font color='blue'> Implementing **bidirectional LSTM and two layers of LSTM** to further improve the model. 
 Training the model on 65000 samples of train set; validating using remaining samples of train set; testing it on the dev set. </font>

In [None]:
class BiLSTMv2(nn.Module):
    def __init__(self, d, dropout = None):
        """
        Implementation of recurrent neural network using
        `nn.Linear` and `nn.Parameter`class
        """
        super(BiLSTMv2, self).__init__()
        self.input_size = d
        self.hidden_size = d
        self.output_size = 2
        self.hidden_size1 = 2*d
        
        #BiLSTM
        self.embedding = nn.Embedding(len(vocab), self.input_size)
        # i_t, c_t, f_t, o_t
        self.W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.Wb = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.Ub = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.bb = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        # LSTM1
        self.W1 = nn.Parameter(torch.Tensor(self.input_size*2, self.hidden_size1 * 4))
        self.U1 = nn.Parameter(torch.Tensor(self.hidden_size1, self.hidden_size1 * 4))
        self.b1 = nn.Parameter(torch.Tensor(self.hidden_size1 * 4))
        # LSTM2
        self.W2 = nn.Parameter(torch.Tensor(self.input_size*2, self.hidden_size1 * 4))
        self.U2 = nn.Parameter(torch.Tensor(self.hidden_size1, self.hidden_size1 * 4))
        self.b2 = nn.Parameter(torch.Tensor(self.hidden_size1 * 4))


        self.linear = nn.Linear(self.hidden_size1, self.output_size, bias=True) 

        self.init_weights()
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, input_tensor):
        """Assumes x is of shape (batch, sequence, feature)"""
        # print(input_tensor.shape)
        emb = self.embedding(input_tensor)
        batch_size = emb.shape[0]

        forward = []
        backward = []

        #BiLSTM
        h_t_for, c_t_for = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
        h_t_back, c_t_back = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
        #LSTM1
        h_t_1, c_t_1 = (torch.zeros(batch_size, self.hidden_size1).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size1).to(emb.device))
        #LSTM2
        h_t_2, c_t_2 = (torch.zeros(batch_size, self.hidden_size1).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size1).to(emb.device))
        
        #BiLSTM
        ds = self.hidden_size
        ds1 = self.hidden_size1
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.W + h_t_for @ self.U + self.b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t_for = f_t * c_t_for + i_t * g_t
            h_t_for = o_t * torch.tanh(c_t_for)
            forward.append(h_t_for)

        for t in reversed(range(emb.shape[1])):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.Wb + h_t_back @ self.Ub + self.bb
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t_back = f_t * c_t_back + i_t * g_t
            h_t_back = o_t * torch.tanh(c_t_back)
            backward.append(h_t_back)

        #LSTM1
        h_lstm1 = []
        for fwd, bwd in zip(forward, backward):
            # print(fwd.shape, bwd.shape)
            input_tensor = torch.cat((fwd, bwd), 1)
            gate = input_tensor @ self.W1 + h_t_1 @ self.U1 + self.b1
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds1]), 
                torch.sigmoid(gate[:, ds1:ds1*2]),  
                torch.tanh(gate[:, ds1*2:ds1*3]),
                torch.sigmoid(gate[:, ds1*3:]), 
            )
            c_t_1 = f_t * c_t_1 + i_t * g_t
            h_t_1 = o_t * torch.tanh(c_t_1)
            h_lstm1.append(h_t_1)

        #LSTM2
        h_lstm2 = []
        for input_tensor in h_lstm1:
            gate = input_tensor @ self.W2 + h_t_2 @ self.U2 + self.b2
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds1]), 
                torch.sigmoid(gate[:, ds1:ds1*2]),  
                torch.tanh(gate[:, ds1*2:ds1*3]),
                torch.sigmoid(gate[:, ds1*3:]), 
            )
            c_t_2 = f_t * c_t_2 + i_t * g_t
            h_t_2 = o_t * torch.tanh(c_t_2)
            h_lstm2.append(h_t_2)

        out = self.linear(h_t_2)
        return out

In [None]:
# result_path = '/home/radhika/radhika_77/data/nlp/models/'
result_path = '/content/drive/MyDrive/nlp/'
criterion = nn.CrossEntropyLoss()
softmax = nn.Softmax(1) 

def train(model, data, labels, val_data, val_labels, num_epochs = 12, file = None):    
    if (file != None):
      best_file = os.path.join(result_path, file)

    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    best_acc = 0
    best_acc_epoch = 0
    
    train_loss=[]
    train_acc=[]
    val_loss = []
    val_acc=[]

    for epoch in range(num_epochs):
      model.train()
      epoch_loss = 0
      correct = 0
      total = 0
      epoch_acc = 0
      for i in range(len(labels)):
          input_tensor = torch.LongTensor([data[i]])
          label = torch.LongTensor([labels[i]])

          optimizer.zero_grad()
          logits = model(input_tensor)
          loss = criterion(logits, label)
          loss.backward()
          optimizer.step()
          _, predicted = torch.max(logits.data, 1)
          total += label.size(0)
          correct += (predicted == label).sum().item()
          epoch_loss += loss.item()

      epoch_acc = (100 * correct / total)
      train_loss.append(round((epoch_loss / len(labels)), 2))
      train_acc.append(round((epoch_acc),2))

      with torch.no_grad():
            model.eval()
            val_epoch_loss = 0
            val_total = 0
            val_correct = 0
            epoch_val_acc = 0
            for i in range(len(val_labels)):
                input_tensor, label = val_data[i], val_labels[i]
                input_tensor = torch.LongTensor([input_tensor])
                label = torch.LongTensor([label])
                logits = model(input_tensor)
                loss = criterion(logits, label)
                _, predicted = torch.max(logits.data, 1)
                val_total += label.size(0)
                val_correct += (predicted == label).sum().item()
                val_epoch_loss += loss.item()

            val_loss.append(round((val_epoch_loss / len(val_labels)),2))
            epoch_val_acc = (100 * val_correct / val_total)
            val_acc.append(round((epoch_val_acc),2))
      # print("epoch {}: Training Loss- {:.2f} , Val Loss- {:.2f} , Training Acc- {:.2f}, Val Acc- {:.2f}".format(epoch, epoch_loss, val_epoch_loss, epoch_acc, epoch_val_acc))
      if (epoch_val_acc >= best_acc):
            best_acc = epoch_val_acc
            best_acc_epoch = epoch
            if (file!=None):
              torch.save(model.state_dict(), best_file)         
      if (epoch == num_epochs - 1):
        print("Best accuracy at epoch: {}".format(best_acc_epoch))
    return model, train_loss, train_acc, val_loss, val_acc

In [None]:
d = 128 # size of word-embedding
num_epochs = 11
model = BiLSTMv2(d)
model, train_loss, train_acc, val_loss, val_acc = train(model, token_ids, labels, val_token_ids, val_labels, num_epochs, 'best_bilstm_v2.pt')

epoch 0: Training Loss- 39540.20 , Val Loss- 877.73 , Training Acc- 63.37, Val Acc- 83.91
epoch 1: Training Loss- 18006.32 , Val Loss- 605.49 , Training Acc- 89.02, Val Acc- 90.38
epoch 2: Training Loss- 10992.66 , Val Loss- 592.08 , Training Acc- 93.92, Val Acc- 91.74
epoch 3: Training Loss- 7898.49 , Val Loss- 636.27 , Training Acc- 95.76, Val Acc- 92.08
epoch 4: Training Loss- 6086.68 , Val Loss- 659.23 , Training Acc- 96.66, Val Acc- 91.91
epoch 5: Training Loss- 4977.78 , Val Loss- 699.48 , Training Acc- 97.23, Val Acc- 92.55
epoch 6: Training Loss- 3962.54 , Val Loss- 800.84 , Training Acc- 97.76, Val Acc- 92.12
epoch 7: Training Loss- 3289.22 , Val Loss- 846.90 , Training Acc- 98.13, Val Acc- 92.38
epoch 8: Training Loss- 3137.43 , Val Loss- 715.26 , Training Acc- 98.20, Val Acc- 92.76
epoch 9: Training Loss- 2620.80 , Val Loss- 889.52 , Training Acc- 98.50, Val Acc- 92.64
epoch 10: Training Loss- 2266.90 , Val Loss- 871.81 , Training Acc- 98.74, Val Acc- 92.34


In [None]:
test_text = test_tsv_read['sentence']
test_labels = list(test_tsv_read['label'])
test_text = list(test_text.str.lower())
test_tokens, test_token_ids = tokenization(test_text)
test(model, test_token_ids, test_labels, 'best_bilstm_v2.pt')

Test loss: 0.97, test accuracy: 87.43


<font color='blue'>The model is trained for 11 epochs. The accuracy of the model is 87.43% on the dev set. This model outperforms all the models trained in solution 3 and solution 4(namely baseline model, vanilla RNN, LSTM, LSTM with dropout).<font>

## 5. Pretrained Word Vectors
The last step is to use pretrained vocabulary and word vectors. The prebuilt vocabulary will replace the vocabulary you built with SST-2 training data, and the word vectors will replace the embedding vectors. You will observe the power of leveraging self-supservised pretrained models.

**Problem 5.1** *(10 points)* Go to https://nlp.stanford.edu/projects/glove/ and download `glove.6B.zip`. Use these pretrained word vectors to further improve your model from 4.2. Report the model's accuracy on the dev data.

**Problem 5.2 (bonus)** *(10 points)* You can go one step further by using word vectors obtained from pretrained language models. Can you import the word embeddings from `bert-base-uncased` model (via Hugging Face's `transformers`: https://huggingface.co/transformers/pretrained_models.html) into your model and improve it further? Report the accuracy on the dev data here. If the score is now higher, explain why you think this is better.

#### $\color{blue}{\text{Solution 5.2}}$

In [None]:
!pip install transformers

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/ed/d5/f4157a376b8a79489a76ce6cfe147f4f3be1e029b7144fa7b8432e8acb26/transformers-4.4.2-py3-none-any.whl (2.0MB)
[K     |████████████████████████████████| 2.0MB 7.7MB/s 
Collecting sacremoses
[?25l  Downloading https://files.pythonhosted.org/packages/7d/34/09d19aff26edcc8eb2a01bed8e98f13a1537005d31e95233fd48216eed10/sacremoses-0.0.43.tar.gz (883kB)
[K     |████████████████████████████████| 890kB 53.3MB/s 
[?25hCollecting tokenizers<0.11,>=0.10.1
[?25l  Downloading https://files.pythonhosted.org/packages/71/23/2ddc317b2121117bf34dd00f5b0de194158f2a44ee2bf5e47c7166878a97/tokenizers-0.10.1-cp37-cp37m-manylinux2010_x86_64.whl (3.2MB)
[K     |████████████████████████████████| 3.2MB 46.8MB/s 
Building wheels for collected packages: sacremoses
  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone
  Created wheel for sacremoses: filename=sacremoses-0.0.43-cp37-none-any.whl size=893262 sha256=2ff12

In [None]:
import torch
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)
model.eval()

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=28.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=466062.0, style=ProgressStyle(descripti…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [None]:
def tokenization1(text): 
  indexed_tokens_list = []
  segment_ids_list = []
  for i in range(len(text)):
    tokenized_text = tokenizer.tokenize(text[i])
    indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
    segment_ids = [1]*len(tokenized_text) 
    indexed_tokens_list.append(indexed_tokens)
    segment_ids_list.append(segment_ids)
  return indexed_tokens_list, segment_ids_list

text = list(tsv_read['sentence'])
labels = list(tsv_read['label'])
indexed_tokens_list, segment_ids_list = tokenization1(text)

val_text = list(val_tsv_read['sentence'])
val_labels = list(val_tsv_read['label'])
val_indexed_tokens_list, val_segment_ids_list = tokenization1(val_text)

In [None]:
class LSTM_without_emb(nn.Module):
    def __init__(self, d):
        """
        Implementation of recurrent neural network using
        `nn.Linear` and `nn.Parameter`class
        """
        super(LSTM_without_emb, self).__init__()
        self.input_size = 768
        self.hidden_size = d
        self.output_size = 2
        # i_t, c_t, f_t, o_t
        self.W = nn.Parameter(torch.Tensor(self.input_size, self.hidden_size * 4))
        self.U = nn.Parameter(torch.Tensor(self.hidden_size, self.hidden_size * 4))
        self.b = nn.Parameter(torch.Tensor(self.hidden_size * 4))

        self.linear = nn.Linear(self.hidden_size, self.output_size, bias=True) 

        self.init_weights()
                
    def init_weights(self):
        stdv = 1.0 / math.sqrt(self.hidden_size)
        for weight in self.parameters():
            weight.data.uniform_(-stdv, stdv)
         
    def forward(self, input_tensor):
        """Assumes x is of shape (batch, sequence, feature)"""
        # print(input_tensor.shape)
        emb = input_tensor
        batch_size = emb.shape[0]

        h_t, c_t = (torch.zeros(batch_size, self.hidden_size).to(emb.device), 
                        torch.zeros(batch_size, self.hidden_size).to(emb.device))
          
        ds = self.hidden_size
        for t in range(emb.shape[1]):
            emb_t = emb[:, t, :]
            gate = emb_t @ self.W + h_t @ self.U + self.b
            i_t, f_t, g_t, o_t = (
                torch.sigmoid(gate[:, :ds]), 
                torch.sigmoid(gate[:, ds:ds*2]),  
                torch.tanh(gate[:, ds*2:ds*3]),
                torch.sigmoid(gate[:, ds*3:]), 
            )
            c_t = f_t * c_t + i_t * g_t
            h_t = o_t * torch.tanh(c_t)

        out = self.linear(h_t)
        return out

In [None]:
# result_path = '/home/edlab/radhika/radhika_77/data/nlp/models/'
result_path = '/content/drive/MyDrive/nlp/'
criterion = nn.CrossEntropyLoss()
softmax = nn.Softmax(1) 

def train(model1, indexed_tokens_list, segment_ids_list, labels, val_indexed_tokens_list, val_segment_ids_list, val_labels, num_epochs = 12, file = None):
    if (file != 'None'):
      best_file = os.path.join(result_path, file)
    model = BertModel.from_pretrained('bert-base-uncased', output_hidden_states = True)
    model.eval()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
    best_acc = 0
    best_acc_epoch = 0
    
    train_loss=[]
    train_acc=[]
    val_loss = []
    val_acc=[]

    for epoch in range(num_epochs):
      model.train()
      epoch_loss = 0
      correct = 0
      total = 0
      epoch_acc = 0
      for i in range(len(labels[:40000])):
          with torch.no_grad():
            tokens_tensor = torch.tensor([indexed_tokens_list[i]]); segments_tensor= torch.tensor([segment_ids_list[i]])
            outputs = model(tokens_tensor, segments_tensor)
            hidden_states = outputs[2]
            token_embeddings = torch.stack(hidden_states, dim =0)
            token_embeddings = torch.squeeze(token_embeddings, dim =1)
            token_embeddings = token_embeddings.permute(1,0,2)
            tokens_vec_sum = []
            for token in token_embeddings:
              sum_vec = torch.sum(token[-4:], dim=0)
              tokens_vec_sum.append(sum_vec)
          # print(data[i].shape)
          input_tensor = torch.stack(tokens_vec_sum)
          input_tensor = input_tensor.reshape(1, input_tensor.shape[0], input_tensor.shape[1])
          label = torch.LongTensor([labels[i]])

          optimizer.zero_grad()
          logits = model1(input_tensor)
          # print(logits.shape, label.shape)
          loss = criterion(logits, label)
          loss.backward(retain_graph=True) 
          optimizer.step()
          _, predicted = torch.max(logits.data, 1)
          total += label.size(0)
          # print(predicted, label)
          correct += (predicted == label).sum().item()
          epoch_loss += loss.item()

      epoch_acc = (100 * correct / total)
      train_loss.append(round((epoch_loss / len(labels)), 2))
      train_acc.append(round((epoch_acc),2))

      with torch.no_grad():
            model1.eval()
            val_epoch_loss = 0
            val_total = 0
            val_correct = 0
            epoch_val_acc = 0
            for i in range(len(val_labels)):
                with torch.no_grad():
                  tokens_tensor = torch.tensor([val_indexed_tokens_list[i]]); segments_tensor= torch.tensor([val_segment_ids_list[i]])
                  outputs = model(tokens_tensor, segments_tensor)
                  hidden_states = outputs[2]
                  token_embeddings = torch.stack(hidden_states, dim =0)
                  token_embeddings = torch.squeeze(token_embeddings, dim =1)
                  token_embeddings = token_embeddings.permute(1,0,2)
                  tokens_vec_sum = []
                  for token in token_embeddings:
                    sum_vec = torch.sum(token[-4:], dim=0)
                    tokens_vec_sum.append(sum_vec)
                input_tensor = torch.stack(tokens_vec_sum)
                input_tensor = input_tensor.reshape(1, input_tensor.shape[0], input_tensor.shape[1])
                label = torch.LongTensor([val_labels[i]])
                logits = model1(input_tensor)
                loss = criterion(logits, label)
                _, predicted = torch.max(logits.data, 1)
                val_total += label.size(0)
                val_correct += (predicted == label).sum().item()
                val_epoch_loss += loss.item()

            val_loss.append(round((val_epoch_loss / len(val_labels)),2))
            epoch_val_acc = (100 * val_correct / val_total)
            val_acc.append(round((epoch_val_acc),2))
      # print("epoch {}: Training Loss- {:.2f} , Val Loss- {:.2f} , Training Acc- {:.2f}, Val Acc- {:.2f}".format(epoch, epoch_loss, val_epoch_loss, epoch_acc, epoch_val_acc))
      if (epoch_val_acc >= best_acc):
            best_acc = epoch_val_acc
            best_acc_epoch = epoch
            if (file!=None):
              torch.save(model.state_dict(), best_file)         
    return model1, train_loss, train_acc, val_loss, val_acc

In [None]:
def test(model, indexed_tokens_list, segment_ids_list, file = None):
    if (file!=None):
      best_file = os.path.join(result_path, file)
      model.load_state_dict(torch.load(best_file))
    model.eval()
    
    with torch.no_grad():
        test_total = 0
        test_correct = 0
        test_loss = 0
        test_accuracy = 0
        for i in range(len(labels)):
            with torch.no_grad():
              tokens_tensor = torch.tensor([indexed_tokens_list[i]]); segments_tensor= torch.tensor([segment_ids_list[i]])
              outputs = model(tokens_tensor, segments_tensor)
              hidden_states = outputs[2]
              token_embeddings = torch.stack(hidden_states, dim =0)
              token_embeddings = torch.squeeze(token_embeddings, dim =1)
              token_embeddings = token_embeddings.permute(1,0,2)
              tokens_vec_sum = []
              for token in token_embeddings:
                sum_vec = torch.sum(token[-4:], dim=0)
                tokens_vec_sum.append(sum_vec)
            # print(data[i].shape)
            input_tensor = torch.stack(tokens_vec_sum)
            input_tensor = input_tensor.reshape(1, input_tensor.shape[0], input_tensor.shape[1])
            label = torch.LongTensor([labels[i]])
            logits = model(input_tensor)
            loss = criterion(logits, label)
            _, predicted = torch.max(logits.data, 1)
            test_total += label.size(0)
            test_correct += (predicted == label).sum().item()
            test_loss += loss.item()
        test_accuracy = round((100 * test_correct / test_total), 2)
        print("Test loss: {}, test accuracy: {}". format(round((test_loss / len(labels)),2), test_accuracy))

In [None]:
d = 128
num_epochs = 12
model = LSTM_without_emb(d)
model1, train_loss, train_acc, val_loss, val_acc = train(model, indexed_tokens_list, segment_ids_list, labels, val_indexed_tokens_list, val_segment_ids_list, val_labels, num_epochs, 'best_LSTM_without_emb.pt')

In [None]:
test_text = list(test_tsv_read['sentence'])
test_labels = list(test_tsv_read['label'])
test_indexed_tokens_list, test_segment_ids_list = tokenization1(test_text)
test(model1, indexed_tokens_list, segment_ids_list, labels)

Test loss: 0.91, test accuracy: 85.62


<font color='blue'>The model is trained for 10 epochs. The accuracy of the model is **85.62%** on the dev set. This model outperforms the baseline model, vanilla RNN, LSTM, LSTM with dropout.<font>

<font color='blue'>This model outperforms the existing LSTM variants in (LSTM, LSTM with dropout in section 4). This is because of 2 reasons:
 
1. Word vectors obtained from pretrained language models helps in getting embeddings for words that never appeared in the training set. So, **the issue of out-of-vocabulary is eliminated**. **Pre-trained language models provide us word embeddings for words it has not seen before or words which rarely occured in training set**. <font>
2. The issue of language polysemy is eliminated. It recognizes **that a word may have more than one meaning depending on the context**. So it does not assign any single vector to a word. Instead, it computes the representation of a word by taking the entire sequence as an input.