# Assignment 2

In this part of assignment 2 we'll be building a machine learning model to detect sentiment of movie reviews using the Stanford Sentiment Treebank([SST])(http://ai.stanford.edu/~amaas/data/sentiment/) dataset. First we will import all the required libraries. We highly recommend that you finish the PyTorch Tutorials [ 1 ](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html),[ 2 ](https://pytorch.org/tutorials/beginner/data_loading_tutorial.html),[ 3 ](https://github.com/yunjey/pytorch-tutorial). before starting this assignment. After finishing this assignment we will able to answer the following questions-


* How to write Dataloaders in Pytorch?
* How to build dictionaries and vocabularies for Deep Nets?
* How to use Embedding Layers in Pytorch?
* How to build various recurrent models (LSTMs and GRUs) for sentiment analysis?
* How to use packed_padded_sequences for sequential models?




# Import Libraries

In [0]:
import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.autograd import Variable
import torch.nn.functional as F
from collections import defaultdict
from torchtext import datasets
from torchtext import data
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
from torch.nn.utils.rnn import pack_sequence, pad_sequence

## Download dataset
First we will download the dataset using [torchtext](https://torchtext.readthedocs.io/en/latest/index.html), which is a package that supports NLP for PyTorch. The following command will get you 3 objects `train_data`, `val_data` and `test_data`. To access the data:

*   To access list of textual tokens - `train_data[0].text`
*   To access label - `train_data[0].label`



In [0]:
if(__name__=='__main__'):
  train_data, val_data, test_data = datasets.SST.splits(data.Field(tokenize = 'spacy'), data.LabelField(dtype = torch.float), filter_pred=lambda ex: ex.label != 'neutral')

In [0]:
if(__name__=='__main__'):
  train_data[0]



# Define the Dataset Class

In the following cell, we will define the dataset class. You need to implement the following functions: 


*   ` build_dictionary() ` - creates the dictionaries `ixtoword` and `wordtoix`. Converts all the text of all examples, in the form of text ids and stores them in `textual_ids`. If a word is not present in your dictionary, it should use `<unk>`. Use the hyperparameter `THRESHOLD` to control the words to be in the dictionary based on their occurrence. Note the occurrences should be `>=THRESHOLD` to be included in the dictionary.
*   ` get_label() ` - It should return the value `0` if the label in the dataset is `positive`, and should return `1` if it is `negative`. 
*   ` get_text() ` - This function should pad the review with `<end>` character uptil a length of `MAX_LEN` if the length of the text is less than the `MAX_LEN`.
*   ` __len__() ` - This function should return the total length of the dataset.
*   ` __getitem__() ` - This function should return the padded text, the length of the text (without the padding) and the label.


In [0]:
from collections import Counter
import copy
THRESHOLD = 3
MAX_LEN = 60
class TextDataset(data.Dataset):
  def __init__(self, examples, split, ixtoword=None, wordtoix=None, THRESHOLD=THRESHOLD):
    self.examples=examples
    self.split = split
    self.THRESHOLD = THRESHOLD
    self.lenth=len(examples)
    if self.split=='train':
      # train 
      self.textual_ids,self.ixtoword,self.wordtoix=self.build_dictionary()
    elif self.split=='test':
      print("test")
      self.ixtoword=ixtoword
      self.wordtoix=wordtoix
      self.textual_ids=self.get_textual_ids()
    ### TO-DO
  
  def get_textual_ids(self):
    textual_ids=[]
    UNK=1
    for i in range(len(self.examples)):
        textual_ids.append(list(map(lambda word: self.wordtoix.get(word,UNK), self.examples[i].text)))
    return textual_ids
       
    
  def build_dictionary(self):
    # count words
    copy_ex=[]
    for i in range(len(self.examples)):
        for inner in self.examples[i].text:
            copy_ex.append(inner)
    vocab_count=Counter(copy_ex)
    END=0
    UNK=1
    wordtoix={
        word:idx
        for idx, (word,count) in enumerate(vocab_count.most_common(), start=2)
        if count>=self.THRESHOLD

    }
    wordtoix['<end>']=END
    wordtoix['<unk>']=UNK
    ixtoword={
        idx:word
        for word, idx in wordtoix.items()
        if idx not in {END,UNK}    
    }
    ixtoword[END]='<end>'
    ixtoword[UNK]='<unk>'
    print(ixtoword[END])
    textual_ids=[]
    for i in range(len(self.examples)):
        textual_ids.append(list(map(lambda word: wordtoix.get(word,UNK), self.examples[i].text)))
    return textual_ids, ixtoword, wordtoix
  
  def get_label(self, index):
    if self.examples[index].label=='positive':
        return 0
    elif self.examples[index].label=='negative':
        return 1
               
  def get_text(self, index):
    ### TO-DO
    END='<end>'
    while len(self.examples[index].text)<MAX_LEN:
        self.examples[index].text.append(END)
    
    return self.examples[index]
    
  
  def __len__(self):
    return self.lenth
  
  def __getitem__(self, index):
    ### TO-DO
   
    text_len=0
    for i in self.examples[index].text:
        if i!='<end>':
            text_len+=1
        else:
            break
            
    lbl=torch.tensor(self.get_label(index))
    while len(self.textual_ids[index])<MAX_LEN:
      self.textual_ids[index].append(0)
      
    text=np.array(self.textual_ids[index])
    text_len=np.int(text_len)
    return text, text_len, lbl
    
    

## Initialize the Dataloader
We initialize the training and testing dataloaders using the Dataset classes we create for both training and testing. Make sure you use the same vocabulary for both the datasets.

In [17]:
if(__name__=='__main__'):
  Ds = TextDataset(train_data, 'train')
  print(Ds.__getitem__(0))
  print(Ds.wordtoix)
  print(Ds.ixtoword)
  textual_ids, ixtoword, wordtoix= Ds.build_dictionary()
  batch_size = 32

  train_loader = torch.utils.data.DataLoader(Ds, batch_size=batch_size, shuffle=True, num_workers=4, drop_last=True)
  test_Ds = TextDataset(test_data,'test',ixtoword, wordtoix) ### TO-DO - using test_data
  
  test_loader = torch.utils.data.DataLoader(test_Ds, batch_size=1, shuffle=False, num_workers=4, drop_last=True) 


<end>
(array([  15, 1206,   10, 3050,    8,   28,    4, 3051, 3052,   11,  151,
         25,   25,    1,   66,    5,   12,   96,   11,  256,    8,   85,
          7,    1,   75, 3881,   41, 2137, 3053,    3,    1,    9,    1,
       3882,    1,   51,  825,    1,    2,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
          0,    0,    0,    0,    0]), 39, tensor(0))
<end>
test


## Build your Sequential Model
In the following we provide you the class to build your model. We provide some parameters, we expect you to use in the initialization of your sequential model.

In [0]:
import torch.nn as nn

class RNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim, n_layers, 
                 bidirectional, dropout, pad_idx):
        
        super().__init__()
        ## To-Do
        # - Create an embedding layer - refer to nn.Embedding
        self.embeds=nn.Embedding(vocab_size,embedding_dim,padding_idx=pad_idx)
        # - Use a sequential network - nn.LSTM or nn.GRU
        self.LSTM=nn.LSTM(
            embedding_dim,
            hidden_dim,
            num_layers=n_layers,
            bidirectional=bidirectional,
            dropout=dropout
        )
        self.final_cell=nn.Linear(hidden_dim*2, output_dim)
        # Have an output layer for outputting a single output value
        self.dropout=nn.Dropout(dropout)
        
    def forward(self, text, text_lengths):
      
        ## TO - DO 
        ## Hint(s):  Refer to nn.utils.rnn.pack_padded_sequence for padded tensors
        #text = [MAX LEN, batch size]
        #text_lengths = [batch size]
        # generate word embedding
        word_embedding=self.dropout(self.embeds(text))
        # pack word embedding
        pack_sequence= nn.utils.rnn.pack_padded_sequence(word_embedding,text_lengths,enforce_sorted=False)
        
        # feed packed word embedding into LSTM
        pack_output, (hidden, cell)=self.LSTM(pack_sequence)
        # unpack LSTM output
        output, output_len= nn.utils.rnn.pad_packed_sequence(pack_output)
        # final forward: take the output corresponding to final timestep
        hidden= self.dropout(torch.cat((hidden[-2,:,:],hidden[-1,:,:]),dim=1))
        
        ## You do not need to apply a sigmoid to the final output - we do that for you when we call it in evaluation
        
        return self.final_cell(hidden)

In [7]:
# Hyperparameters for your model
# Feel Free to play around with these
# for getting optimal performance
# TO-DO
if(__name__=='__main__'):
  INPUT_DIM = len(Ds.build_dictionary()[2]) #this should be your vocab size
  EMBEDDING_DIM = 100
  HIDDEN_DIM = 256
  OUTPUT_DIM = 1
  N_LAYERS = 2
  BIDIRECTIONAL = True
  DROPOUT = 0.5
  PAD_IDX = 0

  model = RNN(INPUT_DIM, 
              EMBEDDING_DIM, 
              HIDDEN_DIM, 
              OUTPUT_DIM, 
              N_LAYERS, 
              BIDIRECTIONAL, 
              DROPOUT, 
              PAD_IDX)

<end>


In [8]:
def count_parameters(model):
    return sum(p.numel() for p in model.parameters() if p.requires_grad)
if(__name__=='__main__'):
  print(f'The model has {count_parameters(model):,} trainable parameters')

The model has 2,830,557 trainable parameters


### Define your loss function and optimizer

In [0]:
import torch.optim as optim
# TO-DO
# Feel Free to play around with different optimizers and loss functions
# for getting optimal performance
# For optimizers : https://pytorch.org/docs/stable/optim.html
# For loss functions : https://pytorch.org/docs/stable/nn.html#loss-functions
if(__name__=='__main__'):
#   optimizer = optim.SGD(model.parameters(), lr=1e-3)
  optimizer = optim.Adam(model.parameters())
  criterion = nn.BCEWithLogitsLoss() 

### Put your model on the GPU

In [0]:
if(__name__=='__main__'):
  model = model.to(device)
  criterion = criterion.to(device)

In [0]:
def binary_accuracy(preds, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    #round predictions to the closest integer
    rounded_preds = torch.round(torch.sigmoid(preds))
    correct = (rounded_preds == y).float() #convert into float for division 
    acc = correct.sum() / len(correct)
    return acc

## Train your Model

In [0]:
def train_model(model, num_epochs, data_loader):
  model.train()
  for epoch in range(num_epochs):
    epoch_loss = 0
    epoch_acc = 0
    for idx, (text, text_lens, label) in enumerate(data_loader):
        if(idx%100==0):
          print('Executed Step {} of Epoch {}'.format(idx, epoch))
        text = text.to(device)
        # text - [batch_len, MAX_LEN]
        text_lens = text_lens.to(device)
        # text - [batch_len]
        label = label.float()
        label = label.to(device)
        optimizer.zero_grad()
        text = text.permute(1, 0) # permute for sentence_len first for embedding
        predictions = model(text, text_lens).squeeze(1)
        loss = criterion(predictions, label)

        acc = binary_accuracy(predictions, label)

        loss.backward()

        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()
    print('Training Loss Value of Epoch {} = {}'.format(epoch ,epoch_loss/len(train_loader)))
    print('Training Accuracy of Epoch {} = {}'.format(epoch ,epoch_acc/len(train_loader)))

## Evaluate your Model

In [0]:
def evaluate(model, data_loader):
  model.eval()
  epoch_loss = 0
  epoch_acc = 0
  all_predictions = []
  for idx, (text, text_lens, label) in enumerate(data_loader):
      if(idx%100==0):
        print('Executed Step {}'.format(idx))
      text = text.to(device)
      text_lens = text_lens.to(device)
      label = label.float()
      label = label.to(device)
      optimizer.zero_grad()
      
      text = text.permute(1, 0)
      predictions = model(text, text_lens).squeeze(1)
      all_predictions.append(torch.round(torch.sigmoid(predictions)))
      loss = criterion(predictions, label)
      acc = binary_accuracy(predictions, label)
      epoch_loss += loss.item()
      epoch_acc += acc.item()
  print(epoch_loss/len(data_loader))
  print(epoch_acc/len(data_loader))
  predictions = torch.cat(all_predictions)
  return predictions

## Training and Evaluation

We first train your model using the training data. Feel free to play around with the number of epochs. We recommend **you write code to save your model** [(save/load model tutorial)](https://pytorch.org/tutorials/beginner/saving_loading_models.html) as colab connections are not permanent and it can get messy if you'll have to train your model again and again.

In [14]:
if(__name__=='__main__'):
  train_model(model, 10, train_loader)


Executed Step 0 of Epoch 0
Executed Step 100 of Epoch 0
Executed Step 200 of Epoch 0
Training Loss Value of Epoch 0 = 0.6787004556368899
Training Accuracy of Epoch 0 = 0.5638020833333334
Executed Step 0 of Epoch 1
Executed Step 100 of Epoch 1
Executed Step 200 of Epoch 1
Training Loss Value of Epoch 1 = 0.6444165176815457
Training Accuracy of Epoch 1 = 0.6385995370370371
Executed Step 0 of Epoch 2
Executed Step 100 of Epoch 2
Executed Step 200 of Epoch 2
Training Loss Value of Epoch 2 = 0.6098295383983188
Training Accuracy of Epoch 2 = 0.6685474537037037
Executed Step 0 of Epoch 3
Executed Step 100 of Epoch 3
Executed Step 200 of Epoch 3
Training Loss Value of Epoch 3 = 0.5683078213974282
Training Accuracy of Epoch 3 = 0.7077546296296297
Executed Step 0 of Epoch 4
Executed Step 100 of Epoch 4
Executed Step 200 of Epoch 4
Training Loss Value of Epoch 4 = 0.530580783193862
Training Accuracy of Epoch 4 = 0.7313368055555556
Executed Step 0 of Epoch 5
Executed Step 100 of Epoch 5
Executed S

Now we will evaluate your model on the test set.

In [15]:
if(__name__=='__main__'):
  predictions = evaluate(model, test_loader)
  predictions = predictions.cpu().data.detach().numpy()
  assert(len(predictions)==len(test_data))

Executed Step 0
Executed Step 100
Executed Step 200
Executed Step 300
Executed Step 400
Executed Step 500
Executed Step 600
Executed Step 700
Executed Step 800
Executed Step 900
Executed Step 1000
Executed Step 1100
Executed Step 1200
Executed Step 1300
Executed Step 1400
Executed Step 1500
Executed Step 1600
Executed Step 1700
Executed Step 1800
0.5252349222605169
0.7825370675453048


## Saving results for Submission
Saving your test results for submission. You will save the `result.txt` with your test data results. Make sure you do not **shuffle** the order of the `test_data` or the autograder will give you a bad score.

You will submit the following files to the autograder on the gradescope :


1.   Your `result.txt` of test data results
2.   Your code of this notebook. You can do it by clicking `File`-> `Download .py` - make sure the name of the downloaded file is `assignment2.py`



In [16]:
if(__name__=='__main__'):
  try:
    from google.colab import drive
    drive.mount('/content/drive')
  except:
    pass
  np.savetxt('drive/My Drive/result.txt', predictions, delimiter=',')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive
