## Please write your name and NetID below:

Name:

NetID:

# CPSC 477/577: Natural Language Processing
## Homework 2: Sentiment analysis (due  **11:59pm on March 29**)
*TFs: Aditya Chander and Michael Linden*

In this homework, you'll build several classifiers that can predict a user's rating of a movie solely from the words of their review. This is called **sentiment analysis**. We'll start with some simple non-neural classifiers before moving to a neural network implementation. We'll also see how transformers perform on this task. Throughout the assignment, we'll explore a few different ways to encode the text data and assess an encoding's impact on model performance. 

The dataset we'll be using is the Rotten Tomatoes dataset. We've downloaded it for you, but if you want to find out where we got it from you can check out this link: https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews.

## Part A: dataset exploration, non-neural methods, investigating different encodings

In [None]:
# Import our libraries (this may take a minute or two)
import pandas as pd   # Great for tables (google spreadsheets, microsoft excel, csv). 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import spacy
from scipy import stats
import os # Good for navigating your computer's files 
import sys
pd.options.mode.chained_assignment = None #suppress warnings

nltk.download('wordnet')
nltk.download('punkt')

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import PCA
from sklearn.metrics import confusion_matrix
!python -m spacy download en_core_web_md
import en_core_web_md
text_to_nlp = en_core_web_md.load()

import torch
from torchtext.legacy import data
from torch.utils.data import random_split
import torch.nn as nn
import torch.optim as optim

! pip install transformers
from transformers import BertTokenizer
from transformers import BertModel

import time

from google.colab import drive
drive.mount('/content/drive')

### Data preprocessing
The dataset is in the file **train.tsv** (even though the file is called train.tsv, this is the entire dataset that we want you to use). We have written a line of code to load up the dataset from the directory where you stored the file into a pandas DataFrame (check out the documentation [here](https://pandas.pydata.org/docs/reference/frame.html) if you've never used pandas before). This variable is called `all_raw`.

The dataframe `all_raw` currently has the columns `PhraseId`, `SentenceId`, `Phrase` and `Sentiment`. We're going to outline a few preprocessing steps that you need to take to get the data in the right form for classification. Name your processed dataframe variable `all_data`. 

**Step 1**: There are several reviews with duplicated `SentenceId` (subphrases of the full review). Write some code to keep only the first phrase for any given `SentenceId` in your dataframe. Then you can drop the `PhraseId` and `SentenceId` columns. Rename the column called `Phrase` to `text`.

**Step 2**: The next preprocessing step is to convert the values in the `Sentiment` column to a binary encoding (Boolean values) that represents whether the review is good (rating of 3 or 4) or bad (rating of 0, 1 or 2). **Keep the original `Sentiment` column**, renaming it `rating`, and title your binary encoding column `is_good`.

In [None]:
data_path = '/content/drive/MyDrive/Year2/Spring2021/CS477/train.tsv' # TODO: replace this with your own path to the dataset (should end in train.tsv)
all_raw = pd.read_table(data_path)
all_data = None # this will contain your final preprocessed dataset

# ## TODO: write code for steps 1 and 2 above

# Tests to ensure your columns are labelled correctly and contain the right content
assert 'is_good' in all_data.columns
assert 'text' in all_data.columns
assert 'rating' in all_data.columns
assert sum(all_data.text.apply(lambda x: len(x))) == 868869

### Exploring the dataset
Make up to 5 plots that show some relationships between the text of the review and the sentiment (you can use both the 0-4 rating in the `rating` column and the binary rating in the `is_good` column, though ultimately we'll only be using the binary rating). Some things you may wish to explore are whether the length of the review has any bearing on the sentiment, how many good and bad reviews there are, etc. In the text cell below, comment on what you've found and whether this matches your intuitions about movie reviews more generally.

We've imported matplotlib.pyplot and seaborn for your convenience.

In [None]:
# TODO: make some plots that show some patterns in the dataset!

**TODO: write your observations about the plots here.**

### Bag of words encoding
**Step 1**: Print out the number of unique tokens in your dataset. You should use the `word_tokenize` function from NLTK (imported).

In [None]:
## TODO: count the number of unique tokens

**Step 2**: Convert the text of each review to a bag-of-words encoding. You should use a vocabulary size that is the minimum of 2001 and the number of tokens you calculated in the cell above.

The number 2001 corresponds to the 2000 most common tokens and an extra one for any words that do not match those tokens (think of these like the `RARE_WORDS` from Homework 1). 

In [None]:
# TODO: implement bag of words

### Basic models: logistic regression
**Step 1**: Use your bag of words encoding to train a logistic regression model that predicts whether a review is good or bad. Train the model on 80% of your dataset and print the accuracy score for both your training and testing data.

We've imported the `LogisticRegression` model from scikit-learn function for your convenience. Use the scikit-learn `train_test_split` function (also imported) to split your dataset into training and testing; set the `random_state` to 1.

In [None]:
# TODO: train your logistic regression model

**Step 2**: We're going to see which words were most important to the model. Print out the words that correspond to the 20 coefficients in the logistic regression model with the highest absolute value. (To get the coefficients for the model you trained, you can use `model.coef_`.)

In [None]:
# TODO: print out top 20 features of logistic regression model

**Step 3**: Plot these 20 coefficients (with their original sign) in a bar plot. Label each bar with the word to which it corresponds. What do you notice about the relationship between the words and the bars? Add your observation in the text cell below.

In [None]:
# TODO: plot the top 20 coefficients

**TODO: write your observations here**

### Other non-neural models

Run the classification again using three different models: **K-nearest Neighbors (with 5 neighbors)**, **Gaussian Naive Bayes** and **Support Vector Classifier**. All of these have been imported from scikit-learn. Which model performed best? Can you speculate why this was the case? You may provide feature importance plots/figures where appropriate.


In [None]:
# TODO: do the same thing with other models and see whether they perform any better


**TODO: which model was best and why?**

### Reducing the size of our review representations
Our reviews are currently encoded in a form with thousands of features. This is potentially slowing down the runtime of our models. However, we can use dimensionality reduction techniques such as **Principal Component Analysis (PCA)** to capture the majority of the variance that we observe in our encodings but with far fewer dimensions needed.

Use PCA to perform dimensionality reduction on the bag-of-words encodings. Then train a logistic regression model using these reduced vectors. Try this on different numbers of principal components; we have provided these in the code. We have imported the PCA implementation from scikit-learn for your convenience. (Hint: the `fit_transform` function might be useful.)

Plot the test performance of your different encodings in a bar plot. How does the performance of the reduced vectors compare with the original?

In [None]:
n_components = [10, 50, 100, 300, 800]

# TODO: perform PCA on bag of words using the numbers of components above and explore how this affects logistic regression model performance
for n in n_components:
    pca = PCA(n_components=n)
    # YOUR CODE HERE


**TODO: how do the dimensionally-reduced vectors affect performance?**

### Other encodings
We're going to explore different ways to represent the dataset that may improve performance on the basic models compared to the bag-of-words representation. 
#### Bag of bigrams
We're going to extend the bag-of-words model to a "bag-of-bigrams" model, where we count the number of occurrences of each bigram that appears in the dataset. Implementing this is very similar to implementing bag of words, but we might expect improved performance as this model accounts for valenced phrases such as "not good". 

As such, implement the bag-of-bigrams model on the dataset (with a vocab size of the minimum of 5000 and the number of unique bigrams) and train a logistic regression model as before. Feel free to reuse your code from the bag-of-words model implementation as you see fit. 

Is the test performance with the bigram encoding any better? Explain why.

In [None]:
# TODO: implement bag of bigrams and train on logistic regression model

**TODO: explain the performance of your bag-of-bigrams model vs. bag-of-words model.**

#### Word embeddings 
Another type of text representation is **word embeddings**. These are large vectors that aim to capture some kind of semantic information about the words. As such, you might expect to find similar words close to each other in word embedding vector space. 

We're going to be using the spacy word embeddings in this assignment. The line below loads up the spacy embeddings, which are 300-dimensional vectors, into a column in the `all_data` dataframe.

In [None]:
# Spacy embeddings
all_data['spacy_embedding'] = all_data['text'].apply(lambda x: text_to_nlp(x).vector)

As before, train a logistic regression model on the spacy embeddings and see how the test performance compares to the bag-of-words and bag-of-bigrams encodings.

In [None]:
# TODO: train logistic regression on spacy word embeddings

**TODO: how does the word embedding vector performance compare to bag of words/bag of bigrams?**

## Part B: RNN implementation (with embedding layer + pretrained embeddings)
Now, we're going to look at some neural methods for sentiment classification. Specifically, we'll explore how **Recurrent Neural Networks (with and without LSTM cells)** and **Transformers** perform on the same task.

Please run the following cells to set everything up. **DO NOT MODIFY THEM**, but do take a look to see how they work.


In [None]:
# Settings
SEED = 12138
torch.manual_seed(SEED)
torch.backends.cudnn.deterministic = True

# Torchtext will let us to load the text and labels separately.
TEXT = data.Field(tokenize = 'spacy')
LABEL = data.LabelField(dtype = torch.float)

In [None]:
# class to convert pandas dataframe to the torchtext dataset
class DataFrameDataset(data.Dataset):
    """Class for using pandas DataFrames as a datasource"""
    def __init__(self, examples, fields, filter_pred=None):
        """
        Create a dataset from a pandas dataframe of examples and Fields
        Arguments:
            examples pd.DataFrame: DataFrame of examples
            fields {str: Field}: The Fields to use in this tuple. The
                string is a field name, and the Field is the associated field.
            filter_pred (callable or None): use only exanples for which
                filter_pred(example) is true, or use all examples if None.
                Default is None
        """
        self.examples = examples.apply(SeriesExample.fromSeries, args=(fields,), axis=1).tolist()
        if filter_pred is not None:
            self.examples = filter(filter_pred, self.examples)
        self.fields = dict(fields)
        # Unpack field tuples
        for n, f in list(self.fields.items()):
            if isinstance(n, tuple):
                self.fields.update(zip(n, f))
                del self.fields[n]

class SeriesExample(data.Example):
    """Class to convert a pandas Series to an Example"""

    @classmethod
    def fromSeries(cls, data, fields):
        return cls.fromdict(data.to_dict(), fields)

    @classmethod
    def fromdict(cls, data, fields):
        ex = cls()
        
        for key, field in fields.items():
            if key not in data:
                raise ValueError("Specified key {} was not found in "
                "the input data".format(key))
            if field is not None:
                setattr(ex, key, field.preprocess(data[key]))
            else:
                setattr(ex, key, data[key])
        return ex

In [None]:
# Format reviews in a pytorch-specific way
fields = {'is_good' : LABEL, 'text' : TEXT}
all_ds = DataFrameDataset(all_data, fields)

In [None]:
# display first entry; this should be the text of the review followed by a binary sentiment classification
print(vars(all_ds.examples[0]))

In [None]:
# cast to actual torchtext dataset class
all_ds = data.Dataset(all_ds,fields)

In [None]:
# make train + valid + test sets
train_ds, test_ds = random_split(all_ds,[int(0.8*len(all_ds)), len(all_ds)-int(0.8*len(all_ds))])
train_ds, valid_ds = random_split(train_ds, [int(0.8*len(train_ds)), len(train_ds)-int(0.8*len(train_ds))])
train_ds, valid_ds, test_ds = data.Dataset(train_ds,fields), data.Dataset(valid_ds,fields), data.Dataset(test_ds,fields)

In [None]:
# See how many examples there are in each subset
print(f'Number of training examples: {len(train_ds)}')
print(f'Number of validation examples: {len(valid_ds)}')
print(f'Number of testing examples: {len(test_ds)}')

In [None]:
# Build preliminary encodings to pass into the model
MAX_VOCAB_SIZE = 25_000

TEXT.build_vocab(train_ds, max_size = MAX_VOCAB_SIZE)
LABEL.build_vocab(train_ds)

In [None]:
# Define iterator for minibatches (helps to train RNN in chunks)
BATCH_SIZE = 64

# If there is a GPU available, we will set to use it; otherwise we will use cpu.
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_ds, valid_ds, test_ds), 
    batch_size = BATCH_SIZE,
    device = device,
    sort_key = lambda x: len(x.text))

### Building the network
Now that you've set everything up, you're ready to build the network architecture.

**Step 1**: The first thing you’ll want to do is fill out the code in the initialization of the RNN class. You’ll need to define three layers: self.embedding, self.rnn, and self.fc. Use the built-in functions in torch.nn to accomplish this (remember that a fully-connected layer is also a linear layer!) and pay attention to what each dimensions each layer should have for its input and output.

**Step 2**: The next step (still in the RNN class) is to implement the forward pass. Make use of the layers you defined above to create embedded, hidden, and output vectors for a given input x.

Hint to start our model:
The RNN model should have the following structure:
1. start by an embedding layer; shape:  (input_dim, embedding_dim)
2. then we put the RNN layer; shape: (embedding_dim, hidden_dim)
3. last, we add a liner layer; shape: (hidden_dim, output_dim)

In [None]:
## TODO: define the RNN class
class RNN(nn.Module):
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim):
        super().__init__()
        
        ## TODO starts
        ## TODO ends
        
    def forward(self, text):

        ## TODO starts
        ## TODO ends

        return result

Now we're going to define a few model hyperparameters and initialize our model.

In [None]:
# define some hyperparameters
INPUT_DIM = len(TEXT.vocab)
EMBEDDING_DIM = 100
HIDDEN_DIM = 256
OUTPUT_DIM = 1

# apply our RNN model here
rnn = RNN(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# set up learning rate + optimization algorithm
learning_rate = 0.001
optimizer = optim.Adam(rnn.parameters(), lr=learning_rate)
criterion = torch.nn.BCEWithLogitsLoss() # need the "with logits" as we need the range to be between 0 and 1

In [None]:
rnn = rnn.to(device)
criterion = criterion.to(device)

### Calculating model accuracy
The model will give us outputs in the range (-Inf,Inf). Your job now is to take a list of these outputs, convert them to predictions of the sentiment label, and calculate the accuracy of the RNN classifier. To this end, fill in the code cell below to implement the `binary_accuracy` function.

Hint: you will need to use an activation function called the **sigmoid** function (`torch.sigmoid`) to get your outputs in the appropriate range, as they need to be converted to probabilities. Then you can predict the class based on whether the probability is below or above 0.5, and compare it to the ground truth label.

In [None]:
## TODO: return the accuracy given the RNN outputs (outputs) and true values (y); accuracy should be a float number
def binary_accuracy(outputs, y):
    """
    Returns accuracy per batch, i.e. if you get 8/10 right, this returns 0.8, NOT 8
    """
    return accuracy

### Training function

The next function is the train function. Most of the code is handled for you – you only need to get a set of predictions for the current batch and then calculate the current loss and accuracy. For the latter two calculations, make sure to use the criterion and binary_accuracy functions you are given. For calculating the batch predictions, extract the text of the current batch and run it through the model, which is passed in as a parameter.


In [None]:
## TODO: finish the training function
## iterator contains batches of the training data; 
## hint: use batch.text and batch.label to get access to the training data and labels
def train(model, iterator, optimizer, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.train()
    
    for batch in iterator:
        # reset the optimiser
        optimizer.zero_grad()

        # TODO: make predictions

        # TODO: calculate loss and accuracy

        # backprop
        loss.backward()
        # update params
        optimizer.step()

        epoch_loss += loss.item()
        epoch_acc += acc.item()

    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Evaluation function

This step is to copy and paste what you did in the training function into the evaluate function. This time, there’s no additional optimization after the predictions, loss, and accuracy are calculated.

In [None]:
## TODO: finish the evaluation function
## iterator contains batches of the training data; 
## hint: this function is very similar to the training function
def evaluate(model, iterator, criterion):
    
    epoch_loss = 0
    epoch_acc = 0
    
    model.eval()
    
    with torch.no_grad():
    
        for batch in iterator:
            
            ## TODO starts

            ## TODO ends
            epoch_loss += loss.item()
            epoch_acc += acc.item()
            
        
    return epoch_loss / len(iterator), epoch_acc / len(iterator)

### Start training
It may take a few minutes in total. The validation set accuracy is around 54%.

In [None]:
N_EPOCHS = 5

best_valid_loss = float('inf')
# let's train 5 epochs
for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train(rnn, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(rnn, valid_iterator, criterion)
      
    # we keep track of the best model, and save it
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(rnn.state_dict(), 'best_model.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

### Restore the best model and evaluate
The test accuracy should be around 55%.


In [None]:
rnn.load_state_dict(torch.load('best_model.pt'))
test_loss, test_acc = evaluate(rnn, test_iterator, criterion)

print(f'Test Loss: {test_loss:.3f} | Test Acc: {test_acc*100:.2f}%')

Compare this performance to the simple model performance from Part A.

**TODO: write your answer below**

### LSTM
This step of this assignment is to modify your RNN into a bidirectional LSTM network. We should expect that this kind of model performs better than our previous ones.

1. You’ll be making changes to your model in the `RNN` Class. In the `init` class, for the rnn layer, use the `nn.LSTM` function and make sure you pass in the bidirectional argument. Also note that the fully connected layer now has to map from two hidden layer passes (forward and backward).
2. In the forward pass, not much changes from before, besides the addition of the cell. Also note that you’ll have to concatenate the final forward hidden layer and the final backward hidden layer. If any of this is unclear, look up example of how `nn.LSTM` works for clarification.


In [None]:
class LSTM(nn.Module):
    # TODO: IMPLEMENT THIS FUNCTION
    # Initialize the three layers in the RNN, self.embedding, self.rnn, and self.fc
    # Each one has a corresponding function in nn
    # embedding maps from input_dim->embedding_dim
    # rnn maps from embedding_dim->hidden_dim
    # fc maps from hidden_dim*2->output_dim
    def __init__(self, input_dim, embedding_dim, hidden_dim, output_dim, bidirectional):
        super().__init__()
        
        ## TODO starts
        ## TODO ends
        
    
    # TODO: IMPLEMENT THIS FUNCTION
    # x has dimensions [sentence length, batch size]
    # embedded has dimensions [sentence length, batch size, embedding_dim]
    # output has dimensions [sentence length, batch size, hidden_dim*2] (since bidirectional)
    # hidden has dimensions [2, batch size, hidden_dim]
        # cell has dimensions [2, batch_size, hidden_dim]
    # Need to concatenate the final forward and backward hidden layers
    def forward(self, x):
        
        ## TODO starts
        ## TODO ends
        
        return result

In [None]:
# apply our RNN model here
BIDIRECTIONAL = True
lstm = LSTM(INPUT_DIM, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM, BIDIRECTIONAL)
## setup device
lstm = lstm.to(device)
criterion = criterion.to(device)

In [None]:
# train again!
best_valid_loss = float('inf')
# let's train 5 epochs
for epoch in range(N_EPOCHS):
    
    train_loss, train_acc = train(lstm, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(lstm, valid_iterator, criterion)
      
    # we keep track of the best model, and save it
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(lstm.state_dict(), 'best_model_LSTM.pt')
    
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

Do you think LSTM is working better than RNN? Why or why not? How do you compare with LSTM and RNN (model complexity, etc)?

**TODO: write your answer to this question**



## Part C: Transformers
In this part, we will see how to apply pre-trained BERT model to improve text classification. You won't have to code anything for yourself here; we provide the code to serve as a learning resource.

Bidirectional Encoder Representations from Transformers (BERT) is a technique for NLP (Natural Language Processing) pre-training developed by Google. BERT was created and published in 2018 by Jacob Devlin and his colleagues from Google. Google is leveraging BERT to better understand user searches. (From WIKI)


Read more: http://jalammar.github.io/illustrated-transformer/

BERT paper: [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)

Understanding BERT: https://towardsdatascience.com/understanding-bert-is-it-a-game-changer-in-nlp-7cca943cf3ad

As before, run the cells below. Don't modify them but do take a look at them to see what's going on.



In [None]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
max_input_length = tokenizer.max_model_input_sizes['bert-base-uncased']

# tokenize a sentence: you will see the tokenizer "cleans" the sentence as well.
tokens = tokenizer.tokenize('Hello WORLD how ARE yoU?')
print(tokens)
# There will be a warning, but just leave it

In [None]:
# Convert text to ids (positions in a dictionary)
indexes = tokenizer.convert_tokens_to_ids(tokens)

print(indexes)

# Special tokens
init_token = tokenizer.cls_token
eos_token = tokenizer.sep_token
pad_token = tokenizer.pad_token
unk_token = tokenizer.unk_token

print(init_token, eos_token, pad_token, unk_token)

# Special token ids
init_token_idx = tokenizer.convert_tokens_to_ids(init_token)
eos_token_idx = tokenizer.convert_tokens_to_ids(eos_token)
pad_token_idx = tokenizer.convert_tokens_to_ids(pad_token)
unk_token_idx = tokenizer.convert_tokens_to_ids(unk_token)

print(init_token_idx, eos_token_idx, pad_token_idx, unk_token_idx)

In [None]:
# Tokenize a sentence and trim it to the max allowed sentence length
def tokenize_and_cut(sentence):
    tokens = tokenizer.tokenize(sentence) 
    tokens = tokens[:max_input_length-2]
    return tokens

In [None]:
# Redefine TEXT with the new tokens
TEXT = data.Field(batch_first = True,
                  use_vocab = False,
                  tokenize = tokenize_and_cut,
                  preprocessing = tokenizer.convert_tokens_to_ids,
                  init_token = init_token_idx,
                  eos_token = eos_token_idx,
                  pad_token = pad_token_idx,
                  unk_token = unk_token_idx)

In [None]:
# Redefine the training, validation and test data
fields = {'is_good' : LABEL, 'text' : TEXT}
all_ds_t = DataFrameDataset(all_data, fields)
train_ds_t, test_ds_t = random_split(all_ds_t,[int(0.8*len(all_ds_t)), len(all_ds_t)-int(0.8*len(all_ds_t))])
train_ds_t, valid_ds_t = random_split(train_ds_t, [int(0.8*len(train_ds_t)), len(train_ds_t)-int(0.8*len(train_ds_t))])
train_ds_t, valid_ds_t, test_ds_t = data.Dataset(train_ds_t,fields), data.Dataset(valid_ds_t,fields), data.Dataset(test_ds_t,fields)

In [None]:
# Rebuild vocabulary 
LABEL.build_vocab(train_ds_t)
BATCH_SIZE = 128

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
    (train_ds_t, valid_ds_t, test_ds_t), 
    batch_size = BATCH_SIZE, 
    device = device,
    sort_key = lambda x: len(x.text))

# It will download the pre-trained model
bert = BertModel.from_pretrained('bert-base-uncased')


In [None]:
# Modify the RNN class to incorporate BERT representations
class MyBERTwithRNN(nn.Module):
    def __init__(self,
                 bert,
                 hidden_dim,
                 output_dim,
                 n_layers,
                 bidirectional,
                 dropout):
        
        super().__init__()
        
        self.bert = bert
        
        embedding_dim = bert.config.to_dict()['hidden_size']
        
        self.rnn = nn.GRU(embedding_dim,
                          hidden_dim,
                          num_layers = n_layers,
                          bidirectional = bidirectional,
                          batch_first = True,
                          dropout = 0 if n_layers < 2 else dropout)
        
        self.out = nn.Linear(hidden_dim * 2 if bidirectional else hidden_dim, output_dim)
        
        self.dropout = nn.Dropout(dropout)
        
    def forward(self, text):
        
        #text = [batch size, sent len]
                
        with torch.no_grad():
            embedded = self.bert(text)[0]
                
        #embedded = [batch size, sent len, emb dim]
        
        _, hidden = self.rnn(embedded)
        
        #hidden = [n layers * n directions, batch size, emb dim]
        
        if self.rnn.bidirectional:
            hidden = self.dropout(torch.cat((hidden[-2,:,:], hidden[-1,:,:]), dim = 1))
        else:
            hidden = self.dropout(hidden[-1,:,:])
                
        #hidden = [batch size, hid dim]
        
        output = self.out(hidden)
        
        #output = [batch size, out dim]
        
        return output

In [None]:
# create model instance
HIDDEN_DIM = 256
OUTPUT_DIM = 1
N_LAYERS = 2
BIDIRECTIONAL = True
DROPOUT = 0.25

bert_model = MyBERTwithRNN(bert,
                         HIDDEN_DIM,
                         OUTPUT_DIM,
                         N_LAYERS,
                         BIDIRECTIONAL,
                         DROPOUT)

In [None]:
# Count the number of parameters in the model. The value should shock you!
def count_parameters(model):
    param_number = 0
    param_number = sum(p.numel() for p in model.parameters() if p.requires_grad)
    return param_number

print(f'The model has {count_parameters(bert_model):,} trainable parameters')

In [None]:
# Let's fix the bert embeddings
for name, param in bert_model.named_parameters():                
    if name.startswith('bert'):
        param.requires_grad = False

In [None]:
# Tether the model to the GPU
bert_model = bert_model.to(device)

In [None]:
# a helper function to see how much time needed
def epoch_time(start_time, end_time):
    elapsed_time = end_time - start_time
    elapsed_mins = int(elapsed_time / 60)
    elapsed_secs = int(elapsed_time - (elapsed_mins * 60))
    return elapsed_mins, elapsed_secs

In [None]:
# Start training.
# Note that it will take ~17 minutes for one epoch.
# The output will be: Epoch: 01 | Epoch Time: 17m 36s...
# Validate accuracy is higher than 85% in the first epoch, higher than 90% in the second epoch.

N_EPOCHS = 2

best_valid_loss = float('inf')

for epoch in range(N_EPOCHS):
    
    start_time = time.time()
    
    train_loss, train_acc = train(bert_model, train_iterator, optimizer, criterion)
    valid_loss, valid_acc = evaluate(bert_model, valid_iterator, criterion)
        
    end_time = time.time()
    
    epoch_mins, epoch_secs = epoch_time(start_time, end_time)
        
    if valid_loss < best_valid_loss:
        best_valid_loss = valid_loss
        torch.save(bert_model.state_dict(), 'best_model.pt')
    
    print(f'Epoch: {epoch+1:02} | Epoch Time: {epoch_mins}m {epoch_secs}s')
    print(f'\tTrain Loss: {train_loss:.3f} | Train Acc: {train_acc*100:.2f}%')
    print(f'\t Val. Loss: {valid_loss:.3f} |  Val. Acc: {valid_acc*100:.2f}%')

**Question**: how do you compare the bert model with Part B? (Hint: consider training time, accuracy, model complexity etc.)

**TODO: write your answer here**

# Submission instructions
1. Print this page as a **PDF** and name it `[your_NetID]_nlp_hw2.pdf`.
2. Rename your notebook `[your_NetID]_nlp_hw2.ipynb`.
3. Submit both on Canvas by **11:59pm on March 29**.