## <h1><center>Assignment 2: Sentiment Analysis</center></h1>




<center><img src="https://www.cs.cornell.edu/courses/cs4782/2025sp/images/clapperboard_attention.jpeg"></center>



&nbsp;


---



**GOAL:** In this project you will be implementing a variety of different NLP models to analyze whether IMBD movie reviews are positive or negative (sentiment analysis). You will also gain familiarity with the HuggingFace platform, which is commonly used to share machine learning models, datasets, and more.

&nbsp;

**WHAT YOU'LL SUBMIT:** Your submission to Gradescope includes:


1.   A `.zip` file uploaded to ***[Coding Assignment 2](https://www.gradescope.com/courses/963234/assignments/5850403)***  containing the following files:

<center>

\#|Files
---|---
i. | `submission.py`
ii. |`LR_google.csv`
iii. |`LR_student.csv`
iv. |`Transformer_preds.csv`
v. |`LSTM_preds.csv`

</center>


2.   A `.txt`file with responses to questions in the notebook uploaded to ***[Coding Assignment 2 Responses](https://www.gradescope.com/courses/963234/assignments/5850976)***

*More on how you are expected to access, modify and save these files as you follow along the instructions in the notebook.*

&nbsp;

**DO's:**


1.   **Recommendation:** Finish coding and debugging on CPU; only use the GPU in the end to get the final results.
2.   **Running on GPU:** You can click on the runtime option and change your runtime type to the **T4 GPU (this should make your training faster)**
3.   As before, all functionality you need to modify is within `submission.py`.
4.   Remember to execute all code cells sequentially, not just those you’ve edited, to ensure your code runs properly.
5.   Please cite any external sources you use to complete this assignment in your written responses.
6.   Before starting your work, please review <a href="https://s3.amazonaws.com/ecornell/global/eCornellPlagiarismPolicy.pdf">eCornell's policy regarding plagiarism</a> (the presentation of someone else's work as your own without source credit).

&nbsp;

**DONT's:**


1.   DO NOT change the names of any provided functions, classes, or variables within the existing code cells, as this will interfere with grading.
2.   DO NOT delete any provided code/imports.

&nbsp;

***NOTE:***
    
*You can resubmit your work as many times as necessary before the submission deadline. If you experience difficulty or have questions about this exercise, use the Ed discussion board to engage with your peers or seek assistance from the TAs.*




# Part 0: Setting up the Colab environment.

The new few code blocks will set up your Colab environment.  Upload the `a2_release` folder to your Google Drive and run/update the cells below, following the TODO instructions. Just like in the first assignment, you must specify the paths to your implementation so it can be accessed by this notebook (see *TODO 1*).

In [None]:
# TODO -1: Reinstall ipython kernel to enable autoreload on latest runtime (by 02/12/2026)
# You'll be prompted to restart the session.
!pip install ipython==8.12.0

In [None]:
# TODO 0: Mount your Google Drive; this allows the runtime environment to access your drive.
from google.colab import drive
drive.mount('/content/gdrive')
import sys
# NOTE: Replace with the path to the A2 folder on your google drive. Make sure your path does NOT include a '/' at the end!
base_dir = "/content/gdrive/MyDrive/a2_release"
sys.path.append(base_dir)
## END TODO

In [None]:
!pip -q install gensim
# # This makes sure the submission module is reloaded whenever you make edits.
%load_ext autoreload
%aimport submission
%autoreload 1
import submission


import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
import torch.nn.functional as F
from torch.autograd import Variable

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(F"Device set to {device}")

In [None]:
!pip install datasets
!pip install stop_words
!pip install transformers

In [None]:
import datasets

import random
import os
import numpy as np
import pandas as pd
import math
from collections import Counter
from itertools import chain
from typing import List
import textwrap

import matplotlib.pyplot as plt
import seaborn as sns
from tqdm.auto import tqdm

random.seed(0)
torch.manual_seed(0)

# Part 1: Create a Dataset
The dataset https://huggingface.co/datasets/imdb that we will be using consists of movie reviews from the IMDB website that are labelled as either `"Negative" (0)` or `"Positive" (1)`.

In [None]:
from datasets import load_dataset

imdb = load_dataset("imdb")

This next set of code will create the train and test splits used for the assignment. To speed up training time, we will only be using a subset of the full imdb dataset.

In [None]:
train_dataset = imdb["train"].shuffle(seed=82).select([i for i in list(range(3000))])
val_dataset = imdb["test"].shuffle(seed=82).select([i + 301 for i in list(range(300))])
test_dataset = imdb["test"].shuffle(seed=82).select([i for i in list(range(300))])

In [None]:
train_df = train_dataset.to_pandas()
val_df = val_dataset.to_pandas()
test_df = test_dataset.to_pandas()

train_df.head()

Visualize training examples

In [None]:
print('Negative Review', '\n')
print(textwrap.fill(train_dataset[2]['text'], 130), '\n')

print('Positive Review', '\n')
print(textwrap.fill(train_dataset[10]['text'], 130))

# Part 2: Word Embeddings

Before we start, let's learn some word embeddings for the words found in our movie review dataset. <br>
Run the following cells to:

*   process our dataset,
*   train a word2vec embedding model on the words in our dataset,
*   visualize these embeddings in 2D space.


In [None]:
import re
import nltk
import stop_words

!pip -q install gensim
from gensim.models import word2vec
from sklearn.manifold import TSNE

The following code cells will clean the text in the imdb review dataset. This includes removing characters that are not alpha-numeric and removing stop words (common words in the English language that do not convey much meaning, e.g. the, and, it, etc.). This is a common processing step in many NLP pipelines.

In [None]:
nltk.download('stopwords')
STOP_WORDS = nltk.corpus.stopwords.words()

def clean_sentence(val):
    '''
    This function remove chars that are not letters or numbers. It then removes
    stop words (common words in the English language that do not convey much
    meaning, e.g. the, and, it, etc.).
    '''
    val = val.lower()
    val = val.replace('<br />', '')
    val = val.replace('.', '. ')
    val = val.replace('!', '! ')
    regex = re.compile('([^\s\w]|_)+')
    sentence = regex.sub('', val).lower()
    sentence = sentence.split(" ")

    for word in list(sentence):
        if word in STOP_WORDS:
            sentence.remove(word)

    sentence = " ".join(sentence)
    return sentence

def clean_dataframe(data):
    data = data.dropna(how="any")

    for col in ['text']:
        data[col] = data[col].apply(clean_sentence)

    return data

data = clean_dataframe(train_df)
data.head(5)

In [None]:
def build_corpus(data):
    '''
    Creates a list of lists containing words from each sentence.
    '''
    corpus = []
    for col in ['text']:
        for sentence in data[col].items():
            word_list = sentence[1].split(" ")
            corpus.append([word for word in word_list if len(word) != 0])

    return corpus

corpus = build_corpus(data)
corpus[0:2]

The following cell trains a word2vec model on our cleaned dataset of movie reviews.

In [None]:
w2v_model = word2vec.Word2Vec(corpus, vector_size=100, window=20, min_count=320, workers=4)

We can visualize our word embeddings in 2D space using a TSNE plot.

In [None]:
def tsne_plot(model):
    '''
    Creates an TSNE model and plots it. This will help visualize the distances
    between word embeddings in 2D space.
    '''
    labels = []
    tokens = []

    for word in model.wv.key_to_index:
        tokens.append(model.wv[word])
        labels.append(word)

    tsne_model = TSNE(perplexity=40, n_components=2, init='pca', n_iter=2500, random_state=23)
    new_values = tsne_model.fit_transform(np.asarray(tokens))

    x = []
    y = []
    for value in new_values:
        x.append(value[0])
        y.append(value[1])

    plt.figure(figsize=(15, 7))
    for i in range(len(x)):
        color = sns.color_palette('Set2')[2]
        plt.scatter(x[i], y[i], color = color)
        plt.annotate(labels[i], xy=(x[i], y[i]), xytext=(5, 2), textcoords='offset points', ha='right',va='bottom')

    plt.show()

In [None]:
tsne_plot(w2v_model)

## Part 2.1: Logistic Regression with Word Embeddings

Now that we have a Word2Vec model that can generate embeddings for the words in our reviews, we can use these embeddings to train a Logistic Regression model to classify the sentiment of these reviews.

In [None]:
import gensim
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np
from gensim.models import Word2Vec
import gensim.downloader as api

In order to generate our inputs to the Logistic Regression model, start by implementing the `get_word_embeddings` function, which given a single review (i.e. a sequence of words), returns the list of corresponding word embeddings for each of the words in the review.

[This](https://radimrehurek.com/gensim/models/word2vec.html#usage-examples) Word2vec documentation may be helpful in implementing `get_word_embeddings`.

In [None]:
# TODO 1: Implement get_word_embeddings in submission.py
from submission import get_word_embeddings

Next, implement `get_reviews_embeddings`, which given a dataframe containing a list of reviews, returns the list of word embeddings for each review in the input data.

In [None]:
# TODO 2: Implement get_reviews_embeddings in submission.py
from submission import get_reviews_embeddings


Finally, we will use this function to obtain the word embeddings for the train, test, and validation sets.

In [None]:
word_embeddings_train = get_reviews_embeddings(w2v_model, data)
word_embeddings_val = get_reviews_embeddings(w2v_model, clean_dataframe(val_df))
word_embeddings_test = get_reviews_embeddings(w2v_model, clean_dataframe(test_df))

Before we can train a logistic regression model on these reviews, we need to ensure that all of our inputs into the model have the same dimensions. Since reviews can have different lengths (i.e. have different word counts), we need a method to standardize the size of the embeddings for each review in a way that does not depend on the length of the review.

A simple way to do this is to perfom max pooling of the word embeddings in each review to obtain a single vector of length `d`, where `d` is the size of the word embeddings.

Implement the function `max_pool`, which performs this pooling operation.

Some reviews may not have any words in our vocabulary. If that is the case, we should return a vector of zeros for the features.

In [None]:
# TODO 3: Implement max_pool in submission.py
from submission import max_pool


In [None]:
X_train_max_pool = [
    max_pool(review, w2v_model.vector_size) for review in word_embeddings_train
]
X_val_max_pool = [
    max_pool(review, w2v_model.vector_size) for review in word_embeddings_val
]
X_test_max_pool = [
    max_pool(review, w2v_model.vector_size) for review in word_embeddings_test
]

y_train = train_df['label']
y_val = val_df['label']
y_test = test_df['label']

Now that we have our training data set up, we can train our model.

We will use the sklearn library's LogisticRegression model class the create and train a logistic regression classification model `logreg_model` as a baseline.

In [None]:
logreg_model = LogisticRegression(max_iter=1000).fit(X_train_max_pool, y_train)


The following code will display the accuracy of your model on the test set.

In [None]:
y_test_pred = logreg_model.predict(X_test_max_pool)
accuracy_test = accuracy_score(y_test, y_test_pred)
print(f"Test Accuracy: {accuracy_test:.2f}")

Now let's see how your model stacks up against a logistic regression model using Google's pre-trained word embeddings. We will load in pretrained embeddings below. This cell may take several minutes to run (*~12 minutes*).

In [None]:
google_w2v_model = api.load('word2vec-google-news-300')

In [None]:
# The type can be helpful to distinguish google's word2vec model from ours
type(google_w2v_model)

Just as we did before, we can use Google's word2vec model to obtain work embeddings for each of the reviews in our dataset and max pool them. However, Google's word2vec doesn't store it's vocabulary in attribute `wv`. <br> Update your `get_word_embeddings` function, to check for different `types` of models input to the function. <br> The expected behavior and i/o of the function still remains the same only adjusted to accommodate multiple models types.

In [None]:
# TODO 4: Add support for Google's word2vec

In [None]:
google_embeddings_train = get_reviews_embeddings(google_w2v_model, data)
google_embeddings_val = get_reviews_embeddings(google_w2v_model, clean_dataframe(val_df))
google_embeddings_test = get_reviews_embeddings(google_w2v_model, clean_dataframe(test_df))

X_train_max_pool_google = [
    max_pool(review, google_w2v_model.vector_size) for review in google_embeddings_train
]
X_val_max_pool_google = [
    max_pool(review, google_w2v_model.vector_size) for review in google_embeddings_val
]
X_test_max_pool_google = [
    max_pool(review, google_w2v_model.vector_size) for review in google_embeddings_test
]

Just like before, we use the sklearn library's LogisticRegression model class. This time, we use the word embeddings from google to compare what accuracy we achieve with slightly more sophisticated embeddings.

In [None]:
logreg_model_google = LogisticRegression(max_iter=1000).fit(X_train_max_pool_google, y_train)

In [None]:
y_test_pred_google = logreg_model_google.predict(X_test_max_pool_google)
accuracy_google = accuracy_score(y_test, y_test_pred_google)
print(f"Test Accuracy (Google's Word2Vec): {accuracy_google:.2f}")

# Part 3: LSTMs for Text Classification (20 pts)

## 3.1: LSTM Model (10 pts)

We first begin by implementing the LSTM model that we will later train for sentiment analysis. The entire architecture is as follows:
1.  **LSTM** takes the data, an initial hidden state, and an initial cell state. (`batch_first` should be set to `True`)
2. **A classification head for sentiment analysis**:
  - **relu**  nonlinearity on the hidden. Use nn.ReLU().
  - **fc1** fully-connected layer with `hidden_size*num_layers` input dimensions and `128` output features
  -  **relu**  nonlinearity on the hidden state. The same relu layer can be used across the entire forward pass.
  - **fc2** Passing the hidden state through another fully connected layer with `128` input dimensions and `num_classes` output features

Instead of implementing the LSTM architecture from scratch, you may make use of [PyTorch's built-in LSTM class](https://pytorch.org/docs/stable/generated/torch.nn.LSTM.html).

Note: During the forward pass, you will need to get the final hidden state from the LSTM to pass into the MLP. You should look at the LSTM documentation to figure out what the shape of the hidden unit is and reshape it so the input to the MLP is `batch_size x (num_layers x hidden_size)`.






In [None]:
#TODO 5: Implement LSTM in submission.py

## 3.2: LSTM Training (10 pts)

We can now move on to training our LSTM on our training data. <br>First you will need to use the downloaded Google word2vec embeddings to create embeddings for input to your LSTM. You will do this by implementing a function `reviews_processing` that processes the reviews embedded with Google's word2vec model. <br>


*   We will clip the sequences to length 40.
*   If a review has fewer than 40 word embeddings, you should pad the review with 0 vectors so that the final sequence has 40 embeddings.

Once we have preprocessed the reviews to have the same length, we can then create a `CustomLSTMDataset` to store the reviews. Finally, we can  create a data loader will give us batched data for training.

In [None]:
from torch.utils.data import Dataset
class CustomLSTMDataset(Dataset):
  def __init__(self, embeddings, labels):
    self.embeddings = embeddings
    self.labels = labels

  def __len__(self):
    return len(self.embeddings)

  def __getitem__(self, idx):
    return self.embeddings[idx], self.labels[idx]

In [None]:
# TODO 6: Implement reviews_processing in submission.py
from submission import reviews_processing

In [None]:
length = 40
batch_size = 16

train_data = CustomLSTMDataset(reviews_processing(google_embeddings_train, length), y_train)
validation_data = CustomLSTMDataset(reviews_processing(google_embeddings_val, length), y_val)
test_data = CustomLSTMDataset(reviews_processing(google_embeddings_test, length), y_test)

train_loader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(validation_data, batch_size=batch_size, shuffle=False)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)

Now create a training loop to train the model, where the data_loader provides batches of (inputs, labels). We will keep track of the train and validation losses and the validation accuracy at each epoch, which should be outputted by the `train` function. As a reminder, for each batch the training loop should:

1. Zero out the gradients of the model
2. Perform a forward pass through the model.
3. Compute the loss using the specified criterion.
4. Perform a backward pass and update the model parameters using the optimizer.
5. Step the optimzer.

Additionally, calculate the validation loss and accuracy at the end of each epoch using the `val` function. Remember to set the model to evaluation mode with `.eval()` before evaluating, and set it back to training mode with `.train()` afterward. If you still find it difficult to get started, you can refer to the `train_bert()` and `val_bert()` functions in Section 5.1.

In [None]:
# TODO 7: Implement val and train in submission.py
from submission import val, train

In [None]:
from submission import LSTM

num_layers = 2
input_size = 300
hidden_size = 64
seq_length = 40
num_classes = 2

# you may change the learning rate and numbers of epochs run
learning_rate = 0.01
lstm_epochs = 10

criterion = nn.CrossEntropyLoss()


# Initialize LSTM model
lstm_model = LSTM(num_layers, input_size, hidden_size, seq_length, num_classes).to(device)

#Initialize optimizer
optimizer = optim.Adam(lstm_model.parameters(), lr=learning_rate)

#run training
lstm_train_loss, lstm_val_loss, lstm_val_acc = train(lstm_model, train_loader, val_loader, criterion, lstm_epochs, optimizer, device)

Now run the cell below to compare the training losses, validation losses, and validation accuracy of the LSTM model over each training epoch.

In [None]:
x = [epoch + 1 for epoch in range(lstm_epochs)]

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))

ax1.plot(x, lstm_train_loss, color='green', label='Train Loss')
ax1.plot(x, lstm_val_loss, color='red', label='Validation Loss')
ax1.set_title("Training & Validation Loss")
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")


ax2.plot(x, lstm_val_acc, color='blue', label='Validation Accuracy')
ax2.set_title("Validation Accuracy")
ax2.set_xlabel("Epochs")
ax2.set_ylabel("Accuracy")

ax1.legend()
ax2.legend()

plt.tight_layout()
plt.show()


In [None]:
_, lstm_accuracy = val(lstm_model, test_loader, criterion, device)

print(f"Test Accuracy (LSTM): {lstm_accuracy:.2f}")

# Part 4: Transformers (50 pts)

The next NLP model we will be applying to our sentiment analysis task is the Transformer!

###Positional Encoding
The following code implements the positional encoding that will be used to inject information on the position of each element in the input sequence into the model. The positional encoding is added to the input embeddings before they are passed through the rest of the encoder, as shown in the figure below

<center>Image of the positional embedding section of a transformer</center>

<center><img src="https://www.cs.cornell.edu/courses/cs4782/2025sp/images/positional_encoding.png"></center>






In [None]:
# TODO 8: Look at PositionalEncoding in submission.py

### Q1: Based on the provided implementation of PositionalEncoding, what is the formula used to assign an embedding to each position/index `i` in the input sequence? What is one benefit of using this function/formula specifically to generate position embeddings?

**Answer: add your answer to responses.tex**

## 4.1: Attention Mechanism (15 pts)
First, let's revisit how the multi-head attention layer works.

**A)** Recall that given query, key, and value matrices $Q \in \mathbb{R}^{n \times d_q}$, $K \in \mathbb{R}^{n \times d_k}$, and $V \in \mathbb{R}^{n \times d_v}$, where $d_q = d_k$ and $n$ is the sequence length, the attention equation for a single head is:
$$\texttt{attention}(Q, K, V) = \texttt{softmax} \left( \frac{QK^T}{\sqrt{d_k}} \right)V$$

**B)** To produce the query, key, and value matrices from the model inputs, weight matrices $W^q$, $W^k$, and $W^v$ are used to transform the input sequence X into the Q, K, and V matrices.

$$Q = X(W^q)^T$$
$$K = X(W^k)^T$$
$$V = X(W^v)^T$$

**C)** If we have $h$ attention heads, the output from each attention head can be represented as:
$$\texttt{head}_i = \texttt{attention}(Q_i, K_i, V_i)\space\space\space \forall \space i \in [0, h)$$

$$\texttt{head}_i = \texttt{attention}(X(W^q_i)^T, X(W^k_i)^T, X(W^v_i)^T)\space\space\space \forall \space i \in [0, h)$$

**Note**: to make the dimensions work out, given $h$ attention heads, $d_k = \frac{d_{model}}{h}$, where $d_{model}$ is the dimension of the embeddings.

**D)** The multi-headed attention output is the concatenation of the output of each head as follows:

$$\texttt{multi-head} = Concat[\texttt{head}_1, \texttt{head}_2, ...\texttt{head}_h]$$

**E)** Finally,  a weight matrix $W^o$ is used to transform the multi-head output and generate the final output of the multi-head attention layer, i.e.

$$\texttt{MultiHead}(Q, K, V) = [\texttt{head}_1, \texttt{head}_2, ...\texttt{head}_h](W^o)^T$$


In [None]:
# TODO 9: Implement MultiHeadAttention in submission.py

### Now, complete the TODOs in the following order

1. **TODO 9.1**: Initialize the linear layers $W^q$, $W^k$, and $W^v$ and $W^o$ used to generate the Q, K, V matrices and transform the multi-head output.

2. Understand the logic used to split the heads.(*barbaric! I know!*)

3. **TODO 9.2** Implement the `compute_attention` function that performs operation in **step (A)** in the description above.

4. Understand the logic used to combine the heads.

5. **TODO 9.3** Implement the forward pass that performs the entire attention process described previously and returns the output as per operation in **step (E)** above.

### Q2: Describe the matrix size transformations as they happen from steps 1-5. Specificaly, what is the shape of:
### i.  the input to the `split_heads` function?
### ii. the output from the `split_heads` function?
### iii.`multi-head` variable in the description above?
### iv. final output of `Multi-Head(Q, K, V)`?

### Answers expected as matrix shapes in terms of $n$, $d_{model}$, $d_q$, $d_k$, $d_v$ and $h$. You can use the additional variable $b$ to represent batch size.

**Answer: add your answer to responses.tex**

## 4.2: Position-Wise Feed-Forward Neural Network (5 pts)
Next, we will implement the position-wise feed-forward portion of the encoder. The feed-forward architecture is as follows:
- **fc1**: fully-connected layer with `d_model` input dimensions and `d_ff` output features
- **ReLU** nonlinearity
- **fc2**: fully-connected layer with `d_ff` input dimensions and `d_model` output features

In [None]:
# TODO 10: Implement FeedForward in submission.py

## 4.3: Encoder Layer (15 pts)

Next, we will implement a single encoder layer (shown in the dashed red outline).

<center>Image of encoder section of transformer</center>

<center><img src="https://www.cs.cornell.edu/courses/cs4782/2025sp/images/encoder.png"></center>

You will implement an encoder layer with the following structure:

1. A single multi-head attention layer with `num_heads` heads. This attention layer performs self-attention, i.e. the key, query, and value matrices are all generated from the same input. Name this layer `self_attn`.

2. A single layer normalization layer with input shape `d_model`. Name this layer `norm1`. As shown in the diagram, the inputs to this layer are the following:

  a. Input embeddings which were input to the multi-head attention block.

  b. Output of the multi-head attention block. A Dropout (with dropout probaility `p`) should be applied to the multi-head attention output before adding.

3. A feed-forward block with input dimension `d_model` and hidden layer dimension `d_ff`. Name this layer `feed_forward`.

4. A second layer normalization layer with input shape `d_model`. Name this layer `norm2`. Similar to the first layer norm layer, this layer also has two inputs:

  a. The input to the feed-forward network.

  b. Output of the feed-forward network. Dropout (with dropout probaility `p`) should be applied to the feed-forward output before adding.

5. A Dropout layer named `dropout` to be used in the layer norms `norm1` and `norm2`



In [None]:
# TODO 11: Implement EncoderLayer in submission.py

## Part 4.4: (Encoder-only)  Transformer (15 pts)

Now that we created all of its components, we can implement the full Transformer Encoder. Similar to the ResNet you implemented in the previous assignment, the main portion of the Encoder involves stacking Encoder Layers together. The entire architecture is as follows:

1. The first step is to add the positional encodings into the input embeddings, using the `PositionalEncoding` module we provided. Name this layer `positional_encoding`. Dropout (with dropout probaility `p`) should be applied to the final output embedding from this layer.

2. Initialize a Dropout layer named `dropout` to be used as per step 1.

3. Next, the output from step 1 is passed through `num_layers` encoder layers. These encoder layers are store in a `ModuleList` variable named `encoder_layers`.

4. The output of the encoder layers (`batch_size x max_seq_length x d_model`) is then mean-pooled across the sequence to yield an output of shape `batch_size x d_model`.

5. Finally, the pooled output is passed through two fully-connected layers, with a `ReLU` non-linearity applied before each fully-connected layer. The first fully-connected layer `fc1` should have `128` output features and the second fully-connected layer `fc2` should have `num_classes` output features.

In [None]:
# TODO 12: Implement Transformer in submission.py

## Transformer Training

The following cells will train the transformer model. They make use of the same same datasets and functions that you created in part 3 for the LSTM. (*~12 minutes on CPU*)

In [None]:
from submission import Transformer
from submission import val, train

d_model = 300
num_heads = 4
num_layers = 4
d_ff = 1024
max_seq_length = 40
dropout = 0.1
num_classes = 2

transformer_epochs = 10 # you may change the number of epochs (note: 10 epochs should take ~15 minutes to train on Colab CPU-only)
lr = 0.0001

criterion = nn.CrossEntropyLoss()
transformer = Transformer(num_classes, d_model, num_heads, num_layers, d_ff, max_seq_length, dropout).to(device)
optimizer = optim.Adam(transformer.parameters(), lr=lr, betas=(0.9, 0.98), eps=1e-9)

In [None]:
transformer_train_loss, transformer_val_loss, transformer_val_acc = train(transformer, train_loader, val_loader, criterion, transformer_epochs, optimizer, device)

Now run the cell below to compare the training losses, validation losses, and validation accuracy of the transformer model over each training epoch.

In [None]:
x = [epoch + 1 for epoch in range(transformer_epochs)]

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))

ax1.plot(x, transformer_train_loss, color='green', label='Train Loss')
ax1.plot(x, transformer_val_loss, color='red', label='Validation Loss')
ax1.set_title("Transformer Training & Validation Loss")
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")

ax2.plot(x, transformer_val_acc, color='blue', label='Validation Accuracy')
ax2.set_title("Transformer Validation Accuracy")
ax2.set_xlabel("Epochs")
ax2.set_ylabel("Accuracy")

ax1.legend()
ax2.legend()

plt.tight_layout()
plt.show()



In [None]:
_, transformer_accuracy = val(transformer, test_loader, criterion, device)

print(f'Transformer Test Accuracy: {transformer_accuracy:.2f}')

#Part 5: Pre-trained Models (10 pts)

As discussed in lecture, in natural language processing, it is common to make use of pre-trained models that can be fine-tuned to a specific task to improve performance.

Through the Hugging Face platform, we have access to a wide variety of these pre-trained models. For this portion of the assignment, we will be using the DistilBERT model, a small, fast, cheap, and light Transformer model trained by distilling BERT base. We will be fine-tuning DistilBERT for our sentiment analysis task.

For this portion of the assignment you will need to connect to GPU runtime.


First, run the following code to download the pre-trained DistilBERT model and its tokenizer from Hugging Face.

Notice that after we download the DistilBERT model, we call `.to(device)`. This sends the model to the GPU. Later in the assignment, when inputting data into the model, you will similarly need to ensure that the data is also on the GPU, i.e. in the same location as the model.





In [None]:
from transformers import AutoModelForSequenceClassification
import torch.optim as opt
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
bert_model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2).to(device)

The following code implements a custom dataset class which will be used in the data loader for the DistilBERT model you will be fine-tuning.

The `__getitem__` function is automatically called by the dataloader called when iterating over the dataset, e. g. during training, to produce the input that will eventually be passed into the model. This implementation of `__getitem__` uses the DistilBERT tokenizer to produce the following values:


*   `'source_ids'`: The list of token ids representing the input sequence.
*   `'source_mask'`: The list of indices specifying which tokens should be attended to by the model. Since different sequences in the same batch might have different lengths, in order to put them all in the same tensor, the sequences must be padded or truncated to the same length. The source attention mask tells the model which elements in `source_ids` are padding so that the model does not attend to them. A more detailed explanation can be found [here](https://huggingface.co/transformers/v3.5.1/glossary.html#attention-mask).

Finally, the output of `__getitem__` also includes the example's ground truth label (`'label'`).






In [None]:
import torch.utils.data as data

class CustomClassDataset(data.Dataset):
    def __init__(self, data, tokenizer):
        super(CustomClassDataset, self).__init__()

        self.data = data
        self.tokenizer = tokenizer
        self.out = self.data['label']
        self.text = self.data['text']
        self.max_len = 256

    def __len__(self):
        '''
        Returns the length of the dataset.
        '''
        return len(self.text)

    def __getitem__(self, idx):
        '''
        Returns the training/validation/test example as index idx.
        '''

        text = str(self.text[idx])
        text = ' '.join(text.split())

        source = self.tokenizer([text], padding='max_length', truncation = True, return_tensors="pt", max_length = self.max_len)
        source_ids = source['input_ids'].squeeze()
        source_mask = source['attention_mask'].squeeze()

        label = self.out[idx]

        inputs = {
            'source_ids': source_ids.to(dtype=torch.long),
            'source_mask': source_mask.to(dtype=torch.long),
            'label': label
        }

        return inputs

In [None]:
batch_size = 32

train_data = CustomClassDataset(train_df, tokenizer)
train_loader = DataLoader(train_data, batch_size, True, pin_memory=True, drop_last=True)

val_data = CustomClassDataset(val_df, tokenizer)
val_loader = DataLoader(val_data, batch_size, True, pin_memory=True, drop_last=False)

test_data = CustomClassDataset(test_df, tokenizer)
test_loader = DataLoader(test_data, batch_size, True, pin_memory=True, drop_last=False)

## 5.1: Fine-tuning

To de-duplicate the code necessary for processing each batch of inputs during training and validation, implement the function `process_batch`, which will input the examples in a batch into the model, and return the model outputs, predicted labels, and loss for that batch. During validation, the function will also output the total number of examples in the batch and the number of examples for which the model predicted the correct label. These numbers will later be used to calculate model accuracy.

[This link](https://huggingface.co/docs/transformers/en/model_doc/distilbert#transformers.DistilBertForSequenceClassification) provides more information on the inputs to a forward pass through the DistilBERT model. For the purposes of this assignment, you will only need to provide the `input_ids` and `attention_mask` as inputs.



Here are some Pytorch docs that may be useful in processing the model outputs:

*   argmax: https://pytorch.org/docs/stable/generated/torch.argmax.html
*   sum: https://pytorch.org/docs/stable/generated/torch.sum.html
*   eq: https://pytorch.org/docs/stable/generated/torch.eq.html
*   size: https://pytorch.org/docs/stable/generated/torch.Tensor.size.html



In [None]:
# TODO 13: Implement process_batch in submission.py
from submission import process_batch

We provide the validation and training function for fine-tuning the DistilBERT model, which will make use of the `process_batch` function that you implemented.

In [None]:
def val_bert(model, val_loader, criterion, device):
    """
    Inputs:
    model (torch.nn.Module): The deep learning model to be trained.
    val_data_loader (torch.utils.data.DataLoader): DataLoader for the validation dataset.
    criterion (torch.nn.Module): Loss function to compute the training loss.

    Outputs:
    Tuple of (validation loss, validation accuracy)
    """
    val_running_loss = 0.0
    correct = 0
    total = 0

    model.eval()
    with torch.no_grad():
        for i, data in enumerate(val_loader, 0):

            _, batch_metrics = process_batch(model, data, criterion,device,  val=True)

            val_running_loss += batch_metrics['loss'].cpu().item()
            correct += batch_metrics['num_correct']
            total += batch_metrics['batch_size']
    model.train()

    avg_val_loss = val_running_loss / len(val_loader)
    return avg_val_loss, (correct / total).item()


In [None]:
def train_bert(model, train_loader, criterion, epochs, optim, lr_scheduler, device):
    """
    Inputs:
    model (torch.nn.Module): The deep learning model to be trained.
    val_data_loader (torch.utils.data.DataLoader): DataLoader for the validation dataset.
    criterion (torch.nn.Module): Loss function to compute the training loss.
    epochs: Number of epochs to train.
    optim: The optimizer for training.
    lr_scheduler: Learning rate scheduler for training.

    Outputs:
    Tuple of (train_loss_arr, val_loss_arr, val_acc_arr), an array of the training and validation
    losses and validation accuracy at each epoch
    """
    train_loss_arr = []
    val_loss_arr = []
    val_acc_arr = []
    running_loss = 0.0

    for epoch in range(epochs):
        running_loss = 0.0

        for batch_idx, data in enumerate(train_loader):

            _, metrics = process_batch(model, data, criterion, device)

            loss = metrics['loss'].cpu().item()

            optim.zero_grad()
            metrics['loss'].backward()
            optim.step()

            running_loss += loss

        avg_train_loss = running_loss / len(train_loader)

        avg_val_loss, val_acc = val_bert(model, val_loader, criterion, device)
        train_loss_arr.append(avg_train_loss)
        val_loss_arr.append(avg_val_loss)
        val_acc_arr.append(val_acc)

        print("epoch:", epoch+1, "training loss:", round(avg_train_loss, 3), 'validation loss:', round(avg_val_loss, 3), 'validation accuracy:', round(val_acc*100, 2))

        lr_scheduler.step()

    return train_loss_arr, val_loss_arr, val_acc_arr

The following cells will fine-tune the DistilBERT model.

Note: If you want to start fresh with fine-tuning (for example, to test different hyperparameters), make sure to re-run the first cell in Part 5. This reloads the original pre-trained weights. If you skip this step, training will continue from where you left off, making it hard to fairly compare different settings.

In [None]:
scheduler_step_size = 15

# you may change the learning rate and training epochs
learning_rate = 3e-5
bert_epochs = 5

optim = opt.Adam(bert_model.parameters(), learning_rate)
lr_scheduler = opt.lr_scheduler.StepLR(optim, scheduler_step_size, 0.1)
criterion = nn.CrossEntropyLoss()

In [None]:
bert_train_loss, bert_val_loss, bert_val_acc = train_bert(bert_model, train_loader, criterion, bert_epochs, optim, lr_scheduler, device)

Now run the cell below to compare the training losses, validation losses, and validation accuracy of the DistilBERT model over each training epoch.

In [None]:
x = [epoch + 1 for epoch in range(bert_epochs)]

f, (ax1, ax2) = plt.subplots(1, 2, figsize=(9, 4))

ax1.plot(x, bert_train_loss, color='green', label='Train Loss')
ax1.plot(x, bert_val_loss, color='red', label='Validation Loss')
ax1.set_title("BERT Training & Validation Loss")
ax1.set_xlabel("Epochs")
ax1.set_ylabel("Loss")

ax2.plot(x, bert_val_acc, color='blue', label='Validation Accuracy')
ax2.set_title("BERT Validation Accuracy")
ax2.set_xlabel("Epochs")
ax2.set_ylabel("Accuracy")

ax1.legend()
ax2.legend()

plt.tight_layout()
plt.show()


In [None]:
_, bert_accuracy = val_bert(bert_model, test_loader, criterion, device)

print('Fine-tuned DistilBERT Test Set Accuracy:', round(bert_accuracy, 3))

# Part 6: Model Comparisons

Finally, now that we have implemented a variety of different NLP models to perform sentiment analysis, we can compare their performance on the IMDB dataset.

The following code will generate a barplot comparing the test accuracies of each of the five models you trained throughout this assignment.

In [None]:
model_names = ['LR', 'LR + W2V', 'LSTM', 'transformer', 'distilBERT']

accuracies = [accuracy_test, accuracy_google, lstm_accuracy, transformer_accuracy, bert_accuracy]
y_pos = np.arange(len(accuracies))

color = sns.color_palette('Set2')[2]
plt.bar(y_pos, height = accuracies, color = color)

plt.xlabel("Model")
plt.ylabel("Test Accuracy")
plt.ylim(0, 1)
plt.xticks(y_pos, model_names)

plt.show()

### Q3: What do you notice about the graph above and the differences between the five models you have implemented? Are the results consistent with what you expected? How do pre-trained models compare to models you trained from scratch? Write 3-4 sentences below.

**Answer: add your answer to responses.tex**

## Run the following to create your submission files.

In [None]:
length = 40
batch_size = 16

test_data = CustomLSTMDataset(
    reviews_processing(google_embeddings_test, length), y_test
)
test_loader = DataLoader(test_data, batch_size=batch_size, shuffle=False)


# store predictions for LSTM and Transformer models
def get_predictions(model, data_loader):
    model.eval()
    model.to(device)
    preds = []

    for i, (inputs, labels) in enumerate(data_loader):
        inputs = inputs.to(device, dtype=torch.float32)
        labels = labels.to(device)

        with torch.no_grad():
            outputs = model(inputs)

        preds.extend(outputs.argmax(dim=1).cpu().numpy())

    p = pd.DataFrame(preds, columns=["preds"])
    file_save_path = os.path.join(base_dir, f"{model.__class__.__name__}_preds.csv")
    p.to_csv(file_save_path, index=False)


# Store predictions for LSTM and Transformer models
get_predictions(lstm_model, test_loader)
get_predictions(transformer, test_loader)

# Store predictions for Logistic Regression
logreg_model = LogisticRegression(max_iter=1000)
logreg_model.fit(X_train_max_pool, y_train)
y_test_pred = logreg_model.predict(X_test_max_pool)
preds = pd.DataFrame(y_test_pred, columns=["preds"])
file_save_path = os.path.join(base_dir, "LR_student.csv")
preds.to_csv(file_save_path, index=False)

logreg_model_google = LogisticRegression(max_iter=1000)
logreg_model_google.fit(X_train_max_pool_google, y_train)
y_test_pred_google = logreg_model_google.predict(X_test_max_pool_google)
preds = pd.DataFrame(y_test_pred_google, columns=["preds"])
file_save_path = os.path.join(base_dir, "LR_google.csv")
preds.to_csv(file_save_path, index=False)