# Part 2: Sentiment Analysis Fine-Tuning with BERT

In this part you will fine-tune a pre-trained encoder-only language model called Bert (originally trained and released by Google in 2018) for a sentiment analysis task. Unlike a causal GPT-style language model, BERT is bidirectional in the sense that it was trained to predict a masked word in the middle of a sequence using both the previous and subsequent tokens. For example, BERT was trained on tasks like predicting the masked token in `The sweet black cat [MASK] by the window in the sun.` considering both the preceding tokens `The sweet black cat` **and** the subsequent tokens `by the window in the sun.` 

This kind of model is not used for autoregressively generating new text, but is very useful when you want to understand an entire sequence of text as a whole, allowing attention to earlier or later tokens in a sequence. Sentiment analysis, wherein we want to classify an entire input sequence as either positive or negative in sentiment (for example, in this text we classify movie reviews as either positive or negative), is a good example where this kind of understanding is important.

In this part we will directly modify the `PyTorch` model and will conduct the fine-tuning directly in `PyTorch` as we have done with previous models.

**Learning objectives.** You will:
1. Examine an encoder-only BERT transformer model
2. Modify a BERT model for sentiment analysis
3. Fine-tune the model on movie review data for sentiment analysis

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

First, ensure that you have the `transformers` and `datasets` modules installed. We will use these modules for importing tokenizers, pretrained models, and datasets. You can run the following cells to try to install them with `pip` if needed. If you are using ondemand, ideally you would simply include `module load transformers` and `module load datasets` when making your initial reservation.

In [2]:
pip install transformers

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install datasets

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


Now the following code imports a *tokenizer* and demonstrates its use. 

Note how the sequence of words in the input string is replaced with a sequence of numbers in the `input_ids`: These are indices into the vocabulary of 30522 used by the tokenizer. Also note the `special_tokens`: an `[UNK]` is used for anything not in the vocabulary, and a `[PAD]` can be useful for padding out a sequence of tokens to a specified length.

Given a sequence of strings, the tokenizer returns a dictionary containing not just the `input_ids` (what you will most often want to use) but also `token_type_ids` (whether the token is special, which you will use least often) and `attention_mask`. The `attention_mask` has the same dimensions as the `input_ids` with a `1` in a given position if there is a non-padding token in that position and a `0` if that position is just a padding token. This is helpful when you are tokenizing a batch of multiple strings with potentially different lengths but want to create a single tensor. `padding='longest'` as shown pads all of the input to the same number of tokens as the longest input by adding `[PAD]` tokens to the end. The `attention_mask` is then passed so that you can ignore the extraneous padding tokens as needed.

Also note the `return_tensors` parameter. Using `"pt"` as shown indicates that the results should be returned as PyTorch tensor. If you omit this parameter then the results will be returned as a Python list by default.

In [7]:
# run but you do not need to modify this code
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased',  
                                          clean_up_tokenization_spaces=True)
print(tokenizer)
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")
print(tokenized)

BertTokenizer(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True),  added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
{'input_ids': tensor([[  101,  1996, 11190,   102,     0,     0],
        [  101,  5598,  2058,  1996,  4231,   102]])

The tokenizer also has a `decode` method by which you can translate `input_ids` back into strings. You can optionally set `skip_special_tokens=True` if you want to ignore the special tokens like padding, unknown, etc.

In [8]:
# run but you do not need to modify this code
for tokens in tokenized["input_ids"]:
    print(tokenizer.decode(tokens, skip_special_tokens=True))

the cow
jumped over the moon


Now we import our language model, in this case a pretrained BERT model. This is an encoder-only transformer architecture previewed below. As you can see, the embedding expects a vocabulary of 30522 matching our tokenizer. The model embedding dimension is 768 and the output layer of the model also has 768 units.

In [9]:
# run but you do not need to modify this code
import torch
from torch import nn
from transformers import BertModel
pretrained_model = BertModel.from_pretrained("google-bert/bert-base-uncased")
print(pretrained_model)

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
  

## Task 1

Our goal will be to modify a base Bert model for a sentiment analysis task. Specifically, we want to predict whether a given review text has a positive (1) or negative (0) sentiment. Define a model architecture that uses the pretrained BERT model but modifies it for classifying a sequence as positive or negative.

Before proceeding, create a model object and ensure you can run forward progagation on a small example such as that defined in the second code block below. Your values may not be interpretable yet prior to fine-tuning, but you should be able to generate outputs of the correct shape.

In [10]:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer

class SentimentBert(nn.Module):
    def __init__(self):
        super(SentimentBert, self).__init__()
        
        # Load pre-trained BERT model (encoder-only)
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        
        # Classification layer to classify the [CLS] embedding
        self.classifier = nn.Linear(self.bert.config.hidden_size, 1)
        
    def forward(self, input_ids, attention_mask):
        # Get the last hidden states from BERT
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        # Use the [CLS] token's embedding (pooled output) for classification
        cls_output = outputs.pooler_output  # Shape: [batch_size, hidden_size]
        
        # Pass the [CLS] token representation through the classifier
        logits = self.classifier(cls_output)  # Shape: [batch_size, 1]
        
        return logits

In [11]:
model = SentimentBert()

# Test forward propagation on a small example
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")

# Run a forward pass with the tokenized example
input_ids = tokenized['input_ids']
attention_mask = tokenized['attention_mask']
logits = model(input_ids, attention_mask)

print("Logits shape:", logits.shape)  # Expected shape: [batch_size, 1]
print("Logits:", logits)

Logits shape: torch.Size([2, 1])
Logits: tensor([[0.2616],
        [0.1739]], grad_fn=<AddmmBackward0>)


## Task 2

Our dataset is drawn from several thousand reviews on the Rotten Tomatoes website. Below we download and preview some of the data. Note that each element of a dataset is a dictionary with a `text` containing the review and a `label` which is `1` for a positive review or `0` for a negative review.

In [12]:
# run but you do not need to modify this code
from datasets import load_dataset
train_data = load_dataset("rotten_tomatoes", split="train")
val_data = load_dataset("rotten_tomatoes", split="validation")

print(f"Training examples: {len(train_data)}, Validation examples: {len(val_data)}")
for i in range(1, 3):
    print(train_data[i])
    print(train_data[-i])

Training examples: 8530, Validation examples: 1066
{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'label': 1}
{'text': 'things really get weird , though not particularly scary : the movie is all portent and no content .', 'label': 0}
{'text': 'effective but too-tepid biopic', 'label': 1}
{'text': 'interminably bleak , to say nothing of boring .', 'label': 0}


As you can see, the reviews are not all the same length. It is better not to pad the entire dataset to the same length, and instead just to perform padding per batch. We will want to have `DataLoader`s for easy iteration over batches of data as tokenized tensors. 

One way to do this is to supply a `collate_fn` to the `DataLoader` constructor. This is a function that takes as input a list of elements from the dataset (called `batch`), which in our case will be a list of dictionaries containing `text` and `label` values. The function should return the batch with tokenized strings padded to the same length along with the corresponding values.

In [13]:
from torch.utils.data import DataLoader

def collate(batch):
    tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
    texts = [item['text'] for item in batch]
    labels = [item['label'] for item in batch]

    tokenized_inputs = tokenizer(
        texts, 
        padding='longest',  
        truncation=True,    
        return_tensors="pt" 
    )
    
    # Convert labels to a tensor
    labels = torch.tensor(labels, dtype=torch.long)

    # Return the tokenized inputs and labels
    return {
        'input_ids': tokenized_inputs['input_ids'],
        'attention_mask': tokenized_inputs['attention_mask'],
        'labels': labels
    }

train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate)
val_dataloader = DataLoader(val_data, batch_size=8, shuffle=False, collate_fn=collate)

In [14]:
# check if DataLoader is as intended
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[  101,  2130,  2007,  1037,  2665, 22338,  1998,  1037,  7123,  1997,
          2543,  1011,  2417,  8457, 18395,  5266,  2010,  3244,  1010,  2174,
          1010, 11382, 23398,  3849,  2000,  2022, 20540,  1010,  2738,  2084,
          3772,  1012,  1998,  2008,  3727,  1037,  4920,  1999,  1996,  2415,
          1997,  1996,  5474,  2239,  2712,  1012,   102],
        [  101,  1996,  4795,  3268,  1997,  9216,  3337,  1005,  2202,  2006,
         29101,  5683, 16267,  2995,  1012,   102,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0],
        [  101,  2045,  1005,  1055,  2019,  8680,  2182,  1010,  2021,  2017,
          2031,  2000,  2404,  2009,  2362,  4426,  1012,   102,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,

## Task 3

Fine-tune the model on the training dataset until you achieve at least 80% accuracy on the validation dataset. You are welcome to use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) or [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer, whichever you prefer. As always, you may need to experiment to find a good learning rate or to decide on other optimization hyperparameters like momentum.

You should track and evaluate the training loss at least every hundred batches. Evaluate the validation loss and accuracy at least once every epoch of training. 

Note that you are working with a relatively large model and should expect a single epoch to take several minutes, even using GPU compute. This is one reason we direct you to evaluate the training loss at least every hundred batches to monitor progress. With well-chosen hyperparameters, you should only need a small number (such as 1-3) epochs of fine-tuning; this should take minutes but not hours.

Make sure to use the `attention_mask`, else the BERT model will be encoding unecessary `[PAD]` characters at the ends of sequences within a batch.

In [15]:
import torch
import torch.nn as nn
from transformers import AdamW
from tqdm import tqdm

# Initialize the model, move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SentimentBert().to(device)

# Loss function and optimizer
criterion = nn.BCEWithLogitsLoss()  # Binary cross-entropy with logits loss for binary classification
optimizer = AdamW(model.parameters(), lr=2e-5)  # A good starting learning rate for BERT fine-tuning

def train(model, dataloader, optimizer, criterion, device, log_interval=100):
    model.train()
    total_loss = 0

    for batch_idx, batch in enumerate(tqdm(dataloader)):
        # Move data to the device
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].float().to(device).unsqueeze(1)  # Reshape for BCEWithLogitsLoss

        optimizer.zero_grad()

        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        loss = criterion(outputs, labels)

        loss.backward()
        optimizer.step()

        total_loss += loss.item()

        if (batch_idx + 1) % log_interval == 0:
            avg_loss = total_loss / log_interval
            print(f"Batch {batch_idx + 1}/{len(dataloader)}, Average Training Loss: {avg_loss:.4f}")
            total_loss = 0

def evaluate(model, dataloader, criterion, device):
    model.eval()
    val_loss = 0
    correct = 0
    total = 0

    with torch.no_grad():
        for batch in dataloader:
            # Move data to the device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].float().to(device).unsqueeze(1)

            # Forward pass
            outputs = model(input_ids=input_ids, attention_mask=attention_mask)
            loss = criterion(outputs, labels)
            val_loss += loss.item()

            # Compute predictions and accuracy
            predictions = torch.round(torch.sigmoid(outputs))  # Apply sigmoid and round for binary predictions
            correct += (predictions == labels).sum().item()
            total += labels.size(0)

    avg_val_loss = val_loss / len(dataloader)
    accuracy = correct / total * 100
    return avg_val_loss, accuracy

num_epochs = 3
best_val_accuracy = 0

for epoch in range(num_epochs):
    print(f"\nEpoch {epoch + 1}/{num_epochs}")

    train(model, train_dataloader, optimizer, criterion, device)

    val_loss, val_accuracy = evaluate(model, val_dataloader, criterion, device)
    print(f"Validation Loss: {val_loss:.4f}, Validation Accuracy: {val_accuracy:.2f}%")

    if val_accuracy > best_val_accuracy:
        best_val_accuracy = val_accuracy
        torch.save(model.state_dict(), "best_sentiment_bert_model.pth")
        print("Model saved!")

    if val_accuracy >= 80.0:
        print("Target validation accuracy reached. Stopping training.")
        break



Epoch 1/3


  9%|▉         | 101/1067 [00:18<02:49,  5.71it/s]

Batch 100/1067, Average Training Loss: 0.5864


 19%|█▉        | 201/1067 [00:37<02:26,  5.91it/s]

Batch 200/1067, Average Training Loss: 0.4316


 28%|██▊       | 301/1067 [00:55<02:14,  5.68it/s]

Batch 300/1067, Average Training Loss: 0.4097


 38%|███▊      | 401/1067 [01:13<01:59,  5.57it/s]

Batch 400/1067, Average Training Loss: 0.3877


 47%|████▋     | 501/1067 [01:31<01:40,  5.66it/s]

Batch 500/1067, Average Training Loss: 0.3791


 56%|█████▋    | 601/1067 [01:49<01:17,  5.98it/s]

Batch 600/1067, Average Training Loss: 0.3387


 66%|██████▌   | 701/1067 [02:07<01:01,  5.99it/s]

Batch 700/1067, Average Training Loss: 0.3110


 75%|███████▌  | 801/1067 [02:24<00:46,  5.71it/s]

Batch 800/1067, Average Training Loss: 0.3278


 84%|████████▍ | 901/1067 [02:41<00:27,  5.95it/s]

Batch 900/1067, Average Training Loss: 0.3357


 94%|█████████▍| 1001/1067 [02:59<00:11,  5.70it/s]

Batch 1000/1067, Average Training Loss: 0.3603


100%|██████████| 1067/1067 [03:11<00:00,  5.56it/s]


Validation Loss: 0.3461, Validation Accuracy: 85.83%
Model saved!
Target validation accuracy reached. Stopping training.


## Task 4

Finally, retrieve five examples (your choice) from the validation dataset for which your fine-tuned model made incorrect predictions. Interpret the results on these five examples. Do you think the model is clearly incorrect or is there any ambiguity in whether the reviews are positive or negative?

In [16]:
model = SentimentBert().to(device)
model.load_state_dict(torch.load("best_sentiment_bert_model.pth"))
model.eval()

# Retrieve five misclassified examples from the validation dataset
misclassified_examples = []
with torch.no_grad():
    for batch in val_dataloader:

        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)

        # Get predictions
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        predictions = torch.round(torch.sigmoid(outputs))

        # Identify misclassified samples
        for i in range(len(labels)):
            if predictions[i] != labels[i] and len(misclassified_examples) < 5:
                misclassified_examples.append({
                    "text": tokenizer.decode(batch['input_ids'][i], skip_special_tokens=True),
                    "prediction": predictions[i].item(),
                    "actual": labels[i].item()
                })
        
        if len(misclassified_examples) >= 5:
            break

# Display the misclassified examples and analyze results
for idx, example in enumerate(misclassified_examples):
    print(f"\nExample {idx + 1}:")
    print(f"Review Text: {example['text']}")
    print(f"Predicted Sentiment: {'Positive' if example['prediction'] == 1 else 'Negative'}")
    print(f"Actual Sentiment: {'Positive' if example['actual'] == 1 else 'Negative'}")



Example 1:
Review Text: made for teens and reviewed as such, this is recommended only for those under 20 years of age... and then only as a very mild rental.
Predicted Sentiment: Negative
Actual Sentiment: Positive

Example 2:
Review Text: those moviegoers who would automatically bypass a hip - hop documentary should give " scratch " a second look.
Predicted Sentiment: Negative
Actual Sentiment: Positive

Example 3:
Review Text: there's absolutely no reason why blue crush, a late - summer surfer girl entry, should be as entertaining as it is
Predicted Sentiment: Negative
Actual Sentiment: Positive

Example 4:
Review Text: the events of the film are just so weird that i honestly never knew what the hell was coming next.
Predicted Sentiment: Negative
Actual Sentiment: Positive

Example 5:
Review Text: mark pellington's latest pop thriller is as kooky and overeager as it is spooky and subtly in love with myth.
Predicted Sentiment: Negative
Actual Sentiment: Positive


The SentimentBert model's misclassifications often come from subtle or ambiguous language where it struggles to interpret nuanced positivity or even neutrality. This is because in a previous task, I set the prediction to be a rounded results after the output came from the sigmoid function. In examples with mixed expressions, like “no reason why... should be as entertaining as it is,” the model misreads phrases commonly associated with negativity, while fails to identify underlying tones of positivity. Similarly, phrases like “kooky and overeager” or “weird” can imply affection or intrigue in the context of film reviews, but the model interprets them as negative. These errors show the model's limitations in handling subjective language, and it could be improved upon with additional training on data with nuanced expressions, neutral sentiments, or sarcasm.