# Part 1: Encoder - Sentiment Analysis with BERT

In this part you will fine-tune a pre-trained encoder-only language model called BERT (originally trained and released by Google in 2018) for a sentiment analysis task. BERT is bidirectional in the sense that it was trained to predict a masked word in the middle of a sequence using both the previous and subsequent tokens. For example, BERT was trained on tasks like predicting the masked token in `The sweet black cat [MASK] by the window in the sun.` considering both the preceding tokens `The sweet black cat` **and** the subsequent tokens `by the window in the sun.` 

This kind of model is not used for autoregressively generating new text, but is very useful when you want to understand an entire sequence of text as a whole, allowing attention to earlier or later tokens in a sequence. Sentiment analysis, wherein we want to classify an entire input sequence as either positive or negative in sentiment (for example, in this task we classify movie reviews as either positive or negative), is a good example where this kind of understanding is important. 

You may recall that we similarly classified pieces of text as toxic or not using Logistic Regression in a previous homework assignment. In that assignment we took a much simpler approach representing each piece of text in a Bag-of-Words representation as simply a vector of counts for the number of times certain words appeared in the text, completely disregarding the order of the words. In this assignment, we take a more modern approach to text analysis using a Transformer language model.

In this part we will directly modify the `PyTorch` model and will conduct the fine-tuning directly in `PyTorch`. Recall that fine-tuning is an example of transfer learning where we update all model parameters. You similarly conducted a fine-tuning of a pretrained ResNet convolutional neural network in a previous assignment.

**Learning objectives.** You will:
1. Examine an encoder-only pretrained BERT transformer model
2. Use BERT embeddings to identify semantically similar and dissimilar text
2. Modify a BERT model for sentiment analysis
3. Fine-tune the model on movie review data for sentiment analysis

While it is possible to complete this assignment using CPU compute, it may be slow. To accelerate your training, consider using GPU resources such as `CUDA` through the CS department cluster. Alternatives include Google colab or local GPU resources for those running on machines with GPU support.

First, ensure that you have the `transformers` and `datasets` modules installed. We will use these modules for importing tokenizers, pretrained models, and datasets. You can run the following cells to try to install them with `pip` if needed. If you are using ondemand, ideally you would simply include `module load transformers` and `module load datasets` when making your initial reservation.

In [3]:
# !pip install transformers

In [4]:
# !pip install datasets

Now the following code imports a *tokenizer* and demonstrates its use. 

Note how the sequence of words in the input string is replaced with a sequence of numbers in the `input_ids`: These are indices into the vocabulary of size 30,522 used by the tokenizer and BERT model. Also note the `special_tokens`: an `[UNK]` is used for anything not in the vocabulary, and a `[PAD]` can be useful for padding out a sequence of tokens to a specified length.

Given a sequence of strings, the tokenizer returns a dictionary containing not just the `input_ids` (what you will most often want to use) but also `token_type_ids` (which you should not need to use) and `attention_mask`. 

The `attention_mask` has the same dimensions as the `input_ids` with a `1` in a given position if there is a non-padding token in that position and a `0` if that position is just a padding token. This is helpful when you are tokenizing a batch of multiple strings with potentially different lengths but want to create a single tensor. `padding='longest'` as shown pads all of the input to the same number of tokens as the longest input by adding `[PAD]` tokens to the end. The `attention_mask` is then passed so that you can ignore the extraneous padding tokens as needed.

Also note the `return_tensors` parameter. Using `"pt"` as shown indicates that the results should be returned as PyTorch tensor. If you omit this parameter then the results will be returned as a Python list by default.

In [6]:
# run but you do not need to modify this code
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased',  
                                          clean_up_tokenization_spaces=True)
print(tokenizer)
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")
print(tokenized)

BertTokenizer(name_or_path='google-bert/bert-base-uncased', vocab_size=30522, model_max_length=512, is_fast=False, padding_side='right', truncation_side='right', special_tokens={'unk_token': '[UNK]', 'sep_token': '[SEP]', 'pad_token': '[PAD]', 'cls_token': '[CLS]', 'mask_token': '[MASK]'}, clean_up_tokenization_spaces=True, added_tokens_decoder={
	0: AddedToken("[PAD]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	100: AddedToken("[UNK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	101: AddedToken("[CLS]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	102: AddedToken("[SEP]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
	103: AddedToken("[MASK]", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
)
{'input_ids': tensor([[  101,  1996, 11190,   102,     0,     0],
        [  101,  5598,  2058,  1996,  4231,   102]])

The tokenizer also has a `decode` method by which you can translate `input_ids` back into strings. You can optionally set `skip_special_tokens=True` if you want to ignore the special tokens like padding, unknown, etc.

In [8]:
# run but you do not need to modify this code
for tokens in tokenized["input_ids"]:
    print(tokenizer.decode(tokens, skip_special_tokens=True))

the cow
jumped over the moon


Now we import our language model, in this case a pretrained BERT model. This is an encoder-only transformer architecture previewed below. As you can see, the embedding expects a vocabulary of size 30,522 matching the tokenizer. The model embedding dimension is 768 and the output dimension is also 768.

In [10]:
# run but you do not need to modify this code
import torch
from torch import nn
from transformers import BertModel
pretrained_model = BertModel.from_pretrained("google-bert/bert-base-uncased")
print(pretrained_model)


BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0-11): 12 x BertLayer(
        (attention): BertAttention(
          (self): BertSdpaSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False

Our dataset is drawn from several thousand reviews on the Rotten Tomatoes website. Below we download and preview some of the data. Note that each element of a dataset is a dictionary with a `text` containing the review and a `label` which is `1` for a positive review or `0` for a negative review.

In [12]:
# run but you do not need to modify this code
from datasets import load_dataset
from torch.utils.data import random_split
import random

# Download data then split Create 50/25/25 into train/val/test
data = load_dataset("rotten_tomatoes", split="train")

# Shuffle data then create random splits for train, val, test
random.seed(372)  # For reproducibility
indices = list(range(len(data)))
random.shuffle(indices)

# We'll just use the first 1,000 data points for simplicity
train_data = [data[i] for i in indices[:500]]
val_data = [data[i] for i in indices[500:750]]
test_data = [data[i] for i in indices[750:1000]]

print(f"Split sizes - Train: {len(train_data)}, Val: {len(val_data)}, Test: {len(test_data)}")

for i in range(5):
    print(train_data[i])

Split sizes - Train: 500, Val: 250, Test: 250
{'text': 'like a pack of dynamite sticks , built for controversy . the film is explosive , but a few of those sticks are wet .', 'label': 0}
{'text': 'effective in all its aspects , margarita happy hour represents an auspicious feature debut for chaiken .', 'label': 1}
{'text': 'this piece of channel 5 grade trash is , quite frankly , an insult to the intelligence of the true genre enthusiast .', 'label': 0}
{'text': 'see clockstoppers if you have nothing better to do with 94 minutes . but be warned , you too may feel time has decided to stand still . or that the battery on your watch has died .', 'label': 0}
{'text': 'it tries too hard , and overreaches the logic of its own world .', 'label': 0}


## Task 1

Before modifying BERT, let's explore what it already knows about text similarity. BERT embeddings can capture semantic relationships between texts, even without task-specific fine-tuning. 

For example, texts that have semantically similar meanings should have similar BERT embeddings, whereas those with very different BERT embeddings should seem semantically to have very different meanings.

A BERT embedding is just a numerical vector of size 768 and is the final output of the pretrained model. In this task, you will explore similar and dissimilar embeddings using the provided functions to find reviews that are semantically similar and dissimilar to a given example.

In [14]:
# Run but DO NOT MODIFY this code

def get_bert_embedding(text, model, tokenizer, device):
    """Get BERT [CLS] embedding for a single text."""
    model.eval()
    inputs = tokenizer(text, return_tensors='pt', padding=True, truncation=True, max_length=512)
    inputs = {k: v.to(device) for k, v in inputs.items()}
    
    with torch.no_grad():
        outputs = model(**inputs)
        # Extract [CLS] token embedding (first token)
        embedding = outputs.last_hidden_state[:, 0, :].squeeze()
    
    return embedding.cpu()

def compute_similarity(embedding1, embedding2):
    """Compute cosine similarity between two embeddings."""
    from torch.nn.functional import cosine_similarity
    return cosine_similarity(embedding1.unsqueeze(0), embedding2.unsqueeze(0)).item()

# Setup for embedding exploration
#device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device = torch.device('cuda' if torch.cuda.is_available() else ('mps' if torch.backends.mps.is_available() else 'cpu'))
pretrained_model.to(device)
print(f"Using: {device}")

Using: mps


Follow the TODO described in the code to identify similar and dissimilar reviews to a reference in the training data.

In [16]:
reference_text = train_data[0]['text']  
print(f"Reference review: {reference_text}")

reference_embedding = get_bert_embedding(reference_text, pretrained_model, tokenizer, device)

# TODO: Use pretrained_model and the functions above to:
# 1. Find the most similar review in the training data to the reference (highest cosine similarity)
# 2. Find the most dissimilar review in the training data to the reference (lowest cosine similarity)
# 3. Print the most similar review and most dissimilar review along with their similarity scores
#
# Hint: You'll need to loop through the dataset and compute similarities of the embeddings
# Note: Start the loop from index 1 to exclude the reference review itself
similarities = []
most_similar_index = -1
most_similar_val = float("-inf")
least_similar_index = -1
least_similar_val = float("inf")
for index in range(1, len(train_data)):
    text = train_data[index]['text']
    embedding = get_bert_embedding(text, pretrained_model, tokenizer, device)
    similarity = compute_similarity(embedding, reference_embedding)
    similarities.append(similarity)
    if similarity > most_similar_val:
        most_similar_index = index
        most_similar_val = similarity
    if similarity < least_similar_val:
        least_similar_index = index
        least_similar_val = similarity
print()
print("Most similar at index", most_similar_index, "with a similarity score of", most_similar_val)
print("Most similar review: ", train_data[most_similar_index]['text'])
print()
print("Most dissimilar at index", least_similar_index, "with a similarity score of", least_similar_val)
print("Most dissimilar review: ", train_data[least_similar_index]['text'])

Reference review: like a pack of dynamite sticks , built for controversy . the film is explosive , but a few of those sticks are wet .

Most similar at index 441 with a similarity score of 0.9287109971046448
Most similar review:  the gags , and the script , are a mixed bag .

Most dissimilar at index 431 with a similarity score of 0.6546767950057983
Most dissimilar review:  almost everyone growing up believes their family must look like " the addams family " to everyone looking in . . . " my big fat greek wedding " comes from the heart . . .


## Task 2

Our goal will be to modify a base Bert model for a sentiment analysis task. Specifically, we want to predict whether a given review text has a positive (1) or negative (0) sentiment. Define a model architecture that uses the pretrained BERT model but modifies it for classifying a sequence as positive or negative.

For binary classification, you have two valid architectural choices:

1. **Single output unit**: Outputs one score per input. Probability estimates would come from applying the sigmoid function, and classifications come from thresholding. This resembles binary logistic regression.
2. **Two output units**: Outputs two scores per input, one for the positive and one for the negative class. Probability estimates come from softmax and classification from choosing the highest score/probability class. This resembles the general approach for multiclassification.

Both approaches work well and you can choose whichever you prefer. We slightly recommend the latter two output unit approach simply because it is more common in practice and consistent with classification architecture beyond binary.

In [18]:
# TODO: Implement the SentimentBert class

class SentimentBert(nn.Module):
    def __init__(self, pretrained_model_name="google-bert/bert-base-uncased"):
        """
        Initialize the sentiment classification model.
        
        Architecture notes:
        - BERT outputs 768-dimensional representations
        - The [CLS] token (first token) is designed for classification tasks
        - You'll need a linear layer to map from BERT's output to your classification head
        """
        super(SentimentBert, self).__init__()
        self.bert = BertModel.from_pretrained(pretrained_model_name)
        
        #TODO: add a final classifier layer
        self.classifier = nn.Linear(768, 2)
        
    def forward(self, input_ids, attention_mask=None):
        """
        Forward pass through the model.
        
        Key points:
        - BERT expects both input_ids and attention_mask
        - Use the [CLS] token representation for sequence classification
        - The [CLS] token is always the first token in the sequence
        
        Args:
            input_ids: Token indices (batch_size, seq_len)
            attention_mask: Mask for padding tokens (batch_size, seq_len)
            
        Returns:
            logits: Classification scores (batch_size, num_classes)
                   where num_classes is 1 or 2 depending on your choice
        """
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        
        # Extract [CLS] token representation (first token of last hidden state)
        pooled_output = outputs.last_hidden_state[:, 0, :]  # Shape: (batch_size, 768)
        
        # TODO: Finish forward to classify the BERT embedding stored in pooled_output
        # Hint: Apply your classifier layer to pooled_output and return the result
        logits = self.classifier(pooled_output)
        return logits

Before proceeding, create a model object and ensure you can run forward propagation on a small example such as that defined in the second code block below. Your values may not be reasonable yet prior to fine-tuning, but you should be able to generate outputs of the correct shape.

In [20]:
# Run this code. You do not need to make modifications, but 
# confirm that you do not have any errors and that the output
# shape matches your architectural specifications
torch.manual_seed(2025)

model = SentimentBert()

# Example batch with a batch size of 2
tokenized = tokenizer(["the cow", "jumped over the moon"], padding='longest', return_tensors="pt")

# Test forward pass
with torch.no_grad():
    outputs = model(tokenized['input_ids'], tokenized['attention_mask'])
    print(f"Output shape: {outputs.shape}")
    print(f"Sample outputs: {outputs}")

Output shape: torch.Size([2, 2])
Sample outputs: tensor([[-0.0032, -0.1649],
        [ 0.0317, -0.1576]])


## Task 3

Now we are ready for fine-tuning. However, as you can see, the reviews are not all the same length. It is better NOT to pad the entire dataset to the same length, and instead just to perform padding per batch. We will want to have `DataLoader`s for easy iteration over batches of data as tokenized tensors. 

One way to do this is to supply a `collate_fn` to the `DataLoader` constructor. This is a function that takes as input a list of elements from the dataset (called `batch`), which in our case will be a list of dictionaries containing `text` and `label` values. The function should return the batch with tokenized strings padded to the same length along with the corresponding values. An implementation is provided for you.

In [23]:
# Run but DO NOT MODIFY this code

from torch.utils.data import DataLoader

def collate(batch):
    tokenizer = BertTokenizer.from_pretrained('google-bert/bert-base-uncased')
    
    # Extract texts and labels from the batch items
    texts = [item['text'] for item in batch]
    labels = [item['label'] for item in batch]
    
    labels = torch.tensor(labels)
    encoded = tokenizer(texts, padding='longest', return_tensors='pt')
    return {
        'input_ids': encoded['input_ids'],
        'attention_mask': encoded['attention_mask'],
        'labels': labels
    }

train_dataloader = DataLoader(train_data, batch_size=8, shuffle=True, collate_fn=collate)
val_dataloader = DataLoader(val_data, batch_size=8, shuffle=False, collate_fn=collate)

# check if DataLoader is as intended
for batch in train_dataloader:
    print(batch)
    break

{'input_ids': tensor([[  101, 11472, 22327,  5937,  4107,  2019,  8387,  4942, 18209,  1010,
          8301,  5244,  2058,  1996, 22213,  1997,  5637,  3348,  1010,  1998,
          7534,  2664,  2178,  5458,  2214,  4432,  1997,  1996,  5637,  2451,
          2004,  2019,  2035,  1011, 18678,  2088,  2073,  2039, 26143,  1010,
          2690,  2465,  8501,  2015,  2066, 24272,  2064,  2514,  2204,  2055,
          3209,  1012,   102],
        [  101,  1996,  8487,  9257,  1011,  7168,  3744, 17738,  1999, 25608,
          2229, 15346,  2007,  1037,  8150,  1997,  2757,  9739,  4658,  1010,
         24639,  8562,  1998,  2074,  1996,  5468,  1997, 24605,  3223,  2000,
          2507,  2023,  5021, 23667, 14081,  2070,  2540,  1012,   102,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0],
        [  101,  2205,  4030,  2005,  1037,  3920,  4306,  1010,  2205,  8467,
          2005,  2019,  3080,  2028,  1012,   102,     

**Fine-tune on the training dataset until you achieve at least 80% accuracy on the validation dataset. Print training loss and validation accuracy per epoch.**


**Training Implementation Notes:**

1. Remember this is a binary classification task. You should train on the cross entropy loss, either the [binary cross entropy loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html#bcewithlogitsloss) or the general [cross entropy loss](https://docs.pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html) depending on your architectural choice in Task 2.

2. Use the [SGD](https://pytorch.org/docs/stable/generated/torch.optim.SGD.html) or [Adam](https://pytorch.org/docs/stable/generated/torch.optim.Adam.html) optimizer, whichever you prefer (Adam is most common for Transformers). As always, you may need to experiment to find a good learning rate or to decide on other optimization hyperparameters like momentum.

3. Recall that in fine-tuning we generally optimize all model parameters with a small learning rate. It is recommended that you start experimenting with learning rates between 0.00005 and 0.000005.

4. Make sure to use both `input_ids` and `attention_mask` in your forward pass. If you omit the `attention_mask`, the BERT model will be encoding unnecessary `[PAD]` characters at the ends of sequences within a batch.

5. Remember to move all tensors to the same device (GPU if available)

Your training should include:
  - A training loop that sets the model to training mode, iterates through batches, and updates weights
  - A validation loop that evaluates accuracy without updating weights once per epoch (remember you can use `torch.no_grad()`)
  - Tracking the training loss and validation accuracy for each epoch. **Print both after each epoch**
  - You can stop training once you reach at least 80% validation accuracy (or better).

Note that you are working with a relatively large model a single epoch of training may take one or several minutes, even using GPU compute. However, the model is extensively pre-trained for natural language understanding. With well-chosen hyperparameters, you should only need a relatively small number of epochs of fine-tuning (likely no more than 2-10); this should take minutes but not hours.

In [25]:
# TODO: fine-tune / train the modified BERT model for sentiment analysis
torch.manual_seed(2025)
fine_tuned_model = SentimentBert().to(device)
CEL = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(fine_tuned_model.parameters(), lr = 0.00002)
max_epochs = 10
best_accuracy = 0.0
for epoch in range(1, max_epochs + 1):
    fine_tuned_model.train()
    sum_loss = 0.0
    n_train = 0
    for batch in train_dataloader:
        input_ids = batch["input_ids"].to(device)
        attention_mask = batch["attention_mask"].to(device)
        labels = batch["labels"].to(device)
        optimizer.zero_grad()
        logits = fine_tuned_model(input_ids, attention_mask)
        loss = CEL(logits, labels)
        loss.backward()
        optimizer.step()
        sum_loss += loss.item() * labels.size(0)
        n_train += labels.size(0)
    training_loss = sum_loss / n_train
    fine_tuned_model.eval()
    correct = 0
    n = 0
    with torch.no_grad():
        for batch in val_dataloader:
            input_ids = batch["input_ids"].to(device)
            attention_mask = batch["attention_mask"].to(device)
            labels = batch["labels"].to(device)
            correct += (fine_tuned_model(input_ids, attention_mask).argmax(1) == labels).sum().item()
            n += labels.size(0)
    val_accuracy = correct / n
    best_accuracy = max(best_accuracy, val_accuracy)
    print(device, "Epoch", epoch, ": Training Loss", training_loss, "| Validation Accuracy", val_accuracy)
    if val_accuracy >= 0.8:
        break
print("The best validation accuracy is", best_accuracy)

mps Epoch 1 : Training Loss 0.5876601719856263 | Validation Accuracy 0.82
The best validation accuracy is 0.82


## Task 4

Evaluate your fine-tuned model on the held out `test_data`. You do not need a DataLoader or batching for this task and are welcome to simply iterate through the `test_data` one at a time based on for loop.

1. **Quantitatively** compute the accuracy your model's predictions on the held out `test_data`. Print your test accuracy.

2. **Qualitatively** retrieve five examples from the test dataset for which your fine-tuned model made incorrect predictions. Print the review, the true label, and the predicted label for each.

3. **Reflect on Quantitative vs Qualitative Evaluation.** Briefly (1-2 paragraphs) Interpret the results on these five qualitative examples. Are the labels ambiguous for any of these examples? Discuss why qualitative evaluation is crucial alongside quantitative metrics in real-world ML deployment for natural language processing tasks such as sentiment analysis.

In [27]:
# TODO: code for task 4 here
torch.manual_seed(2025)
wrong_examples = []
test_correct = 0
n_test = 0
with torch.no_grad():
    for example in test_data:
        encode = tokenizer(example["text"], padding = "longest", return_tensors = "pt")
        input_ids = encode["input_ids"].to(device)
        attention_mask = encode["attention_mask"].to(device)
        true_label = int(example["label"])
        pred_label = fine_tuned_model(input_ids, attention_mask).argmax(1).item()
        test_correct += int(pred_label == true_label)
        n_test += 1
        if pred_label != true_label and len(wrong_examples) < 5:
            wrong_examples.append((example["text"], true_label, pred_label))
test_accuracy = test_correct / n_test
print("The test accuracy is", test_accuracy)
print()
print("Displaying 5 incorrect predictions")
print()
for i, (text, true_label, pred_label) in enumerate(wrong_examples, 1):
    print("Example", i, ":")
    print("True Label:", true_label, "| Predicted Label:", pred_label)
    print("Review:", text)
    print()

The test accuracy is 0.8

Displaying 5 incorrect predictions

Example 1 :
True Label: 1 | Predicted Label: 0
Review: better than the tepid star trek : insurrection ; falls short of first contact because the villain couldn't pick the lint off borg queen alice krige's cape ; and finishes half a parsec ( a nose ) ahead of generations .

Example 2 :
True Label: 1 | Predicted Label: 0
Review: the mark of a respectable summer blockbuster is one of two things : unadulterated thrills or genuine laughs .

Example 3 :
True Label: 1 | Predicted Label: 0
Review: the movie has an avalanche of eye-popping visual effects .

Example 4 :
True Label: 0 | Predicted Label: 1
Review: it's so crammed with scenes and vistas and pretty moments that it's left a few crucial things out , like character development and coherence .

Example 5 :
True Label: 1 | Predicted Label: 0
Review: girls gone wild and gone civil again



Looking at these examples, I notice that these mistakes mostly comes from mixed-polarity or indirect language. For example, Example 1 uses phrases like "better than" and "but falls short", but the problem is that there is a mix of good points and bad points which would make it difficult to judge the overall tone. And then for Example 2, the review does not look like a sentiment since it does not clearly say if the movie is good or bad, which I find this label to be ambiguous. In Example 3, I find that the review gives a faint praise for the movie, but I can see it being a backhanded criticism which might be why the predicted label is negative, so I think that this label may be context-dependent in this case. Now for Example 4, it uses phrases such as "it's so crammed" and "left a few crucial things out" that clearly tell us that the tone of this review is negative, but it uses phrases like "pretty moments" for instance that have positive connotation on its own which could mislead the model into predicting this review as positive. Lastly, Exmaple 5 uses sarcastic phrases, which ultimately makes it difficult to decide if this review has a positive or negative tone, thus I find the label fo this example to be ambiguous. So for the most part, I find that there are some labels were I feel like it is ambiguous, and I find that the model's wrong guesses are mostly understandable.

I think this case is a good example of why qualitative evaluation should be complemented alongside quantitative metrics in real-world ML development. I just find that the quantitative matrics mainly tell us how often the model is right and how well it performs, but it does not tell us why the errors happen, such as sarcasm, idioms, references, or label ambiguity to name a few. So seeing these mistakes can help us to plan some real fixes, such as adding more training examples with contrast words, improving labelling rules, and adjusting how we can handle very short reviews. And this has real-world applications as in ML development, these edge cases drive user trust, safety, and product quality. Therefore, I think it is important to conduct these qualitative evaluations to ensure that the model performs robustly not just on quantitative metrics, but on the nuanced language that people actually write. 