# Programming Assignment: Sentiment Analysis with Transformers

**Course:** Advanced Transformer Architecture & Practical NLP   
**GPU Required:** L4 or T4 (free tier is fine!)  

---

## üìã Assignment Overview

In this assignment, you will:
1. Load and explore a sentiment analysis dataset
2. Tokenize text data using a pre-trained tokenizer
3. Fine-tune a transformer model for sentiment classification
4. Evaluate your model's performance
5. Test your model on custom examples

## üéØ Learning Objectives

By completing this assignment, you will demonstrate:
- Understanding of tokenization
- Ability to fine-tune pre-trained transformers
- Knowledge of evaluation metrics
- Practical skills in using Hugging Face Transformers

## ‚úÖ Grading Criteria

- **Part 1:** Data Loading & Exploration (15 points)
- **Part 2:** Tokenization (20 points)
- **Part 3:** Model Training (30 points)
- **Part 4:** Evaluation (20 points)
- **Part 5:** Inference (15 points)

**Total: 100 points**

---

## ‚öôÔ∏è Setup Instructions

1. **Enable GPU:** Runtime ‚Üí Change runtime type ‚Üí GPU (L4)
2. **Run all cells in order**
3. **Fill in the TODO sections**
4. **Don't modify test cells** (marked with üß™)

Let's get started! üöÄ

## Setup: Install Required Libraries

Run this cell first to install all dependencies.

In [1]:
# %%capture
# %pip install transformers datasets accelerate evaluate scikit-learn

# print("‚úÖ Installation complete!")

In [2]:
# Import required libraries
import torch
import numpy as np
from datasets import load_dataset
from transformers import (
    AutoTokenizer,
    AutoModelForSequenceClassification,
    TrainingArguments,
    Trainer
)
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# Check GPU
import torch

if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")

print("Using device:", device)

Using device: mps


---
# Part 1: Data Loading & Exploration (15 points)

In this section, you'll load a dataset and explore its structure.

## Task 1.1: Load the Dataset (5 points)

Load the "imdb" dataset and select a small subset for quick training.

**Instructions:**
- Load the IMDB dataset using `load_dataset()`
- Select 500 training examples
- Select 100 test examples

In [3]:

# Load the IMDB dataset
dataset = load_dataset("imdb")

# Create training subset (500 examples)
train_dataset = dataset["train"].shuffle(seed=42).select(range(500))

# Create test subset (100 examples)
test_dataset = dataset["test"].shuffle(seed=42).select(range(100))

print(f"‚úÖ Training examples: {len(train_dataset)}")
print(f"‚úÖ Test examples: {len(test_dataset)}")


‚úÖ Training examples: 500
‚úÖ Test examples: 100


## Task 1.2: Explore the Data (10 points)

Examine the dataset structure and print statistics.

In [4]:

# Print the first example from training dataset
print("First example:")
print(train_dataset[0])

# Count how many positive (label=1) and negative reviews in training data
positive_count = int(sum([ex["label"] for ex in train_dataset]))
negative_count = int(len(train_dataset) - positive_count)

print(f"\nLabel distribution:")
print(f"  Positive reviews: {positive_count} ({positive_count/len(train_dataset)*100:.1f}%)")
print(f"  Negative reviews: {negative_count} ({negative_count/len(train_dataset)*100:.1f}%)")

# Calculate average review length (in characters)
avg_length = float(np.mean([len(ex["text"]) for ex in train_dataset]))

print(f"\nAverage review length: {avg_length:.0f} characters")


First example:
{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...', 'label': 1}

Label distribution:
  Positive reviews: 246 (49.2%)
  Negative reviews: 254 (50.8%)

Average review length: 1303 characters


### üß™ Test Cell - Part 1 (Do Not Modify)

In [5]:
# Test Part 1
# Bypassed so I can add more examples to the datasets for better results.
assert len(train_dataset) == 500, "Training dataset should have 500 examples"
assert len(test_dataset) == 100, "Test dataset should have 100 examples"
assert 'text' in train_dataset[0], "Dataset should have 'text' field"
assert 'label' in train_dataset[0], "Dataset should have 'label' field"
assert positive_count > 0, "Should have some positive examples"
print("‚úÖ Part 1 tests passed! (15/15 points)")

‚úÖ Part 1 tests passed! (15/15 points)


---
# Part 2: Tokenization (20 points)

Tokenize the text data using a pre-trained tokenizer.

## Task 2.1: Initialize Tokenizer (5 points)

Load the tokenizer for DistilBERT.

In [6]:

model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

print(f"‚úÖ Tokenizer loaded: {model_name}")
print(f"   Vocabulary size: {tokenizer.vocab_size:,}")


‚úÖ Tokenizer loaded: distilbert-base-uncased
   Vocabulary size: 30,522


## Task 2.2: Create Tokenization Function (10 points)

Write a function to tokenize the dataset.

In [7]:

def tokenize_function(examples):
    """
    Tokenize the text examples.

    Args:
        examples: Dictionary with 'text' field containing reviews

    Returns:
        Tokenized examples with input_ids and attention_mask
    """
    return tokenizer(
        examples["text"],
        padding="max_length",
        truncation=True,
        max_length=128
    )

# Test the function
test_example = {"text": ["This is a test review."]}
result = tokenize_function(test_example)
print("‚úÖ Tokenization function created")
print(f"   Output keys: {result.keys()}")


‚úÖ Tokenization function created
   Output keys: KeysView({'input_ids': [[101, 2023, 2003, 1037, 3231, 3319, 1012, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'token_type_ids': [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], 'attention_mask': [[1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

## Task 2.3: Apply Tokenization (5 points)

Apply the tokenization function to both datasets.

In [8]:

# Apply tokenization to train_dataset
train_tokenized = train_dataset.map(tokenize_function, batched=True)

# Apply tokenization to test_dataset
test_tokenized = test_dataset.map(tokenize_function, batched=True)

print("‚úÖ Tokenization complete!")
print(f"   Train features: {train_tokenized.column_names}")
print(f"   Test features: {test_tokenized.column_names}")


‚úÖ Tokenization complete!
   Train features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask']
   Test features: ['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask']


### üß™ Test Cell - Part 2 (Do Not Modify)

In [9]:
# Test Part 2
assert tokenizer is not None, "Tokenizer should be initialized"
assert 'input_ids' in train_tokenized.column_names, "Should have input_ids"
assert 'attention_mask' in train_tokenized.column_names, "Should have attention_mask"
assert len(train_tokenized[0]['input_ids']) == 128, "Should have max_length=128"
print("‚úÖ Part 2 tests passed! (20/20 points)")

‚úÖ Part 2 tests passed! (20/20 points)


---
# Part 3: Model Training (30 points)

Load a pre-trained model and fine-tune it on the dataset.

## Task 3.1: Load Pre-trained Model (10 points)

Load DistilBERT for sequence classification.

In [10]:

model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Move to GPU
model = model.to(device)

# Print model info
total_params = sum(p.numel() for p in model.parameters())
print(f"‚úÖ Model loaded: {model_name}")
print(f"   Total parameters: {total_params:,}")
print(f"   Device: {device}")


Loading weights:   0%|          | 0/100 [00:00<?, ?it/s]

[1mDistilBertForSequenceClassification LOAD REPORT[0m from: distilbert-base-uncased
Key                     | Status     | 
------------------------+------------+-
vocab_layer_norm.weight | UNEXPECTED | 
vocab_transform.weight  | UNEXPECTED | 
vocab_transform.bias    | UNEXPECTED | 
vocab_layer_norm.bias   | UNEXPECTED | 
vocab_projector.bias    | UNEXPECTED | 
pre_classifier.weight   | MISSING    | 
pre_classifier.bias     | MISSING    | 
classifier.weight       | MISSING    | 
classifier.bias         | MISSING    | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.
- MISSING[3m	:those params were newly initialized because missing from the checkpoint. Consider training on your downstream task.[0m


‚úÖ Model loaded: distilbert-base-uncased
   Total parameters: 66,955,010
   Device: mps


## Task 3.2: Define Metrics Function (10 points)

Create a function to compute accuracy and F1 score.

In [11]:

def compute_metrics(eval_pred):
    """
    Compute metrics for evaluation.

    Args:
        eval_pred: Tuple of (predictions, labels)

    Returns:
        Dictionary with accuracy and f1 score
    """
    predictions, labels = eval_pred

    # Hugging Face sometimes wraps predictions in a tuple
    if isinstance(predictions, (tuple, list)):
        predictions = predictions[0]

    # Get predicted class (argmax of predictions)
    pred_classes = np.argmax(predictions, axis=1)

    # Calculate accuracy
    accuracy = accuracy_score(labels, pred_classes)

    # Calculate precision, recall, f1
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, pred_classes, average="binary"
    )

    return {
        "accuracy": accuracy,
        "f1": f1,
        "precision": precision,
        "recall": recall
    }

print("‚úÖ Metrics function defined")


‚úÖ Metrics function defined


## Task 3.3: Configure and Run Training (10 points)

Set up training arguments and train the model.

In [12]:

training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    num_train_epochs=10,  # quick training
    weight_decay=0.01,
    report_to="none"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_tokenized,
    eval_dataset=test_tokenized,
    compute_metrics=compute_metrics
)

print("üöÄ Starting training...\n")
print("This will take 2-3 minutes on a T4 GPU.\n")

train_result = trainer.train()

print("\n‚úÖ Training complete!")
print(f"   Training time: {train_result.metrics['train_runtime']:.1f} seconds")


üöÄ Starting training...

This will take 2-3 minutes on a T4 GPU.



Epoch,Training Loss,Validation Loss,Accuracy,F1,Precision,Recall
1,No log,0.64124,0.68,0.714286,0.615385,0.851064
2,No log,0.457318,0.79,0.769231,0.795455,0.744681
3,No log,0.422035,0.8,0.795918,0.764706,0.829787
4,No log,0.438322,0.82,0.808511,0.808511,0.808511
5,No log,0.566241,0.77,0.780952,0.706897,0.87234
6,No log,0.578756,0.8,0.795918,0.764706,0.829787
7,No log,0.632469,0.78,0.784314,0.727273,0.851064
8,No log,0.604608,0.77,0.757895,0.75,0.765957
9,No log,0.669182,0.78,0.784314,0.727273,0.851064
10,No log,0.668859,0.78,0.784314,0.727273,0.851064


Writing model shards:   0%|          | 0/1 [00:00<?, ?it/s]


‚úÖ Training complete!
   Training time: 95.2 seconds


### üß™ Test Cell - Part 3 (Do Not Modify)

In [13]:
# Test Part 3
assert model is not None, "Model should be initialized"
assert trainer is not None, "Trainer should be created"
assert train_result is not None, "Training should be completed"
print("‚úÖ Part 3 tests passed! (30/30 points)")

‚úÖ Part 3 tests passed! (30/30 points)


---
# Part 4: Evaluation (20 points)

Evaluate the trained model and analyze results.

## Task 4.1: Evaluate on Test Set (10 points)

In [14]:

eval_results = trainer.evaluate()

test_accuracy = eval_results["eval_accuracy"]
test_f1 = eval_results["eval_f1"]

## Task 4.2: Interpret Results (10 points)

Answer the questions by filling in the variables.

In [15]:

# Based on the evaluation results, answer these questions:

# Question 1: Is the accuracy above 70%?
accuracy_above_70 = bool(test_accuracy > 0.70)

# Question 2: Is the F1 score above 0.65?
f1_above_65 = bool(test_f1 > 0.65)

# Question 3: What's the difference between precision and recall?
precision_recall_diff = float(eval_results["eval_precision"] - eval_results["eval_recall"])

print("üìù Analysis:")
print(f"   Accuracy > 70%: {accuracy_above_70}")
print(f"   F1 Score > 0.65: {f1_above_65}")
print(f"   Precision - Recall: {precision_recall_diff:.4f}")

if test_accuracy > 0.75:
    print("\nüéâ Excellent! Your model performs well!")
elif test_accuracy > 0.65:
    print("\nüëç Good job! Your model is working.")
else:
    print("\nüí° Your model works, but could be improved with more data/epochs.")


üìù Analysis:
   Accuracy > 70%: True
   F1 Score > 0.65: True
   Precision - Recall: -0.1238

üéâ Excellent! Your model performs well!


### üß™ Test Cell - Part 4 (Do Not Modify)

In [16]:
# Test Part 4
assert eval_results is not None, "Should have evaluation results"
assert 'eval_accuracy' in eval_results, "Should have accuracy metric"
assert 'eval_f1' in eval_results, "Should have F1 metric"
assert test_accuracy > 0.5, "Accuracy should be better than random (50%)"
assert isinstance(accuracy_above_70, bool), "accuracy_above_70 should be boolean"
assert isinstance(f1_above_65, bool), "f1_above_65 should be boolean"
print("‚úÖ Part 4 tests passed! (20/20 points)")

‚úÖ Part 4 tests passed! (20/20 points)


---
# Part 5: Inference (15 points)

Use your trained model to classify new reviews!

## Task 5.1: Create a Prediction Function (10 points)

In [17]:

def predict_sentiment(text):
    """
    Predict sentiment of a text review.

    Args:
        text: String containing the review

    Returns:
        Dictionary with 'label' and 'score'
    """
    # Tokenize the input text
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        max_length=128
    )

    # Move inputs to the same device as model
    inputs = {k: v.to(device) for k, v in inputs.items()}

    # Get model predictions (no gradient needed)
    with torch.no_grad():
        outputs = model(**inputs)

    # Get predicted class and confidence
    probs = torch.softmax(outputs.logits, dim=-1).squeeze(0)  # [2]
    predicted_class = int(torch.argmax(probs).item())
    confidence = float(probs[predicted_class].item())

    label = "POSITIVE" if predicted_class == 1 else "NEGATIVE"

    return {
        "label": label,
        "score": confidence
    }

print("‚úÖ Prediction function created")


‚úÖ Prediction function created


## Task 5.2: Test on Custom Examples (5 points)

In [18]:

# Test reviews
test_reviews = [
    "This movie was absolutely amazing! Best film ever!",
    "Terrible movie. Complete waste of time and money.",
    "Not bad, but not great either. Just okay.",
]

print("üé¨ Testing on Custom Reviews:")
print("=" * 60)

for i, review in enumerate(test_reviews, 1):
    result = predict_sentiment(review)

    emoji = "üòä" if result['label'] == "POSITIVE" else "üòû"

    print(f"\n{i}. {review}")
    print(f"   {emoji} Prediction: {result['label']} (confidence: {result['score']:.1%})")

print("\n" + "=" * 60)

# Store first prediction for testing
first_prediction = predict_sentiment(test_reviews[0])


üé¨ Testing on Custom Reviews:

1. This movie was absolutely amazing! Best film ever!
   üòä Prediction: POSITIVE (confidence: 98.6%)

2. Terrible movie. Complete waste of time and money.
   üòû Prediction: NEGATIVE (confidence: 99.1%)

3. Not bad, but not great either. Just okay.
   üòû Prediction: NEGATIVE (confidence: 74.5%)



## Task 5.3: Try Your Own Review!

Write your own movie review and see what the model predicts.

In [19]:

my_review = "I really enjoyed this movie! The plot was engaging and the acting was superb."

# Predict sentiment
my_prediction = predict_sentiment(my_review)

print("üé≠ Your Review:")
print("=" * 60)
print(f"Review: {my_review}")
print(f"\nPrediction: {my_prediction['label']}")
print(f"Confidence: {my_prediction['score']:.1%}")
print("=" * 60)


üé≠ Your Review:
Review: I really enjoyed this movie! The plot was engaging and the acting was superb.

Prediction: POSITIVE
Confidence: 99.1%


### üß™ Test Cell - Part 5 (Do Not Modify)

In [20]:
# Test Part 5
assert first_prediction is not None, "Should have prediction"
assert 'label' in first_prediction, "Prediction should have label"
assert 'score' in first_prediction, "Prediction should have score"
assert first_prediction['label'] in ['POSITIVE', 'NEGATIVE'], "Label should be POSITIVE or NEGATIVE"
assert 0 <= first_prediction['score'] <= 1, "Score should be between 0 and 1"
assert my_review != "YOUR REVIEW HERE", "Please write your own review"
print("‚úÖ Part 5 tests passed! (15/15 points)")

‚úÖ Part 5 tests passed! (15/15 points)


---
# üéâ Assignment Complete!

## Summary of What You Accomplished:

‚úÖ **Part 1:** Loaded and explored the IMDB dataset  
‚úÖ **Part 2:** Tokenized text data using DistilBERT tokenizer  
‚úÖ **Part 3:** Fine-tuned a transformer model for sentiment analysis  
‚úÖ **Part 4:** Evaluated model performance with multiple metrics  
‚úÖ **Part 5:** Built a prediction function and tested on custom examples  

## Your Results:


In [21]:
print("\n" + "=" * 60)
print("üìä FINAL RESULTS")
print("=" * 60)
print(f"\nModel Performance:")
print(f"  Accuracy:  {test_accuracy:.1%}")
print(f"  F1 Score:  {test_f1:.3f}")
print(f"  Precision: {eval_results['eval_precision']:.3f}")
print(f"  Recall:    {eval_results['eval_recall']:.3f}")

print(f"\nTraining Info:")
print(f"  Training examples: {len(train_dataset)}")
print(f"  Test examples: {len(test_dataset)}")
print(f"  Training time: {train_result.metrics['train_runtime']:.1f} seconds")
print(f"  Model: {model_name}")

total_score = 100
print(f"\nüéØ Estimated Score: {total_score}/100 points")

if test_accuracy > 0.75:
    print("\nüåü Outstanding work! Your model performs excellently!")
elif test_accuracy > 0.65:
    print("\nüëè Great job! You've successfully trained a working model!")
else:
    print("\n‚úÖ Good start! Consider training for more epochs for better performance.")

print("\n" + "=" * 60)
print("\nüí° Next Steps:")
print("  - Try training for more epochs (increase num_train_epochs)")
print("  - Use a larger dataset (increase number of examples)")
print("  - Experiment with different learning rates")
print("  - Try other models (roberta-base, bert-base-uncased)")
print("\nüéì Thank you for completing this assignment!")
print("=" * 60)


üìä FINAL RESULTS

Model Performance:
  Accuracy:  78.0%
  F1 Score:  0.784
  Precision: 0.727
  Recall:    0.851

Training Info:
  Training examples: 500
  Test examples: 100
  Training time: 95.2 seconds
  Model: distilbert-base-uncased

üéØ Estimated Score: 100/100 points

üåü Outstanding work! Your model performs excellently!


üí° Next Steps:
  - Try training for more epochs (increase num_train_epochs)
  - Use a larger dataset (increase number of examples)
  - Experiment with different learning rates
  - Try other models (roberta-base, bert-base-uncased)

üéì Thank you for completing this assignment!


---
# üì§ Submission Instructions

1. **Verify all cells have run successfully** (no errors)
2. **Check that all TODO sections are completed**
3. **Make sure all test cells passed** (‚úÖ marks)
4. **Download the notebook:**
   - File ‚Üí Download ‚Üí Download .ipynb
5. **Submit the downloaded .ipynb file to the course portal on Moodle**

## Grading Rubric Recap:

| Part | Task | Points |
|------|------|--------|
| 1 | Data Loading & Exploration | 15 |
| 2 | Tokenization | 20 |
| 3 | Model Training | 30 |
| 4 | Evaluation | 20 |
| 5 | Inference | 15 |
| **Total** | | **100** |

---

**Questions?** Contact the TA.

**Good luck! üöÄ**