In [1]:
!pip install transformers datasets scikit-learn




# Homework 6 – Mini AI Pipeline
## IMDB Sentiment Analysis (Positive / Negative)

- Student: Kanghan Lee (2024149005)
- Model: DistilBERT (SST-2 fine-tuned)
- Dataset: IMDB (200 train / 50 test)

This notebook implements:
1. A naive keyword-based baseline
2. An AI pipeline using a pretrained transformer
3. Fair evaluation and comparison


In [3]:
import random
import numpy as np
import torch

from datasets import load_dataset
from transformers import pipeline

from sklearn.metrics import accuracy_score, f1_score


In [4]:
# Reproducibility
SEED = 42
random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)


<torch._C.Generator at 0x79c52c2fd8f0>

In [5]:
dataset = load_dataset("imdb")
dataset


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

plain_text/train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

plain_text/test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

plain_text/unsupervised-00000-of-00001.p(…):   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 25000
    })
    unsupervised: Dataset({
        features: ['text', 'label'],
        num_rows: 50000
    })
})

In [7]:
def sample_subset(split, n_samples, seed=42):
    shuffled = split.shuffle(seed=seed)
    return shuffled.select(range(n_samples))


In [8]:
train_subset = sample_subset(dataset["train"], 200)
test_subset  = sample_subset(dataset["test"], 50)

print("Train size:", len(train_subset))
print("Test size :", len(test_subset))


Train size: 200
Test size : 50


In [9]:
# Inspect an example
train_subset[0]


{'text': 'There is no relation at all between Fortier and Profiler but the fact that both are police series about violent crimes. Profiler looks crispy, Fortier looks classic. Profiler plots are quite simple. Fortier\'s plot are far more complicated... Fortier looks more like Prime Suspect, if we have to spot similarities... The main character is weak and weirdo, but have "clairvoyance". People like to compare, to judge, to evaluate. How about just enjoying? Funny thing too, people writing Fortier looks American but, on the other hand, arguing they prefer American series (!!!). Maybe it\'s the language, or the spirit, but I think this series is more English than American. By the way, the actors are really good and funny. The acting is not superficial at all...',
 'label': 1}

In [10]:
positive_keywords = [
    "good", "great", "amazing", "excellent", "love", "wonderful"
]

negative_keywords = [
    "bad", "terrible", "awful", "boring", "hate", "worst"
]


In [11]:
def keyword_baseline_predict(text):
    text = text.lower()

    pos_count = sum(word in text for word in positive_keywords)
    neg_count = sum(word in text for word in negative_keywords)

    # Predict positive if tie
    return 1 if pos_count >= neg_count else 0


In [12]:
baseline_preds = [
    keyword_baseline_predict(ex["text"]) for ex in test_subset
]

true_labels = [ex["label"] for ex in test_subset]


In [13]:
baseline_acc = accuracy_score(true_labels, baseline_preds)
baseline_f1  = f1_score(true_labels, baseline_preds)

baseline_acc, baseline_f1


(0.6, 0.6774193548387096)

In [14]:
sentiment_pipe = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english",
    device=0 if torch.cuda.is_available() else -1
)


config.json:   0%|          | 0.00/629 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Device set to use cpu


In [15]:
def distilbert_predict(text):
    output = sentiment_pipe(text, truncation=True)[0]
    return 1 if output["label"] == "POSITIVE" else 0


In [16]:
bert_preds = [
    distilbert_predict(ex["text"]) for ex in test_subset
]


In [17]:
bert_acc = accuracy_score(true_labels, bert_preds)
bert_f1  = f1_score(true_labels, bert_preds)

bert_acc, bert_f1


(0.94, 0.9333333333333333)

In [18]:
print("=== IMDB Sentiment Classification Results ===")
print(f"Naive Baseline | Accuracy: {baseline_acc:.3f}, F1: {baseline_f1:.3f}")
print(f"DistilBERT     | Accuracy: {bert_acc:.3f}, F1: {bert_f1:.3f}")


=== IMDB Sentiment Classification Results ===
Naive Baseline | Accuracy: 0.600, F1: 0.677
DistilBERT     | Accuracy: 0.940, F1: 0.933


In [19]:
def show_differences(n=3):
    shown = 0
    for i in range(len(test_subset)):
        if baseline_preds[i] != bert_preds[i]:
            ex = test_subset[i]
            print("=" * 80)
            print("Review (truncated):")
            print(ex["text"][:300], "...")
            print()
            print(f"Gold label      : {ex['label']}")
            print(f"Baseline pred   : {baseline_preds[i]}")
            print(f"DistilBERT pred : {bert_preds[i]}")
            shown += 1
            if shown >= n:
                break


In [20]:
show_differences(3)


Review (truncated):
I was truly and wonderfully surprised at "O' Brother, Where Art Thou?" The video store was out of all the movies I was planning on renting, so then I came across this. I came home and as I watched I became engrossed and found myself laughing out loud. The Coen's have made a magnificiant film again.  ...

Gold label      : 1
Baseline pred   : 1
DistilBERT pred : 0
Review (truncated):
This movie spends most of its time preaching that it is the script that makes the movie, but apparently there was no script when they shot this waste of time! The trailer makes this out to be a comedy, but the film can't decide if it wants to be a comedy, a drama, a romance or an action film. Press  ...

Gold label      : 0
Baseline pred   : 1
DistilBERT pred : 0
Review (truncated):
Porn legend Gregory Dark directs this cheesy horror flick that has Glen Jacobs (Kane from WWF/WWE/ whatever it calls itself nowadays) in his cinematic debut. He plays Jacob Goodknight, a blind serial killer w

In [21]:
results = {
    "baseline_accuracy": baseline_acc,
    "baseline_f1": baseline_f1,
    "bert_accuracy": bert_acc,
    "bert_f1": bert_f1
}

results


{'baseline_accuracy': 0.6,
 'baseline_f1': 0.6774193548387096,
 'bert_accuracy': 0.94,
 'bert_f1': 0.9333333333333333}