<a href="https://colab.research.google.com/github/LucasColas/Movie-review/blob/main/IMDB_movie_reviews.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## prerequisites

In [None]:
!pip install --upgrade datasets fsspec s3fs huggingface_hub


Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
Collecting s3fs
  Downloading s3fs-2025.5.1-py3-none-any.whl.metadata (1.9 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting fsspec
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Collecting aiobotocore<3.0.0,>=2.5.4 (from s3fs)
  Downloading aiobotocore-2.23.0-py3-none-any.whl.metadata (24 kB)
INFO: pip is looking at multiple versions of s3fs to determine which version is compatible with other requirements. This could take a while.
Collecting s3fs
  Downloading s3fs-2025.5.0-py3-none-any.whl.metadata (1.9 kB)
  Downloading s3fs-2025.3.2-py3-none-any.whl.metadata (1.9 kB)
  Downloading s3fs-2025.3.1-py3-none-any.whl.metadata (1.9 kB)
  Downloading s3fs-2025.3.0-py3-none-any.whl.metadata (1.9 kB)
Collecting aioitertools<1.0.0,>=0.5.1 (from aioboto

In [None]:
!pip install --upgrade transformers


Collecting transformers
  Downloading transformers-4.53.1-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.53.1-py3-none-any.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m51.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.0
    Uninstalling transformers-4.53.0:
      Successfully uninstalled transformers-4.53.0
Successfully installed transformers-4.53.1


## libraries

In [None]:
from collections import Counter
from datasets import load_dataset, concatenate_datasets
from transformers import AutoTokenizer
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


## dataset

In [None]:


# Load IMDB: train/test split with 25k examples each.
# Labels : 0 (negative), 1 (positive)
imdb = load_dataset("imdb")
train_ds, test_ds = imdb["train"], imdb["test"]


In [None]:
train_ds

Dataset({
    features: ['text', 'label'],
    num_rows: 25000
})

In [None]:
display(train_ds[0])

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [None]:
# Count labels in the training set
train_label_counts = Counter(train_ds["label"])
print("Label counts in the training dataset:", train_label_counts)

# Count labels in the testing set
test_label_counts = Counter(test_ds["label"])
print("Label counts in the testing dataset:", test_label_counts)

Label counts in the training dataset: Counter({0: 12500, 1: 12500})
Label counts in the testing dataset: Counter({0: 12500, 1: 12500})


In [None]:
# merge half the test set with the training set.

test_ds_label_0 = test_ds.filter(lambda x: x['label'] == 0)
test_ds_label_1 = test_ds.filter(lambda x: x['label'] == 1)


num_to_move_0 = len(test_ds_label_0) // 2
num_to_move_1 = len(test_ds_label_1) // 2

# examples to move from test to train
move_ds_label_0 = test_ds_label_0.select(range(num_to_move_0))
move_ds_label_1 = test_ds_label_1.select(range(num_to_move_1))

# new training set
new_train_ds = concatenate_datasets([train_ds, move_ds_label_0, move_ds_label_1])

# the remaining examples are for the new test set
remain_ds_label_0 = test_ds_label_0.select(range(num_to_move_0, len(test_ds_label_0)))
remain_ds_label_1 = test_ds_label_1.select(range(num_to_move_1, len(test_ds_label_1)))
new_test_ds     = concatenate_datasets([remain_ds_label_0, remain_ds_label_1])

new_train_ds = new_train_ds.shuffle(seed=42)


train_ds, test_ds = new_train_ds, new_test_ds

# 9. Verify label distributions
print("Label counts in new TRAIN set:", Counter(train_ds["label"]))
print("Label counts in new TEST  set:", Counter(test_ds["label"]))

Label counts in new TRAIN set: Counter({1: 18750, 0: 18750})
Label counts in new TEST  set: Counter({0: 6250, 1: 6250})


## model

In [None]:


model_name = "prajjwal1/bert-tiny"
tokenizer  = AutoTokenizer.from_pretrained(model_name)

def tokenize(batch):
    return tokenizer(batch["text"],
                     padding="max_length",
                     truncation=True,
                     max_length=256)

train_tok = train_ds.map(tokenize, batched=True)
test_tok  = test_ds.map(tokenize, batched=True)

# Set format for PyTorch
train_tok.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_tok.set_format ( "torch", columns=["input_ids", "attention_mask", "label"])


Map:   0%|          | 0/37500 [00:00<?, ? examples/s]

Map:   0%|          | 0/12500 [00:00<?, ? examples/s]

In [None]:
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer

# Load pretrained model with a classification head
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# Define training hyperparameters
args = TrainingArguments(
    output_dir="imdb-transformer",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=64,
    per_device_eval_batch_size=32,
    num_train_epochs=10,
    weight_decay=0.01,
)

def compute_metrics(pred):
    labels = pred.label_ids
    preds  = pred.predictions.argmax(-1)
    acc    = accuracy_score(labels, preds)
    p, r, f, _ = precision_recall_fscore_support(labels, preds, average="binary")
    return {"accuracy": acc, "precision": p, "recall": r, "f1": f}

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    eval_dataset=test_tok,
    compute_metrics=compute_metrics,
)


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at prajjwal1/bert-tiny and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## training

In [None]:

trainer.train()
results = trainer.evaluate()
print(results)


Epoch,Training Loss,Validation Loss


Epoch,Training Loss,Validation Loss,Accuracy,Precision,Recall,F1
1,0.5751,0.446763,0.7988,0.803116,0.79168,0.797357
2,0.4093,0.379284,0.8352,0.803447,0.88752,0.843394
3,0.3621,0.349035,0.85,0.84972,0.8504,0.85006
4,0.3348,0.338216,0.85376,0.861156,0.84352,0.852247
5,0.3184,0.332857,0.85856,0.863681,0.85152,0.857557
6,0.3062,0.329025,0.86184,0.852753,0.87472,0.863597
7,0.2938,0.327034,0.86128,0.850948,0.876,0.863292
8,0.2862,0.328337,0.86104,0.872421,0.84576,0.858884
9,0.2796,0.326676,0.86408,0.854384,0.87776,0.865914
10,0.2755,0.325666,0.86352,0.856001,0.87408,0.864946


{'eval_loss': 0.32566601037979126, 'eval_accuracy': 0.86352, 'eval_precision': 0.8560012535255406, 'eval_recall': 0.87408, 'eval_f1': 0.8649461684610513, 'eval_runtime': 104.8092, 'eval_samples_per_second': 119.264, 'eval_steps_per_second': 3.731, 'epoch': 10.0}


In [None]:
# Alternative : use fp16
args = TrainingArguments(
    output_dir="imdb-transformer",
    eval_strategy="epoch",
    save_strategy="epoch",
    logging_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=32,
    per_device_eval_batch_size=32,
    num_train_epochs=3,
    weight_decay=0.01,
    fp16=True,
    fp16_opt_level="O1",
    gradient_accumulation_steps=2,
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=train_tok,
    eval_dataset=test_tok,
    compute_metrics=compute_metrics,
)

trainer.train()
results = trainer.evaluate()
print(results)

## Inference

In [None]:
from transformers import pipeline
nlp = pipeline("text-classification", model=trainer.model, tokenizer=tokenizer, return_all_scores=True)
example = "This movie was not as good as I expected."
print(nlp(example))

Device set to use cpu


[[{'label': 'LABEL_0', 'score': 0.3659629225730896}, {'label': 'LABEL_1', 'score': 0.6340370178222656}]]


