<a href="https://colab.research.google.com/github/098Steve/Jupyter/blob/main/IMDB_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This notebook fine-tunes BERT on the IMDB positive/negative reviews dataset

Please make this your own for editing: click File > Save a Copy in Drive

Set the runtime as T4 GPU. Click the triangle arrow top right, click Change Runtime Type, Select T4 GPU. You may need to reload the notebook

Work through to understand each step. Use Google and an LLM to help you

In [None]:
# ✅ Step 0: Install required libraries (only run once in Colab)
!pip install datasets -U transformers

In [None]:
# ✅ Step 1: Imports
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
import torch
import numpy as np
from sklearn.metrics import accuracy_score

In [None]:
# ✅ Step 2: Load the IMDB dataset and take a small subset for quick training
dataset = load_dataset("imdb", split={"train": "train[:2000]", "test": "test[:1000]"})
small_train = dataset["train"]
small_test = dataset["test"]

In [None]:
# ✅ Step 3: Load tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

In [None]:
# ✅ Step 4: Tokenise text
def tokenize(batch):
    return tokenizer(batch["text"], padding="max_length", truncation=True, max_length=512)

train_enc = small_train.map(tokenize, batched=True)
test_enc = small_test.map(tokenize, batched=True)

In [None]:
# ✅ Step 5: Set PyTorch format
train_enc.set_format("torch", columns=["input_ids", "attention_mask", "label"])
test_enc.set_format("torch", columns=["input_ids", "attention_mask", "label"])

In [None]:
# ✅ Step 6: Load model
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

In [None]:
# ✅ Step 7: Define metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return {"accuracy": accuracy_score(labels, predictions)}

In [None]:
# ✅ Step 8: Training arguments (tuned for Colab + T4 GPU)
training_args = TrainingArguments(
    output_dir="./bert-imdb",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    eval_strategy="epoch",
    save_strategy="no",
    logging_steps=10,
    logging_dir="./logs",
    load_best_model_at_end=False,
    report_to="none",  # turn off wandb
)

In [None]:
# ✅ Step 9: Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_enc,
    eval_dataset=test_enc,
    compute_metrics=compute_metrics,
)

trainer.train()

In [None]:
import torch

# after trainer.train()
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

In [None]:
# ✅ Step 10: Evaluate
metrics = trainer.evaluate()
print("✅ Test Accuracy:", metrics["eval_accuracy"])

In [None]:
# ✅ Step 11: Make a predictive function
def predict_sentiment(text):
    # Tokenise and immediately move tensors to the same device as model
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding=True,
        max_length=512
    ).to(device)

    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)
        logits = outputs.logits
        probs = torch.nn.functional.softmax(logits, dim=-1)
        pred = torch.argmax(probs, dim=1).item()
        confidence = probs[0, pred].item()

    sentiment = "👍 Positive" if pred == 1 else "👎 Negative"
    print(f"Sentiment: {sentiment} ({confidence:.2%} confidence)")

In [None]:
predict_sentiment("I loved the movie.")
predict_sentiment("It was boring, slow, and way too long. I wouldn't recommend it.")



---



---



---



Well done for getting to the end of this Notebook!

How was the performance of the model?

Maybe you can spot the error that causes the performance to be low... Easily fixed!