This project implements a sentiment analysis pipeline using the Hugging Face transformers and datasets libraries, fine-tuning a pre-trained BERT model (bert-base-uncased) on the IMDb movie reviews dataset. The pipeline begins with data loading and preprocessing, where the dataset is either fetched via load_dataset(). The text data is tokenized using the BertTokenizer, ensuring truncation and padding are applied to fit BERT’s input size. We fine-tune TFBertForSequenceClassification (a TensorFlow-compatible version of BERT) on the IMDb dataset using model.fit(). The optimizer is created using Hugging Face’s create_optimizer utility, which schedules learning rate decay. The model is trained for two epochs and evaluated using accuracy and loss metrics on a validation split. After training, the model is saved and reused for inference on sample text using a custom predict_sentiment function. This pipeline leverages transfer learning to minimize training time and achieve high performance on a binary classification task.

In [None]:
!pip install transformers datasets tokenizers



In [None]:
!pip install -U datasets huggingface_hub fsspec # using this because there were some errors

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting huggingface_hub
  Downloading huggingface_hub-0.33.2-py3-none-any.whl.metadata (14 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m9.6 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading huggingface_hub-0.33.2-py3-none-any.whl (515 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m515.4/515.4 kB[0m [31m36.2 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m18.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, huggingface_hub, datasets
  Attempting uninstall: fsspec
    Found existing installati

In [None]:
import torch
from datasets import load_dataset
from transformers import BertTokenizer, BertForSequenceClassification, Trainer, TrainingArguments
from transformers import DataCollatorWithPadding
from sklearn.metrics import accuracy_score, f1_score
import numpy as np

In [None]:
# Load IMDb dataset
dataset = load_dataset("imdb")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# Load tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

In [None]:
# Preprocess dataset: Tokenize
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True)

In [None]:
tokenized_datasets = dataset.map(tokenize_function, batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [None]:
# Split into train and test sets
train_dataset = tokenized_datasets["train"]
test_dataset = tokenized_datasets["test"]

In [None]:
# Load pre-trained BERT model for binary classification
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
# Data collator for dynamic padding
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

In [None]:
# Compute metrics
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    f1 = f1_score(labels, predictions)
    return {"accuracy": acc, "f1": f1}

In [None]:
# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir="./logs",
    logging_steps=10,
    save_total_limit=1,
    report_to=[]
)

In [None]:
# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset.shuffle(seed=42).select(range(2000)),  # subset for speed
    eval_dataset=test_dataset.select(range(1000)),  # subset for speed
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics,
)

  trainer = Trainer(


In [None]:
import os
os.environ["WANDB_DISABLED"] = "true"

In [None]:
# Fine-tune the model
trainer.train()

Epoch,Training Loss,Validation Loss,Accuracy,F1
1,0.3853,0.531316,0.812,0.0
2,0.1701,0.389994,0.901,0.0


TrainOutput(global_step=500, training_loss=0.2449758805632591, metrics={'train_runtime': 449.0833, 'train_samples_per_second': 8.907, 'train_steps_per_second': 1.113, 'total_flos': 973938460296960.0, 'train_loss': 0.2449758805632591, 'epoch': 2.0})

In [None]:
# Evaluate the model
metrics = trainer.evaluate()
print("Evaluation metrics:", metrics)

Evaluation metrics: {'eval_loss': 0.3899937868118286, 'eval_accuracy': 0.901, 'eval_f1': 0.0, 'eval_runtime': 28.6301, 'eval_samples_per_second': 34.928, 'eval_steps_per_second': 4.366, 'epoch': 2.0}


In [None]:
# Save the model
model_path = "./fine-tuned-bert-imdb"
trainer.save_model(model_path)
tokenizer.save_pretrained(model_path)

('./fine-tuned-bert-imdb/tokenizer_config.json',
 './fine-tuned-bert-imdb/special_tokens_map.json',
 './fine-tuned-bert-imdb/vocab.txt',
 './fine-tuned-bert-imdb/added_tokens.json')

In [None]:
# Inference on new text
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True)
    # Move input tensors to the same device as the model
    device = model.device
    inputs = {k: v.to(device) for k, v in inputs.items()}
    with torch.no_grad():
        outputs = model(**inputs)
    logits = outputs.logits
    predicted_class = torch.argmax(logits).item()
    return "Positive" if predicted_class == 1 else "Negative"

In [None]:
# Example inference
sample_text = "I absolutely loved this movie. It was fantastic!"
print("Sample Text Prediction:", predict_sentiment(sample_text))

Sample Text Prediction: Positive
