# 🎬 IMDB Sentiment Classifier with Hugging Face

This notebook fine-tunes a BERT model to classify IMDB movie reviews as **positive** or **negative** using Hugging Face's `transformers` library.

**Recommendation:** Use GPU for faster training to significantly speed up your model's training time, it is highly recommended to change the execution environment type to GPU.

You can do this by following these steps:

- Go to "Execution Environment" in the top menu.
- Select "Change Execution Environment Type."
- Under "Hardware Accelerator," choose "GPU."
- Click "Save."
- Once you have configured the GPU, you can run the next cell.

In [1]:
# Install compatible versions of datasets and fsspec to avoid errors
!pip install -U datasets fsspec

Collecting datasets
  Downloading datasets-3.6.0-py3-none-any.whl.metadata (19 kB)
Collecting fsspec
  Downloading fsspec-2025.5.1-py3-none-any.whl.metadata (11 kB)
  Downloading fsspec-2025.3.0-py3-none-any.whl.metadata (11 kB)
Downloading datasets-3.6.0-py3-none-any.whl (491 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m491.5/491.5 kB[0m [31m24.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading fsspec-2025.3.0-py3-none-any.whl (193 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m193.6/193.6 kB[0m [31m19.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: fsspec, datasets
  Attempting uninstall: fsspec
    Found existing installation: fsspec 2025.3.2
    Uninstalling fsspec-2025.3.2:
      Successfully uninstalled fsspec-2025.3.2
  Attempting uninstall: datasets
    Found existing installation: datasets 2.14.4
    Uninstalling datasets-2.14.4:
      Successfully uninstalled datasets-2.14.4
[31mERROR: pip's dependency resolver 

In [2]:
#  Upgrade transformers to avoid compatibility issues with TrainingArguments
!pip install --upgrade transformers

Collecting transformers
  Downloading transformers-4.53.1-py3-none-any.whl.metadata (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.9/40.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
Downloading transformers-4.53.1-py3-none-any.whl (10.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m10.8/10.8 MB[0m [31m71.8 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.53.0
    Uninstalling transformers-4.53.0:
      Successfully uninstalled transformers-4.53.0
Successfully installed transformers-4.53.1


In [3]:
!pip uninstall -y transformers
!pip install transformers==4.53.1


Found existing installation: transformers 4.53.1
Uninstalling transformers-4.53.1:
  Successfully uninstalled transformers-4.53.1
Collecting transformers==4.53.1
  Using cached transformers-4.53.1-py3-none-any.whl.metadata (40 kB)
Using cached transformers-4.53.1-py3-none-any.whl (10.8 MB)
Installing collected packages: transformers
Successfully installed transformers-4.53.1


In [4]:
!pip install wandb



In [5]:
!pip install transformers datasets scikit-learn



## 1.  Import libraries

In [6]:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import load_dataset
from sklearn.metrics import accuracy_score
import numpy as np
import wandb

## 2.  Load and prepare the IMDB dataset

In [7]:
dataset = load_dataset("imdb")
dataset = dataset.shuffle(seed=42)
dataset["train"] = dataset["train"].select(range(2000))
dataset["test"] = dataset["test"].select(range(1000))


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


README.md: 0.00B [00:00, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 3.  Tokenize the data

In [8]:
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text"])
tokenized_datasets = tokenized_datasets.rename_column("label", "labels")
tokenized_datasets.set_format("torch")


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Map:   0%|          | 0/2000 [00:00<?, ? examples/s]

Map:   0%|          | 0/1000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

## 4.  Load pre-trained BERT model

In [9]:
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


## 5.  Define metrics function

In [10]:
def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    acc = accuracy_score(labels, predictions)
    return {"accuracy": acc}


In [11]:
import transformers
print(transformers.__version__)


4.53.1


## 6.  Set training arguments

In [12]:
training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01,
    logging_dir='./logs',
    logging_steps=10,
    report_to="none",
)

## 7.  Train the model

This process takes approximately 10 minutes.

In [13]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

trainer.train()


  trainer = Trainer(


Epoch,Training Loss,Validation Loss,Accuracy
1,0.3649,0.279812,0.887
2,0.2938,0.37447,0.906


TrainOutput(global_step=500, training_loss=0.30332061201334, metrics={'train_runtime': 487.2281, 'train_samples_per_second': 8.21, 'train_steps_per_second': 1.026, 'total_flos': 1052444221440000.0, 'train_loss': 0.30332061201334, 'epoch': 2.0})

## 8.  Evaluate the model

In [14]:
metrics = trainer.evaluate()
print(f"Precisión final en IMDB: {metrics['eval_accuracy']:.4f}")


Precisión final en IMDB: 0.9060


## 9.  Save the trained model

In [15]:
model.save_pretrained("imdb_sentiment_model")
tokenizer.save_pretrained("imdb_sentiment_model")


('imdb_sentiment_model/tokenizer_config.json',
 'imdb_sentiment_model/special_tokens_map.json',
 'imdb_sentiment_model/vocab.txt',
 'imdb_sentiment_model/added_tokens.json',
 'imdb_sentiment_model/tokenizer.json')

## 10.  Classify a custom review

In [17]:
def predict_sentiment(text):
    inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True).to(model.device)
    with torch.no_grad():
        outputs = model(**inputs)
        prediction = torch.argmax(outputs.logits, dim=-1).item()
        return "Positive" if prediction == 1 else "Negative"

user_review = input("Write a movie review: ")
print("Predicted sentiment:", predict_sentiment(user_review))


Write a movie review: the movie was boring!
Predicted sentiment: Negative
