# IMDb Sentiment Analysis - Fine-tuned BERT Model

This project fine-tunes a pre-trained BERT model from Hugging Face on the IMDb movie reviews dataset for binary sentiment classification (positive/negative).

## Project Overview

- Task: Sentiment Analysis (Binary Classification)
- Dataset: IMDb
- Model: bert-base-uncased fine-tuned using Transformers and PyTorch
- Accuracy: About 92.8% on validation data

## Sample Usage

### Load the model and tokenizer

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification

tokenizer = AutoTokenizer.from_pretrained("saved_model")
model = AutoModelForSequenceClassification.from_pretrained("saved_model")


In [1]:
from datasets import load_dataset
dataset = load_dataset("imdb")

README.md:   0%|          | 0.00/7.81k [00:00<?, ?B/s]

train-00000-of-00001.parquet:   0%|          | 0.00/21.0M [00:00<?, ?B/s]

test-00000-of-00001.parquet:   0%|          | 0.00/20.5M [00:00<?, ?B/s]

unsupervised-00000-of-00001.parquet:   0%|          | 0.00/42.0M [00:00<?, ?B/s]

Generating train split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/25000 [00:00<?, ? examples/s]

Generating unsupervised split:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [2]:
import datasets
print(datasets.__version__)


3.6.0


In [3]:
print(dataset.keys())

dict_keys(['train', 'test', 'unsupervised'])


In [4]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [5]:
len(dataset['train'])

25000

In [6]:
len(dataset['test'])

25000

In [7]:
dataset["train"][0]

{'text': 'I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really had to see this for myself.<br /><br />The plot is centered around a young Swedish drama student named Lena who wants to learn everything she can about life. In particular she wants to focus her attentions to making some sort of documentary on what the average Swede thought about certain political issues such as the Vietnam War and race issues in the United States. In between asking politicians and ordinary denizens of Stockholm about their opinions on politics, she has sex with her drama teacher, classmates, and married men.<br /><br />What kills me about I AM CURIOUS-YELLOW is that 40 years ago, this was considered pornographic. Really, the sex and nudity scenes are few and far be

In [10]:
from transformers import AutoTokenizer

In [14]:
tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [19]:
def tokenizer_function(example):
  return tokenizer(example["text"], padding="max_length",truncation=True)
tokenized_datasets = dataset.map(tokenizer_function,batched=True)

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/25000 [00:00<?, ? examples/s]

Map:   0%|          | 0/50000 [00:00<?, ? examples/s]

In [23]:
import torch
from torch.utils.data import DataLoader

In [25]:
tokenized_datasets.set_format(type='torch' , columns=['input_ids','attention_mask','label'])

In [26]:
train_dataloader = DataLoader(tokenized_datasets['train'], batch_size=16, shuffle=True)
eval_dataloader = DataLoader(tokenized_datasets['test'], batch_size=16)

In [27]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", num_labels=2)


model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [31]:
%pip install evaluate



In [39]:
from transformers import TrainingArguments, Trainer
import numpy as np
import evaluate

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    predictions = np.argmax(logits, axis=-1)
    return accuracy_metric.compute(predictions=predictions, references=labels)


training_args = TrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=1,
    weight_decay=0.01,
    fp16=True
)


trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_datasets["train"],
    eval_dataset=tokenized_datasets["test"],
    compute_metrics=compute_metrics,
)

In [40]:
trainer.train()


Epoch,Training Loss,Validation Loss,Accuracy
1,0.2408,0.230765,0.92852


TrainOutput(global_step=3125, training_loss=0.2066149475097656, metrics={'train_runtime': 626.9477, 'train_samples_per_second': 39.876, 'train_steps_per_second': 4.984, 'total_flos': 3311684966400000.0, 'train_loss': 0.2066149475097656, 'epoch': 1.0})

In [41]:
model.save_pretrained("./saved_model")
tokenizer.save_pretrained("./saved_model")


('./saved_model/tokenizer_config.json',
 './saved_model/special_tokens_map.json',
 './saved_model/vocab.txt',
 './saved_model/added_tokens.json',
 './saved_model/tokenizer.json')

In [44]:
from google.colab import drive

In [45]:
drive.mount('/content/drive')

Mounted at /content/drive


In [46]:
inputs = tokenizer("This movie was amazing and full of emotions!", return_tensors="pt").to(model.device)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()

print("Positive" if prediction == 1 else "Negative")


Positive


In [47]:
text = "This movie was terrible. The acting was bad and the story was boring."
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model(**inputs)
prediction = outputs.logits.argmax(-1).item()

print("Positive" if prediction == 1 else "Negative")


Negative
