<a href="https://colab.research.google.com/github/MominaSiddiq/bert-sentiment-analysis/blob/main/Bert_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [None]:
# ✅ Install core libraries with compatible versions
!pip install -q transformers==4.53.0 datasets accelerate fsspec==2023.6.0


# Imports


In [19]:
# Import essential libraries for working with transformers and datasets
from datasets import load_dataset                    # For loading the IMDb dataset
from transformers import (BertTokenizer,             # Tokenizer for BERT
                          BertForSequenceClassification,  # Pretrained BERT model for sentiment classification
                          Trainer,                   # Trainer handles the training loop
                          TrainingArguments)         # Used to define training configurations
import torch                                          # PyTorch backend
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support


In [3]:
import transformers
print(transformers.__version__)
print(transformers.TrainingArguments.__init__.__code__.co_varnames)


4.53.0
('self', 'output_dir', 'overwrite_output_dir', 'do_train', 'do_eval', 'do_predict', 'eval_strategy', 'prediction_loss_only', 'per_device_train_batch_size', 'per_device_eval_batch_size', 'per_gpu_train_batch_size', 'per_gpu_eval_batch_size', 'gradient_accumulation_steps', 'eval_accumulation_steps', 'eval_delay', 'torch_empty_cache_steps', 'learning_rate', 'weight_decay', 'adam_beta1', 'adam_beta2', 'adam_epsilon', 'max_grad_norm', 'num_train_epochs', 'max_steps', 'lr_scheduler_type', 'lr_scheduler_kwargs', 'warmup_ratio', 'warmup_steps', 'log_level', 'log_level_replica', 'log_on_each_node', 'logging_dir', 'logging_strategy', 'logging_first_step', 'logging_steps', 'logging_nan_inf_filter', 'save_strategy', 'save_steps', 'save_total_limit', 'save_safetensors', 'save_on_each_node', 'save_only_model', 'restore_callback_states_from_checkpoint', 'no_cuda', 'use_cpu', 'use_mps_device', 'seed', 'data_seed', 'jit_mode_eval', 'use_ipex', 'bf16', 'fp16', 'fp16_opt_level', 'half_precision_ba

# Load IMDb Dataset

Load the IMDb movie reviews dataset using Hugging Face's `datasets` library. This dataset contains 25,000 labeled movie reviews for training and 25,000 for testing, with binary sentiment labels: `0` for negative, and `1` for positive.


In [None]:
# Load the IMDb dataset from Hugging Face
# The dataset contains 25,000 training and 25,000 test examples
dataset = load_dataset("imdb")

# Display the dataset structure
print(dataset)


## Printed Sample

Below, a positive and a negative example from the dataset is printed to better understand the data.


In [5]:
# Instead of printing full text, just show first 300 characters
print("Sample Negative Review:\n")  # Show a sample of negitive review
print(dataset['train'][0]['text'][:300])
print("Label:", dataset['train'][0]['label'])

print("\nSample Positive Review:\n") # Show a sample of positive review
print(dataset['train'][1]['text'][:300])
print("Label:", dataset['train'][1]['label'])



Sample Negative Review:

I rented I AM CURIOUS-YELLOW from my video store because of all the controversy that surrounded it when it was first released in 1967. I also heard that at first it was seized by U.S. customs if it ever tried to enter this country, therefore being a fan of films considered "controversial" I really h
Label: 0

Sample Positive Review:

"I Am Curious: Yellow" is a risible and pretentious steaming pile. It doesn't matter what one's political views are because this film can hardly be taken seriously on any level. As for the claim that frontal male nudity is an automatic NC-17, that isn't true. I've seen R-rated films with male nudity
Label: 0


# Tokenizing the Dataset

The text data is tokenized using a pretrained BERT tokenizer.
Each movie review is converted into input tokens and padded or truncated to a fixed length.
The tokenizer also generates attention masks, which indicate which tokens are actual input versus padding.


## Load BERT Tokenizer

In [None]:
# Load the pretrained BERT tokenizer (base uncased model)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


## Define a Tokenization Function

In [7]:
# Define a function that will tokenize the text data
def tokenize_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",       # pad all sequences to max_length
        truncation=True,            # truncate reviews longer than max_length
        max_length=512              # BERT supports max 512 tokens
    )


## Apply Tokenization to the Dataset

In [None]:
# Apply tokenization to the entire dataset
# This creates new fields: input_ids, token_type_ids, attention_mask
tokenized_datasets = dataset.map(tokenize_function, batched=True)


## Remove Unused Columns

In [9]:
# Remove the original text column to keep only tokenized inputs
tokenized_datasets = tokenized_datasets.remove_columns(["text"])


## Set Format for PyTorch

In [10]:
# Set the dataset format for PyTorch (input_ids, attention_mask, labels)
tokenized_datasets.set_format("torch")


## Debug Check

In [11]:
# Preview one tokenized example
# Temporarily remove formatting to preview
tokenized_datasets.reset_format()
print(tokenized_datasets["train"][0])

{'label': 0, 'input_ids': [101, 1045, 12524, 1045, 2572, 8025, 1011, 3756, 2013, 2026, 2678, 3573, 2138, 1997, 2035, 1996, 6704, 2008, 5129, 2009, 2043, 2009, 2001, 2034, 2207, 1999, 3476, 1012, 1045, 2036, 2657, 2008, 2012, 2034, 2009, 2001, 8243, 2011, 1057, 1012, 1055, 1012, 8205, 2065, 2009, 2412, 2699, 2000, 4607, 2023, 2406, 1010, 3568, 2108, 1037, 5470, 1997, 3152, 2641, 1000, 6801, 1000, 1045, 2428, 2018, 2000, 2156, 2023, 2005, 2870, 1012, 1026, 7987, 1013, 1028, 1026, 7987, 1013, 1028, 1996, 5436, 2003, 8857, 2105, 1037, 2402, 4467, 3689, 3076, 2315, 14229, 2040, 4122, 2000, 4553, 2673, 2016, 2064, 2055, 2166, 1012, 1999, 3327, 2016, 4122, 2000, 3579, 2014, 3086, 2015, 2000, 2437, 2070, 4066, 1997, 4516, 2006, 2054, 1996, 2779, 25430, 14728, 2245, 2055, 3056, 2576, 3314, 2107, 2004, 1996, 5148, 2162, 1998, 2679, 3314, 1999, 1996, 2142, 2163, 1012, 1999, 2090, 4851, 8801, 1998, 6623, 7939, 4697, 3619, 1997, 8947, 2055, 2037, 10740, 2006, 4331, 1010, 2016, 2038, 3348, 2007, 201

In [12]:
# Set it back to torch format
tokenized_datasets.set_format("torch")


# Defining and Training the BERT Model

A pretrained BERT model (`bert-base-uncased`) is loaded for sequence classification.
The model is then fine-tuned on the IMDb movie review dataset using the Hugging Face `Trainer` API.
Training arguments such as learning rate, batch size, and number of epochs are defined to control the fine-tuning process.


## Load the BERT Model

In [None]:
# Load a pretrained BERT model for sequence classification with two labels
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


## Define Metrics Function


In [34]:
# Define the compute metrics function
def compute_metrics(pred):
    labels = pred.label_ids
    preds = np.argmax(pred.predictions, axis=1)
    precision, recall, f1, _ = precision_recall_fscore_support(labels, preds, average='binary')
    acc = accuracy_score(labels, preds)
    return {
        'accuracy': acc,
        'f1': f1,
        'precision': precision,
        'recall': recall
    }


## Define Training Arguments

In [35]:
# Define training parameters
training_args = TrainingArguments(
    output_dir="./results",              # Save checkpoints here
    save_strategy="epoch",               # Save model each epoch
    logging_dir="./logs",                # Directory for logs
    logging_steps=10,                    # Log every 10 steps
    per_device_train_batch_size=8,       # Training batch size
    per_device_eval_batch_size=8,        # Evaluation batch size
    num_train_epochs=2,                  # Number of epochs
    learning_rate=2e-5,                  # Learning rate
    weight_decay=0.01,                   # Weight decay to avoid overfitting
    load_best_model_at_end=True,         # Load best model after training
    eval_strategy="epoch",               # Evaluate every epoch
    metric_for_best_model="f1",          # Use F1 to choose best model
    greater_is_better=True,              # Higher F1 is better
    report_to="none"                     # Disable W&B integration
)

## Split Dataset for Training/Evaluation

In [36]:
# Split the dataset into training and validation sets (e.g., 90% train, 10% eval)
tokenized_datasets = tokenized_datasets["train"].train_test_split(test_size=0.1)

# Rename splits for clarity
tokenized_datasets["train"] = tokenized_datasets["train"]
tokenized_datasets["eval"] = tokenized_datasets["test"]

# Set dataset format to PyTorch to avoid NumPy Arrow error
tokenized_datasets.set_format("torch")

# Print dataset sizes
print(f"Training samples: {len(tokenized_datasets['train'])}")
print(f"Evaluation samples: {len(tokenized_datasets['eval'])}")


Training samples: 14761
Evaluation samples: 1641


## Create the Trainer

In [37]:
# Create Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics
)


## Train the Model

In [38]:
# Train the model
trainer.train()


ValueError: Unable to avoid copy while creating an array as requested.
If using `np.array(obj, copy=False)` replace it with `np.asarray(obj)` to allow a copy when needed (no behavior change in NumPy 1.x).
For more details, see https://numpy.org/devdocs/numpy_2_0_migration_guide.html#adapting-to-changes-in-the-copy-keyword.