<a href="https://colab.research.google.com/github/MominaSiddiq/AI_Generated_vs_HumanCreated_Sketches/blob/main/Bert_sentiment_analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Installation

In [None]:
# Install required Hugging Face libraries
!pip install -q transformers datasets accelerate

# ✅ Upgrade transformers to the latest version to avoid Trainer-related errors
!pip install -U transformers


In [None]:
# Fix for dataset loading issue: upgrade fsspec to latest version
!pip install -U fsspec==2023.6.0


# Imports


In [None]:
# Import essential libraries for working with transformers and datasets
from datasets import load_dataset                    # For loading the IMDb dataset
from transformers import (BertTokenizer,             # Tokenizer for BERT
                          BertForSequenceClassification,  # Pretrained BERT model for sentiment classification
                          Trainer,                   # Trainer handles the training loop
                          TrainingArguments)         # Used to define training configurations
import torch                                          # PyTorch backend


# Load IMDb Dataset

Load the IMDb movie reviews dataset using Hugging Face's `datasets` library. This dataset contains 25,000 labeled movie reviews for training and 25,000 for testing, with binary sentiment labels: `0` for negative, and `1` for positive.


In [None]:
# Load the IMDb dataset from Hugging Face
# The dataset contains 25,000 training and 25,000 test examples
dataset = load_dataset("imdb")

# Display the dataset structure
print(dataset)


# Printed Sample

Below, a positive and a negative example from the dataset is printed to better understand the data.


In [None]:
# Instead of printing full text, just show first 300 characters
print("Sample Negative Review:\n")  # Show a sample of negitive review
print(dataset['train'][0]['text'][:300])
print("Label:", dataset['train'][0]['label'])

print("\nSample Positive Review:\n") # Show a sample of positive review
print(dataset['train'][1]['text'][:300])
print("Label:", dataset['train'][1]['label'])



# Tokenizing the Dataset

The text data is tokenized using a pretrained BERT tokenizer.
Each movie review is converted into input tokens and padded or truncated to a fixed length.
The tokenizer also generates attention masks, which indicate which tokens are actual input versus padding.


## Load BERT Tokenizer

In [None]:
# Load the pretrained BERT tokenizer (base uncased model)
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")


## Define a Tokenization Function

In [None]:
# Define a function that will tokenize the text data
def tokenize_function(example):
    return tokenizer(
        example["text"],
        padding="max_length",       # pad all sequences to max_length
        truncation=True,            # truncate reviews longer than max_length
        max_length=512              # BERT supports max 512 tokens
    )


## Apply Tokenization to the Dataset

In [None]:
# Apply tokenization to the entire dataset
# This creates new fields: input_ids, token_type_ids, attention_mask
tokenized_datasets = dataset.map(tokenize_function, batched=True)


## Remove Unused Columns

In [None]:
# Remove the original text column to keep only tokenized inputs
tokenized_datasets = tokenized_datasets.remove_columns(["text"])


## Set Format for PyTorch

In [None]:
# Set the dataset format for PyTorch (input_ids, attention_mask, labels)
tokenized_datasets.set_format("torch")


## Debug Check

In [None]:
# Preview one tokenized example
# Temporarily remove formatting to preview
tokenized_datasets.reset_format()
print(tokenized_datasets["train"][0])

In [None]:
# Set it back to torch format
tokenized_datasets.set_format("torch")


# Defining and Training the BERT Model

A pretrained BERT model (`bert-base-uncased`) is loaded for sequence classification.
The model is then fine-tuned on the IMDb movie review dataset using the Hugging Face `Trainer` API.
Training arguments such as learning rate, batch size, and number of epochs are defined to control the fine-tuning process.


## Load the BERT Model

In [None]:
# Load a pretrained BERT model for sequence classification with two labels
model = BertForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)


## Define Training Arguments

In [None]:
# Define training parameters for the Trainer API
training_args = TrainingArguments(
    output_dir="./results",              # output directory for checkpoints
    evaluation_strategy="epoch",         # evaluate every epoch
    save_strategy="epoch",               # save model every epoch
    per_device_train_batch_size=8,       # batch size for training
    per_device_eval_batch_size=8,        # batch size for evaluation
    num_train_epochs=2,                  # number of training epochs
    learning_rate=2e-5,                  # learning rate
    weight_decay=0.01,                   # weight decay to reduce overfitting
    logging_dir="./logs",                # directory for logs
    logging_steps=10,                    # log every 10 steps
    load_best_model_at_end=True          # load best model after training
)
