<a href="https://colab.research.google.com/github/AnovaYoung/Natural-Language-Processing/blob/main/Sentiment_Analysis_with_BERT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Sentiment Analysis with BERT on IMDb Movie Reviews

In this assignment, I will use the IMDb movie review dataset and the BERT model from the Transformers library to build a sentiment analysis model that predicts whether a movie review is positive or negative.

In [1]:
import zipfile
import os

zip_file_path = '/content/archive (4).zip'

extract_dir = '/content/imdb_dataset/'

with zipfile.ZipFile(zip_file_path, 'r') as zip_ref:
    zip_ref.extractall(extract_dir)

extracted_files = os.listdir(extract_dir)
extracted_files

['IMDB Dataset.csv']

Text Preprocessing:

Tokenize the movie reviews using the BERT tokenizer.

Convert the tokenized reviews into input features suitable for BERT.

In [2]:
import pandas as pd
from transformers import BertTokenizer

dataset_path = '/content/imdb_dataset/IMDB Dataset.csv'
df = pd.read_csv(dataset_path)

print(df.head())

# This initalizes the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Tokenize the text in the 'review' column and convert it to input format for BERT
def tokenize_reviews(reviews):
    return tokenizer(
        reviews.tolist(),
        padding=True,
        truncation=True,
        max_length=512,
        return_tensors='pt'
    )

# Apply the tokenization to the reviews
tokenized_reviews = tokenize_reviews(df['review'])

print(tokenized_reviews)


                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]



{'input_ids': tensor([[ 101, 2028, 1997,  ...,    0,    0,    0],
        [ 101, 1037, 6919,  ...,    0,    0,    0],
        [ 101, 1045, 2245,  ...,    0,    0,    0],
        ...,
        [ 101, 1045, 2572,  ...,    0,    0,    0],
        [ 101, 1045, 1005,  ...,    0,    0,    0],
        [ 101, 2053, 2028,  ...,    0,    0,    0]]), 'token_type_ids': tensor([[0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        ...,
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0],
        [0, 0, 0,  ..., 0, 0, 0]]), 'attention_mask': tensor([[1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        ...,
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0],
        [1, 1, 1,  ..., 0, 0, 0]])}


In [10]:
label_counts = df['sentiment'].value_counts()
print(label_counts)

sentiment
positive    25000
negative    25000
Name: count, dtype: int64


**Explanation:**

I am using the BERT tokenizer (bert-base-uncased), which is pre-trained on a large corpus and handles text tokenization.

The tokenize_reviews function converts the reviews into tokenized input, padded and truncated to a maximum length of 512 tokens which is pretty standard.

The return_tensors='pt' ensures the tokenized output is actually in a format that can be passed directly to a PyTorch model.

**Model Training:**

Steps I am going to take:

Load the pre-trained BERT model for sequence classification from the Transformers library.

Fine-tune the BERT model for sentiment analysis.

Implement training loops and loss calculation.

In [3]:
from sklearn.model_selection import train_test_split
import torch
from torch.utils.data import DataLoader, TensorDataset

# Convert sentiment labels to binary (1 for positive, 0 for negative)
labels = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0).values
labels = torch.tensor(labels)

# Training and val sets (80/20)
train_inputs, val_inputs, train_labels, val_labels = train_test_split(tokenized_reviews['input_ids'], labels, test_size=0.2)
train_masks, val_masks = train_test_split(tokenized_reviews['attention_mask'], test_size=0.2)

# Create TensorDatasets
train_data = TensorDataset(train_inputs, train_masks, train_labels)
val_data = TensorDataset(val_inputs, val_masks, val_labels)

# Set batch size
batch_size = 8

# Create DataLoaders
train_dataloader = DataLoader(train_data, batch_size=batch_size, shuffle=True)
val_dataloader = DataLoader(val_data, batch_size=batch_size)

**Step 2: Load Pre-trained BERT Model**

In this step I load the pre-trained BERT model for sequence classification.

In [4]:
from transformers import BertForSequenceClassification, AdamW

# This loads the BERT model for Bi-Classification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

# Move the model to the GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# This initilizes the optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)


model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


**Step 3: Simplified Training Loop (Single Epoch)**


I've been having trouble with the system crashing.So to avoid crashing I will simplify the training loop and run one epoch at a time.

In [None]:
# Training loop for only one epoch
model.train()
total_loss = 0

for batch in train_dataloader:
    input_ids, attention_mask, labels = [b.to(device) for b in batch]

    # This will zero out any previously calculated gradients
    model.zero_grad()

    # Forward pass
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    total_loss += loss.item()

    # Backward pass (calculates the gradients)
    loss.backward()

    # Update the weights basd on BP
    optimizer.step()

print(f"Training loss for this epoch: {total_loss / len(train_dataloader)}")


KeyboardInterrupt: 

I interupted this runtiome because of how long it was taking (over an hour and a half for a single epoch). I'm going to go from BERT to DistilBERT which is a smaller, faster, and lighter version of BERT. It retains 97% of its language understanding power while being 60% faster.

In [5]:
from transformers import DistilBertForSequenceClassification, DistilBertTokenizer

# Initialize DistilBERT tokenizer and model
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Move model to GPU
model.to(device)


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

config.json:   0%|          | 0.00/483 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/268M [00:00<?, ?B/s]

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

**Use Mixed Precision Training**

Mixed precision training trains models faster by using both float16 (half precision) and float32 (single precision), which can reduce memory usage and training time significantly.

**First step: Initialize the model, optimizer, and gradient scaler**

In [6]:
from torch.cuda.amp import autocast, GradScaler

# Thius is the gradient scaler for mixed precision
scaler = GradScaler()

# Set up Adam opt.
optimizer = AdamW(model.parameters(), lr=2e-5)

# Early stopping and gradient accumulation
accumulation_steps = 4  # Number of batches to accumulate gradients
best_val_loss = float("inf")  # Keep track of best val loss
patience = 2  # Stop if no improvement after 2 valchecks
no_improvement = 0  # this is for early stopping

# Gradient clipping
clip_value = 1.0  # Clip gradients to prevent them from exploding which slows down training


  scaler = GradScaler()


**Second step: Validation loop for early stopping (run this after each epoch)**

In [7]:
# Val loop for early stopping
model.eval()
total_val_loss = 0
correct_predictions = 0

with torch.no_grad():
    for batch in val_dataloader:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]

        with autocast():
            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss

            # This calculates accuracy
            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            correct_predictions += torch.sum(predictions == labels)

        total_val_loss += loss.item()

# Calculate val loss and accuracy
avg_val_loss = total_val_loss / len(val_dataloader)
val_accuracy = correct_predictions.double() / len(val_labels)

print(f"Validation loss: {avg_val_loss}, Validation accuracy: {val_accuracy}")

# Logic for early stopping
if avg_val_loss < best_val_loss:
    best_val_loss = avg_val_loss
    no_improvement = 0  # This code will reset counter if validation loss improves
else:
    no_improvement += 1  # Increment counter if no improvement

if no_improvement >= patience:
    print("Early stopping triggered!")
    stop_training = True  # Set to stop training


  with autocast():


Validation loss: 0.69559951171875, Validation accuracy: 0.48300000000000004


In [11]:
tokenizer = DistilBertTokenizer.from_pretrained('distilbert-base-uncased')
model = DistilBertForSequenceClassification.from_pretrained('distilbert-base-uncased', num_labels=2)

# Adjust learning rate
optimizer = AdamW(model.parameters(), lr=1e-5)  # Lower learning rate for better convergence

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


DistilBertForSequenceClassification(
  (distilbert): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0-5): 6 x TransformerBlock(
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
          (sa_layer_norm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
          (ffn): FFN(
            (dropout): Dropout(p=0.1, inplace=False)
 

**Reduce Sequence Length to 128**

In [12]:
def tokenize_reviews(reviews):
    return tokenizer(
        reviews.tolist(),
        padding=True,
        truncation=True,
        max_length=128,
        return_tensors='pt'
    )

tokenized_reviews = tokenize_reviews(df['review'])



Run the Training Loop

I'll run updated training loop using the DistilBERT model again, shorter sequence length, lower learning rate, and increased number of epochs.

In [14]:
# Training loop for a single epoch
model.train()
total_loss = 0

for step, batch in enumerate(train_dataloader):
    input_ids, attention_mask, labels = [b.to(device) for b in batch]

    optimizer.zero_grad()

    # Forward pass
    outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
    loss = outputs.loss
    total_loss += loss.item()

    # Backward pass and optimization
    loss.backward()
    optimizer.step()

print(f"Training loss for this epoch: {total_loss / len(train_dataloader)}")


Training loss for this epoch: 0.29475172882005574


In [16]:
# Validation loop to evaluate the model
model.eval()  # Set model to evaluation mode
total_val_loss = 0
correct_predictions = 0

# No gradient calculations during evaluation
with torch.no_grad():
    for batch in val_dataloader:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]

        # Forward pass for validation
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_val_loss += loss.item()

        # Get the model's predictions
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        # Calculate the number of correct predictions
        correct_predictions += torch.sum(predictions == labels)

# Calculate average validation loss and accuracy
avg_val_loss = total_val_loss / len(val_dataloader)
val_accuracy = correct_predictions.double() / len(val_labels)

print(f"Validation loss: {avg_val_loss}, Validation accuracy: {val_accuracy}")


Validation loss: 0.23967382346987726, Validation accuracy: 0.9019


In [17]:
# Early stopping logic
if avg_val_loss < best_val_loss:
    best_val_loss = avg_val_loss
    no_improvement = 0  # Reset counter if validation loss improves
else:
    no_improvement += 1  # Increment counter if no improvement

# Stop training if no improvement
if no_improvement >= patience:
    print("Early stopping triggered!")
    stop_training = True


In [18]:
# Number of epochs
epochs = 4
patience = 2  # Number of epochs to wait before early stopping

stop_training = False
best_val_loss = float('inf')  # best val loss for early stopping
no_improvement = 0

for epoch in range(epochs):
    if stop_training:
        break  # this physically stops training if early stopping is triggered

    print(f"Epoch {epoch + 1}/{epochs}")

    # Training phase (we've seen this code before, im rewriting it now for all epochs)
    model.train()
    total_loss = 0

    for step, batch in enumerate(train_dataloader):
        input_ids, attention_mask, labels = [b.to(device) for b in batch]

        optimizer.zero_grad()

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        total_loss += loss.item()

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

    print(f"Training loss for this epoch: {total_loss / len(train_dataloader)}")

    # Validation phase
    model.eval()
    total_val_loss = 0
    correct_predictions = 0

    with torch.no_grad():
        for batch in val_dataloader:
            input_ids, attention_mask, labels = [b.to(device) for b in batch]

            outputs = model(input_ids=input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_val_loss += loss.item()

            logits = outputs.logits
            predictions = torch.argmax(logits, dim=-1)
            correct_predictions += torch.sum(predictions == labels)

    avg_val_loss = total_val_loss / len(val_dataloader)
    val_accuracy = correct_predictions.double() / len(val_labels)

    print(f"Validation loss: {avg_val_loss}, Validation accuracy: {val_accuracy}")

    # Early stopping logic
    if avg_val_loss < best_val_loss:
        best_val_loss = avg_val_loss
        no_improvement = 0  # Reset counter if val improv
    else:
        no_improvement += 1  # Increment counter if no improvement

    if no_improvement >= patience:
        print("Early stopping triggered!")
        stop_training = True


Epoch 1/4
Training loss for this epoch: 0.18610273154806345
Validation loss: 0.2423911775164306, Validation accuracy: 0.9011
Epoch 2/4
Training loss for this epoch: 0.11254101849168073
Validation loss: 0.26106351408641787, Validation accuracy: 0.9072
Epoch 3/4
Training loss for this epoch: 0.06520828280560673
Validation loss: 0.3034048312114086, Validation accuracy: 0.9079
Early stopping triggered!



The model showed excellent performance during training, with a training loss of 0.0652 in the third epoch and a validation accuracy of 90.79%. Early stopping was triggered, which tells me that the model stopped improving on the validation set. This suggests that the model is well-trained, though there may be slight overfitting as the validation loss increased marginally in the last epoch while accuracy flattened.

# Split the Dataset into Training and Testing Sets

In [22]:
print(f"Number of input_ids: {len(input_ids_list)}")
print(f"Number of labels: {len(labels)}")

Number of input_ids: 50000
Number of labels: 8


In [23]:
# re-convert sentiment labels to binary values (1 for positive, 0 for negative)
labels = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0).values

assert len(labels) == len(input_ids_list), "Number of labels does not match the number of input samples."

In [24]:
print(f"Number of input_ids: {len(input_ids_list)}")
print(f"Number of labels: {len(labels)}")

Number of input_ids: 50000
Number of labels: 50000


In [26]:
train_inputs, test_inputs, train_labels, test_labels = train_test_split(
    input_ids_list,
    labels,
    test_size=0.2,
    random_state=42
)

train_masks, test_masks = train_test_split(
    attention_mask_list,
    test_size=0.2,
    random_state=42
)


In [27]:
print(f"Training set size: {len(train_inputs)}")
print(f"Testing set size: {len(test_inputs)}")

Training set size: 40000
Testing set size: 10000


# Create DataLoaders

In [29]:
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler

# Convert lists back to tensors
train_inputs = torch.tensor(train_inputs)
test_inputs = torch.tensor(test_inputs)
train_labels = torch.tensor(train_labels)
test_labels = torch.tensor(test_labels)
train_masks = torch.tensor(train_masks)
test_masks = torch.tensor(test_masks)

# Create TensorDatasets
train_data = TensorDataset(train_inputs, train_masks, train_labels)
test_data = TensorDataset(test_inputs, test_masks, test_labels)

# Create DataLoaders
batch_size = 16

# Training DataLoader
train_dataloader = DataLoader(
    train_data,
    sampler=RandomSampler(train_data),
    batch_size=batch_size
)

# Test DataLoader
test_dataloader = DataLoader(
    test_data,
    sampler=SequentialSampler(test_data),
    batch_size=batch_size
)


# Model Evauation Using Accuracy, Precision, Recall, and F1-score.

In [30]:
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

# This will evaluate the model on the test set
model.eval()
test_predictions = []
test_labels_all = []

with torch.no_grad():
    for batch in test_dataloader:
        input_ids, attention_mask, labels = [b.to(device) for b in batch]

        # Forward pass
        outputs = model(input_ids=input_ids, attention_mask=attention_mask)
        logits = outputs.logits
        predictions = torch.argmax(logits, dim=-1)

        test_predictions.extend(predictions.cpu().numpy())
        test_labels_all.extend(labels.cpu().numpy())

# Calculate scores
accuracy = accuracy_score(test_labels_all, test_predictions)
precision, recall, f1, _ = precision_recall_fscore_support(test_labels_all, test_predictions, average='binary')

print(f"Test Accuracy: {accuracy}")
print(f"Precision: {precision}")
print(f"Recall: {recall}")
print(f"F1 Score: {f1}")


Test Accuracy: 0.9266
Precision: 0.9422642284775016
Recall: 0.9101012105576504
F1 Score: 0.9259034928326267



These are very strong metrics, indicating that my model is making accurate predictions and balancing precision and recall well.

# Predictions on Sample Reviews

Now that the model has been evaluated, the final step in the project is to make predictions on a set of sample movie reviews.

In [32]:
# Create a random sample
sample_reviews = [
    "This movie was interesting, I didn't think I was going to like it because of all the violence, but ended up enjoying it very much!",
    "This movie was bizzare and wasn't great. Honestly, it was really boring and kind of a waste of time."
]

# Always tokenize the samples!
sample_inputs = tokenizer(
    sample_reviews,
    padding=True,
    truncation=True,
    max_length=128,
    return_tensors='pt'
)

# Move inputs to the device (GPU/T4)
sample_inputs = {key: val.to(device) for key, val in sample_inputs.items()}

# Code for predictions
model.eval()
with torch.no_grad():
    outputs = model(**sample_inputs)
    logits = outputs.logits
    predictions = torch.argmax(logits, dim=-1)

# Convert predictions to sentiment labels
predicted_labels = ['positive' if label == 1 else 'negative' for label in predictions]

# Display
for review, sentiment in zip(sample_reviews, predicted_labels):
    print(f"Review: {review}")
    print(f"Predicted Sentiment: {sentiment}\n")


Review: This movie was interesting, I didn't think I was going to like it because of all the violence, but ended up enjoying it very much!
Predicted Sentiment: positive

Review: This movie was bizzare and wasn't great. Honestly, it was really boring and kind of a waste of time.
Predicted Sentiment: negative

