# Sentiment Analysis with BERT on IMDb Movie Reviews

In this assignment, we are implementing a sentiment analysis model using BERT (Bidirectional Encoder Representations from Transformers) to classify IMDb movie reviews as positive or negative. We will be using the pre-trained BERT model from the Transformers library and fine-tuning it on the IMDb dataset.

## Setting up Environment

We are importing the necessary libraries for our sentiment analysis task including transformers for BERT, PyTorch for deep learning, pandas for data handling, and scikit-learn for evaluation metrics.


In [3]:
# Installing and importing required libraries
import pandas as pd
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support, classification_report
import warnings
warnings.filterwarnings('ignore')

# Import transformers components (AdamW is now in torch.optim)
try:
    from transformers import BertTokenizer, BertForSequenceClassification
    print("BERT components imported successfully!")
except ImportError as e:
    print(f"Import error: {e}")

# Import optimizer from torch (AdamW moved from transformers to torch.optim)
from torch.optim import AdamW

# Import numpy
import numpy as np

# Setting random seeds for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("Environment setup completed successfully!")


BERT components imported successfully!
Environment setup completed successfully!


## Loading and Exploring the Dataset

We are loading the IMDB movie review dataset which contains 50,000 movie reviews labeled as positive or negative. We will examine the structure and basic statistics of the dataset.


In [4]:
# Loading the IMDB dataset
df = pd.read_csv('IMDB-Dataset.csv')

# Displaying basic information about the dataset
print(f"Dataset shape: {df.shape}")
print(f"Columns: {list(df.columns)}")
print(f"\nFirst few rows:")
print(df.head())


Dataset shape: (50000, 2)
Columns: ['review', 'sentiment']

First few rows:
                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive
3  Basically there's a family where a little boy ...  negative
4  Petter Mattei's "Love in the Time of Money" is...  positive


## Data Exploration and Preprocessing

We are exploring the dataset distribution and preprocessing the text data by cleaning HTML tags and preparing sentiment labels for binary classification.


In [5]:
# Exploring dataset distribution
print("Sentiment distribution:")
print(df['sentiment'].value_counts())
print(f"\nDataset balance: {df['sentiment'].value_counts(normalize=True)}")

# Checking for missing values
print(f"\nMissing values:")
print(df.isnull().sum())


Sentiment distribution:
sentiment
positive    25000
negative    25000
Name: count, dtype: int64

Dataset balance: sentiment
positive    0.5
negative    0.5
Name: proportion, dtype: float64

Missing values:
review       0
sentiment    0
dtype: int64


In [6]:
# Preprocessing text data - removing HTML tags and cleaning
import re

def clean_text(text):
    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text

# Applying text cleaning
df['cleaned_review'] = df['review'].apply(clean_text)

# Converting sentiment labels to numerical format (0: negative, 1: positive)
df['label'] = df['sentiment'].map({'negative': 0, 'positive': 1})

print("Text preprocessing completed!")
print(f"Sample cleaned review: {df['cleaned_review'].iloc[1][:200]}...")


Text preprocessing completed!
Sample cleaned review: A wonderful little production. The filming technique is very unassuming- very old-time-BBC fashion and gives a comforting, and sometimes discomforting, sense of realism to the entire piece. The actors...


## Splitting Dataset into Training and Testing Sets

We are splitting the dataset into training and testing sets to evaluate our model's performance on unseen data. We will use 80% for training and 20% for testing.


In [7]:
# Splitting the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    df['cleaned_review'], 
    df['label'], 
    test_size=0.2, 
    random_state=42, 
    stratify=df['label']
)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"Training label distribution: {y_train.value_counts().sort_index()}")
print(f"Testing label distribution: {y_test.value_counts().sort_index()}")


Training set size: 40000
Testing set size: 10000
Training label distribution: label
0    20000
1    20000
Name: count, dtype: int64
Testing label distribution: label
0    5000
1    5000
Name: count, dtype: int64


## Setting up BERT Tokenizer

We are initializing the BERT tokenizer which will convert our text reviews into tokens that BERT can understand. We will use the 'bert-base-uncased' model for tokenization.


In [8]:
# Initializing BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Setting maximum sequence length for BERT (512 is the maximum for BERT)
MAX_LEN = 256  # Using 256 for faster training while maintaining good performance

print("BERT tokenizer initialized successfully!")
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Maximum sequence length: {MAX_LEN}")


BERT tokenizer initialized successfully!
Vocabulary size: 30522
Maximum sequence length: 256


## Creating BERT Input Features

We are converting text reviews into BERT input features including input_ids, attention_masks, and token_type_ids. This process tokenizes the text and prepares it for BERT model input.


In [9]:
# Function to encode text using BERT tokenizer
def encode_texts(texts, tokenizer, max_len):
    input_ids = []
    attention_masks = []
    
    for text in texts:
        encoded = tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=max_len,
            padding='max_length',
            truncation=True,
            return_attention_mask=True,
            return_tensors='pt'
        )
        
        input_ids.append(encoded['input_ids'])
        attention_masks.append(encoded['attention_mask'])
    
    return torch.cat(input_ids, dim=0), torch.cat(attention_masks, dim=0)

# Encoding training and testing data
print("Encoding training data...")
train_input_ids, train_attention_masks = encode_texts(X_train.tolist(), tokenizer, MAX_LEN)

print("Encoding testing data...")
test_input_ids, test_attention_masks = encode_texts(X_test.tolist(), tokenizer, MAX_LEN)

print("Text encoding completed!")
print(f"Training input shape: {train_input_ids.shape}")
print(f"Testing input shape: {test_input_ids.shape}")


Encoding training data...
Encoding testing data...
Text encoding completed!
Training input shape: torch.Size([40000, 256])
Testing input shape: torch.Size([10000, 256])


## Creating PyTorch Dataset and DataLoader

We are creating a custom PyTorch Dataset class to handle our BERT inputs and creating DataLoaders for efficient batch processing during training and evaluation.


In [10]:
# Custom Dataset class for BERT inputs
class IMDbDataset(Dataset):
    def __init__(self, input_ids, attention_masks, labels):
        self.input_ids = input_ids
        self.attention_masks = attention_masks
        self.labels = labels
    
    def __len__(self):
        return len(self.labels)
    
    def __getitem__(self, idx):
        return {
            'input_ids': self.input_ids[idx],
            'attention_mask': self.attention_masks[idx],
            'labels': torch.tensor(self.labels.iloc[idx], dtype=torch.long)
        }

# Converting labels to tensors
train_labels = torch.tensor(y_train.values, dtype=torch.long)
test_labels = torch.tensor(y_test.values, dtype=torch.long)

# Creating dataset objects
train_dataset = IMDbDataset(train_input_ids, train_attention_masks, y_train)
test_dataset = IMDbDataset(test_input_ids, test_attention_masks, y_test)

print("Dataset objects created successfully!")
print(f"Training dataset size: {len(train_dataset)}")
print(f"Testing dataset size: {len(test_dataset)}")


Dataset objects created successfully!
Training dataset size: 40000
Testing dataset size: 10000


In [11]:
# Creating DataLoaders for batch processing
BATCH_SIZE = 16  # Using smaller batch size for memory efficiency

train_dataloader = DataLoader(
    train_dataset,
    batch_size=BATCH_SIZE,
    shuffle=True,
    num_workers=0  # Set to 0 for compatibility
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=BATCH_SIZE,
    shuffle=False,
    num_workers=0
)

print("DataLoaders created successfully!")
print(f"Training batches: {len(train_dataloader)}")
print(f"Testing batches: {len(test_dataloader)}")
print(f"Batch size: {BATCH_SIZE}")


DataLoaders created successfully!
Training batches: 2500
Testing batches: 625
Batch size: 16


## Loading Pre-trained BERT Model

We are loading the pre-trained BERT model for sequence classification. The model will be fine-tuned for binary sentiment classification with 2 output classes (negative and positive).


In [12]:
# Loading pre-trained BERT model for sequence classification
model = BertForSequenceClassification.from_pretrained(
    'bert-base-uncased',
    num_labels=2,  # Binary classification (negative=0, positive=1)
    output_attentions=False,
    output_hidden_states=False
)

# Checking if GPU is available and moving model to device
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model = model.to(device)

print(f"BERT model loaded successfully!")
print(f"Device: {device}")
print(f"Model parameters: {sum(p.numel() for p in model.parameters()):,}")


Some weights of BertForSequenceClassification were not initialized from the model checkpoint at bert-base-uncased and are newly initialized: ['classifier.bias', 'classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


BERT model loaded successfully!
Device: cpu
Model parameters: 109,483,778


## Setting up Training Configuration

We are configuring the training parameters including optimizer, learning rate, and number of epochs for fine-tuning the BERT model.


In [16]:
# Training configuration
EPOCHS = 4  # Using 4 epochs for demonstration (should be increased for better performance)
LEARNING_RATE = 2e-5  # Recommended learning rate for BERT fine-tuning

# Setting up optimizer
optimizer = AdamW(
    model.parameters(),
    lr=LEARNING_RATE,
    eps=1e-8  # Default epsilon value for AdamW
)

# Setting up loss function (CrossEntropyLoss is used automatically by BERT model)
print("Training configuration completed!")
print(f"Epochs: {EPOCHS}")
print(f"Learning rate: {LEARNING_RATE}")
print(f"Optimizer: AdamW")
print(f"Total training steps: {len(train_dataloader) * EPOCHS}")


Training configuration completed!
Epochs: 4
Learning rate: 2e-05
Optimizer: AdamW
Total training steps: 10000


## Fine-tuning BERT Model

We are implementing the training loop to fine-tune the BERT model on our IMDb dataset. The model will learn to classify movie reviews as positive or negative sentiment.


In [None]:
# Training function
def train_model(model, train_dataloader, optimizer, device, epochs):
    model.train()
    total_loss = 0
    
    for epoch in range(epochs):
        epoch_loss = 0
        
        for batch in train_dataloader:
            # Moving batch to device
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            
            # Zero gradients
            optimizer.zero_grad()
            
            # Forward pass
            outputs = model(
                input_ids=input_ids,
                attention_mask=attention_mask,
                labels=labels
            )
            
            # Calculate loss
            loss = outputs.loss
            epoch_loss += loss.item()
            
            # Backward pass
            loss.backward()
            optimizer.step()
        
        total_loss += epoch_loss
    
    return total_loss / (epochs * len(train_dataloader))

# Fine-tuning the model
print("Starting BERT fine-tuning...")
avg_loss = train_model(model, train_dataloader, optimizer, device, EPOCHS)
print(f"Training completed! Average loss: {avg_loss:.4f}")


Starting BERT fine-tuning...
