# Sentiment Analysis of IMDb Reviews

This notebook implements a sentiment analysis task using IMDb movie reviews. We compare the effectiveness of static embeddings (Word2Vec) and contextual embeddings (BERT) in classifying sentiments expressed in movie reviews.

### 1. Setup and Installation

First, we import necessary libraries and set up our environment.

In [None]:
# Installation of necessary libraries with user flags
%pip install scipy pandas numpy transformers scikit-learn matplotlib seaborn threadpoolctl joblib gensim ipywidgets tqdm
%pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Import necessary libraries
import pandas as pd
import numpy as np
import torch
from torch import nn, optim
from torch.utils.data import Dataset, DataLoader
from transformers import BertTokenizer, BertForSequenceClassification
from torch.optim import AdamW
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, roc_curve, auc, matthews_corrcoef, precision_recall_curve, average_precision_score
import matplotlib.pyplot as plt
import seaborn as sns
import re
from gensim import downloader as api
import warnings
from tqdm.notebook import tqdm
import itertools

# Check for CUDA availability for GPU acceleration
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

# Set up random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

### 2. Problem Definition and Hypothesis
#### Problem Definition
The objective of this project is to classify sentiments of IMDb movie reviews using vector-based text representations. Specifically, we aim to determine the efficacy of different embeddings and model architectures (static vs. contextual) in sentiment classification.

#### Hypothesis
We hypothesize that contextual embeddings (like BERT) will perform significantly better than static embeddings (like Word2Vec) for sentiment classification in movie reviews.

### 3. Data Acquisition and Preprocessing
We load the IMDb dataset and preprocess the text by removing HTML tags, converting to lowercase, and removing punctuation.

In [None]:
# Load the IMDb movie reviews dataset
df = pd.read_csv('C:/Users/serem/Documents/Workspaces/Sentiment Analysis/IMDB Dataset.csv')
df['review'] = df['review'].apply(lambda x: re.sub(r'<[^>]*>', '', x).lower().replace(r'[^\w\s]', ' ').strip())

# Split data into training and testing sets
train_texts, val_texts, train_labels, val_labels = train_test_split(df['review'], df['sentiment'].map({'positive': 1, 'negative': 0}), test_size=0.2, random_state=42)

# Tokenization for BERT
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

def tokenize_for_bert(texts):
    return tokenizer(texts, padding='max_length', truncation=True, max_length=512, return_tensors="pt")

train_encodings = tokenize_for_bert(train_texts.tolist())
val_encodings = tokenize_for_bert(val_texts.tolist())

### 4. Model Configuration
#### Static Embeddings with Word2Vec
We utilize pre-trained Word2Vec embeddings to transform text data into vectors and feed these into a simple neural network for classification.

In [None]:
# Load pre-trained Word2Vec
word_vectors = api.load("word2vec-google-news-300")

# Model for Word2Vec embeddings
class SentimentClassifier(nn.Module):
    def __init__(self):
        super(SentimentClassifier, self).__init__()
        self.fc1 = nn.Linear(300, 50)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(50, 1)
        self.sigmoid = nn.Sigmoid()

    def forward(self, x):
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return self.sigmoid(x)

static_model = SentimentClassifier().to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

#### Contextual Embeddings with BERT
We employ a pre-trained BERT model from the Hugging Face library, fine-tuning it on the sentiment classification task.

In [None]:
# Load BERT model
bert_model = BertForSequenceClassification.from_pretrained('bert-base-uncased').to(torch.device("cuda" if torch.cuda.is_available() else "cpu"))

### 5. Experimentation
We perform training and validation for both models.

#### Define a Dataloader

In [None]:
# Define a dataloader
class IMDbDataset(Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels

    def __len__(self):
        return len(self.labels)

    def __getitem__(self, idx):
        item = {key: val[idx].clone().detach() for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels[idx], dtype=torch.long)
        return item

train_dataset = IMDbDataset(train_encodings, train_labels.tolist())
val_dataset = IMDbDataset(val_encodings, val_labels.tolist())
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=16, shuffle=False)

#### Training and Evaluation Functions

In [None]:
# Training function
def train(model, data_loader, optimizer, device):
    model.train()
    total_loss = 0
    progress_bar = tqdm(data_loader, desc="Training", leave=False)
    for batch in progress_bar:
        optimizer.zero_grad()
        input_ids = batch['input_ids'].to(device)
        attention_mask = batch['attention_mask'].to(device)
        labels = batch['labels'].to(device)
        outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        progress_bar.set_postfix(loss=loss.item())
    return total_loss / len(data_loader)

# Evaluation function
def evaluate(model, data_loader, device):
    model.eval()
    total_loss = 0
    progress_bar = tqdm(data_loader, desc="Evaluating", leave=False)
    with torch.no_grad():
        for batch in progress_bar:
            input_ids = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
            loss = outputs.loss
            total_loss += loss.item()
            progress_bar.set_postfix(loss=loss.item())
    return total_loss / len(data_loader)

#### 6. Model Training
Train the BERT model and validate its performance.

In [None]:
# Suppress specific UserWarning from transformers or PyTorch
warnings.filterwarnings("ignore", message="Torch was not compiled with flash attention.")

# Optimizers
bert_optimizer = AdamW(bert_model.parameters(), lr=2e-5)
static_optimizer = optim.Adam(static_model.parameters(), lr=0.001)

# Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")

for epoch in range(3):
    print(f"Starting Epoch {epoch + 1}")
    
    print("Training phase...")
    train_loss = train(bert_model, train_loader, bert_optimizer, device)
    print(f"Finished training for Epoch {epoch + 1}, Train Loss: {train_loss:.4f}")
    
    print("Evaluation phase...")
    val_loss = evaluate(bert_model, val_loader, device)
    print(f"Finished evaluation for Epoch {epoch + 1}, Val Loss: {val_loss:.4f}")

### 7. Evaluation
Assess the performance of the BERT model using accuracy and other metrics.

In [None]:
# Function to get model predictions
def get_predictions(model, data_loader, device):
    model.eval()
    predictions = []
    real_values = []
    prediction_scores = []
    with torch.no_grad():
        for batch in data_loader:
            inputs = batch['input_ids'].to(device)
            attention_mask = batch['attention_mask'].to(device)
            labels = batch['labels'].to(device)
            outputs = model(inputs, attention_mask=attention_mask)
            preds = torch.argmax(outputs.logits, dim=1)
            predictions.extend(preds.cpu().numpy())
            real_values.extend(labels.cpu().numpy())
            prediction_scores.extend(outputs.logits[:,1].cpu().numpy())
    return predictions, real_values, prediction_scores

# Evaluate model
predictions, real_values, prediction_scores = get_predictions(bert_model, val_loader, device)

# Metrics
cm = confusion_matrix(real_values, predictions)
accuracy = accuracy_score(real_values, predictions)
precision = precision_score(real_values, predictions, average='binary')
recall = recall_score(real_values, predictions, average='binary')
f1 = f1_score(real_values, predictions, average='binary')
mcc = matthews_corrcoef(real_values, predictions)

# Output metrics
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"Matthews Correlation Coefficient: {mcc:.2f}")

### 8. Visualization
This section leverages multiple visual tools to comprehensively display the model's performance metrics, helping to illustrate its strengths and pinpoint areas that may require improvement. By examining these visualizations, we can gain deeper insights into the model's behavior in various scenarios.

#### Confusion Matrix
The Confusion Matrix is an essential tool for understanding the performance of classification models at a granular level. It distinguishes between the counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN). In this normalized version, each value in the matrix represents the proportion of predictions for actual class labels, providing clarity on the model's precision in classifying each category. This helps identify if the model is biased or particularly weak in recognizing one class over another.

In [None]:
# Normalized Confusion Matrix visualization
cm_normalized = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

plt.figure(figsize=(8, 6))
sns.heatmap(cm_normalized, annot=True, fmt=".2f", cmap='Blues', xticklabels=['Negative', 'Positive'], yticklabels=['Negative', 'Positive'])
plt.title('Normalized Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

#### ROC Curve
The Receiver Operating Characteristic (ROC) Curve is a powerful diagnostic tool for binary classifiers. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR) at various threshold levels, providing a visual representation of the trade-off between sensitivity and specificity. The Area Under the Curve (AUC) metric summarizes the overall ability of the model to discriminate between the classes at all thresholds. Points along the curve represent different thresholds; a higher curve indicates better performance.

In [None]:
# ROC Curve with threshold annotations
fpr, tpr, thresholds = roc_curve(real_values, prediction_scores)
roc_auc = auc(fpr, tpr)
threshold_indices = np.searchsorted(thresholds, np.linspace(0.1, 0.9, 9))
threshold_indices = np.clip(threshold_indices, 0, len(thresholds) - 1)

plt.figure(figsize=(8, 6))
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
for idx in np.unique(threshold_indices):  # Ensure unique indices to avoid duplicate labels
    plt.text(fpr[idx], tpr[idx], f'Thresh={thresholds[idx]:.2f}', fontsize=9)
plt.fill_between(fpr, tpr, alpha=0.2, color='orange')
plt.plot([0, 1], [0, 1], color='navy', linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

#### Precision-Recall Curve
The Precision-Recall Curve is another crucial measure for models, especially in scenarios with imbalanced classes. It focuses on the relationship between precision (accuracy of positive predictions) and recall (ability to find all positive instances), which is crucial for models where the cost of false negatives is high. The area under the curve (AP) is a single value metric that quantifies the performance depicted by the curve.

In [None]:
# Precision-Recall Curve visualization
precision, recall, _ = precision_recall_curve(real_values, prediction_scores)
ap_score = average_precision_score(real_values, prediction_scores)

plt.figure(figsize=(8, 6))
plt.plot(recall, precision, color='green', lw=2, label=f'Precision-Recall curve (AP = {ap_score:.2f})')
plt.fill_between(recall, precision, alpha=0.2, color='green')
plt.xlim([0.0, 1.05])
plt.ylim([0.0, 1.05])
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall Curve')
plt.legend(loc="lower right")
plt.show()

#### Histograms of Predictions vs. Actual Values
Histograms are straightforward yet effective for comparing the distribution of predicted versus actual values. This visualization helps assess how well the predictions align with the true labels and can indicate if the model tends to over-predict one class or is generally balanced. It is particularly useful for identifying biases in predictive behavior.

In [None]:
# Histogram for Predictions vs Actual Values
plt.figure(figsize=(10, 6))
bins = np.array([0, 1]) - 0.5  # Center bins on integer values
plt.hist(predictions, bins=bins, alpha=0.5, label='Predictions', color='blue', rwidth=0.8)
plt.hist(real_values, bins=bins, alpha=0.5, label='Actual Values', color='red', rwidth=0.8)
plt.xticks([0, 1], ['Negative', 'Positive'])
plt.ylabel('Number of Samples')
plt.xlabel('Sentiment')
plt.title('Comparison of Predictions and Actual Values')
plt.legend(loc='upper right')
plt.show()

### 9. Conclusions and Future Work
#### Findings
The BERT model demonstrates strong performance in sentiment classification, with high accuracy, precision, recall, and F1 scores.
The performance of the BERT model suggests that contextual embeddings capture the nuances of sentiment better than static embeddings.
#### Limitations
The training time and computational resources required for BERT are significantly higher than those for Word2Vec-based models.
The current implementation does not include hyperparameter tuning, which could potentially improve model performance.
#### Future Directions
Implement and evaluate the Word2Vec-based model for a direct comparison with BERT.
Experiment with other contextual models like RoBERTa or GPT to see if they offer performance improvements.
Explore different preprocessing techniques and their impact on model performance.
Conduct hyperparameter tuning to optimize model performance further.