# **IMDB Movie Reviews – Sentiment Analysis**

This project aims to build and evaluate machine learning models for **sentiment classification** of IMDB movie reviews (**positive vs. negative**).  
The workflow follows a structured pipeline covering preprocessing, feature engineering, model training, evaluation, and error analysis.

---

### 🔹 Project Objectives
- Preprocess raw text data into clean and usable form.  
- Explore two feature extraction methods: **Bag of Words (BoW)** and **TF-IDF**.  
- Train and compare multiple models under both **baseline** and **tuned** configurations.  
- Select the **best baseline** and **best tuned** model based on **F1-Score**.  
- Perform **error analysis** (False Positives & False Negatives) to interpret model behavior.  

---

### 🔹 Workflow Overview
1. **Data Preprocessing** – text cleaning, tokenization, vectorization.  
2. **Model Training & Evaluation** – baseline vs. tuned models.  
3. **Model Selection** – identify best models by F1-Score.  
4. **Error Analysis** – compare misclassifications across models.  
5. **Visualization & Reporting** – metrics tables, confusion matrices, error breakdown.  

---

### 🔹 Evaluation Metrics
- **Accuracy** – overall correctness of predictions.  
- **F1-Score** – balance between precision & recall, used for final model selection.  


---
### Project Functions
All project-specific functions are implemented in the `src/` folder.  
Each module contains the relevant functions for preprocessing, feature extraction, model training, evaluation, and error analysis.

In [None]:
# General purpose
import numpy as np
import pandas as pd
import copy

# Data Preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.tokenize.toktok import ToktokTokenizer
import re
from bs4 import BeautifulSoup
import spacy
from sklearn.preprocessing import LabelBinarizer

# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Model training
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

# Model selection & evaluation
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
!git https://github.com/TuliDas/Benchmark-ML-Models-Movie-Reviews-Sentiment
%cd Benchmark-ML-Models-Movie-Reviews-Sentiment

In [None]:
import sys
sys.path.append("src")  # Add src folder to Python path to import project modules

## Dataset Loading

- **Source**: [IMDB Movie Reviews Dataset (50K, Kaggle)](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews)  
- **Steps**:
  1. Download the dataset from Kaggle and upload it to Google Colab.  
  2. Load the CSV file into a Pandas DataFrame.  
  3. Preserve a deep copy of the original dataset (`data_OG`).  
  4. Display the dataset shape and preview the first few rows.  

In [None]:
csv_path = "/content/IMDB Dataset.csv"

data = pd.read_csv(csv_path, engine='python', delimiter=',')
data_OG = copy.deepcopy(data)

print("Loaded data shape:", data.shape)
data.head()

# **Data Preprocessing**

Before proceeding with model training, we will apply the following **eight key preprocessing steps** to clean and prepare the dataset:

1. **HTML & bracketed text removal (clean raw text)**

2. **Lowercasing + remove numbers and special chars (normalize text)**

3. **Handle negations (combine negation + next word to preserve meaning)**

4. **Remove extra whitespace (clean spacing for consistent tokens)**

5. **Lemmatization (reduce words to their base forms)**

6. **Remove stopwords (remove common words that add little meaning)**

7. **Remove very short tokens (remove tiny tokens like 'a', 'I' that may remain)**

8. **Stemming (optional last step to reduce words further)**

In [None]:
# If spaCy is not installed
!pip install spacy
!python -m spacy download en_core_web_sm

In [None]:
nltk.download('stopwords')

In [None]:
from src.data_preprocessing import preprocess_pipeline

# Apply text preprocessing to the 'review' column
# See src/data_preprocessing.py for detailed implementation of preprocess_pipeline
data['review'] = data['review'].apply(lambda x: preprocess_pipeline(x,use_stemming=True))

In [None]:
data.head()

# 🎯 **Label Encoding**

We convert the sentiment labels from text (`"positive"`, `"negative"`) to numerical format (0 and 1) using `LabelBinarizer`.

- **`LabelBinarizer`** transforms categorical labels into binary values:
  - 0 for **negative**
  - 1 for **positive**

This numerical format is required for machine learning models to process the target variable.


In [None]:
# Encode sentiment labels to 0 (negative) and 1 (positive)
label = LabelBinarizer()
sentiment_binary = label.fit_transform(data['sentiment'])

# **🧪 Train-Test Split**

We split the dataset into training and testing sets to evaluate model performance on unseen data.

- **`test_size=0.4`**: 40% of data is reserved for testing, 60% for training.  
- **`random_state=42`**: Ensures reproducibility by fixing the random seed.  
- **`stratify=data['sentiment']`**: Maintains the same class distribution in both train and test sets to avoid imbalance.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    data['review'],
    sentiment_binary,
    test_size=0.4,          # 60% train, 40% test split
    random_state=42,        # for reproducibility
    stratify=data['sentiment']  # keep class balance in splits
)

In [None]:
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

## **Feature Extraction**
- Utility function to convert raw text into numerical features using a given vectorizer
(e.g., CountVectorizer for BoW, TfidfVectorizer for TF-IDF).


#### **1. CountVectorizer (Bag of Words)**
Converts text into token counts with filtering to reduce noise:
- min_df=5 → remove rare words  
- max_df=0.9 → remove overly common words  
- ngram_range=(1,3) → include unigrams, bigrams, trigrams

#### **2. TF-IDF Vectorizer**
Represents text by term importance:
- High weight = frequent in doc but rare across corpus
- Captures importance better than raw counts



### `featured_data`
Stores train/test splits for different feature extraction methods.  
```python
featured_data = {
    "BoW": {
        "train": <sparse_matrix>,   # X_train after CountVectorizer
        "test": <sparse_matrix>     # X_test after CountVectorizer
    },
    "TF-IDF": {
        "train": <sparse_matrix>,   # X_train after TfidfVectorizer
        "test": <sparse_matrix>     # X_test after TfidfVectorizer
    }
}


In [None]:
from src.feature_extraction import initialize_featured_data, extract_all_features
# See src/feature_extraction.py for function details (`initialize_featured_data` and `extract_all_features`)

featured_data = initialize_featured_data()

# Extract all features for training and testing data
featured_data = extract_all_features(featured_data, X_train, X_test)

# Check the keys to see available feature types (e.g., 'BoW', 'TF-IDF')
featured_data.keys()

In [None]:
# Print Some sample output
print(f"BoW-train-Shape : {featured_data['BoW']['train'].shape}")
print(f"TF-IDF-test-Shape : {featured_data['TF-IDF']['test'].shape}")

## 🔹 **Model Training Plan**  

We will now train and evaluate multiple machine learning models for sentiment classification.  
The goal is to compare different algorithms and feature extraction techniques to determine which combination works best for IMDB movie reviews.  

### 2. Models to Train
We will train **four classifiers**, each with both BoW and TF–IDF features:  
- **Logistic Regression (LR)**  
- **Multinomial Naive Bayes (MNB)**  
- **Stochastic Gradient Descent Classifier (SGDClassifier)**  
- **Support Vector Classifier (SVC)**  

## Dictionary Structures Used

To organize predictions, evaluation scores, and error analysis, we maintain a few structured dictionaries:

### 1. `results`
Stores predictions and metrics for each model, version, and feature type.  
```python
results = {
    "LogisticRegression": {
        "baseline": {
            "BoW": {"predictions": ..., "f1-score": ...},
            "TF-IDF": {...}
        },
        "tuned": {...}
    },
    "LinearSVC": {...}
}


In [None]:
from src.model_training import initialize_model_dict_structure, train_models_and_store_predictions

# Initialize structured dictionary to store models and predictions
# See src/model_training.py for function details
results = initialize_model_dict_structure()


In [None]:
# Train all baseline models on extracted features and store their predictions in the results dictionary
results = train_models_and_store_predictions(results, featured_data, y_train, mode="baseline")

In [None]:
# Print Some sample output
print(f"Predictions's Array shape of LogisticRegression-Baseline-BoW model : {results['LogisticRegression']['baseline']['BoW']['predictions'].shape}")
print(f"Predictions's Array shape of MultinomialNB-Baseline-TF-IDF model : {results['MultinomialNB']['baseline']['TF-IDF']['predictions'].shape}")

# **📊 Model Evaluation**

After training all models with both **Bag of Words (BoW)** and **TF-IDF** features, the next step is to **evaluate their performance**.

We will calculate:  

1. **Accuracy Score** – indicates the proportion of correctly classified reviews out of all predictions.
2. **Classification Report** – Includes **Precision, Recall, F1-score** for each class (positive & negative).  

The classification report provides a detailed evaluation of model performance by showing:

- **Precision:** How many of the predicted positive reviews are actually positive.  
- **Recall:** How many of the actual positive reviews were correctly identified.  
- **F1-Score:** Harmonic mean of precision and recall, balancing both metrics.  
- **Support:** Number of true instances for each class in the test set.

#### **🔄 Cross-Validation**

Cross-validation splits the training data into multiple folds, training and testing the model on different subsets.

- Provides a **more reliable estimate** of model performance than a single train-test split.

In [None]:
# Calculate and store cross-validation accuracy for baseline models
from src.model_evaluation import calculate_cross_validation

results = calculate_cross_validation(results, featured_data, y_train, mode="baseline")

In [None]:
# Print Some sample output

print(f"Cross-Validation Score of SGDClassifier-Baseline-BoW model : {results['SGDClassifier']['baseline']['BoW']['cv_accuracy']}")
print(f"Cross-Validation Score of LinearSVC-Baseline-TF-IDF model : {results['LinearSVC']['baseline']['TF-IDF']['cv_accuracy']}")

In [None]:
# Evaluate all (Baseline) Models : store Accuracy, classification Reports (Separately Precision, recall, f1-score) and Confusion-Matrix
from src.model_evaluation import evaluate_and_update_metrics

results = evaluate_and_update_metrics(results, featured_data, y_test, mode="baseline")

In [None]:
# Print and Testing some sample output

prec = results['LogisticRegression']['baseline']['BoW']['precision']
rec = results['LogisticRegression']['baseline']['BoW']['recall']
f1 = results['LogisticRegression']['baseline']['BoW']['f1_score']
print(f"Precision, Recall and F1-score of Logistic Regression model : {prec},{rec},{f1}")

#### 📊 Visualize Confusion Matrix

A confusion matrix shows how well a classification model performs by displaying counts of:

- **True Positives (TP)**: Correctly predicted positive cases  
- **True Negatives (TN)**: Correctly predicted negative cases  
- **False Positives (FP)**: Incorrectly predicted positives (Type I error)  
- **False Negatives (FN)**: Incorrectly predicted negatives (Type II error)

It helps understand the types of errors your model makes beyond overall accuracy.

In [None]:
# Visualize confusion matrices for all baseline models and feature types
from src.utils import visualize_confusion_matrix
visualize_confusion_matrix(results, featured_data, mode="baseline")

## 🔧 Hyperparameter Tuning of Models

After establishing baseline performance, the next step is **hyperparameter tuning** to optimize each model.  

We will use **GridSearchCV** on both **Bag of Words (BoW)** and **TF-IDF** features to systematically search for the best parameters.  

### Steps:
1. **Define parameter grids** for each model (Logistic Regression, Multinomial Naive Bayes, SGDClassifier, and LinearSVC).  
2. **Run GridSearchCV** to perform cross-validation and find the best parameters.  
3. **Retrain each model** using the best parameters found.  


In [None]:
# Tune all models store tuned predictions
from src.hyperparameter_tuning import get_hyperparameter_grids, tune_all_models
param_grids = get_hyperparameter_grids()
results = tune_all_models(results, featured_data, param_grids, y_train)

In [None]:
# Print Some sample output
print(f"Predictions's Array shape of SGDClassifier-Tuned-BoW model : {results['SGDClassifier']['tuned']['BoW']['predictions'].shape}")
print(f"Predictions's Array shape of LinearSVC-Tuned-TF-IDF model : {results['LinearSVC']['tuned']['TF-IDF']['predictions'].shape}")

In [None]:
# Evaluate all (Tuned) Models : store Accuracy, classification Reports (Separately Precision, recall, f1-score) and Confusion-Matrix
results = evaluate_and_update_metrics(results,featured_data, y_test, mode="tuned")

In [None]:
# Visualize confusion matrices for all tuned models
visualize_confusion_matrix(results,featured_data, mode="tuned")

## **Model Performance Comparison**
In this section, we consolidate the evaluation results of all models and feature types into a single **comparison DataFrame**.
This provides a structured view of **baseline vs. tuned** performance for each model across **BoW and TF-IDF** representations.


In [None]:
# Comparing results across models with featured data
from src.model_performance_comparison import create_comparison_dataframe

df_results = create_comparison_dataframe(results, featured_data)
df_results


In [None]:
# Visualize model comparison table
from src.utils import visualize_model_comparison

visualize_model_comparison(df_results)

### 🔎 Identifying Models to Compare

During model selection, we observed that both the **best baseline** and the **best tuned** models turned out to be the same:  
**LinearSVC with TF-IDF features**, achieving the highest F1-Score overall.  

Since comparing two identical models does not provide meaningful insights, we adopt the following strategy:  

1. **Best Tuned Model**  
   - Selected strictly by highest F1-Score.  
   - This is always a **TF-IDF–based tuned model** (LinearSVC).  

2. **Baseline Model for Comparison**  
   - To ensure diversity, we restrict the baseline to use a **different feature representation (BoW)**.  
   - We then select the **best-performing baseline model** within BoW features (Logistic Regression).  

This approach allows us to compare:  
- **LinearSVC + TF-IDF (tuned)** vs. **Logistic Regression + BoW (baseline)**  

✅ This ensures a fairer and more insightful comparison between a strong tuned model and a simpler baseline with different features.


### `best_two_models` (Sample Structure)

```python
best_two_models = {
    "baseline": {"full-name": "...", "predictions": [...], "f1-score": ...},
    "tuned":    {"full-name": "...", "predictions": [...], "f1-score": ...}
}


In [None]:
from src.select_best_models import get_best_baseline_and_tuned
best_two_models = get_best_baseline_and_tuned(df_results, results, metric="F1-Score")

for _,info in best_two_models.items():
  print(f"{info['full-name']} : F1-Score = {info['f1-score']}")



### **Error Analysis Initialization & Computation**

1. **Initialize Error Analysis Dictionary**  
   Using `initialize_error_analysis_dict(best_two_models)`, we create a structured dictionary to store:
   - Model details (`name`, `feature`, `f1_score`, etc.)
   - False positives (`false_positives`)
   - False negatives (`false_negatives`)
   - Their corresponding indexes (`fp_indexes`, `fn_indexes`)

2. **Compute FP & FN for Best Models**  
   `compute_bestModels_all_fp_fn()` fills the dictionary with the false positives and false negatives for:
   - Best baseline model
   - Best tuned model  

3. **Quick Overview**  
   We print the counts of FP and FN for each model/version to get an immediate sense of misclassifications.


### `error_analysis_dict` (Sample Structure)

```python
error_analysis_dict = {
    "baseline": { "full-name": "...", "name": "...", "false_positives": "..",
        "false_negatives": "..", "fp_indexes": "..", "fn_indexes": ".."
    },
    "tuned": {
        "full-name": "...", ............ "fn_indexes": ".."
    }
}


In [None]:
from src.error_analysis import initialize_error_analysis_dict, compute_bestModels_all_fp_fn

error_analysis_dict = initialize_error_analysis_dict(best_two_models)
error_analysis_dict = compute_bestModels_all_fp_fn(error_analysis_dict, best_two_models, data_OG, data, X_test, y_test)

for version, info in error_analysis_dict.items():
  print(f"{info['name']}-{version}-({info['feature']}) : FP = {len(info['false_positives'])} , FN = {len(info['false_negatives'])}")

### Writing False Positives and False Negatives to Text Files

For deeper insight into model misclassifications, we focus on the **two selected models**:

1. **LinearSVC – Tuned – (TF-IDF)** (best overall model by F1-score)  
2. **Logistic Regression – Baseline – (BoW)** (best baseline with different feature representation)  

We generate **three text files** for analysis:

1. **FP & FN detected by both models** – Reviews misclassified by **both models**.  
2. **FP & FN detected by Logistic Regression only** – Errors unique to the **baseline BoW model**.  
3. **FP & FN detected by LinearSVC only** – Errors unique to the **tuned TF-IDF model**.  

This approach helps us identify:  
- **Common hard cases** that challenge both models.  
- **Model-specific weaknesses**, highlighting where one model struggles while the other succeeds.  
- **Error patterns** that may guide future improvements in preprocessing, feature extraction, or model design.  


### `separate_fp_fn_df` (Sample Structure)

```python
separate_fp_fn_df = {
    "both_models": {  "file-name": textfile name,
                 "fp": dataframe of fp,
                 "fn": dataframe of fn": ".."
    },
    "only_baseline": {
            "file-name": textfile name,  ......
    },
    "only_tuned" : { ......
    }
}


In [None]:
from src.error_analysis import generate_error_analysis_separate_dataFrames

baseline_name = error_analysis_dict['baseline']['full-name']
tuned_name = error_analysis_dict['tuned']['full-name']

separate_fp_fn_df = generate_error_analysis_separate_dataFrames(error_analysis_dict,baseline_name, tuned_name)

for key, value in separate_fp_fn_df.items():
  print(f"{key} : FP = {len(value['fp'])} , FN = {len(value['fn'])}")

In [None]:
from src.reporting import create_three_text_files
create_three_text_files(separate_fp_fn_df)