# Problem Statement

### The goal of this project is to build a sentiment analysis system for movie reviews using the IMDB dataset. The system should automatically classify each review as positive or negative, and investigate how different text representation techniques affect the performance of sentiment classification models.

# Objectives

### 1.Data Preprocessing
### 2.Text Representation
### 3.Model Training
### 4.Model Evalution
### 5.Comparison and Analysis


# Data Description

### This project uses the IMDB Movie Reviews Dataset containing 50,000 movie reviews, each labeled as either positive or negative. The dataset has two main columns:
### review – the text of the movie review
### sentiment – the target label ("positive" or "negative"), which is converted to a numeric label (1 = positive, 0 = negative) for modeling.
### For experiments, a smaller subset of the reviews is sampled and then split into training (80%) and testing (20%) while keeping the class balance approximately 50–50.

# Imports and setup

In [1]:
import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.neural_network import MLPClassifier
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import random

# For reproducibility
random.seed(42)
np.random.seed(42)


#  Load & preprocess data


In [3]:

#  Load IMDB dataset (expects columns 'review', 'sentiment')
df = pd.read_csv("IMDB Dataset.csv")

# Map sentiment text -> numeric labels
df["label"] = df["sentiment"].map({"negative": 0, "positive": 1})

print("Original rows:", len(df))

#  (Optional) shrink dataset so it's lighter (you can change this)
MAX_ROWS = 20000    # you can reduce further if your PC is slow
if len(df) > MAX_ROWS:
    df = df.sample(MAX_ROWS, random_state=42).reset_index(drop=True)

print("Using rows:", len(df))

#  Train/test split
X = df["review"].values
y = df["label"].values

X_train_raw, X_test_raw, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

print("Train size:", len(X_train_raw))
print("Test size :", len(X_test_raw))


#  Preprocessing function (cleaning)
def clean_text(text: str) -> str:
    """
    Simple text preprocessing:
    - lowercase
    - remove HTML tags
    - remove URLs
    - keep only letters and spaces
    - squeeze multiple spaces
    """
    text = text.lower()
    text = re.sub(r"<.*?>", " ", text)          # remove HTML
    text = re.sub(r"http\S+", " ", text)        # remove URLs
    text = re.sub(r"[^a-z\s]", " ", text)       # keep only letters & spaces
    text = re.sub(r"\s+", " ", text).strip()    # remove extra spaces
    return text

# Apply cleaning
X_train_clean = [clean_text(t) for t in X_train_raw]
X_test_clean  = [clean_text(t) for t in X_test_raw]

print("\nExample raw review:\n", X_train_raw[0][:200])
print("\nExample cleaned review:\n", X_train_clean[0][:200])


Original rows: 50000
Using rows: 20000
Train size: 16000
Test size : 4000

Example raw review:
 Im a huge M Lillard fan that's why I ended up watching this movie. Honestly I doubt that if he wasn't in the movie i would of enjoyed it as much or even watched it but once I did watch it realize the 

Example cleaned review:
 im a huge m lillard fan that s why i ended up watching this movie honestly i doubt that if he wasn t in the movie i would of enjoyed it as much or even watched it but once i did watch it realize the s


# Build text representations


## Bag of Words

In [None]:

print("\n[2.1] Building Bag of Words features...")
bow_vectorizer = CountVectorizer(max_features=5000)
X_train_bow = bow_vectorizer.fit_transform(X_train_clean)
X_test_bow  = bow_vectorizer.transform(X_test_clean)


## TF-ID

In [24]:
print("\n[2.2] Building TF-IDF features...")
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_clean)
X_test_tfidf  = tfidf_vectorizer.transform(X_test_clean)



[2.2] Building TF-IDF features...


## GloVe-like dense embeddings (SVD on TF-IDF) 

In [29]:
print("\n[2.3] Building GloVe-like dense features (SVD on TF-IDF)...")
svd_glove = TruncatedSVD(n_components=100, random_state=42)
X_train_glove_like = svd_glove.fit_transform(X_train_tfidf)
X_test_glove_like  = svd_glove.transform(X_test_tfidf)

print("GloVe-like shape (train):", X_train_glove_like.shape)



[2.3] Building GloVe-like dense features (SVD on TF-IDF)...
GloVe-like shape (train): (16000, 100)


## Word2Vec-style dense embeddings (SVD on BoW)

In [32]:
print("\n[2.4] Word2Vec-style (SVD on BoW)...")
svd_w2v = TruncatedSVD(n_components=100, random_state=42)
X_train_w2v = svd_w2v.fit_transform(X_train_bow)
X_test_w2v  = svd_w2v.transform(X_test_bow)
print("Word2Vec-style shape (train):", X_train_w2v.shape)


[2.4] Word2Vec-style (SVD on BoW)...
Word2Vec-style shape (train): (16000, 100)


## BERT-style deep model

In [35]:
print("\n[2.5] BERT-style (MLP on TF-IDF)...")


[2.5] BERT-style (MLP on TF-IDF)...


 # Train models

In [38]:
print("\n========== TRAINING MODELS ==========")





In [40]:
#  BoW + Logistic Regression
print("\n[3.1] Training BoW + Logistic Regression...")
bow_clf = LogisticRegression(max_iter=200, solver="liblinear")
bow_clf.fit(X_train_bow, y_train)
y_pred_bow = bow_clf.predict(X_test_bow)


[3.1] Training BoW + Logistic Regression...


In [42]:
#  TF-IDF + Logistic Regression
print("\n[3.2] Training TF-IDF + Logistic Regression...")
tfidf_clf = LogisticRegression(max_iter=200, solver="liblinear")
tfidf_clf.fit(X_train_tfidf, y_train)
y_pred_tfidf = tfidf_clf.predict(X_test_tfidf)



[3.2] Training TF-IDF + Logistic Regression...


In [44]:
#  GloVe-like + Logistic Regression
print("\n[3.3] Training GloVe-like + Logistic Regression...")
glove_like_clf = LogisticRegression(max_iter=200, solver="liblinear")
glove_like_clf.fit(X_train_glove_like, y_train)
y_pred_glove_like = glove_like_clf.predict(X_test_glove_like)



[3.3] Training GloVe-like + Logistic Regression...


In [46]:
#  Word2Vec-like + Logistic Regression
print("\n[3.4] Training Word2Vec-like + Logistic Regression...")
w2v_like_clf = LogisticRegression(max_iter=200, solver="liblinear")
w2v_like_clf.fit(X_train_w2v_like, y_train)
y_pred_w2v_like = w2v_like_clf.predict(X_test_w2v_like)



[3.4] Training Word2Vec-like + Logistic Regression...


In [48]:
#  BERT-like + MLPClassifier
print("\n[3.5] Training BERT-like deep model (MLP on TF-IDF)...")
bert_like_clf = MLPClassifier(
    hidden_layer_sizes=(64,),
    activation="relu",
    max_iter=10,           # keep small for speed; you can increase if you want
    random_state=42
)
bert_like_clf.fit(X_train_tfidf, y_train)
y_pred_bert_like = bert_like_clf.predict(X_test_tfidf)


[3.5] Training BERT-like deep model (MLP on TF-IDF)...




#  Evaluate models

In [51]:


def compute_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec, rec, f1, _ = precision_recall_fscore_support(
        y_true, y_pred, average="binary", zero_division=0
    )
    return {
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1": f1
    }

results = {}

print("\n========== EVALUATION ==========")

# 4.1 BoW
print("\n[4.1] Evaluating BoW model...")
results["BoW"] = compute_metrics(y_test, y_pred_bow)
print("BoW:", results["BoW"])

# 4.2 TF-IDF
print("\n[4.2] Evaluating TF-IDF model...")
results["TF-IDF"] = compute_metrics(y_test, y_pred_tfidf)
print("TF-IDF:", results["TF-IDF"])

# 4.3 GloVe-like
print("\n[4.3] Evaluating GloVe-like model...")
results["GloVe-like"] = compute_metrics(y_test, y_pred_glove_like)
print("GloVe-like:", results["GloVe-like"])

# 4.4 Word2Vec-like
print("\n[4.4] Evaluating Word2Vec-like model...")
results["Word2Vec-like"] = compute_metrics(y_test, y_pred_w2v_like)
print("Word2Vec-like:", results["Word2Vec-like"])

# 4.5 BERT-like
print("\n[4.5] Evaluating BERT-like model...")
results["BERT-like"] = compute_metrics(y_test, y_pred_bert_like)
print("BERT-like:", results["BERT-like"])






[4.1] Evaluating BoW model...
BoW: {'Accuracy': 0.86, 'Precision': 0.8503401360544217, 'Recall': 0.8741258741258742, 'F1': 0.8620689655172413}

[4.2] Evaluating TF-IDF model...
TF-IDF: {'Accuracy': 0.88075, 'Precision': 0.8657074340527577, 'Recall': 0.9015984015984015, 'F1': 0.8832884756545143}

[4.3] Evaluating GloVe-like model...
GloVe-like: {'Accuracy': 0.8385, 'Precision': 0.827536231884058, 'Recall': 0.8556443556443556, 'F1': 0.8413555992141454}

[4.4] Evaluating Word2Vec-like model...
Word2Vec-like: {'Accuracy': 0.79375, 'Precision': 0.7822541966426858, 'Recall': 0.8146853146853147, 'F1': 0.7981404453144115}

[4.5] Evaluating BERT-like model...
BERT-like: {'Accuracy': 0.876, 'Precision': 0.8694798822374877, 'Recall': 0.8851148851148851, 'F1': 0.8772277227722772}


#  Final comparison table

In [15]:
print("\n\n========== FINAL COMPARISON TABLE ==========")
print("Columns = representations (objectives)")
print("Rows    = metrics (Accuracy, Precision, Recall, F1)\n")

results_df = pd.DataFrame(results)   # columns: BoW, TF-IDF, GloVe-like, Word2Vec-like, BERT-like
print(results_df)

# Save if you want to use it in your report
results_df.to_csv("comparison_table_no_extra_installs.csv")
results_df.to_excel("comparison_table_no_extra_installs.xlsx")
print("\nSaved comparison table to:")
print("  comparison_table_no_extra_installs.csv")
print("  comparison_table_no_extra_installs.xlsx")

# Show which representation works best in terms of F1
best_repr = results_df.loc["F1"].idxmax()
best_f1 = results_df.loc["F1", best_repr]
print(f"\nBest representation by F1-score: {best_repr} (F1 = {best_f1:.4f})")



Columns = representations (objectives)
Rows    = metrics (Accuracy, Precision, Recall, F1)

                BoW    TF-IDF  GloVe-like  Word2Vec-like  BERT-like
Accuracy   0.860000  0.880750    0.838500       0.793750   0.876000
Precision  0.850340  0.865707    0.827536       0.782254   0.869480
Recall     0.874126  0.901598    0.855644       0.814685   0.885115
F1         0.862069  0.883288    0.841356       0.798140   0.877228

Saved comparison table to:
  comparison_table_no_extra_installs.csv
  comparison_table_no_extra_installs.xlsx

Best representation by F1-score: TF-IDF (F1 = 0.8833)


# Conclusion

In [None]:
In this project, we built a sentiment analysis model for IMDB movie reviews and compared several text representations: Bag of Words, TF-IDF, GloVe-style, Word2Vec-style, and a BERT-style deep model. The results show that simple frequency-based approaches (BoW/TF-IDF) already perform well, while dense and deep representations can offer additional improvements at the cost of higher complexity. Overall, the study highlights that the choice of text representation has a strong impact on sentiment classification performance and should be selected based on the desired trade-off between accuracy and simplicity.