# Sentiment Analysis on IMDB Dataset
#### *Hermela Seltanu Gizaw*

#### Objective: 
To perform a comparative study of different text representation techniques in sentiment analysis using traditional machine learning.

#### Project Description
This project focuses on classifying 50,000 movie reviews from the IMDB Dataset into positive or negative sentiments. The primary goal is to evaluate the efficacy of various feature extraction methods when paired with a traditional Logistic Regression classifier. To ensure the highest accuracy, a comprehensive NLP pipeline was implemented, including HTML noise removal, case normalization, stop-word filtering, and lemmatization.

#### Core Tasks and Techniques:
The model was trained and evaluated using four distinct text representation strategies:

1. One-Hot Encoding: A binary representation capturing word existence.
2. Bag of Words (BoW): A frequency-based count of word occurrences.
3. TF-IDF (Term Frequency-Inverse Document Frequency): A statistical weighting method that penalizes common "noise" words and rewards distinctive sentiment indicators.
4. Word2Vec: A prediction-based embedding technique using the Skip-gram architecture to capture semantic relationships, transformed into document vectors via mean pooling.

By comparing these techniques, this project demonstrates how the mathematical representation of text significantly impacts the predictive performance and interpretability of traditional machine learning models.

### Library Imports
In this initial step, we import the necessary Python packages. We use pandas for data manipulation, re for cleaning text via regular expressions, nltk for natural language processing tasks, and sklearn for our traditional machine learning classification.

In [21]:
import pandas as pd
import numpy as np
import re
import nltk
import logging
import warnings
import matplotlib.pyplot as plt
import seaborn as sns
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report, ConfusionMatrixDisplay

# Complete silence for NLTK and Warnings
warnings.filterwarnings('ignore')
for package in ['stopwords', 'wordnet', 'punkt']:
    nltk.download(package, quiet=True)

print("Libraries imported and NLTK data loaded successfully.")

Libraries imported and NLTK data loaded successfully.


### Dataset Loading and Initial Inspection
We load the IMDB dataset from the CSV file. A critical part of this step is encoding the sentiment labels as binary integers (0 for negative, 1 for positive) so they can be processed by our machine learning algorithms.

In [8]:
# Load the dataset
# Ensure the dataset file is named 'IMDB Dataset.csv' in your local directory
df = pd.read_csv('IMDB Dataset.csv')

# Label Encoding
df['label'] = df['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Check for class balance to ensure the training data is not biased
print("Dataset Shape:", df.shape)
print("\nSentiment Distribution:\n", df['sentiment'].value_counts())
df.head()

Dataset Shape: (50000, 3)

Sentiment Distribution:
 sentiment
positive    25000
negative    25000
Name: count, dtype: int64


Unnamed: 0,review,sentiment,label
0,One of the other reviewers has mentioned that ...,positive,1
1,A wonderful little production. <br /><br />The...,positive,1
2,I thought this was a wonderful way to spend ti...,positive,1
3,Basically there's a family where a little boy ...,negative,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive,1


- As shown in our initial inspection, the dataset is perfectly balanced with 25,000 samples for each class, which ensures that our model accuracy will be a reliable metric for performance.

### Text Preprocessing and Normalization
Text data in the IMDB dataset contains HTML tags and punctuation that act as "noise." We standardize the text by converting it to lowercase, removing non-alphabetical characters, and applying Lemmatization to group different forms of the same word (e.g., "acting" and "act") together.

In [9]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def preprocess_text(text):
    # Remove HTML <br /> tags
    text = re.sub(r'<br\s*/?>', ' ', text)
    # Remove special characters and numbers, then lowercase
    text = re.sub(r'[^a-zA-Z]', ' ', text).lower()
    # Tokenization
    tokens = text.split()
    # Remove stopwords and apply Lemmatization
    cleaned = [lemmatizer.lemmatize(w) for w in tokens if w not in stop_words]
    return " ".join(cleaned)

# Apply preprocessing to the entire dataset
df['cleaned_review'] = df['review'].apply(preprocess_text)
print("Sample of cleaned text:\n", df['cleaned_review'][0][:200])

Sample of cleaned text:
 one reviewer mentioned watching oz episode hooked right exactly happened first thing struck oz brutality unflinching scene violence set right word go trust show faint hearted timid show pull punch reg


### Frequency-Based Representations (One-Hot & BoW)
We implement the first two representations using CountVectorizer.

- One-Hot Encoding sets binary=True to represent only the presence (1) or absence (0) of a word.
- Bag of Words (BoW) sets binary=False to capture the raw frequency counts of words.
We limit the vocabulary to 10,000 features to ensure the models remain computationally efficient.

In [26]:
from sklearn.feature_extraction.text import CountVectorizer

# 1. Implementation
ohe_vectorizer = CountVectorizer(binary=True, max_features=10000)
X_ohe = ohe_vectorizer.fit_transform(df['cleaned_review'])

bow_vectorizer = CountVectorizer(binary=False, max_features=10000)
X_bow = bow_vectorizer.fit_transform(df['cleaned_review'])

# 2. Precise Comparison Logic
sample_index = 0
# We use the vocabulary from the BoW vectorizer to look up indices in both
feature_names = bow_vectorizer.get_feature_names_out()
ohe_vocab = ohe_vectorizer.vocabulary_

print(f"One-Hot Matrix Shape:    {X_ohe.shape}")
print(f"Bag of Words Matrix Shape: {X_bow.shape}")
print("\n--- Direct Comparison: One-Hot vs. Bag of Words ---")
print(f"{'Word':<15} | {'BoW Count':<10} | {'One-Hot Value':<10}")
print("-" * 45)

# Pick words that appeared more than once in the first review
sample_bow_row = X_bow[sample_index].toarray().flatten()
multiple_occurrence_indices = [i for i, val in enumerate(sample_bow_row) if val > 1]

for i in multiple_occurrence_indices[:5]:
    word = feature_names[i]
    # Find the corresponding index in the One-Hot matrix
    ohe_idx = ohe_vocab[word]
    bow_val = X_bow[sample_index, i]
    ohe_val = X_ohe[sample_index, ohe_idx]
    print(f"{word:<15} | {int(bow_val):<10} | {int(ohe_val):<10}")

One-Hot Matrix Shape:    (50000, 10000)
Bag of Words Matrix Shape: (50000, 10000)

--- Direct Comparison: One-Hot vs. Bag of Words ---
Word            | BoW Count  | One-Hot Value
---------------------------------------------
away            | 2          | 1         
city            | 2          | 1         
due             | 2          | 1         
episode         | 2          | 1         
first           | 2          | 1         


- The direct comparison above highlights the fundamental distinction between these two traditional text representations. While both matrices share the same dimensions of 50,000 reviews by 10,000 features, the values within them serve different purposes. In the Bag of Words column, the model captures the raw frequency of terms (showing that words like "episode" or "city" appear twice), which allows the classifier to measure the "intensity" of a review's focus.

- In contrast, the One-Hot Value column remains capped at a binary 1, demonstrating that it only records the presence of a feature regardless of its repetition. This difference is key: BoW provides a denser signal for the model to weigh repeated sentiment, while One-Hot offers a simpler, categorical existence check.

### Statistical Weighting Representation (TF-IDF)
While BoW treats all words equally, TF-IDF (Term Frequency-Inverse Document Frequency) uses the IDF log-function to lower the weight of words that appear too frequently across all reviews (like "movie" or "film") and increases the weight of words that are rare but sentiment-heavy.

In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Implementation 3: TF-IDF
tfidf_vectorizer = TfidfVectorizer(max_features=10000)
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_review'])

print(f"TF-IDF Matrix Shape: {X_tfidf.shape}")

# Highlighting the weighting difference:
# We can look at the IDF weights to see which words were deemed "least important"
word_weights = pd.DataFrame({'word': tfidf_vectorizer.get_feature_names_out(), 
                             'idf': tfidf_vectorizer.idf_})
print("\nWords with lowest IDF (most common/least distinctive):")
print(word_weights.sort_values(by='idf').head(5))

TF-IDF Matrix Shape: (50000, 10000)

Words with lowest IDF (most common/least distinctive):
       word       idf
5888  movie  1.433298
3405   film  1.537566
6233    one  1.548167
5225   like  1.753512
9076   time  1.900733


- The output confirms that our TF-IDF implementation is successfully filtering the dataset's "noise." The matrix dimensions of (50,000, 10,000) show we are processing the full dataset while maintaining a controlled vocabulary of the top 10,000 features to prevent overfitting. Most importantly, the words with the lowest Inverse Document Frequency (IDF) scores; such as "movie" (1.43) and "film" (1.53) - reveal the model's intelligence; it has correctly identified that because these terms appear in almost every review, they carry no "weight" for distinguishing between positive and negative sentiment.

- By mathematically penalizing these common terms, TF-IDF forces the Logistic Regression classifier to prioritize rare, high-impact adjectives that actually define sentiment, which is why this technique achieved our highest accuracy of 89.35%.

### Semantic Embedding Representation (Word2Vec)
For the final representation, we train a Word2Vec model. We use the Skip-gram (sg=1) architecture. To convert word vectors into a Document Embedding, we compute the average of all word vectors in each review.

In [24]:
from gensim.models import Word2Vec

# 1. Training
sentences = [doc.split() for doc in df['cleaned_review']]
w2v_model = Word2Vec(sentences=sentences, vector_size=100, window=5, min_count=5, sg=1)

# 2. Vectorization
def get_doc_vector(tokens, model):
    valid_vectors = [model.wv[word] for word in tokens if word in model.wv]
    if not valid_vectors:
        return np.zeros(100)
    return np.mean(valid_vectors, axis=0)

X_w2v = np.array([get_doc_vector(s, w2v_model) for s in sentences])

# 3. Enhanced Output: Demonstrate Semantic Understanding
print(f"Word2Vec Feature Matrix Shape: {X_w2v.shape}")
print("\n--- Insight: Semantic Similarity Learned by Word2Vec ---")
print("Words most similar to 'excellent':", [w for w, s in w2v_model.wv.most_similar('excellent', topn=3)])
print("Words most similar to 'bad':      ", [w for w, s in w2v_model.wv.most_similar('bad', topn=3)])

Word2Vec Feature Matrix Shape: (50000, 100)

--- Insight: Semantic Similarity Learned by Word2Vec ---
Words most similar to 'excellent': ['superb', 'outstanding', 'terrific']
Words most similar to 'bad':       ['terrible', 'awful', 'horrible']


- The Word2Vec output confirms that the model has successfully captured the Distributional Hypothesis, which posits that words occurring in similar contexts share similar meanings. The matrix shape of (50,000, 100) indicates that each of our 50,000 reviews has been compressed into a dense, 100-dimensional vector. Unlike frequency-based methods, the "Semantic Similarity" results prove that the model understands linguistic nuances; for instance, it correctly identifies that "excellent" is semantically related to "superb" and "outstanding".

- This is a powerful advantage for sentiment analysis, as it allows the classifier to generalize its learning across synonyms, ensuring that even if a specific word appears rarely, its semantic "neighborhood" can guide the model toward the correct sentiment prediction.

### Model Training and Comparative Evaluation
In this step, we train a Logistic Regression classifier for each of our four text representations. To ensure a fair "apples-to-apples" comparison, we use the same model architecture and the same train-test split for every iteration. We store the results in a dictionary to generate a final performance table.

In [22]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Dictionary to hold the final metrics
comparison_data = []

def evaluate_technique(X, y, name):
    # Split the data consistently
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    # Initialize and train the Traditional ML model
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    
    # Predictions
    y_pred = model.predict(X_test)
    
    # Calculate Metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred, average='weighted')
    rec = recall_score(y_test, y_pred, average='weighted')
    f1 = f1_score(y_test, y_pred, average='weighted')
    
    # Append results to our list
    comparison_data.append({
        "Representation": name,
        "Accuracy": acc,
        "Precision": prec,
        "Recall": rec,
        "F1-Score": f1
    })
    
    print(f"Finished evaluating: {name}")

# Execute evaluations
evaluate_technique(X_ohe, df['label'], "One-Hot Encoding")
evaluate_technique(X_bow, df['label'], "Bag of Words")
evaluate_technique(X_tfidf, df['label'], "TF-IDF")
evaluate_technique(X_w2v, df['label'], "Word2Vec (Averaged)")

# Display the Final Comparison Table
summary_table = pd.DataFrame(comparison_data)
print("\n--- Final Performance Comparison ---")
display(summary_table.sort_values(by='Accuracy', ascending=False))

Finished evaluating: One-Hot Encoding
Finished evaluating: Bag of Words
Finished evaluating: TF-IDF
Finished evaluating: Word2Vec (Averaged)

--- Final Performance Comparison ---


Unnamed: 0,Representation,Accuracy,Precision,Recall,F1-Score
2,TF-IDF,0.8935,0.893862,0.8935,0.893461
1,Bag of Words,0.8743,0.874299,0.8743,0.874299
3,Word2Vec (Averaged),0.8729,0.872917,0.8729,0.872894
0,One-Hot Encoding,0.8702,0.870227,0.8702,0.870192


### Observations and Conclusions

- The final performance comparison reveals that TF-IDF is the superior text representation for this task, achieving the highest accuracy of 89.35%. This success stems from its ability to mathematically down-weight generic "noise" words like "movie" and "film" while emphasizing unique, sentiment-rich adjectives. Bag of Words (87.43%) performed slightly better than One-Hot Encoding (87.02%), proving that word frequency (how many times a user says "excellent") provides a marginal boost in confidence over simple word existence.

- Interestingly, Word2Vec (87.29%), while semantically advanced-did not outperform TF-IDF, likely because the "Mean Pooling" approach of averaging word vectors can "blur" specific emotional triggers in longer movie reviews compared to the precise keyword weighting of statistical methods.

- In conclusion, this assignment demonstrates that traditional machine learning models like Logistic Regression remain highly effective and interpretable for sentiment analysis when paired with robust feature engineering. While One-Hot and Bag of Words offer simple baselines, the statistical refinement of TF-IDF provides the most reliable signal for identifying positive and negative reviews in the IMDB dataset. This workflow proves that understanding the mathematical relationship between terms and documents is just as critical as the choice of the classifier itself.