# Dataset

This project utilizes the IMDB Large Movie Review Dataset introduced by Maas et al. (2011) in their paper "Learning Word Vectors for Sentiment Analysis." The dataset consists of 50,000 movie reviews collected from IMDB, equally divided into 25,000 training and 25,000 test samples, with balanced classes for binary sentiment classification (positive and negative). Each review is preprocessed and labeled, making it a standard benchmark for evaluating sentiment analysis models. This dataset is widely used for exploring natural language processing techniques, especially in tasks involving text classification, embedding learning, and sentiment prediction.

#### Reference
__Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–150. http://www.aclweb.org/anthology/P11-1015__

In [1]:
#import nltk
#nltk.download('punkt_tab')
#nltk.download('punkt')
#nltk.download('wordnet')
#nltk.download('omw-1.4')
#nltk.download('stopwords')

In [48]:
import tarfile
import os
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from concurrent.futures import ThreadPoolExecutor
from glob import glob

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC 
from sklearn.metrics import accuracy_score, classification_report  

from sklearn.model_selection import GridSearchCV

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.manifold import TSNE

import matplotlib.pyplot as plt

## Loading Data

In [2]:
#this cell reads the dataset in its original path and extracts it to the project folder
#data_path = r"D:\Data_and_AI\Datasets\Large Movie Review Dataset\aclImdb_v1.tar.gz"
#with tarfile.open(data_path, 'r:gz') as tar:
#    tar.extractall()

  tar.extractall()


In [3]:
def load_files(directory):
    # Get all .txt files in pos/neg subdirectories
    pos_files = glob(os.path.join(directory, 'pos', '*.txt'))
    neg_files = glob(os.path.join(directory, 'neg', '*.txt'))
    all_files = pos_files + neg_files
    
    # Read files in parallel (I/O-bound)
    def read_file(file_path):
        with open(file_path, 'r', encoding='utf-8') as f:
            text = f.read()
        label = 'pos' if 'pos' in file_path else 'neg'
        return (text, label)
    
    with ThreadPoolExecutor() as executor:
        data = list(executor.map(read_file, all_files))
    
    return pd.DataFrame(data, columns=['text', 'label'])

In [50]:
train_df = load_files('aclImdb/train')
test_df = load_files('aclImdb/test')

print(f"Training data shape: {train_df.shape}")
print(f"Test data shape: {test_df.shape}")

Training data shape: (25000, 2)
Test data shape: (25000, 2)


In [51]:
print("Training Data\n")
train_df.head()

Training Data



Unnamed: 0,text,label
0,Bromwell High is a cartoon comedy. It ran at t...,pos
1,Homelessness (or Houselessness as George Carli...,pos
2,Brilliant over-acting by Lesley Ann Warren. Be...,pos
3,This is easily the most underrated film inn th...,pos
4,This is not the typical Mel Brooks film. It wa...,pos


In [52]:
print("Test Data\n")
test_df.head()

Test Data



Unnamed: 0,text,label
0,I went and saw this movie last night after bei...,pos
1,Actor turned director Bill Paxton follows up h...,pos
2,As a recreational golfer with some knowledge o...,pos
3,"I saw this film in a sneak preview, and it is ...",pos
4,Bill Paxton has taken the true story of the 19...,pos


## Preprocessing

In [53]:
def clean_html(text):
    return BeautifulSoup(text, 'html.parser').get_text()

def clean_text(text):
    # Remove HTML tags
    text = clean_html(text)
    #substitute any non (^) alphabet characters lower or upper (a-zA-Z) or whitespace (\s) with empty string ('')
    text = re.sub(r'[^a-zA-Z\s]', '', text).lower()
    # Replace any one or more spaces (+) with one space (' ')
    text = re.sub(r'\s+', ' ', text).strip()

    return text

# Example:
sample_text = "This is a <b>sample</b> review!!! 😊"
print(clean_text(sample_text))  # Output: "this is a sample review"

this is a sample review


In [54]:
text = "this is a sample review"
tokens = word_tokenize(text)
print(tokens)  # Output: ['this', 'is', 'a', 'sample', 'review']

['this', 'is', 'a', 'sample', 'review']


In [55]:
stop_words = set(stopwords.words('english'))

def remove_stopwords(tokens):
    return [word for word in tokens if word not in stop_words]

tokens = ['this', 'is', 'a', 'sample', 'review']
filtered_tokens = remove_stopwords(tokens)
print(filtered_tokens)  # Output: ['sample', 'review']

['sample', 'review']


In [56]:
lemmatizer = WordNetLemmatizer()

def lemmatize(tokens):
    return [lemmatizer.lemmatize(word) for word in tokens]

tokens = ['running', 'dogs', 'ate']
lemmatized = lemmatize(tokens)
print(lemmatized)  # Output: ['running', 'dog', 'ate']

['running', 'dog', 'ate']


In [57]:
#full preprocessing pipeline

def preprocess_text(text):
    # Clean HTML and special characters
    text = clean_text(text)
    # Tokenize
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = remove_stopwords(tokens)
    # Lemmatize
    tokens = lemmatize(tokens)
    return ' '.join(tokens)  # Return as a single string

In [58]:
#preprocess training and test data
train_df['processed_text'] = train_df['text'].apply(preprocess_text)
test_df['processed_text'] = test_df['text'].apply(preprocess_text)

  return BeautifulSoup(text, 'html.parser').get_text()
  return BeautifulSoup(text, 'html.parser').get_text()


In [59]:
train_df.head()

Unnamed: 0,text,label,processed_text
0,Bromwell High is a cartoon comedy. It ran at t...,pos,bromwell high cartoon comedy ran time program ...
1,Homelessness (or Houselessness as George Carli...,pos,homelessness houselessness george carlin state...
2,Brilliant over-acting by Lesley Ann Warren. Be...,pos,brilliant overacting lesley ann warren best dr...
3,This is easily the most underrated film inn th...,pos,easily underrated film inn brook cannon sure f...
4,This is not the typical Mel Brooks film. It wa...,pos,typical mel brook film much less slapstick mov...


In [60]:
test_df.head()

Unnamed: 0,text,label,processed_text
0,I went and saw this movie last night after bei...,pos,went saw movie last night coaxed friend mine i...
1,Actor turned director Bill Paxton follows up h...,pos,actor turned director bill paxton follows prom...
2,As a recreational golfer with some knowledge o...,pos,recreational golfer knowledge sport history pl...
3,"I saw this film in a sneak preview, and it is ...",pos,saw film sneak preview delightful cinematograp...
4,Bill Paxton has taken the true story of the 19...,pos,bill paxton taken true story u golf open made ...


## Feature Extraction

In [15]:
# 1. Bag-of-Words (BoW) Features
bow_vectorizer = CountVectorizer(max_features=5000) 
# max_features limits vocabulary size to the most frequeny 5000 words in the corpus
X_train_bow = bow_vectorizer.fit_transform(train_df['processed_text'])
X_test_bow = bow_vectorizer.transform(test_df['processed_text'])

In [16]:
# 2. TF-IDF Features
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
# Here, max_features limits the vocabulary to the top N words with the highest TF-IDF scores (not raw frequency)
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['processed_text'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['processed_text'])

In [17]:
# Get labels
y_train = train_df['label']
y_test = test_df['label']

## Models

In [18]:
def train_evaluate_model(model_name, X_train, X_test, y_train, y_test):
    print(f"\n=== {model_name} ===")
    
    # Initialize and train model
    model = LogisticRegression(max_iter=1000, random_state=42)
    model.fit(X_train, y_train)
    
    # Predict and evaluate
    y_pred = model.predict(X_test)
    print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred))

In [19]:
# 1. BoW Model
train_evaluate_model("Logistic Regression with BoW", X_train_bow, X_test_bow, y_train, y_test)


=== Logistic Regression with BoW ===
Accuracy: 0.8438

Classification Report:
              precision    recall  f1-score   support

         neg       0.84      0.85      0.85     12500
         pos       0.85      0.83      0.84     12500

    accuracy                           0.84     25000
   macro avg       0.84      0.84      0.84     25000
weighted avg       0.84      0.84      0.84     25000



In [20]:
# 2. TF-IDF Model
train_evaluate_model("Logistic Regression with TF-IDF", X_train_tfidf, X_test_tfidf, y_train, y_test)


=== Logistic Regression with TF-IDF ===
Accuracy: 0.8773

Classification Report:
              precision    recall  f1-score   support

         neg       0.88      0.87      0.88     12500
         pos       0.87      0.88      0.88     12500

    accuracy                           0.88     25000
   macro avg       0.88      0.88      0.88     25000
weighted avg       0.88      0.88      0.88     25000



### Hyperparameter Tuning

### 5-fold Cross Validation
The training data is split into 5 equal parts (folds).

The model is trained on 4 folds and validated on the remaining 1 fold.

This process repeats 5 times, with each fold serving as the validation set once.

Performance metrics (e.g., accuracy) are averaged across all 5 runs to estimate generalization

In [21]:
params = {'C': [0.01, 0.1, 1.0, 10, 100]}
grid = GridSearchCV(LogisticRegression(max_iter=1000), params, cv=5) #cv=5 is a 5-fold crossvalidation
grid.fit(X_train_tfidf, y_train)

print("Best C:", grid.best_params_['C'])  # e.g., C=1.0
print("Best Training Accuracy:", grid.best_score_)

Best C: 1.0
Best Training Accuracy: 0.85796


### C is a hyperparameter, it's the inverse of regularization strength, so a smaller C means a stronger regularization. its default value is actually 1.0 (which turns out to result in the best accuracy)
### Since the 5-fold cross validation averages the validation accuracies over the 5 iterations, it gives a more conservative estimate of accuracy than just one run for the model, that's why we end up with 85% validation accuracy as opposed to 87% we get when we evaluate the model on the test set.

In [22]:
# After running GridSearchCV:
best_model = grid.best_estimator_  # Get the model with best C
test_accuracy = best_model.score(X_test_tfidf, y_test)
print(f"Test accuracy with best C: {test_accuracy:.4f}")

Test accuracy with best C: 0.8773


### Using N-grams

In [23]:
# Use bigrams/trigrams
tfidf = TfidfVectorizer(
    ngram_range=(1, 3),  # Unigrams, Bigrams, and Trigrams
    max_features=15000,   # Top 15000 terms by TF-IDF score
    sublinear_tf=True     # Apply sublinear scaling (log(1 + tf))
)
X_train_tfidf_ngram = tfidf.fit_transform(train_df['processed_text'])
X_test_tfidf_ngram = tfidf.transform(test_df['processed_text'])

In [24]:
# Define hyperparameters to tune
params = {
    'C': [0.1, 1.0, 10],  # Regularization strength
    'penalty': ['l2'],      # L2 regularization (default)
}

# Search for best parameters
lr = LogisticRegression(max_iter=1000, random_state=42)
grid = GridSearchCV(lr, params, cv=5, scoring='accuracy')
grid.fit(X_train_tfidf_ngram, y_train)

print(f"Best C: {grid.best_params_['C']}")  # e.g., C=10
best_lr = grid.best_estimator_

Best C: 1.0


In [25]:
params = {
    'C': [0.1, 1.0, 10],
}

svm = LinearSVC(random_state=42, max_iter=10000) #Support Vector Classification
grid_svm = GridSearchCV(svm, params, cv=5, scoring='accuracy')
grid_svm.fit(X_train_tfidf_ngram, y_train)

print(f"Best SVM C: {grid_svm.best_params_['C']}")  # e.g., C=0.1
best_svm = grid_svm.best_estimator_

Best SVM C: 0.1


### Evaluate on test set

In [26]:
# Logistic Regression
y_pred_lr = best_lr.predict(X_test_tfidf_ngram)
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(classification_report(y_test, y_pred_lr))

# SVM
y_pred_svm = best_svm.predict(X_test_tfidf_ngram)
print(f"\nSVM Accuracy: {accuracy_score(y_test, y_pred_svm):.4f}")
print(classification_report(y_test, y_pred_svm))

Logistic Regression Accuracy: 0.8887
              precision    recall  f1-score   support

         neg       0.89      0.88      0.89     12500
         pos       0.89      0.89      0.89     12500

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000


SVM Accuracy: 0.8896
              precision    recall  f1-score   support

         neg       0.89      0.88      0.89     12500
         pos       0.89      0.89      0.89     12500

    accuracy                           0.89     25000
   macro avg       0.89      0.89      0.89     25000
weighted avg       0.89      0.89      0.89     25000



### Interpretation
See the most impactful words for sentiment

In [28]:
# For Logistic Regression
feature_names = tfidf.get_feature_names_out()
coefs = best_lr.coef_[0]
top_positive = [feature_names[i] for i in coefs.argsort()[-10:][::-1]]  # e.g., ['excellent', 'best']
top_negative = [feature_names[i] for i in coefs.argsort()[:10]]         # e.g., ['worst', 'awful']
print("Top Positive Words:", top_positive)
print("Top Negative Words:", top_negative)

Top Positive Words: ['great', 'excellent', 'perfect', 'wonderful', 'best', 'favorite', 'amazing', 'loved', 'today', 'fun']
Top Negative Words: ['worst', 'bad', 'awful', 'waste', 'boring', 'poor', 'nothing', 'terrible', 'dull', 'worse']


# Summary

This project focuses on building and evaluating sentiment analysis models using the IMDB Large Movie Review Dataset introduced by Maas et al. (2011). The dataset contains 50,000 labeled movie reviews split evenly between positive and negative sentiment classes, making it ideal for binary text classification.

The pipeline begins with thorough text preprocessing, including HTML tag removal, lowercasing, punctuation removal, stopword elimination, and lemmatization. Two main types of textual features are extracted:

Bag-of-Words (BoW): Using a CountVectorizer with a vocabulary size limited to the top 5,000 most frequent terms.

TF-IDF (Term Frequency-Inverse Document Frequency): Captures term importance across documents with the same 5,000-word limit. An extended version with unigrams, bigrams, and trigrams (n-grams up to size 3) and 15,000 features is also explored.

Two classification algorithms are trained and evaluated:

Logistic Regression

Linear Support Vector Machine (SVM)

Initial models using BoW and TF-IDF achieved accuracies of 84.38% and 87.73%, respectively. Further performance improvements were achieved through hyperparameter tuning using 5-fold cross-validation and the inclusion of n-gram features. The final models reached accuracies of:

Logistic Regression: 88.87%

SVM: 88.96%

The project also includes interpretability analysis by extracting the most influential positive and negative words based on the Logistic Regression model’s coefficients. This provides insights into the linguistic patterns associated with each sentiment class.

Overall, the project demonstrates the effectiveness of traditional machine learning approaches for sentiment analysis when combined with careful preprocessing, feature engineering, and model tuning.