## IMDB Movie Review Sentiment Analysis with EDA, Feature Engineering, and ML

This notebook implements a comprehensive sentiment analysis pipeline covering:
- Data Loading & Preprocessing, First EDA, Feature Engineering, Second EDA & Statistical Inference, Machine Learning, Presentation & Reflection

### Dataset
- **Source**: IMDB Movie Reviews Dataset - https://www.kaggle.com/datasets/mahmoudshaheen1134/imdp-data
- **Location**: `../dataset/IMDB-Dataset.csv`
### Authors
- **Students**: Ainedembe Denis, Musinguzi Benson
- **Lecturer**: Harriet Sibitenda (PhD)


### Import Required Libraries


In [1]:
# DATA MANIPULATION AND ANALYSIS
import pandas as pd  # Data manipulation and analysis: DataFrames, data loading, cleaning
import numpy as np   # Numerical computing: arrays, mathematical operations, random number generation

# VISUALIZATION
import matplotlib.pyplot as plt   # Plotting library: create charts, graphs, histograms
import seaborn as sns             # Statistical visualization: enhanced plots, heatmaps, statistical graphics
from wordcloud import WordCloud   # Word cloud generation: visualize most frequent words in text

# MACHINE LEARNING - MODEL SELECTION AND TRAINING
from sklearn.model_selection import train_test_split  # Split data into training and testing sets
from sklearn.model_selection import KFold             # K-fold cross-validation iterator

# MACHINE LEARNING - FEATURE EXTRACTION
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text to TF-IDF features (term frequency-inverse document frequency)
from sklearn.feature_extraction.text import CountVectorizer  # Convert text to bag-of-words count features

# MACHINE LEARNING - CLASSIFICATION MODELS
from sklearn.naive_bayes import MultinomialNB               # Naive Bayes classifier for text classification
from sklearn.linear_model import LogisticRegression         # Logistic regression classifier
from sklearn.svm import LinearSVC                           # Linear Support Vector Machine classifier
from sklearn.ensemble import GradientBoostingClassifier     # Gradient Boosting ensemble classifier

# MACHINE LEARNING - MODEL EVALUATION METRICS
from sklearn.metrics import accuracy_score      # Calculate classification accuracy
from sklearn.metrics import precision_score     # Calculate precision - positive predictive value
from sklearn.metrics import recall_score        # Calculate recall (sensitivity, true positive rate)
from sklearn.metrics import f1_score            # Calculate F1 score (harmonic mean of precision and recall)
from sklearn.metrics import confusion_matrix    # Generate confusion matrix for classification results
from sklearn.metrics import roc_auc_score       # Calculate ROC-AUC (Area Under ROC Curve)

# MACHINE LEARNING - UTILITIES
from sklearn.utils import resample                      # Resample datasets for balancing classes
from sklearn.preprocessing import MinMaxScaler          # Scale features to a specific range (0-1)
from sklearn.base import clone                          # Clone/copy machine learning models
from sklearn.calibration import CalibratedClassifierCV  # Calibrate classifier probabilities

# STATISTICAL ANALYSIS
from scipy import stats               # Statistical functions: distributions, statistical tests (stats.sem, stats.t.ppf)
from scipy.stats import ttest_ind     # Independent samples t-test for hypothesis testing

# NATURAL LANGUAGE PROCESSING
import nltk                                    # Natural Language Toolkit: NLP library
from nltk.corpus import stopwords              # Common stopwords (e.g., "the", "a", "is") to filter out
import re                                      # Regular expressions: pattern matching for text cleaning (HTML removal, punctuation)

# NLTK DATA DOWNLOAD AND INITIALIZATION
# Download stopwords corpus
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords', quiet=True)

# Initialize stopwords set for English language
stop_words = set(stopwords.words('english'))

# REPRODUCIBILITY SETTINGS
# Set random seed for reproducibility (ensures consistent results across runs)
np.random.seed(42)

# DISPLAY SETTINGS
# Configure pandas display options
pd.set_option('display.max_columns', None)      # Show all columns when printing DataFrames
pd.set_option('display.max_colwidth', 100)      # Maximum column width when displaying text

# Configure matplotlib and seaborn visualization styles
plt.style.use('seaborn-v0_8')                   # Use seaborn style for plots
sns.set_palette("husl")                         # Set color palette for seaborn plots
# Display plots inline in Jupyter notebook
%matplotlib inline                              

print("Libraries imported and environment set up successfully.")
print(f"NLTK data ready. Loaded {len(stop_words)} stopwords.")

Libraries imported and environment set up successfully.
NLTK data ready. Loaded 198 stopwords.


### Part A: Data Loading & Preprocessing

#### A1. Load Dataset


In [2]:
# Load dataset from CSV file
# The dataset contains movie reviews and their sentiment labels
df = pd.read_csv('../dataset/IMDB-Dataset.csv')

print(f"Dataset shape: {df.shape}")
df.head()  # Display first few rows to inspect the data structure


Dataset shape: (50000, 2)


Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that after watching just 1 Oz episode you'll be hooked....,positive
1,A wonderful little production. <br /><br />The filming technique is very unassuming- very old-ti...,positive
2,"I thought this was a wonderful way to spend time on a too hot summer weekend, sitting in the air...",positive
3,Basically there's a family where a little boy (Jake) thinks there's a zombie in his closet & his...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is a visually stunning film to watch. Mr. Mattei off...",positive


In [None]:
# Check data quality
msing_vl = df.isnull().sum().sum()
print(f"Missing values: {msing_vl}")


### A2. Text Cleaning

Removing HTML tags, Converting to lowercase, Removing punctuation & Tokenizing


In [None]:
# Sample review before cleaning
print("Sample review of first 300 characters:")
print(df['review'].iloc[0][:300])
print("\n")


In [None]:
# Text cleaning function
# This function implements the required preprocessing steps:

def clean_text(text):
    # Handle missing or empty text
    if pd.isna(text) or text == '':
        return ''
    # Step 1: Remove HTML tags using regex pattern matching
    text = re.sub(r'<[^>]+>', '', str(text))
    # Step 2: Convert to lowercase for consistency
    text = text.lower()
    # Step 3: Remove punctuation - keep alphanumeric and spaces only
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    # Step 4: Simple tokenization - split on whitespace
    tokens = text.split()
    # Step 5: Remove stopwords - common words like "the", "a", "is" that don't carry sentiment and 
    # - short words less than 2 characters to filter out noise

    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    # Join tokens back to string for vectorization
    return ' '.join(tokens)

# Test cleaning function
sample_text = df['review'].iloc[0]
cleaned_sample = clean_text(sample_text)
print("Original df with first 200 chars:", sample_text[:300])
print("Cleaned df with first 200 chars:", cleaned_sample[:300])


In [None]:
# Apply cleaning to all reviews
df['cleaned_review'] = df['review'].apply(clean_text)
print("Apply cleaning to all reviews completed.")

In [None]:
# Convert text to numerical features using TF-IDF (Term Frequency-Inverse Document Frequency)
# TF-IDF is chosen over Bag-of-Words as it weights words by importance (rare words get higher scores)
# Parameters:
#   - max_features=5000: Limit vocabulary to top 5000 features (reduces dimensionality)
#   - ngram_range=(1, 2): Include unigrams (single words) and bigrams (word pairs) for context
#   - min_df=2: Ignore terms appearing in less than 2 documents (filter rare terms)
#   - max_df=0.95: Ignore terms appearing in more than 95% of documents (filter common terms)

tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=2, max_df=0.95)
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_review'])  # Transform text to TF-IDF matrix
y = df['sentiment'].map({'negative': 0, 'positive': 1})  # Convert labels to binary (0=negative, 1=positive)

print(f"TF-IDF matrix shape: {X_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print(f"Label distribution: {y.value_counts().to_dict()}")


### A3. Handling class balance if dataset is imbalanced

Check sentiment distribution, Check if balanced, Visualize sentiment distribution


In [None]:
# Check sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
print(f"\nSentiment distribution:\n{sentiment_counts}")
print(f"\nPercentage:\n{df['sentiment'].value_counts(normalize=True) * 100}")

# Check if balanced
is_balanced = df['sentiment'].value_counts(normalize=True).std() < 0.05
print(f"\nDataset is {'balanced' if is_balanced else 'imbalanced'}")


In [None]:
# Visualize sentiment distribution
plt.figure(figsize=(10, 6))
sentiment_counts.plot(kind='bar', color=['green', 'red'], edgecolor='black')
plt.title('Sentiment Distribution in IMDB Dataset', fontsize=14, fontweight='bold')
plt.xlabel('Sentiment', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(sentiment_counts):
    plt.text(i, v + 500, str(v), ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()


### Part B: First Exploratory Data Analysis (EDA)

B1. Most Frequent Words by Sentiment


In [None]:
# Get most frequent words for positive and negative reviews
# Part B requirement: "Compute most frequent words in positive vs negative reviews"
from collections import Counter

def get_top_words(reviews, n=20):
    #Get top N words from reviews by frequency count.
    all_words = []
    # Collect all words from all reviews in the set
    for review in reviews:
        all_words.extend(review.split())  # Split each review into words and add to list
    # Count word frequencies and return top N
    return Counter(all_words).most_common(n)

# Positive reviews
positive_reviews = df[df['sentiment'] == 'positive']['cleaned_review']
positive_words = get_top_words(positive_reviews, 20)

# Negative reviews
negative_reviews = df[df['sentiment'] == 'negative']['cleaned_review']
negative_words = get_top_words(negative_reviews, 20)

print("Top 10 words in POSITIVE reviews:")
for word, count in positive_words[:10]:
    print(f"  {word}: {count}")

print("\nTop 10 words in NEGATIVE reviews:")
for word, count in negative_words[:10]:
    print(f"  {word}: {count}")


## B2. Word Clouds for Each Class


In [None]:
# Create word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Positive reviews word cloud
positive_text = ' '.join(positive_reviews)
wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)
axes[0].imshow(wordcloud_positive, interpolation='bilinear')
axes[0].set_title('Positive Reviews Word Cloud', fontsize=14, fontweight='bold')
axes[0].axis('off')

# Negative reviews word cloud
negative_text = ' '.join(negative_reviews)
wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_text)
axes[1].imshow(wordcloud_negative, interpolation='bilinear')
axes[1].set_title('Negative Reviews Word Cloud', fontsize=14, fontweight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()


### B3. Histogram of Review Lengths


In [None]:
# Calculate review lengths
df['review_length'] = df['cleaned_review'].apply(lambda x: len(x.split()))

# Create histogram
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram by sentiment
positive_lengths = df[df['sentiment'] == 'positive']['review_length']
negative_lengths = df[df['sentiment'] == 'negative']['review_length']

axes[0].hist([positive_lengths, negative_lengths], bins=50, alpha=0.7, 
             label=['Positive', 'Negative'], color=['green', 'red'])
axes[0].set_xlabel('Review Length (words)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Review Length Distribution by Sentiment', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot([positive_lengths, negative_lengths], labels=['Positive', 'Negative'])
axes[1].set_ylabel('Review Length (words)', fontsize=12)
axes[1].set_title('Review Length Box Plot by Sentiment', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Positive: Mean={positive_lengths.mean():.1f}, Median={positive_lengths.median():.1f}, Std={positive_lengths.std():.1f}")
print(f"Negative: Mean={negative_lengths.mean():.1f}, Median={negative_lengths.median():.1f}, Std={negative_lengths.std():.1f}")


### B4. Do Positive vs Negative Reviews Differ in Length or Vocabulary?

**Findings:**
1. **Length differences**: Positive and negative reviews have similar average lengths, but there may be slight variations in distribution.
2. **Vocabulary differences**: Positive reviews contain words like "great", "excellent", "wonderful", "love", "good", "best", "amazing", "enjoy", "perfect", "brilliant".
3. **Negative reviews** contain words like "bad", "worst", "awful", "terrible", "horrible", "waste", "boring", "disappointing", "poor", "dull".
4. **Key insight**: While length may be similar, vocabulary clearly differs between positive and negative reviews, which makes sentiment classification feasible.
