## IMDB Movie Review Sentiment Analysis with EDA, Feature Engineering, and ML

This notebook implements a comprehensive sentiment analysis pipeline covering:
- Data Loading & Preprocessing, First EDA, Feature Engineering, Second EDA & Statistical Inference, Machine Learning, Presentation & Reflection

### Dataset
- **Source**: IMDB Movie Reviews Dataset - https://www.kaggle.com/datasets/mahmoudshaheen1134/imdp-data
- **Location**: `../dataset/IMDB-Dataset.csv`
### Authors
- **Students**: Ainedembe Denis, Musinguzi Benson
- **Lecturer**: Harriet Sibitenda (PhD)


### Import Required Libraries


In [None]:
# DATA MANIPULATION AND ANALYSIS
import pandas as pd  # Data manipulation and analysis: DataFrames, data loading, cleaning
import numpy as np   # Numerical computing: arrays, mathematical operations, random number generation

# VISUALIZATION
import matplotlib.pyplot as plt   # Plotting library: create charts, graphs, histograms
import seaborn as sns             # Statistical visualization: enhanced plots, heatmaps, statistical graphics
from wordcloud import WordCloud   # Word cloud generation: visualize most frequent words in text

# MACHINE LEARNING - MODEL SELECTION AND TRAINING
from sklearn.model_selection import train_test_split  # Split data into training and testing sets
from sklearn.model_selection import KFold             # K-fold cross-validation iterator

# MACHINE LEARNING - FEATURE EXTRACTION
from sklearn.feature_extraction.text import TfidfVectorizer  # Convert text to TF-IDF features (term frequency-inverse document frequency)
from sklearn.feature_extraction.text import CountVectorizer  # Convert text to bag-of-words count features
from collections import Counter                              # Count occurrences of elements in iterable i.e used for word frequency

# MACHINE LEARNING - CLASSIFICATION MODELS
from sklearn.naive_bayes import MultinomialNB               # Naive Bayes classifier for text classification
from sklearn.linear_model import LogisticRegression         # Logistic regression classifier
from sklearn.svm import LinearSVC                           # Linear Support Vector Machine classifier
from sklearn.ensemble import GradientBoostingClassifier     # Gradient Boosting ensemble classifier

# MACHINE LEARNING - MODEL EVALUATION METRICS
from sklearn.metrics import accuracy_score      # Calculate classification accuracy
from sklearn.metrics import precision_score     # Calculate precision - positive predictive value
from sklearn.metrics import recall_score        # Calculate recall (sensitivity, true positive rate)
from sklearn.metrics import f1_score            # Calculate F1 score (harmonic mean of precision and recall)
from sklearn.metrics import confusion_matrix    # Generate confusion matrix for classification results
from sklearn.metrics import roc_auc_score       # Calculate ROC-AUC (Area Under ROC Curve)

# MACHINE LEARNING - UTILITIES
from sklearn.utils import resample                      # Resample datasets for balancing classes
from sklearn.preprocessing import MinMaxScaler          # Scale features to a specific range (0-1)
from sklearn.base import clone                          # Clone/copy machine learning models
from sklearn.calibration import CalibratedClassifierCV  # Calibrate classifier probabilities

# STATISTICAL ANALYSIS
from scipy import stats               # Statistical functions: distributions, statistical tests (stats.sem, stats.t.ppf)
from scipy.stats import ttest_ind     # Independent samples t-test for hypothesis testing

# NATURAL LANGUAGE PROCESSING
import nltk                                    # Natural Language Toolkit: NLP library
from nltk.corpus import stopwords              # Common stopwords (e.g., "the", "a", "is") to filter out
import re                                      # Regular expressions: pattern matching for text cleaning (HTML removal, punctuation)

# NLTK DATA DOWNLOAD AND INITIALIZATION
# Download stopwords corpus
try:
    nltk.data.find('corpora/stopwords')
except LookupError:
    nltk.download('stopwords', quiet=True)

# Initialize stopwords set for English language
stop_words = set(stopwords.words('english'))

# REPRODUCIBILITY SETTINGS
# Set random seed for reproducibility (ensures consistent results across runs)
np.random.seed(42)

# DISPLAY SETTINGS
# Configure pandas display options
pd.set_option('display.max_columns', None)      # Show all columns when printing DataFrames
pd.set_option('display.max_colwidth', 100)      # Maximum column width when displaying text

# Configure matplotlib and seaborn visualization styles
plt.style.use('seaborn-v0_8')                   # Use seaborn style for plots
sns.set_palette("husl")                         # Set color palette for seaborn plots
# Display plots inline in Jupyter notebook
%matplotlib inline                              

print("Libraries imported and environment set up successfully.")
print(f"NLTK data ready. Loaded {len(stop_words)} stopwords.")

### Part A: Data Loading & Preprocessing

#### A1. Load Dataset


In [None]:
# Load dataset from CSV file
# The dataset contains movie reviews and their sentiment labels
df = pd.read_csv('../dataset/IMDB-Dataset.csv')

print(f"Dataset shape: {df.shape}")
df.head()  # Display first few rows to inspect the data structure


In [None]:
# Check data quality
msing_vl = df.isnull().sum().sum()
print(f"Missing values: {msing_vl}")


### A2. Text Cleaning

Removing HTML tags, Converting to lowercase, Removing punctuation & Tokenizing


In [None]:
# Sample review before cleaning
print("Sample review of first 300 characters:")
print(df['review'].iloc[0][:300])
print("\n")


In [None]:
# Text cleaning function
# This function implements the required preprocessing steps:

def clean_text(text):
    # Handle missing or empty text
    if pd.isna(text) or text == '':
        return ''
    # Step 1: Remove HTML tags using regex pattern matching
    text = re.sub(r'<[^>]+>', '', str(text))
    # Step 2: Convert to lowercase for consistency
    text = text.lower()
    # Step 3: Remove punctuation - keep alphanumeric and spaces only
    text = re.sub(r'[^a-zA-Z0-9\s]', ' ', text)
    # Step 4: Simple tokenization - split on whitespace
    tokens = text.split()
    # Step 5: Remove stopwords - common words like "the", "a", "is" that don't carry sentiment and 
    # - short words less than 2 characters to filter out noise

    tokens = [word for word in tokens if word not in stop_words and len(word) > 2]
    # Join tokens back to string for vectorization
    return ' '.join(tokens)

# Test cleaning function
sample_text = df['review'].iloc[0]
cleaned_sample = clean_text(sample_text)
print("Original df with first 200 chars:", sample_text[:300])
print("Cleaned df with first 200 chars:", cleaned_sample[:300])


In [None]:
# Apply cleaning to all reviews
df['cleaned_review'] = df['review'].apply(clean_text)
print("Apply cleaning to all reviews completed.")

In [None]:
# Convert text to numerical features using TF-IDF (Term Frequency-Inverse Document Frequency)
# We have chosen TF-IDF has over Bag-of-Words as it weights words by importance. rare words get higher scores
# Parameters:
#   - max_features=5000: Limit vocabulary to top 5000 features - this reduces dimensionality
#   - ngram_range=(1, 2): Include unigrams (single words) and bigrams (word pairs) for context
#   - min_df=2: Ignore terms appearing in less than 2 documents (filter rare terms)
#   - max_df=0.95: Ignore terms appearing in more than 95% of documents (filter common terms)

tfidf_vectorizer = TfidfVectorizer(max_features=5000, ngram_range=(1, 2), min_df=2, max_df=0.95)
X_tfidf = tfidf_vectorizer.fit_transform(df['cleaned_review'])  # Transform text to TF-IDF matrix
y = df['sentiment'].map({'negative': 0, 'positive': 1})  # Convert labels to binary (0=negative, 1=positive)

print(f"TF-IDF matrix shape: {X_tfidf.shape}")
print(f"Vocabulary size: {len(tfidf_vectorizer.vocabulary_)}")
print(f"Label distribution: {y.value_counts().to_dict()}")


### A3. Handling class balance if dataset is imbalanced

Check sentiment distribution, Check if balanced, Visualize sentiment distribution


In [None]:
# Check sentiment distribution
sentiment_counts = df['sentiment'].value_counts()
print(f"\nSentiment distribution:\n{sentiment_counts}")

percg = df['sentiment'].value_counts(normalize=True) * 100
print(f"\nPercentage:\n{percg}")

# Check if balanced
is_balanced = df['sentiment'].value_counts(normalize=True).std() < 0.05
print(f"\nDataset is {'balanced' if is_balanced else 'imbalanced'}")


In [None]:
# Visualize sentiment distribution
plt.figure(figsize=(10, 6))
sentiment_counts.plot(kind='bar', color=['green', 'red'], edgecolor='black')
plt.title('Sentiment Distribution in IMDB Dataset', fontsize=14, fontweight='bold')
plt.xlabel('Sentiment', fontsize=12)
plt.ylabel('Count', fontsize=12)
plt.xticks(rotation=0)
plt.grid(axis='y', alpha=0.3)

# Add value labels on bars
for i, v in enumerate(sentiment_counts):
    plt.text(i, v + 500, str(v), ha='center', va='bottom', fontweight='bold')

plt.tight_layout()
plt.show()


### Part B: First Exploratory Data Analysis (EDA)

B1. Most Frequent Words by Sentiment


In [None]:
# Get most frequent words for positive and negative reviews
# Part B requirement: "Compute most frequent words in positive vs negative reviews"

def get_top_words(reviews, n=20):
    #Get top N words from reviews by frequency count.
    all_words = []
    # Collect all words from all reviews in the set
    for review in reviews:
        all_words.extend(review.split())  # Split each review into words and add to list
    # Count word frequencies and return top N
    return Counter(all_words).most_common(n)

# Positive reviews
positive_reviews = df[df['sentiment'] == 'positive']['cleaned_review']
positive_words = get_top_words(positive_reviews, 20)

# Negative reviews
negative_reviews = df[df['sentiment'] == 'negative']['cleaned_review']
negative_words = get_top_words(negative_reviews, 20)

print("Top 10 words in POSITIVE reviews:")
for word, count in positive_words[:10]:
    print(f"  {word}: {count}")

print("\nTop 10 words in NEGATIVE reviews:")
for word, count in negative_words[:10]:
    print(f"  {word}: {count}")


## B2. Word Clouds for Each Class


In [None]:
# Create word clouds
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Positive reviews word cloud
positive_text = ' '.join(positive_reviews)
wordcloud_positive = WordCloud(width=800, height=400, background_color='white').generate(positive_text)
axes[0].imshow(wordcloud_positive, interpolation='bilinear')
axes[0].set_title('Positive Reviews Word Cloud', fontsize=14, fontweight='bold')
axes[0].axis('off')

# Negative reviews word cloud
negative_text = ' '.join(negative_reviews)
wordcloud_negative = WordCloud(width=800, height=400, background_color='white').generate(negative_text)
axes[1].imshow(wordcloud_negative, interpolation='bilinear')
axes[1].set_title('Negative Reviews Word Cloud', fontsize=14, fontweight='bold')
axes[1].axis('off')

plt.tight_layout()
plt.show()


### B3. Histogram of Review Lengths


In [None]:
# Calculate review lengths
df['review_length'] = df['cleaned_review'].apply(lambda x: len(x.split()))

# Create histogram
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Histogram by sentiment
positive_lengths = df[df['sentiment'] == 'positive']['review_length']
negative_lengths = df[df['sentiment'] == 'negative']['review_length']

axes[0].hist([positive_lengths, negative_lengths], bins=50, alpha=0.7, 
             label=['Positive', 'Negative'], color=['green', 'red'])
axes[0].set_xlabel('Review Length (words)', fontsize=12)
axes[0].set_ylabel('Frequency', fontsize=12)
axes[0].set_title('Review Length Distribution by Sentiment', fontsize=14, fontweight='bold')
axes[0].legend()
axes[0].grid(alpha=0.3)

# Box plot
axes[1].boxplot([positive_lengths, negative_lengths], labels=['Positive', 'Negative'])
axes[1].set_ylabel('Review Length (words)', fontsize=12)
axes[1].set_title('Review Length Box Plot by Sentiment', fontsize=14, fontweight='bold')
axes[1].grid(alpha=0.3)

plt.tight_layout()
plt.show()

# Statistics
print(f"Positive: Mean={positive_lengths.mean():.1f}, Median={positive_lengths.median():.1f}, Std={positive_lengths.std():.1f}")
print(f"Negative: Mean={negative_lengths.mean():.1f}, Median={negative_lengths.median():.1f}, Std={negative_lengths.std():.1f}")


### B4. Do Positive vs Negative Reviews Differ in Length or Vocabulary?

**Findings:**
1. **Length differences**: Positive and negative reviews have similar average lengths, but there may be slight variations in distribution.
2. **Vocabulary differences**: Positive reviews contain words like "great", "excellent", "wonderful", "love", "good", "best", "amazing", "enjoy", "perfect", "brilliant".
3. **Negative reviews** contain words like "bad", "worst", "awful", "terrible", "horrible", "waste", "boring", "disappointing", "poor", "dull".
4. **Key insight**: While length may be similar, vocabulary clearly differs between positive and negative reviews, which makes sentiment classification feasible.


### Part C: Feature Engineering

C1. Create sentiment lexicon score per review (positive − negative word counts)


In [None]:
# SENTIMENT LEXICON SCORE - TWO APPROACHES,  MANUAL LEXICON & DYNAMIC EXTRACTION
# (Choose ONE approach by commenting/uncommenting the appropriate sections below)
# ============================================================================
# 
# OPTION 1: MANUAL LEXICON
#   - Pros: Fast, interpretable, based on linguistic knowledge
#   - Cons: Limited coverage, may miss domain-specific terms
#
# OPTION 2: DYNAMIC EXTRACTION (Advanced - Data-Driven)
#   - Pros: Discovers domain-specific words automatically, data-driven
#   - Cons: More complex, may include noise, requires data to be loaded first
# ============================================================================

# OPTION 1: MANUAL LEXICON (Pre-defined word lists)
# Comment out this entire section if using OPTION 2

positive_words_lexicon = ['good', 'great', 'excellent', 'amazing', 'wonderful', 'fantastic', 
                          'love', 'like', 'best', 'perfect', 'brilliant', 'awesome', 'enjoy',
                          'beautiful', 'outstanding', 'superb', 'marvelous', 'fabulous', 'delightful']

negative_words_lexicon = ['bad', 'worst', 'awful', 'terrible', 'horrible', 'hate', 'disappointing',
                          'poor', 'boring', 'dull', 'waste', 'stupid', 'annoying', 'frustrating',
                          'disgusting', 'pathetic', 'ridiculous', 'unpleasant', 'ugly']


"""
# OPTION 2: DYNAMIC EXTRACTION (Extract from dataset)
# Uncomment this entire section if using OPTION 2 (and comment out OPTION 1 above)

# Get word frequencies for positive and negative reviews
positive_reviews = df[df['sentiment'] == 'positive']['cleaned_review']
negative_reviews = df[df['sentiment'] == 'negative']['cleaned_review']

# Count words in each class
positive_word_counts = Counter()
negative_word_counts = Counter()

for review in positive_reviews:
    positive_word_counts.update(review.split())

for review in negative_reviews:
    negative_word_counts.update(review.split())

# Calculate relative frequency (word frequency / total words in class)
total_positive_words = sum(positive_word_counts.values())
total_negative_words = sum(negative_word_counts.values())

# Get all unique words
all_words = set(positive_word_counts.keys()) | set(negative_word_counts.keys())

# Calculate sentiment score for each word
# Score = (freq in positive / total positive) - (freq in negative / total negative)
word_sentiment_scores = {}
for word in all_words:
    pos_freq = positive_word_counts.get(word, 0) / total_positive_words if total_positive_words > 0 else 0
    neg_freq = negative_word_counts.get(word, 0) / total_negative_words if total_negative_words > 0 else 0
    word_sentiment_scores[word] = pos_freq - neg_freq

# Extract top positive and negative words (words with highest/lowest scores)
# Filter by minimum frequency to avoid rare words
min_frequency = 10  # Word must appear at least 10 times
positive_words_dynamic = [word for word, score in sorted(word_sentiment_scores.items(), 
                                                          key=lambda x: x[1], reverse=True)
                          if positive_word_counts.get(word, 0) >= min_frequency][:30]

negative_words_dynamic = [word for word, score in sorted(word_sentiment_scores.items(), 
                                                          key=lambda x: x[1])
                          if negative_word_counts.get(word, 0) >= min_frequency][:30]

print("Top 20 Dynamically Extracted +ve Words:", positive_words_dynamic[:20])
print("Top 20 Dynamically Extracted -ve Words:", negative_words_dynamic[:20])

# Use dynamically extracted lexicons
positive_words_lexicon = positive_words_dynamic
negative_words_lexicon = negative_words_dynamic
"""

#Calculate sentiment lexicon score (positive - negative word counts).
def calculate_lexicon_score(text):
    
    # Returns a score where:
    #- Positive values indicate more positive words ; Negative values indicate more negative words
    #- Zero indicates balanced or neutral
   
    words = text.lower().split()  # Convert to lowercase and split into words
    # Count positive words in the text
    positive_count = sum(1 for word in words if word in positive_words_lexicon)
    # Count negative words in the text
    negative_count = sum(1 for word in words if word in negative_words_lexicon)
    # Return difference (positive - negative)
    return positive_count - negative_count

# Calculate lexicon scores
df['lexicon_score'] = df['cleaned_review'].apply(calculate_lexicon_score)

print("Lexicon score statistics:")
print(df['lexicon_score'].describe())
print(f"\nMean by sentiment:\n{df.groupby('sentiment')['lexicon_score'].mean()}")


### C2. Extract N-grams (Bigrams and Trigrams)


In [None]:
# Extract bigrams and trigrams
# N-grams capture context and phrases that single words miss (e.g., "not good" vs "good")
# Extract bigrams (2-word phrases)
# Example: "very good", "not bad", "really great"
bigram_vectorizer = CountVectorizer(ngram_range=(2, 2), max_features=20)
bigrams = bigram_vectorizer.fit_transform(df['cleaned_review'])
bigram_features = bigram_vectorizer.get_feature_names_out()

# Extract trigrams (3-word phrases)
# Example: "not very good", "one of the"
trigram_vectorizer = CountVectorizer(ngram_range=(3, 3), max_features=20)
trigrams = trigram_vectorizer.fit_transform(df['cleaned_review'])
trigram_features = trigram_vectorizer.get_feature_names_out()

print("Top 10 Bigrams:", list(bigram_features[:10]))
print("Top 10 Trigrams:", list(trigram_features[:10]))

# Store n-gram counts as features
df['bigram_count'] = [bigrams[i].sum() for i in range(len(df))]
df['trigram_count'] = [trigrams[i].sum() for i in range(len(df))]


### C3. Compute Readability Metrics
Compute readability metrics (e.g., average word length, sentence length)


In [None]:
    #These metrics capture writing style and complexity, which may correlate with sentiment expression
    #Compute readability metrics: average word length, sentence length.
    
    #Returns: avg_sentence_length: Average number of words per sentence; avg_sentence_length: Average number of words per sentence
def compute_readability_metrics(text):
    # Calculate average word length the indicator of vocabulary complexity
    words = text.split()
    if len(words) > 0:
        avg_word_length = sum(len(word) for word in words) / len(words)
    else:
        avg_word_length = 0
    
    # Calculate average sentence length the indicator of writing complexity
    # Approximate sentence count by counting sentence-ending punctuation
    # Note: This is an approximation since cleaned text may have removed some punctuation
    sentence_count = max(1, text.count('.') + text.count('!') + text.count('?'))
    avg_sentence_length = len(words) / sentence_count if sentence_count > 0 else len(words)
    
    return avg_word_length, avg_sentence_length

# Apply readability metrics
readability_metrics = df['cleaned_review'].apply(compute_readability_metrics)
df['avg_word_length'] = [m[0] for m in readability_metrics]
df['avg_sentence_length'] = [m[1] for m in readability_metrics]

print("Readability metrics statistics:")
print(df[['avg_word_length', 'avg_sentence_length']].describe())
print(f"\nMean by sentiment:\n{df.groupby('sentiment')[['avg_word_length', 'avg_sentence_length']].mean()}")


### C4. Why These Engineered Features Help Sentiment Classification

Importance of these Features:
1. *Sentiment Lexicon Score*: Directly captures sentiment polarity by counting positive vs negative words. Higher scores indicate positive sentiment. Lexicon scores provide direct sentiment signals
2. *N-grams (Bigrams/Trigrams)*: Capture context and phrases that single words miss (e.g., "not good" vs "good"). It handles negations better than unigrams
3. *Readability Metrics*: Captures stylistic differences that may correlate with sentiment
   Average word length: Longer words may indicate more formal or detailed reviews
   Average sentence length: Can indicate review complexity and writing style
   These metrics may correlate with review quality and sentiment expression

Combining these with TF-IDF features provides richer feature representation


### Second EDA & Statistical Inference

D1. Hypothesis Test: Are Reviews with Higher Lexicon Scores More Likely to be Positive?


In [None]:
# Hypothesis test: Are reviews with higher lexicon scores more likely to be positive?
# Hypothesis test: Are reviews with higher lexicon scores more likely to be positive?"
# 
# Null Hypothesis (H0): No difference in lexicon scores between positive and negative reviews
# Alternative Hypothesis (H1): Positive reviews have higher lexicon scores than negative reviews
# Significance level: α = 0.05

# Extract lexicon scores for each sentiment class
positive_scores = df[df['sentiment'] == 'positive']['lexicon_score']
negative_scores = df[df['sentiment'] == 'negative']['lexicon_score']

# Perform independent samples t-test (one-tailed, right-tailed)
# Tests if positive reviews have significantly higher lexicon scores
t_stat, p_value = ttest_ind(positive_scores, negative_scores, alternative='greater')

print(f"H0: No difference in lexicon scores")
print(f"H1: Positive reviews have higher lexicon scores")
print(f"T-statistic: {t_stat:.4f}, P-value: {p_value:.4f}")
print(f"Conclusion: {'Reject H0' if p_value < 0.05 else 'Fail to reject H0'} (α = 0.05)")
print(f"Reviews with higher lexicon scores are {'more likely' if p_value < 0.05 else 'not significantly more likely'} to be positive.")


### D2. Confidence Interval for Average Review Length by Sentiment


In [None]:
# Calculate confidence intervals for average review length
# Confidence intervals provide a range estimate for the true population mean
def confidence_interval(data, confidence=0.95):
    
    Calculate confidence interval for mean using t-distribution.
    
    Args: 'data: Array-like data'; 'confidence: Confidence level (default 0.95 for 95% CI)'
    
    Returns: Tuple of (lower_bound, upper_bound)
    n = len(data)
    mean = data.mean()  # Sample mean
    std_err = stats.sem(data)  # Standard error of the mean
    # Calculate margin of error using t-distribution (for sample size < 30 or unknown population std)
    h = std_err * stats.t.ppf((1 + confidence) / 2, n - 1)
    return mean - h, mean + h  # Return (lower, upper) bounds

# Positive reviews
pos_mean = positive_lengths.mean()
pos_ci = confidence_interval(positive_lengths)

# Negative reviews
neg_mean = negative_lengths.mean()
neg_ci = confidence_interval(negative_lengths)

print("95% Confidence Intervals for Average Review Length:")
print(f"Positive: Mean={pos_mean:.2f}, CI=[{pos_ci[0]:.2f}, {pos_ci[1]:.2f}]")
print(f"Negative: Mean={neg_mean:.2f}, CI=[{neg_ci[0]:.2f}, {neg_ci[1]:.2f}]")
print(f"Difference: {pos_mean - neg_mean:.2f} words, CI overlap: {'Yes' if pos_ci[1] >= neg_ci[0] and neg_ci[1] >= pos_ci[0] else 'No'}")


### D3. Visualize Sentiment Differences in Engineered Features


In [None]:
# Visualize engineered features by sentiment
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Lexicon score distribution
axes[0, 0].hist([df[df['sentiment'] == 'positive']['lexicon_score'],
                 df[df['sentiment'] == 'negative']['lexicon_score']],
                bins=30, alpha=0.7, label=['Positive', 'Negative'], color=['green', 'red'])
axes[0, 0].set_xlabel('Lexicon Score', fontsize=12)
axes[0, 0].set_ylabel('Frequency', fontsize=12)
axes[0, 0].set_title('Lexicon Score Distribution by Sentiment', fontsize=12, fontweight='bold')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# Average word length
axes[0, 1].boxplot([df[df['sentiment'] == 'positive']['avg_word_length'],
                    df[df['sentiment'] == 'negative']['avg_word_length']],
                   labels=['Positive', 'Negative'])
axes[0, 1].set_ylabel('Average Word Length', fontsize=12)
axes[0, 1].set_title('Average Word Length by Sentiment', fontsize=12, fontweight='bold')
axes[0, 1].grid(alpha=0.3)

# Average sentence length
axes[1, 0].boxplot([df[df['sentiment'] == 'positive']['avg_sentence_length'],
                    df[df['sentiment'] == 'negative']['avg_sentence_length']],
                   labels=['Positive', 'Negative'])
axes[1, 0].set_ylabel('Average Sentence Length', fontsize=12)
axes[1, 0].set_title('Average Sentence Length by Sentiment', fontsize=12, fontweight='bold')
axes[1, 0].grid(alpha=0.3)

# Review length
axes[1, 1].boxplot([positive_lengths, negative_lengths], labels=['Positive', 'Negative'])
axes[1, 1].set_ylabel('Review Length (words)', fontsize=12)
axes[1, 1].set_title('Review Length by Sentiment', fontsize=12, fontweight='bold')
axes[1, 1].grid(alpha=0.3)

plt.tight_layout()
plt.show()


### D4. Interpret Findings in Context

**Key Findings:**
1. **Lexicon Score**: Positive reviews have significantly higher lexicon scores than negative reviews as confirmed by hypothesis test. This validates that lexicon-based features are effective for sentiment classification.
2. **Review Length**: Positive and negative reviews have similar average lengths, with overlapping confidence intervals. Length alone is not a strong predictor of sentiment.
3. **Readability Metrics**: Average word length and sentence length show similar distributions for both sentiments, suggesting that writing style (complexity) is not strongly correlated with sentiment.
4. **Implications**: 
   - Lexicon-based features are the most discriminative engineered features
   - TF-IDF features capturing vocabulary differences are likely more important than readability metrics
   - Combining lexicon scores with TF-IDF features should improve classification performance
