# Project 06: Sentiment Analysis System

**Difficulty**: ‚≠ê‚≠ê Intermediate  
**Estimated Time**: 6-8 hours  
**Prerequisites**: 
- Python programming fundamentals
- Pandas and NumPy basics
- Basic machine learning concepts
- Understanding of classification metrics

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Preprocess text data** using NLTK for NLP tasks (tokenization, stopword removal, stemming/lemmatization)
2. **Engineer text features** using TF-IDF and Word2Vec embeddings
3. **Build and compare multiple models** (Naive Bayes, Logistic Regression, Random Forest, LSTM)
4. **Evaluate sentiment classifiers** using accuracy, precision, recall, F1-score, and ROC curves
5. **Perform error analysis** to understand model limitations and misclassifications
6. **Interpret model predictions** by identifying influential features and words
7. **Export trained models** for production deployment

## Project Overview

**Goal**: Build an NLP sentiment classifier to automatically determine whether movie reviews are positive or negative.

**Dataset**: IMDB Movie Reviews (50,000 reviews - 25,000 train, 25,000 test)

**Approach**:
1. Load and explore the IMDB dataset
2. Preprocess text (cleaning, tokenization, normalization)
3. Engineer features (TF-IDF, Word2Vec)
4. Train multiple models and compare performance
5. Analyze errors and interpret predictions
6. Export best model for deployment

## 1. Setup and Imports

In [None]:
# Data manipulation and analysis
import pandas as pd
import numpy as np
import os
from pathlib import Path

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud

# NLP and text preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer, WordNetLemmatizer
import re
from bs4 import BeautifulSoup

# Feature engineering
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer

# Machine learning models
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

# Model evaluation
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    confusion_matrix, classification_report, roc_curve, auc, roc_auc_score
)

# Deep learning (TensorFlow/Keras)
try:
    import tensorflow as tf
    from tensorflow.keras.models import Sequential
    from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
    from tensorflow.keras.preprocessing.text import Tokenizer
    from tensorflow.keras.preprocessing.sequence import pad_sequences
    TENSORFLOW_AVAILABLE = True
except ImportError:
    print("TensorFlow not available. LSTM model will be skipped.")
    TENSORFLOW_AVAILABLE = False

# Utilities
import pickle
import warnings
from time import time

# Configuration
warnings.filterwarnings('ignore')
%matplotlib inline
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

# Set random seeds for reproducibility
np.random.seed(42)
if TENSORFLOW_AVAILABLE:
    tf.random.set_seed(42)

print("All libraries imported successfully!")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")
print(f"TensorFlow available: {TENSORFLOW_AVAILABLE}")

### Download Required NLTK Data

In [None]:
# Download required NLTK data
# This only needs to be run once
nltk_resources = ['stopwords', 'punkt', 'wordnet', 'omw-1.4']

for resource in nltk_resources:
    try:
        nltk.data.find(f'corpora/{resource}' if resource != 'punkt' else f'tokenizers/{resource}')
        print(f"‚úì {resource} already downloaded")
    except LookupError:
        print(f"Downloading {resource}...")
        nltk.download(resource, quiet=True)
        print(f"‚úì {resource} downloaded successfully")

print("\nAll NLTK resources ready!")

## 2. Data Loading and Preparation

The IMDB dataset contains 50,000 movie reviews split into:
- **Training set**: 25,000 reviews (12,500 positive, 12,500 negative)
- **Test set**: 25,000 reviews (12,500 positive, 12,500 negative)

Reviews are stored as individual text files in `pos/` and `neg/` directories.

In [None]:
def load_imdb_data(data_dir, split='train', sample_size=None):
    """
    Load IMDB movie reviews from directory structure.
    
    Parameters:
    -----------
    data_dir : str or Path
        Base directory containing the aclImdb folder
    split : str
        Either 'train' or 'test'
    sample_size : int, optional
        Number of reviews to load per class (for faster testing)
    
    Returns:
    --------
    pd.DataFrame with columns: review, sentiment, label
    """
    data_path = Path(data_dir) / 'aclImdb' / split
    
    # Verify directory exists
    if not data_path.exists():
        raise FileNotFoundError(
            f"Data directory not found: {data_path}\n"
            f"Please download the IMDB dataset and extract it to {data_dir}/\n"
            f"Download from: http://ai.stanford.edu/~amaas/data/sentiment/"
        )
    
    reviews = []
    sentiments = []
    
    # Load positive reviews
    pos_dir = data_path / 'pos'
    pos_files = list(pos_dir.glob('*.txt'))
    if sample_size:
        pos_files = pos_files[:sample_size]
    
    for file_path in pos_files:
        with open(file_path, 'r', encoding='utf-8') as f:
            reviews.append(f.read())
            sentiments.append('positive')
    
    # Load negative reviews
    neg_dir = data_path / 'neg'
    neg_files = list(neg_dir.glob('*.txt'))
    if sample_size:
        neg_files = neg_files[:sample_size]
    
    for file_path in neg_files:
        with open(file_path, 'r', encoding='utf-8') as f:
            reviews.append(f.read())
            sentiments.append('negative')
    
    # Create DataFrame
    df = pd.DataFrame({
        'review': reviews,
        'sentiment': sentiments
    })
    
    # Add numeric label (0=negative, 1=positive)
    df['label'] = (df['sentiment'] == 'positive').astype(int)
    
    # Shuffle the dataset
    df = df.sample(frac=1, random_state=42).reset_index(drop=True)
    
    return df

# Set data directory (adjust if needed)
DATA_DIR = Path('data')

# For quick testing, use sample_size parameter
# For full analysis, set sample_size=None
SAMPLE_SIZE = None  # Use None for full dataset, or 1000 for quick testing

print("Loading training data...")
train_df = load_imdb_data(DATA_DIR, split='train', sample_size=SAMPLE_SIZE)
print(f"‚úì Loaded {len(train_df)} training reviews")

print("\nLoading test data...")
test_df = load_imdb_data(DATA_DIR, split='test', sample_size=SAMPLE_SIZE)
print(f"‚úì Loaded {len(test_df)} test reviews")

print(f"\nTotal dataset size: {len(train_df) + len(test_df)} reviews")

In [None]:
# Display first few examples
print("Sample Reviews:")
print("=" * 80)
train_df.head()

In [None]:
# Check data info
print("Training Data Info:")
print(train_df.info())
print("\nClass Distribution:")
print(train_df['sentiment'].value_counts())
print("\nClass Balance:")
print(train_df['sentiment'].value_counts(normalize=True))

## 3. Exploratory Data Analysis (EDA)

Let's explore the characteristics of our text data:
1. Review length distribution
2. Word frequency analysis
3. Word clouds for positive/negative reviews
4. Common words by sentiment

In [None]:
# Calculate review statistics
train_df['review_length'] = train_df['review'].apply(len)
train_df['word_count'] = train_df['review'].apply(lambda x: len(x.split()))

print("Review Length Statistics:")
print(train_df[['review_length', 'word_count']].describe())

print("\nBy Sentiment:")
print(train_df.groupby('sentiment')[['review_length', 'word_count']].mean())

In [None]:
# Visualize review length distribution
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Character length distribution
axes[0].hist(train_df[train_df['sentiment']=='positive']['review_length'], 
             bins=50, alpha=0.6, label='Positive', color='green')
axes[0].hist(train_df[train_df['sentiment']=='negative']['review_length'], 
             bins=50, alpha=0.6, label='Negative', color='red')
axes[0].set_xlabel('Review Length (characters)')
axes[0].set_ylabel('Frequency')
axes[0].set_title('Distribution of Review Length')
axes[0].legend()
axes[0].set_xlim(0, 5000)  # Limit x-axis for better visualization

# Word count distribution
axes[1].hist(train_df[train_df['sentiment']=='positive']['word_count'], 
             bins=50, alpha=0.6, label='Positive', color='green')
axes[1].hist(train_df[train_df['sentiment']=='negative']['word_count'], 
             bins=50, alpha=0.6, label='Negative', color='red')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].set_title('Distribution of Word Count')
axes[1].legend()
axes[1].set_xlim(0, 1000)  # Limit x-axis for better visualization

plt.tight_layout()
plt.show()

print("Observation: Both positive and negative reviews have similar length distributions.")

### Word Clouds

Visualize the most common words in positive and negative reviews.

In [None]:
# Create word clouds for positive and negative reviews
def create_wordcloud(text, title, max_words=100):
    """
    Generate and display a word cloud.
    """
    wordcloud = WordCloud(
        width=800, 
        height=400,
        background_color='white',
        max_words=max_words,
        colormap='viridis'
    ).generate(text)
    
    plt.figure(figsize=(12, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title(title, fontsize=16, fontweight='bold')
    plt.tight_layout(pad=0)
    plt.show()

# Combine all positive reviews
positive_text = ' '.join(train_df[train_df['sentiment']=='positive']['review'])
create_wordcloud(positive_text, 'Most Common Words in Positive Reviews')

# Combine all negative reviews
negative_text = ' '.join(train_df[train_df['sentiment']=='negative']['review'])
create_wordcloud(negative_text, 'Most Common Words in Negative Reviews')

## 4. Text Preprocessing

Text preprocessing is crucial for NLP tasks. We'll implement a comprehensive preprocessing pipeline:

1. **Remove HTML tags** (reviews may contain HTML)
2. **Convert to lowercase** (normalize case)
3. **Remove special characters and punctuation**
4. **Remove numbers** (usually not informative for sentiment)
5. **Tokenization** (split into words)
6. **Remove stopwords** (common words like 'the', 'a', 'is')
7. **Stemming or Lemmatization** (reduce words to root form)

In [None]:
# Initialize preprocessing tools
stop_words = set(stopwords.words('english'))
stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

def preprocess_text(text, use_stemming=False, use_lemmatization=True, remove_stopwords=True):
    """
    Comprehensive text preprocessing function.
    
    Parameters:
    -----------
    text : str
        Raw text to preprocess
    use_stemming : bool
        Apply Porter stemming
    use_lemmatization : bool
        Apply WordNet lemmatization (recommended over stemming)
    remove_stopwords : bool
        Remove common English stopwords
    
    Returns:
    --------
    str : Cleaned and preprocessed text
    """
    # 1. Remove HTML tags using BeautifulSoup
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    # 2. Convert to lowercase
    text = text.lower()
    
    # 3. Remove URLs
    text = re.sub(r'http\S+|www\S+', '', text)
    
    # 4. Remove special characters and numbers (keep only letters and spaces)
    text = re.sub(r'[^a-z\s]', ' ', text)
    
    # 5. Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # 6. Tokenization
    tokens = word_tokenize(text)
    
    # 7. Remove stopwords
    if remove_stopwords:
        tokens = [word for word in tokens if word not in stop_words]
    
    # 8. Stemming or Lemmatization
    if use_stemming:
        tokens = [stemmer.stem(word) for word in tokens]
    elif use_lemmatization:
        tokens = [lemmatizer.lemmatize(word) for word in tokens]
    
    # 9. Join tokens back into string
    cleaned_text = ' '.join(tokens)
    
    return cleaned_text

# Test preprocessing on a sample review
sample_review = train_df.iloc[0]['review']
print("Original Review:")
print(sample_review[:500])  # Show first 500 characters
print("\n" + "="*80 + "\n")
print("Preprocessed Review:")
print(preprocess_text(sample_review)[:500])

In [None]:
# Apply preprocessing to all reviews
print("Preprocessing training data...")
start_time = time()
train_df['cleaned_review'] = train_df['review'].apply(preprocess_text)
train_time = time() - start_time
print(f"‚úì Training data preprocessed in {train_time:.2f} seconds")

print("\nPreprocessing test data...")
start_time = time()
test_df['cleaned_review'] = test_df['review'].apply(preprocess_text)
test_time = time() - start_time
print(f"‚úì Test data preprocessed in {test_time:.2f} seconds")

# Check for any empty reviews after preprocessing
empty_train = train_df['cleaned_review'].str.strip().eq('').sum()
empty_test = test_df['cleaned_review'].str.strip().eq('').sum()
print(f"\nEmpty reviews after preprocessing: {empty_train} train, {empty_test} test")

# Remove empty reviews if any
if empty_train > 0:
    train_df = train_df[train_df['cleaned_review'].str.strip() != ''].reset_index(drop=True)
if empty_test > 0:
    test_df = test_df[test_df['cleaned_review'].str.strip() != ''].reset_index(drop=True)

print(f"\nFinal dataset sizes: {len(train_df)} train, {len(test_df)} test")

## 5. Feature Engineering

We'll create two types of text features:

### 5.1 TF-IDF (Term Frequency-Inverse Document Frequency)

TF-IDF converts text to numerical vectors by considering:
- **Term Frequency (TF)**: How often a word appears in a document
- **Inverse Document Frequency (IDF)**: How rare/important a word is across all documents

**Formula**: TF-IDF = TF √ó IDF

This gives higher weights to words that are frequent in a document but rare overall.

In [None]:
# Create TF-IDF features
# Using both unigrams and bigrams for better context
tfidf_vectorizer = TfidfVectorizer(
    max_features=10000,  # Limit to top 10,000 features
    ngram_range=(1, 2),  # Use unigrams and bigrams
    min_df=5,            # Ignore terms that appear in fewer than 5 documents
    max_df=0.8,          # Ignore terms that appear in more than 80% of documents
    sublinear_tf=True    # Apply sublinear tf scaling (1 + log(tf))
)

# Fit on training data only (prevent data leakage)
print("Creating TF-IDF features...")
start_time = time()
X_train_tfidf = tfidf_vectorizer.fit_transform(train_df['cleaned_review'])
X_test_tfidf = tfidf_vectorizer.transform(test_df['cleaned_review'])
tfidf_time = time() - start_time

print(f"‚úì TF-IDF features created in {tfidf_time:.2f} seconds")
print(f"Feature matrix shape: {X_train_tfidf.shape}")
print(f"Number of features: {len(tfidf_vectorizer.get_feature_names_out())}")
print(f"Sparsity: {(1.0 - X_train_tfidf.nnz / (X_train_tfidf.shape[0] * X_train_tfidf.shape[1])) * 100:.2f}%")

# Extract labels
y_train = train_df['label'].values
y_test = test_df['label'].values

In [None]:
# Examine top TF-IDF features
feature_names = tfidf_vectorizer.get_feature_names_out()
print("Sample features (unigrams and bigrams):")
print(feature_names[:20])
print("\nTotal vocabulary size:", len(feature_names))

## 6. Model Training and Evaluation

We'll train and compare four different models:

1. **Naive Bayes**: Fast probabilistic classifier (baseline)
2. **Logistic Regression**: Linear model with strong performance on text
3. **Random Forest**: Ensemble method for non-linear patterns
4. **LSTM Neural Network**: Deep learning for sequential patterns

For each model, we'll track:
- Training time
- Accuracy, Precision, Recall, F1-Score
- Confusion matrix
- ROC curve

In [None]:
# Helper function to evaluate models
def evaluate_model(y_true, y_pred, y_pred_proba=None, model_name="Model"):
    """
    Comprehensive model evaluation with multiple metrics.
    """
    print(f"\n{'='*60}")
    print(f"{model_name} Evaluation Results")
    print(f"{'='*60}")
    
    # Calculate metrics
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    
    print(f"Accuracy:  {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1-Score:  {f1:.4f}")
    
    # ROC-AUC if probabilities available
    if y_pred_proba is not None:
        roc_auc = roc_auc_score(y_true, y_pred_proba)
        print(f"ROC-AUC:   {roc_auc:.4f}")
    
    # Classification report
    print("\nClassification Report:")
    print(classification_report(y_true, y_pred, target_names=['Negative', 'Positive']))
    
    return {
        'accuracy': accuracy,
        'precision': precision,
        'recall': recall,
        'f1': f1,
        'roc_auc': roc_auc_score(y_true, y_pred_proba) if y_pred_proba is not None else None
    }

# Store results for comparison
model_results = {}

### 6.1 Naive Bayes Classifier

In [None]:
# Train Naive Bayes model
print("Training Naive Bayes classifier...")
start_time = time()

nb_model = MultinomialNB(alpha=1.0)  # alpha is smoothing parameter
nb_model.fit(X_train_tfidf, y_train)

nb_train_time = time() - start_time
print(f"‚úì Training completed in {nb_train_time:.2f} seconds")

# Make predictions
y_pred_nb = nb_model.predict(X_test_tfidf)
y_pred_proba_nb = nb_model.predict_proba(X_test_tfidf)[:, 1]

# Evaluate
nb_metrics = evaluate_model(y_test, y_pred_nb, y_pred_proba_nb, "Naive Bayes")
nb_metrics['train_time'] = nb_train_time
model_results['Naive Bayes'] = nb_metrics

### 6.2 Logistic Regression

In [None]:
# Train Logistic Regression model
print("Training Logistic Regression classifier...")
start_time = time()

lr_model = LogisticRegression(
    C=1.0,              # Inverse regularization strength
    max_iter=1000,      # Maximum iterations
    solver='saga',      # Solver for large datasets
    random_state=42,
    n_jobs=-1           # Use all CPU cores
)
lr_model.fit(X_train_tfidf, y_train)

lr_train_time = time() - start_time
print(f"‚úì Training completed in {lr_train_time:.2f} seconds")

# Make predictions
y_pred_lr = lr_model.predict(X_test_tfidf)
y_pred_proba_lr = lr_model.predict_proba(X_test_tfidf)[:, 1]

# Evaluate
lr_metrics = evaluate_model(y_test, y_pred_lr, y_pred_proba_lr, "Logistic Regression")
lr_metrics['train_time'] = lr_train_time
model_results['Logistic Regression'] = lr_metrics

### 6.3 Random Forest Classifier

In [None]:
# Train Random Forest model
print("Training Random Forest classifier...")
print("Note: This may take several minutes due to high dimensionality.")
start_time = time()

rf_model = RandomForestClassifier(
    n_estimators=100,    # Number of trees
    max_depth=50,        # Limit depth to prevent overfitting
    min_samples_split=5,
    random_state=42,
    n_jobs=-1,           # Use all CPU cores
    verbose=0
)
rf_model.fit(X_train_tfidf, y_train)

rf_train_time = time() - start_time
print(f"‚úì Training completed in {rf_train_time:.2f} seconds")

# Make predictions
y_pred_rf = rf_model.predict(X_test_tfidf)
y_pred_proba_rf = rf_model.predict_proba(X_test_tfidf)[:, 1]

# Evaluate
rf_metrics = evaluate_model(y_test, y_pred_rf, y_pred_proba_rf, "Random Forest")
rf_metrics['train_time'] = rf_train_time
model_results['Random Forest'] = rf_metrics

### 6.4 LSTM Neural Network (Deep Learning)

LSTM (Long Short-Term Memory) networks can capture sequential patterns in text that traditional models might miss.

In [None]:
if TENSORFLOW_AVAILABLE:
    # Prepare data for LSTM (requires different preprocessing)
    print("Preparing data for LSTM...")
    
    # Tokenize text for deep learning
    MAX_VOCAB_SIZE = 10000
    MAX_SEQUENCE_LENGTH = 200
    EMBEDDING_DIM = 128
    
    # Create tokenizer
    keras_tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token='<OOV>')
    keras_tokenizer.fit_on_texts(train_df['cleaned_review'])
    
    # Convert text to sequences
    X_train_seq = keras_tokenizer.texts_to_sequences(train_df['cleaned_review'])
    X_test_seq = keras_tokenizer.texts_to_sequences(test_df['cleaned_review'])
    
    # Pad sequences to same length
    X_train_padded = pad_sequences(X_train_seq, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
    X_test_padded = pad_sequences(X_test_seq, maxlen=MAX_SEQUENCE_LENGTH, padding='post')
    
    print(f"‚úì Sequences prepared: {X_train_padded.shape}")
    
    # Build LSTM model
    print("\nBuilding LSTM model...")
    lstm_model = Sequential([
        # Embedding layer: converts word indices to dense vectors
        Embedding(input_dim=MAX_VOCAB_SIZE, output_dim=EMBEDDING_DIM, input_length=MAX_SEQUENCE_LENGTH),
        
        # LSTM layer: captures sequential patterns
        LSTM(64, dropout=0.2, recurrent_dropout=0.2),
        
        # Dense layers for classification
        Dense(32, activation='relu'),
        Dropout(0.5),
        Dense(1, activation='sigmoid')  # Binary classification
    ])
    
    # Compile model
    lstm_model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    print("\nModel Architecture:")
    lstm_model.summary()
    
    # Train LSTM model
    print("\nTraining LSTM model...")
    print("Note: This will take 15-30 minutes depending on your hardware.")
    start_time = time()
    
    history = lstm_model.fit(
        X_train_padded, y_train,
        epochs=5,
        batch_size=128,
        validation_split=0.2,
        verbose=1
    )
    
    lstm_train_time = time() - start_time
    print(f"\n‚úì Training completed in {lstm_train_time:.2f} seconds ({lstm_train_time/60:.1f} minutes)")
    
    # Make predictions
    y_pred_proba_lstm = lstm_model.predict(X_test_padded).flatten()
    y_pred_lstm = (y_pred_proba_lstm > 0.5).astype(int)
    
    # Evaluate
    lstm_metrics = evaluate_model(y_test, y_pred_lstm, y_pred_proba_lstm, "LSTM Neural Network")
    lstm_metrics['train_time'] = lstm_train_time
    model_results['LSTM'] = lstm_metrics
    
    # Plot training history
    fig, axes = plt.subplots(1, 2, figsize=(15, 5))
    
    # Accuracy plot
    axes[0].plot(history.history['accuracy'], label='Train Accuracy')
    axes[0].plot(history.history['val_accuracy'], label='Validation Accuracy')
    axes[0].set_xlabel('Epoch')
    axes[0].set_ylabel('Accuracy')
    axes[0].set_title('LSTM Training History - Accuracy')
    axes[0].legend()
    axes[0].grid(True)
    
    # Loss plot
    axes[1].plot(history.history['loss'], label='Train Loss')
    axes[1].plot(history.history['val_loss'], label='Validation Loss')
    axes[1].set_xlabel('Epoch')
    axes[1].set_ylabel('Loss')
    axes[1].set_title('LSTM Training History - Loss')
    axes[1].legend()
    axes[1].grid(True)
    
    plt.tight_layout()
    plt.show()

else:
    print("TensorFlow not available. Skipping LSTM model.")

## 7. Model Comparison

Let's compare all models side-by-side to determine which performs best.

In [None]:
# Create comparison DataFrame
comparison_df = pd.DataFrame(model_results).T
comparison_df = comparison_df.round(4)
comparison_df = comparison_df.sort_values('accuracy', ascending=False)

print("Model Comparison:")
print("=" * 80)
print(comparison_df)

# Visualize comparison
fig, axes = plt.subplots(1, 2, figsize=(15, 5))

# Performance metrics comparison
metrics_to_plot = ['accuracy', 'precision', 'recall', 'f1']
comparison_df[metrics_to_plot].plot(kind='bar', ax=axes[0])
axes[0].set_title('Model Performance Comparison', fontsize=14, fontweight='bold')
axes[0].set_ylabel('Score')
axes[0].set_xlabel('Model')
axes[0].set_ylim(0.7, 1.0)
axes[0].legend(title='Metric', loc='lower right')
axes[0].grid(axis='y', alpha=0.3)
plt.setp(axes[0].xaxis.get_majorticklabels(), rotation=45, ha='right')

# Training time comparison
comparison_df['train_time'].plot(kind='bar', ax=axes[1], color='coral')
axes[1].set_title('Training Time Comparison', fontsize=14, fontweight='bold')
axes[1].set_ylabel('Time (seconds)')
axes[1].set_xlabel('Model')
axes[1].grid(axis='y', alpha=0.3)
plt.setp(axes[1].xaxis.get_majorticklabels(), rotation=45, ha='right')

plt.tight_layout()
plt.show()

# Identify best model
best_model_name = comparison_df['accuracy'].idxmax()
print(f"\nüèÜ Best Model: {best_model_name}")
print(f"   Accuracy: {comparison_df.loc[best_model_name, 'accuracy']:.4f}")
print(f"   F1-Score: {comparison_df.loc[best_model_name, 'f1']:.4f}")
print(f"   Training Time: {comparison_df.loc[best_model_name, 'train_time']:.2f}s")

## 8. Confusion Matrices

Visualize the confusion matrices to understand where each model makes errors.

In [None]:
# Create confusion matrices for all models
models = [
    ('Naive Bayes', y_pred_nb),
    ('Logistic Regression', y_pred_lr),
    ('Random Forest', y_pred_rf)
]

if TENSORFLOW_AVAILABLE:
    models.append(('LSTM', y_pred_lstm))

num_models = len(models)
fig, axes = plt.subplots(1, num_models, figsize=(5*num_models, 4))
if num_models == 1:
    axes = [axes]

for idx, (model_name, y_pred) in enumerate(models):
    cm = confusion_matrix(y_test, y_pred)
    
    # Plot confusion matrix
    sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', ax=axes[idx],
                xticklabels=['Negative', 'Positive'],
                yticklabels=['Negative', 'Positive'])
    axes[idx].set_title(f'{model_name}\nConfusion Matrix', fontweight='bold')
    axes[idx].set_ylabel('True Label')
    axes[idx].set_xlabel('Predicted Label')

plt.tight_layout()
plt.show()

## 9. ROC Curves

ROC (Receiver Operating Characteristic) curves show the trade-off between true positive rate and false positive rate.

In [None]:
# Plot ROC curves for all models
plt.figure(figsize=(10, 8))

models_with_proba = [
    ('Naive Bayes', y_pred_proba_nb),
    ('Logistic Regression', y_pred_proba_lr),
    ('Random Forest', y_pred_proba_rf)
]

if TENSORFLOW_AVAILABLE:
    models_with_proba.append(('LSTM', y_pred_proba_lstm))

for model_name, y_pred_proba in models_with_proba:
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    roc_auc = auc(fpr, tpr)
    
    plt.plot(fpr, tpr, linewidth=2, label=f'{model_name} (AUC = {roc_auc:.4f})')

# Plot diagonal (random classifier)
plt.plot([0, 1], [0, 1], 'k--', linewidth=1, label='Random Classifier (AUC = 0.5000)')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curves - Model Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(alpha=0.3)
plt.tight_layout()
plt.show()

print("Interpretation:")
print("- The closer the curve to the top-left corner, the better the model")
print("- AUC (Area Under Curve) of 1.0 = perfect classifier")
print("- AUC of 0.5 = random guessing")

## 10. Error Analysis

Let's examine some misclassified examples to understand model limitations.

In [None]:
# Use best model for error analysis (Logistic Regression typically performs best)
# Add predictions to test dataframe
test_df['predicted_lr'] = y_pred_lr
test_df['predicted_proba_lr'] = y_pred_proba_lr
test_df['correct'] = test_df['label'] == test_df['predicted_lr']

# Identify misclassified examples
misclassified = test_df[~test_df['correct']].copy()
print(f"Total misclassified: {len(misclassified)} out of {len(test_df)} ({len(misclassified)/len(test_df)*100:.2f}%)")

# Separate false positives and false negatives
false_positives = misclassified[misclassified['label'] == 0]  # Predicted positive, actually negative
false_negatives = misclassified[misclassified['label'] == 1]  # Predicted negative, actually positive

print(f"\nFalse Positives: {len(false_positives)}")
print(f"False Negatives: {len(false_negatives)}")

In [None]:
# Show examples of false positives (model thinks negative review is positive)
print("Examples of False Positives (Actually Negative, Predicted Positive):")
print("=" * 80)

for idx, row in false_positives.head(3).iterrows():
    print(f"\nConfidence: {row['predicted_proba_lr']:.2f}")
    print(f"Review: {row['review'][:500]}...")
    print("-" * 80)

In [None]:
# Show examples of false negatives (model thinks positive review is negative)
print("Examples of False Negatives (Actually Positive, Predicted Negative):")
print("=" * 80)

for idx, row in false_negatives.head(3).iterrows():
    print(f"\nConfidence: {row['predicted_proba_lr']:.2f}")
    print(f"Review: {row['review'][:500]}...")
    print("-" * 80)

## 11. Model Interpretability

Understand which features (words) are most important for predictions.

In [None]:
# Extract feature importance from Logistic Regression
# Coefficients indicate how strongly a feature contributes to each class
feature_names = np.array(tfidf_vectorizer.get_feature_names_out())
coefficients = lr_model.coef_[0]

# Top features for positive sentiment (highest positive coefficients)
top_positive_indices = np.argsort(coefficients)[-20:][::-1]
top_positive_features = feature_names[top_positive_indices]
top_positive_coeffs = coefficients[top_positive_indices]

# Top features for negative sentiment (lowest/most negative coefficients)
top_negative_indices = np.argsort(coefficients)[:20]
top_negative_features = feature_names[top_negative_indices]
top_negative_coeffs = coefficients[top_negative_indices]

# Visualize top features
fig, axes = plt.subplots(1, 2, figsize=(16, 8))

# Positive features
axes[0].barh(range(len(top_positive_features)), top_positive_coeffs, color='green')
axes[0].set_yticks(range(len(top_positive_features)))
axes[0].set_yticklabels(top_positive_features)
axes[0].set_xlabel('Coefficient Value', fontsize=12)
axes[0].set_title('Top 20 Features for Positive Sentiment', fontsize=14, fontweight='bold')
axes[0].invert_yaxis()
axes[0].grid(axis='x', alpha=0.3)

# Negative features
axes[1].barh(range(len(top_negative_features)), top_negative_coeffs, color='red')
axes[1].set_yticks(range(len(top_negative_features)))
axes[1].set_yticklabels(top_negative_features)
axes[1].set_xlabel('Coefficient Value', fontsize=12)
axes[1].set_title('Top 20 Features for Negative Sentiment', fontsize=14, fontweight='bold')
axes[1].invert_yaxis()
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nInterpretation:")
print("- Higher positive coefficients = stronger indicator of positive sentiment")
print("- More negative coefficients = stronger indicator of negative sentiment")
print("- Notice how bigrams (two-word phrases) can be very informative!")

## 12. Making Predictions on New Reviews

Let's test our best model on custom reviews.

In [None]:
def predict_sentiment(review_text, model=lr_model, vectorizer=tfidf_vectorizer):
    """
    Predict sentiment for a new review.
    
    Parameters:
    -----------
    review_text : str
        Raw review text
    model : sklearn model
        Trained classifier
    vectorizer : sklearn vectorizer
        Fitted TF-IDF vectorizer
    
    Returns:
    --------
    dict with sentiment, confidence, and processed text
    """
    # Preprocess
    cleaned = preprocess_text(review_text)
    
    # Vectorize
    features = vectorizer.transform([cleaned])
    
    # Predict
    prediction = model.predict(features)[0]
    proba = model.predict_proba(features)[0]
    
    sentiment = 'positive' if prediction == 1 else 'negative'
    confidence = proba[prediction]
    
    return {
        'sentiment': sentiment,
        'confidence': confidence,
        'cleaned_text': cleaned,
        'probabilities': {'negative': proba[0], 'positive': proba[1]}
    }

# Test on custom reviews
test_reviews = [
    "This movie was absolutely fantastic! The acting was superb and the plot kept me engaged throughout.",
    "Terrible waste of time. Poor acting, boring story, and awful special effects.",
    "It was okay. Nothing special but not terrible either. Average movie overall.",
    "One of the best films I've ever seen! Masterpiece of cinema. Highly recommended!",
    "I fell asleep halfway through. Incredibly dull and predictable."
]

print("Predictions on Custom Reviews:")
print("=" * 80)

for review in test_reviews:
    result = predict_sentiment(review)
    print(f"\nReview: {review}")
    print(f"Predicted Sentiment: {result['sentiment'].upper()}")
    print(f"Confidence: {result['confidence']:.2%}")
    print(f"Probabilities: Negative={result['probabilities']['negative']:.2%}, "
          f"Positive={result['probabilities']['positive']:.2%}")
    print("-" * 80)

## 13. Saving the Model for Deployment

Export the best model and preprocessing artifacts for production use.

In [None]:
# Create models directory if it doesn't exist
models_dir = Path('models')
models_dir.mkdir(exist_ok=True)

# Save Logistic Regression model (best balance of performance and speed)
with open(models_dir / 'logistic_regression_model.pkl', 'wb') as f:
    pickle.dump(lr_model, f)
print("‚úì Saved Logistic Regression model")

# Save TF-IDF vectorizer (needed for preprocessing new reviews)
with open(models_dir / 'tfidf_vectorizer.pkl', 'wb') as f:
    pickle.dump(tfidf_vectorizer, f)
print("‚úì Saved TF-IDF vectorizer")

# Save preprocessing function details
preprocessing_config = {
    'use_stemming': False,
    'use_lemmatization': True,
    'remove_stopwords': True,
    'stopwords': list(stop_words)
}
with open(models_dir / 'preprocessing_config.pkl', 'wb') as f:
    pickle.dump(preprocessing_config, f)
print("‚úì Saved preprocessing configuration")

# Save LSTM model if trained
if TENSORFLOW_AVAILABLE and 'LSTM' in model_results:
    lstm_model.save(models_dir / 'lstm_model.h5')
    print("‚úì Saved LSTM model")
    
    with open(models_dir / 'keras_tokenizer.pkl', 'wb') as f:
        pickle.dump(keras_tokenizer, f)
    print("‚úì Saved Keras tokenizer")

print(f"\nAll models saved to: {models_dir.absolute()}")

## 14. Summary and Key Takeaways

### What We Accomplished

1. **Data Loading**: Successfully loaded 50,000 IMDB movie reviews
2. **Text Preprocessing**: Built a comprehensive NLP pipeline (cleaning, tokenization, lemmatization)
3. **Feature Engineering**: Created TF-IDF features with unigrams and bigrams
4. **Model Training**: Trained and compared 4 different models:
   - Naive Bayes: Fast baseline (83-85% accuracy)
   - Logistic Regression: Best balance (87-89% accuracy)
   - Random Forest: Good but slower (84-86% accuracy)
   - LSTM: State-of-the-art deep learning (88-90% accuracy)
5. **Evaluation**: Used multiple metrics (accuracy, precision, recall, F1, ROC-AUC)
6. **Error Analysis**: Examined misclassified examples to understand limitations
7. **Interpretability**: Identified most influential words for each sentiment
8. **Deployment**: Exported models for production use

### Key Insights

**Best Model**: Logistic Regression offers the best trade-off between:
- Performance (87-89% accuracy)
- Training speed (2-5 minutes)
- Inference speed (milliseconds per prediction)
- Interpretability (clear feature weights)

**Important Features**:
- Bigrams (two-word phrases) are highly informative
- Words like "excellent", "wonderful", "great" strongly indicate positive sentiment
- Words like "waste", "worst", "terrible" strongly indicate negative sentiment

**Common Errors**:
- Sarcastic reviews (positive words used ironically)
- Mixed sentiment reviews (both positive and negative aspects)
- Reviews with complex language or rare vocabulary

### Next Steps for Improvement

1. **Try advanced models**:
   - BERT or other transformer models (Hugging Face)
   - Pre-trained word embeddings (GloVe, Word2Vec)

2. **Feature engineering**:
   - Sentiment lexicons (VADER, TextBlob)
   - Part-of-speech tags
   - Negation handling ("not good" vs "good")

3. **Hyperparameter tuning**:
   - GridSearchCV or RandomizedSearchCV
   - Cross-validation for robust evaluation

4. **Deployment**:
   - Build REST API with FastAPI (see api.py)
   - Deploy to cloud (AWS, GCP, Azure)
   - Add monitoring and logging

5. **Business applications**:
   - Real-time sentiment monitoring dashboard
   - Product review analysis
   - Social media sentiment tracking

### What You Learned

‚úÖ Text preprocessing for NLP tasks  
‚úÖ TF-IDF feature engineering  
‚úÖ Training multiple ML models for comparison  
‚úÖ Evaluating classifiers with proper metrics  
‚úÖ Error analysis and model interpretation  
‚úÖ Saving models for production deployment  

**Congratulations!** You've built a complete sentiment analysis system from scratch.

---

## Additional Resources

- [NLTK Book](https://www.nltk.org/book/)
- [Scikit-learn Text Classification](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html)
- [TensorFlow Text Classification Tutorial](https://www.tensorflow.org/tutorials/keras/text_classification)
- [Original IMDB Dataset Paper](http://ai.stanford.edu/~amaas/papers/wvSent_acl2011.pdf)

---

**Next Project**: [Project 07 - Credit Card Fraud Detection](../07_credit_card_fraud/)