# Text Classification: Naive Bayes vs LSTM Comparison

**Course:** BUS 405 - Foundations of Big Data Analytics  
**Topic:** Comparative Analysis of Traditional ML and Deep Learning for Text Classification

---

## Objectives
1. Implement text classification using **Multinomial Naive Bayes** (traditional ML approach)
2. Implement text classification using **LSTM** (deep learning approach)
3. Compare performance metrics, training time, and model characteristics
4. Understand when to use each approach

## Dataset
We'll use the **SMS Spam Collection Dataset** from Kaggle - a real-world dataset containing 5,574 SMS messages labeled as 'spam' or 'ham' (legitimate).

Dataset source: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

## Part 1: Setup and Data Loading

In [None]:
# Install required packages (uncomment if needed)
# !pip install pandas numpy scikit-learn tensorflow matplotlib seaborn

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
import warnings
warnings.filterwarnings('ignore')

# Scikit-learn imports for Naive Bayes
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import (accuracy_score, precision_score, recall_score, 
                             f1_score, confusion_matrix, classification_report,
                             roc_curve, auc)

# TensorFlow/Keras imports for LSTM
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.callbacks import EarlyStopping

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

print("TensorFlow version:", tf.__version__)
print("All libraries imported successfully!")

In [None]:
# Download the SMS Spam Collection dataset
# Method 1: Direct download from UCI ML Repository
import urllib.request
import zipfile
import os

# Download dataset
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
filename = "smsspamcollection.zip"

if not os.path.exists('SMSSpamCollection'):
    print("Downloading SMS Spam Collection dataset...")
    urllib.request.urlretrieve(url, filename)
    
    # Extract the zip file
    with zipfile.ZipFile(filename, 'r') as zip_ref:
        zip_ref.extractall('.')
    print("Dataset downloaded and extracted!")
else:
    print("Dataset already exists.")

In [None]:
# Load the dataset
df = pd.read_csv('SMSSpamCollection', sep='\t', header=None, names=['label', 'message'])

print(f"Dataset Shape: {df.shape}")
print(f"\nFirst 5 rows:")
df.head()

## Part 2: Exploratory Data Analysis

In [None]:
# Basic statistics
print("Dataset Information:")
print("="*50)
print(f"Total messages: {len(df)}")
print(f"\nLabel Distribution:")
print(df['label'].value_counts())
print(f"\nLabel Percentage:")
print(df['label'].value_counts(normalize=True) * 100)

In [None]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

# Bar plot
colors = ['#2ecc71', '#e74c3c']
df['label'].value_counts().plot(kind='bar', ax=axes[0], color=colors)
axes[0].set_title('Class Distribution (Count)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Label')
axes[0].set_ylabel('Count')
axes[0].tick_params(axis='x', rotation=0)

# Pie chart
df['label'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                                 colors=colors, explode=(0, 0.1))
axes[1].set_title('Class Distribution (Percentage)', fontsize=12, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

print("\nüìä Observation: The dataset is imbalanced with ~87% ham and ~13% spam messages.")

In [None]:
# Analyze message length
df['message_length'] = df['message'].apply(len)
df['word_count'] = df['message'].apply(lambda x: len(x.split()))

print("Message Length Statistics:")
print("="*50)
print(df.groupby('label')[['message_length', 'word_count']].describe())

In [None]:
# Visualize message length distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Message length distribution
for label, color in zip(['ham', 'spam'], colors):
    subset = df[df['label'] == label]
    axes[0].hist(subset['message_length'], bins=50, alpha=0.7, label=label, color=color)
axes[0].set_title('Distribution of Message Length (Characters)', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Message Length')
axes[0].set_ylabel('Frequency')
axes[0].legend()

# Word count distribution
for label, color in zip(['ham', 'spam'], colors):
    subset = df[df['label'] == label]
    axes[1].hist(subset['word_count'], bins=30, alpha=0.7, label=label, color=color)
axes[1].set_title('Distribution of Word Count', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Word Count')
axes[1].set_ylabel('Frequency')
axes[1].legend()

plt.tight_layout()
plt.show()

print("\nüìä Observation: Spam messages tend to be longer than ham messages.")

In [None]:
# Sample messages
print("Sample HAM messages:")
print("-"*50)
for msg in df[df['label']=='ham']['message'].sample(3, random_state=42).values:
    print(f"‚Ä¢ {msg[:100]}..." if len(msg) > 100 else f"‚Ä¢ {msg}")

print("\nSample SPAM messages:")
print("-"*50)
for msg in df[df['label']=='spam']['message'].sample(3, random_state=42).values:
    print(f"‚Ä¢ {msg[:100]}..." if len(msg) > 100 else f"‚Ä¢ {msg}")

## Part 3: Data Preprocessing

In [None]:
import re
import string

def preprocess_text(text):
    """
    Preprocess text data:
    1. Convert to lowercase
    2. Remove special characters and digits
    3. Remove extra whitespace
    """
    # Convert to lowercase
    text = text.lower()
    
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    
    # Remove extra whitespace
    text = ' '.join(text.split())
    
    return text

# Apply preprocessing
df['cleaned_message'] = df['message'].apply(preprocess_text)

# Convert labels to binary
df['label_encoded'] = df['label'].map({'ham': 0, 'spam': 1})

print("Preprocessing completed!")
print("\nSample before and after:")
print("-"*50)
for i in [0, 1, 2]:
    print(f"Original: {df['message'].iloc[i][:60]}...")
    print(f"Cleaned:  {df['cleaned_message'].iloc[i][:60]}...")
    print()

In [None]:
# Split data into training and testing sets
X = df['cleaned_message']
y = df['label_encoded']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {len(X_train)}")
print(f"Testing set size: {len(X_test)}")
print(f"\nTraining label distribution:")
print(y_train.value_counts())
print(f"\nTesting label distribution:")
print(y_test.value_counts())

---

## Part 4: Naive Bayes Classification

### Theory
**Multinomial Naive Bayes** is a probabilistic classifier based on Bayes' theorem with the "naive" assumption of conditional independence between features.

$$P(y|x_1, x_2, ..., x_n) = \frac{P(y) \prod_{i=1}^{n} P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

For text classification:
- **Features**: Word frequencies (Bag of Words) or TF-IDF scores
- **Advantages**: Fast, works well with small datasets, interpretable
- **Limitations**: Independence assumption, doesn't capture word order

In [None]:
# Method 1: Bag of Words with Naive Bayes
print("="*60)
print("NAIVE BAYES WITH BAG OF WORDS (CountVectorizer)")
print("="*60)

# Create CountVectorizer (Bag of Words)
count_vectorizer = CountVectorizer(max_features=5000, stop_words='english')

# Transform text to feature vectors
X_train_bow = count_vectorizer.fit_transform(X_train)
X_test_bow = count_vectorizer.transform(X_test)

print(f"Vocabulary size: {len(count_vectorizer.vocabulary_)}")
print(f"Feature matrix shape (train): {X_train_bow.shape}")
print(f"Feature matrix shape (test): {X_test_bow.shape}")

In [None]:
# Train Naive Bayes model with BOW
start_time = time.time()

nb_bow_model = MultinomialNB(alpha=1.0)  # alpha is Laplace smoothing parameter
nb_bow_model.fit(X_train_bow, y_train)

nb_bow_train_time = time.time() - start_time
print(f"Training time: {nb_bow_train_time:.4f} seconds")

# Predictions
start_time = time.time()
y_pred_nb_bow = nb_bow_model.predict(X_test_bow)
y_pred_proba_nb_bow = nb_bow_model.predict_proba(X_test_bow)[:, 1]
nb_bow_inference_time = time.time() - start_time
print(f"Inference time: {nb_bow_inference_time:.4f} seconds")

In [None]:
# Method 2: TF-IDF with Naive Bayes
print("\n" + "="*60)
print("NAIVE BAYES WITH TF-IDF")
print("="*60)

# Create TF-IDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000, stop_words='english')

# Transform text to TF-IDF features
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print(f"Feature matrix shape (train): {X_train_tfidf.shape}")

# Train Naive Bayes with TF-IDF
start_time = time.time()
nb_tfidf_model = MultinomialNB(alpha=1.0)
nb_tfidf_model.fit(X_train_tfidf, y_train)
nb_tfidf_train_time = time.time() - start_time
print(f"Training time: {nb_tfidf_train_time:.4f} seconds")

# Predictions
start_time = time.time()
y_pred_nb_tfidf = nb_tfidf_model.predict(X_test_tfidf)
y_pred_proba_nb_tfidf = nb_tfidf_model.predict_proba(X_test_tfidf)[:, 1]
nb_tfidf_inference_time = time.time() - start_time
print(f"Inference time: {nb_tfidf_inference_time:.4f} seconds")

In [None]:
# Evaluate Naive Bayes models
def evaluate_model(y_true, y_pred, model_name):
    """Calculate and display evaluation metrics"""
    accuracy = accuracy_score(y_true, y_pred)
    precision = precision_score(y_true, y_pred)
    recall = recall_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    
    print(f"\n{model_name} Results:")
    print("-"*40)
    print(f"Accuracy:  {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall:    {recall:.4f}")
    print(f"F1-Score:  {f1:.4f}")
    
    return {'accuracy': accuracy, 'precision': precision, 'recall': recall, 'f1': f1}

# Evaluate both NB models
nb_bow_metrics = evaluate_model(y_test, y_pred_nb_bow, "Naive Bayes (BOW)")
nb_tfidf_metrics = evaluate_model(y_test, y_pred_nb_tfidf, "Naive Bayes (TF-IDF)")

In [None]:
# Confusion matrices for Naive Bayes
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# BOW confusion matrix
cm_bow = confusion_matrix(y_test, y_pred_nb_bow)
sns.heatmap(cm_bow, annot=True, fmt='d', cmap='Blues', ax=axes[0],
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
axes[0].set_title('Naive Bayes (BOW)\nConfusion Matrix', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')

# TF-IDF confusion matrix
cm_tfidf = confusion_matrix(y_test, y_pred_nb_tfidf)
sns.heatmap(cm_tfidf, annot=True, fmt='d', cmap='Blues', ax=axes[1],
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
axes[1].set_title('Naive Bayes (TF-IDF)\nConfusion Matrix', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Predicted')
axes[1].set_ylabel('Actual')

plt.tight_layout()
plt.show()

In [None]:
# Feature importance analysis for Naive Bayes
def get_top_features(model, vectorizer, n=10):
    """Get top features for spam classification"""
    feature_names = vectorizer.get_feature_names_out()
    
    # Log probabilities for each class
    log_prob_spam = model.feature_log_prob_[1]  # Spam class
    log_prob_ham = model.feature_log_prob_[0]   # Ham class
    
    # Difference indicates importance for spam detection
    spam_importance = log_prob_spam - log_prob_ham
    
    # Get top spam indicators
    top_spam_idx = spam_importance.argsort()[-n:][::-1]
    top_ham_idx = spam_importance.argsort()[:n]
    
    return ([(feature_names[i], spam_importance[i]) for i in top_spam_idx],
            [(feature_names[i], spam_importance[i]) for i in top_ham_idx])

top_spam, top_ham = get_top_features(nb_tfidf_model, tfidf_vectorizer, n=15)

print("Top 15 Spam Indicator Words:")
print("-"*40)
for word, score in top_spam:
    print(f"  {word:15s} : {score:.4f}")

print("\nTop 15 Ham Indicator Words:")
print("-"*40)
for word, score in top_ham:
    print(f"  {word:15s} : {score:.4f}")

---

## Part 5: LSTM Classification

### Theory
**Long Short-Term Memory (LSTM)** is a type of Recurrent Neural Network (RNN) designed to capture long-range dependencies in sequences.

Key components:
- **Forget Gate**: Decides what information to discard
- **Input Gate**: Decides what new information to store
- **Output Gate**: Decides what to output

**Advantages**: 
- Captures sequential patterns and word order
- Handles variable-length inputs
- Can learn complex patterns

**Limitations**: 
- Requires more data and computational resources
- Longer training time
- Less interpretable

In [None]:
# Prepare data for LSTM
print("="*60)
print("LSTM DATA PREPARATION")
print("="*60)

# Hyperparameters
MAX_VOCAB_SIZE = 10000  # Maximum vocabulary size
MAX_SEQUENCE_LENGTH = 100  # Maximum sequence length
EMBEDDING_DIM = 100  # Embedding dimension

# Tokenize text
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)

# Convert text to sequences
X_train_seq = tokenizer.texts_to_sequences(X_train)
X_test_seq = tokenizer.texts_to_sequences(X_test)

# Pad sequences to same length
X_train_padded = pad_sequences(X_train_seq, maxlen=MAX_SEQUENCE_LENGTH, 
                                padding='post', truncating='post')
X_test_padded = pad_sequences(X_test_seq, maxlen=MAX_SEQUENCE_LENGTH, 
                               padding='post', truncating='post')

# Convert labels to numpy arrays
y_train_lstm = y_train.values
y_test_lstm = y_test.values

print(f"Vocabulary size: {len(tokenizer.word_index)}")
print(f"Training sequences shape: {X_train_padded.shape}")
print(f"Testing sequences shape: {X_test_padded.shape}")

In [None]:
# Example: How tokenization works
sample_text = X_train.iloc[0]
sample_seq = X_train_seq[0]

print("Tokenization Example:")
print("-"*50)
print(f"Original text: {sample_text}")
print(f"\nToken IDs: {sample_seq[:20]}...")
print(f"\nPadded sequence (first 20): {X_train_padded[0][:20]}")

# Reverse mapping
reverse_word_index = {v: k for k, v in tokenizer.word_index.items()}
decoded = ' '.join([reverse_word_index.get(idx, '?') for idx in sample_seq[:10]])
print(f"\nDecoded (first 10 tokens): {decoded}")

In [None]:
# Build LSTM Model
print("\n" + "="*60)
print("LSTM MODEL ARCHITECTURE")
print("="*60)

def build_lstm_model(vocab_size, embedding_dim, max_length):
    """
    Build LSTM model for binary text classification
    """
    model = Sequential([
        # Embedding layer: converts word indices to dense vectors
        Embedding(input_dim=vocab_size, 
                  output_dim=embedding_dim, 
                  input_length=max_length,
                  name='embedding'),
        
        # LSTM layer with dropout for regularization
        LSTM(64, return_sequences=True, dropout=0.2, name='lstm_1'),
        LSTM(32, dropout=0.2, name='lstm_2'),
        
        # Dense layers
        Dense(64, activation='relu', name='dense_1'),
        Dropout(0.5, name='dropout'),
        
        # Output layer
        Dense(1, activation='sigmoid', name='output')
    ])
    
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    
    return model

# Build model
lstm_model = build_lstm_model(
    vocab_size=MAX_VOCAB_SIZE,
    embedding_dim=EMBEDDING_DIM,
    max_length=MAX_SEQUENCE_LENGTH
)

# Display model summary
lstm_model.summary()

In [None]:
# Train LSTM Model
print("\n" + "="*60)
print("TRAINING LSTM MODEL")
print("="*60)

# Early stopping to prevent overfitting
early_stopping = EarlyStopping(
    monitor='val_loss',
    patience=3,
    restore_best_weights=True,
    verbose=1
)

# Train the model
start_time = time.time()

history = lstm_model.fit(
    X_train_padded, y_train_lstm,
    epochs=10,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stopping],
    verbose=1
)

lstm_train_time = time.time() - start_time
print(f"\nTotal training time: {lstm_train_time:.2f} seconds")

In [None]:
# Plot training history
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy plot
axes[0].plot(history.history['accuracy'], label='Training Accuracy', linewidth=2)
axes[0].plot(history.history['val_accuracy'], label='Validation Accuracy', linewidth=2)
axes[0].set_title('LSTM Model Accuracy', fontsize=12, fontweight='bold')
axes[0].set_xlabel('Epoch')
axes[0].set_ylabel('Accuracy')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Loss plot
axes[1].plot(history.history['loss'], label='Training Loss', linewidth=2)
axes[1].plot(history.history['val_loss'], label='Validation Loss', linewidth=2)
axes[1].set_title('LSTM Model Loss', fontsize=12, fontweight='bold')
axes[1].set_xlabel('Epoch')
axes[1].set_ylabel('Loss')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

In [None]:
# Evaluate LSTM Model
print("\n" + "="*60)
print("LSTM MODEL EVALUATION")
print("="*60)

# Predictions
start_time = time.time()
y_pred_proba_lstm = lstm_model.predict(X_test_padded, verbose=0)
y_pred_lstm = (y_pred_proba_lstm > 0.5).astype(int).flatten()
lstm_inference_time = time.time() - start_time

print(f"Inference time: {lstm_inference_time:.4f} seconds")

# Calculate metrics
lstm_metrics = evaluate_model(y_test_lstm, y_pred_lstm, "LSTM")

In [None]:
# LSTM Confusion Matrix
plt.figure(figsize=(6, 5))

cm_lstm = confusion_matrix(y_test_lstm, y_pred_lstm)
sns.heatmap(cm_lstm, annot=True, fmt='d', cmap='Greens',
            xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
plt.title('LSTM Model\nConfusion Matrix', fontsize=12, fontweight='bold')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.tight_layout()
plt.show()

# Detailed classification report
print("\nDetailed Classification Report (LSTM):")
print(classification_report(y_test_lstm, y_pred_lstm, target_names=['Ham', 'Spam']))

---

## Part 6: Comparative Analysis

In [None]:
# Comprehensive comparison
print("="*70)
print("COMPREHENSIVE MODEL COMPARISON")
print("="*70)

# Create comparison DataFrame
comparison_data = {
    'Model': ['Naive Bayes (BOW)', 'Naive Bayes (TF-IDF)', 'LSTM'],
    'Accuracy': [nb_bow_metrics['accuracy'], nb_tfidf_metrics['accuracy'], lstm_metrics['accuracy']],
    'Precision': [nb_bow_metrics['precision'], nb_tfidf_metrics['precision'], lstm_metrics['precision']],
    'Recall': [nb_bow_metrics['recall'], nb_tfidf_metrics['recall'], lstm_metrics['recall']],
    'F1-Score': [nb_bow_metrics['f1'], nb_tfidf_metrics['f1'], lstm_metrics['f1']],
    'Training Time (s)': [nb_bow_train_time, nb_tfidf_train_time, lstm_train_time],
    'Inference Time (s)': [nb_bow_inference_time, nb_tfidf_inference_time, lstm_inference_time]
}

comparison_df = pd.DataFrame(comparison_data)
comparison_df = comparison_df.round(4)

# Display comparison
print(comparison_df.to_string(index=False))

In [None]:
# Visualize performance comparison
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score']
models = ['NB (BOW)', 'NB (TF-IDF)', 'LSTM']
colors = ['#3498db', '#2ecc71', '#e74c3c']

# Performance metrics comparison
x = np.arange(len(metrics))
width = 0.25

for i, (model, color) in enumerate(zip(models, colors)):
    values = comparison_df[comparison_df['Model'].str.contains(model.split()[0])][metrics].values[0]
    axes[0].bar(x + i*width, values, width, label=model, color=color, alpha=0.8)

axes[0].set_xlabel('Metrics')
axes[0].set_ylabel('Score')
axes[0].set_title('Performance Metrics Comparison', fontsize=12, fontweight='bold')
axes[0].set_xticks(x + width)
axes[0].set_xticklabels(metrics)
axes[0].legend()
axes[0].set_ylim(0.8, 1.0)
axes[0].grid(True, alpha=0.3, axis='y')

# Training time comparison
train_times = comparison_df['Training Time (s)'].values
bars = axes[1].bar(models, train_times, color=colors, alpha=0.8)
axes[1].set_xlabel('Model')
axes[1].set_ylabel('Time (seconds)')
axes[1].set_title('Training Time Comparison', fontsize=12, fontweight='bold')
axes[1].grid(True, alpha=0.3, axis='y')

# Add value labels on bars
for bar, time in zip(bars, train_times):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.5, 
                 f'{time:.2f}s', ha='center', va='bottom', fontsize=10)

# Inference time comparison
inference_times = comparison_df['Inference Time (s)'].values
bars = axes[2].bar(models, inference_times, color=colors, alpha=0.8)
axes[2].set_xlabel('Model')
axes[2].set_ylabel('Time (seconds)')
axes[2].set_title('Inference Time Comparison', fontsize=12, fontweight='bold')
axes[2].grid(True, alpha=0.3, axis='y')

# Add value labels
for bar, time in zip(bars, inference_times):
    axes[2].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001, 
                 f'{time:.4f}s', ha='center', va='bottom', fontsize=10)

plt.tight_layout()
plt.show()

In [None]:
# ROC Curve Comparison
plt.figure(figsize=(10, 8))

# Calculate ROC curves
fpr_bow, tpr_bow, _ = roc_curve(y_test, y_pred_proba_nb_bow)
roc_auc_bow = auc(fpr_bow, tpr_bow)

fpr_tfidf, tpr_tfidf, _ = roc_curve(y_test, y_pred_proba_nb_tfidf)
roc_auc_tfidf = auc(fpr_tfidf, tpr_tfidf)

fpr_lstm, tpr_lstm, _ = roc_curve(y_test_lstm, y_pred_proba_lstm)
roc_auc_lstm = auc(fpr_lstm, tpr_lstm)

# Plot ROC curves
plt.plot(fpr_bow, tpr_bow, color='#3498db', lw=2, 
         label=f'Naive Bayes (BOW) - AUC = {roc_auc_bow:.4f}')
plt.plot(fpr_tfidf, tpr_tfidf, color='#2ecc71', lw=2, 
         label=f'Naive Bayes (TF-IDF) - AUC = {roc_auc_tfidf:.4f}')
plt.plot(fpr_lstm, tpr_lstm, color='#e74c3c', lw=2, 
         label=f'LSTM - AUC = {roc_auc_lstm:.4f}')

# Diagonal line (random classifier)
plt.plot([0, 1], [0, 1], color='gray', lw=1, linestyle='--', label='Random Classifier')

plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=10)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nAUC Scores:")
print(f"  Naive Bayes (BOW):    {roc_auc_bow:.4f}")
print(f"  Naive Bayes (TF-IDF): {roc_auc_tfidf:.4f}")
print(f"  LSTM:                 {roc_auc_lstm:.4f}")

In [None]:
# All confusion matrices side by side
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

cms = [cm_bow, cm_tfidf, cm_lstm]
titles = ['Naive Bayes (BOW)', 'Naive Bayes (TF-IDF)', 'LSTM']
cmaps = ['Blues', 'Blues', 'Greens']

for ax, cm, title, cmap in zip(axes, cms, titles, cmaps):
    sns.heatmap(cm, annot=True, fmt='d', cmap=cmap, ax=ax,
                xticklabels=['Ham', 'Spam'], yticklabels=['Ham', 'Spam'])
    ax.set_title(f'{title}', fontsize=11, fontweight='bold')
    ax.set_xlabel('Predicted')
    ax.set_ylabel('Actual')

plt.suptitle('Confusion Matrix Comparison', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()

---

## Part 7: Sample Predictions

In [None]:
# Test with sample messages
sample_messages = [
    "Congratulations! You've won a $1000 gift card. Click here to claim now!",
    "Hey, are we still meeting for lunch tomorrow at 2pm?",
    "URGENT: Your account has been compromised. Call this number immediately.",
    "Thanks for your help yesterday. I really appreciate it!",
    "FREE entry to win iPhone 15! Text WIN to 12345 now!",
    "Can you pick up some groceries on your way home?"
]

def predict_message(message, models_dict):
    """
    Predict spam/ham for a given message using all models
    """
    # Preprocess
    cleaned = preprocess_text(message)
    
    results = {}
    
    # Naive Bayes (TF-IDF) prediction
    tfidf_vec = tfidf_vectorizer.transform([cleaned])
    nb_pred = nb_tfidf_model.predict(tfidf_vec)[0]
    nb_prob = nb_tfidf_model.predict_proba(tfidf_vec)[0][1]
    results['NB (TF-IDF)'] = ('Spam' if nb_pred == 1 else 'Ham', nb_prob)
    
    # LSTM prediction
    seq = tokenizer.texts_to_sequences([cleaned])
    padded = pad_sequences(seq, maxlen=MAX_SEQUENCE_LENGTH, padding='post', truncating='post')
    lstm_prob = lstm_model.predict(padded, verbose=0)[0][0]
    lstm_pred = 1 if lstm_prob > 0.5 else 0
    results['LSTM'] = ('Spam' if lstm_pred == 1 else 'Ham', lstm_prob)
    
    return results

# Make predictions
print("="*80)
print("SAMPLE PREDICTIONS")
print("="*80)

for i, msg in enumerate(sample_messages, 1):
    print(f"\nüìß Message {i}: \"{msg[:60]}...\"" if len(msg) > 60 else f"\nüìß Message {i}: \"{msg}\"")
    print("-"*70)
    
    results = predict_message(msg, None)
    
    for model, (prediction, prob) in results.items():
        icon = "üö´" if prediction == 'Spam' else "‚úÖ"
        print(f"  {model:15s}: {icon} {prediction:4s} (confidence: {prob:.2%})")

---

## Part 8: Summary and Conclusions

In [None]:
# Final Summary Table
print("="*80)
print("FINAL COMPARISON SUMMARY")
print("="*80)

# Add AUC to comparison
comparison_df['AUC'] = [roc_auc_bow, roc_auc_tfidf, roc_auc_lstm]

# Display final comparison
print("\nüìä Performance Metrics:")
print(comparison_df[['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']].to_string(index=False))

print("\n‚è±Ô∏è Computational Efficiency:")
print(comparison_df[['Model', 'Training Time (s)', 'Inference Time (s)']].to_string(index=False))

# Find best model for each metric
print("\nüèÜ Best Model by Metric:")
for metric in ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'AUC']:
    best_idx = comparison_df[metric].idxmax()
    best_model = comparison_df.loc[best_idx, 'Model']
    best_value = comparison_df.loc[best_idx, metric]
    print(f"  {metric:12s}: {best_model} ({best_value:.4f})")

## Key Findings and Recommendations

### Performance Analysis

| Aspect | Naive Bayes | LSTM |
|--------|-------------|------|
| **Accuracy** | High (~96-98%) | High (~97-99%) |
| **Training Speed** | Very fast (<1 sec) | Slow (minutes) |
| **Inference Speed** | Very fast | Slower |
| **Interpretability** | High (feature importance) | Low (black box) |
| **Data Requirements** | Works with small data | Needs more data |
| **Sequential Patterns** | No | Yes |

### When to Use Each Model

**Choose Naive Bayes when:**
- You have limited computational resources
- You need quick prototyping and baseline
- Dataset is small to medium-sized
- Interpretability is important
- Real-time predictions are required

**Choose LSTM when:**
- Word order and context matter significantly
- You have large datasets
- Computational resources are available
- Higher accuracy is worth the extra complexity
- You're dealing with longer text sequences

### Practical Recommendations for SMS Spam Detection

1. **For production systems**: Consider Naive Bayes due to its speed and competitive accuracy
2. **For research/high-stakes**: LSTM or transformer models may provide marginal improvements
3. **Ensemble approach**: Combine both models for robust predictions

---

## Exercises for Students

1. **Experiment with hyperparameters:**
   - Try different `max_features` in vectorizers
   - Modify LSTM architecture (add/remove layers, change units)
   - Adjust `alpha` parameter in Naive Bayes

2. **Try different preprocessing:**
   - Add stemming or lemmatization
   - Remove/keep numbers
   - Experiment with n-grams

3. **Advanced models:**
   - Implement Bidirectional LSTM
   - Try GRU instead of LSTM
   - Implement attention mechanism

4. **Handle class imbalance:**
   - Apply SMOTE
   - Use class weights
   - Try undersampling/oversampling

5. **Deploy the model:**
   - Create a simple Flask/FastAPI endpoint
   - Build a Streamlit dashboard

In [None]:
# Save models for future use
import pickle

# Save Naive Bayes model and vectorizer
with open('nb_tfidf_model.pkl', 'wb') as f:
    pickle.dump({'model': nb_tfidf_model, 'vectorizer': tfidf_vectorizer}, f)

# Save LSTM model
lstm_model.save('lstm_spam_classifier.h5')

# Save tokenizer
with open('lstm_tokenizer.pkl', 'wb') as f:
    pickle.dump(tokenizer, f)

print("‚úÖ Models saved successfully!")
print("   - nb_tfidf_model.pkl")
print("   - lstm_spam_classifier.h5")
print("   - lstm_tokenizer.pkl")

---

## References

1. Almeida, T.A., G√≥mez Hidalgo, J.M., Yamakami, A. (2011). Contributions to the Study of SMS Spam Filtering: New Collection and Results. DocEng'11.

2. Hochreiter, S., & Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation.

3. McCallum, A., & Nigam, K. (1998). A Comparison of Event Models for Naive Bayes Text Classification. AAAI Workshop.

4. Dataset: https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset

---

*Notebook prepared for BUS 405: Foundations of Big Data Analytics*