# Project 2: Machine Learning - SMS Spam Classification (FILL-IN VERSION)

## Overview
This is the fill-in version for classroom instruction. Complete the missing code sections to build a complete ML pipeline for SMS spam classification.

## Learning Objectives
- **Supervised Learning**: Text classification with multiple algorithms
- **Unsupervised Learning**: Clustering and dimensionality reduction
- **Model Evaluation**: Comprehensive performance metrics
- **Deployment**: FastAPI web service creation

## Instructions
- Fill in the code sections marked with `# TODO: ...`
- Run each cell after completing it
- Ask for help if you get stuck!

## Phase 1: Data Loading and Exploration (0-20 minutes)

In [None]:
# TODO: Import necessary libraries
# Hint: You'll need pandas, numpy, matplotlib, seaborn, and sklearn modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, accuracy_score
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
import re
import warnings
warnings.filterwarnings('ignore')

print("✅ Libraries imported successfully!")

In [None]:
# TODO: Load the dataset
# Hint: Use pd.read_csv() to load 'sms_spam_dataset.csv'
df = pd.read_csv('sms_spam_dataset.csv')

# TODO: Display basic information about the dataset
print("📊 Dataset Overview:")
print(f"Shape: {df.shape}")
print(f"Columns: {list(df.columns)}")

# TODO: Display first few rows
print("\n📋 First 5 rows:")
# TODO: Use df.head() to display first 5 rows
df.head()

In [None]:
# TODO: Explore the data
print("🔍 Data Exploration:")

# TODO: Check for missing values
print("\n2. Missing Values:")
print(df.isnull().sum())

# TODO: Check label distribution
print("\n3. Label Distribution:")
# TODO: Use df['label'].value_counts() to see spam vs ham distribution
print(df['label'].value_counts())

# TODO: Display basic statistics
print("\n4. Basic Statistics:")
# TODO: Use df.describe() to get statistical summary
df.describe()

## Phase 2: Supervised Learning (20-50 minutes)

In [None]:
# TODO: Text preprocessing function
def preprocess_text(text):
    # TODO: Convert to lowercase
    text = text.lower()
    
    # TODO: Remove punctuation (keep only letters, numbers, and spaces)
    text = re.sub(r'[^a-zA-Z0-9\s]', '', text)
    
    # TODO: Remove extra whitespaces
    text = ' '.join(text.split())
    
    return text

# TODO: Apply preprocessing to the text column
df['text_processed'] = df['text'].apply(preprocess_text)

print("📝 Text preprocessing completed!")
print("\nExample:")
print(f"Original: {df['text'].iloc[0]}")
print(f"Processed: {df['text_processed'].iloc[0]}")

In [None]:
# TODO: Prepare the data for machine learning
# Hint: Create X (features) and y (target)
X = df['text_processed']  # TODO: Use df['text_processed'] as features
y = df['label']  # TODO: Use df['label'] as target

# TODO: Split the data into training and testing sets
# Hint: Use train_test_split with test_size=0.2, random_state=42
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# TODO: Create TF-IDF vectorizer and transform the text data
# Hint: Use TfidfVectorizer with max_features=1000, stop_words='english'
tfidf_vectorizer = TfidfVectorizer(max_features=1000, stop_words='english')
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

print("🚀 Data Preparation Complete!")
print(f"Training set size: {X_train_tfidf.shape}")
print(f"Test set size: {X_test_tfidf.shape}")

In [None]:
# TODO: Train and evaluate different models
# Hint: Try Logistic Regression, Naive Bayes, and Random Forest

# TODO: Initialize the models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Naive Bayes': MultinomialNB(),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42)
}

results = {}
print("🎯 Model Training and Evaluation:")

# TODO: Create a loop to train and evaluate each model
for name, model in models.items():
    print(f"\n🤖 Training {name}...")
    
    # TODO: Train the model using fit()
    model.fit(X_train_tfidf, y_train)
    
    # TODO: Make predictions using predict()
    y_pred = model.predict(X_test_tfidf)
    
    # TODO: Calculate accuracy using accuracy_score()
    accuracy = accuracy_score(y_test, y_pred)
    
    # TODO: Print classification report
    print(f"✅ {name} Accuracy: {accuracy:.4f}")
    print(f"📊 Classification Report:")
    print(classification_report(y_test, y_pred, target_names=['Ham', 'Spam']))
    
    # Store results
    results[name] = {
        'model': model,
        'accuracy': accuracy
    }

# TODO: Find the best model
best_model_name = max(results, key=lambda x: results[x]['accuracy'])
print(f"\n🏆 Best Model: {best_model_name}")

## Phase 3: Unsupervised Learning (50-70 minutes)

In [None]:
# TODO: Apply K-means clustering
print("🔍 Unsupervised Learning - K-means Clustering:")

# TODO: Create KMeans with n_clusters=2
kmeans = KMeans(n_clusters=2, random_state=42)

# TODO: Fit the model and predict clusters
X_full_tfidf = tfidf_vectorizer.transform(df['text_processed'])
cluster_labels = kmeans.fit_predict(X_full_tfidf)

# TODO: Add cluster labels to dataframe
df['cluster'] = cluster_labels

# TODO: Analyze clusters
print("📊 Cluster Analysis:")
cluster_analysis = df.groupby(['cluster', 'label']).size().unstack(fill_value=0)
print(cluster_analysis)

# TODO: Calculate cluster purity
cluster_0_purity = max(cluster_analysis.iloc[0]) / cluster_analysis.iloc[0].sum()
cluster_1_purity = max(cluster_analysis.iloc[1]) / cluster_analysis.iloc[1].sum()

print(f"\n🎯 Cluster Purity:")
print(f"Cluster 0 purity: {cluster_0_purity:.4f}")
print(f"Cluster 1 purity: {cluster_1_purity:.4f}")

## Summary

### What You've Learned:
- Text preprocessing and feature engineering
- Supervised learning with multiple algorithms
- Unsupervised learning techniques
- Model evaluation and comparison
- Deployment preparation

### Instructions:
1. Complete all TODO sections in order
2. Run each cell after filling in the code
3. Ask questions if you get stuck
4. Share your findings with the class

### Next Steps:
After completing this notebook, you'll be ready to:
- Create FastAPI web services
- Deploy ML models in production
- Handle real-world text classification problems