# Assignment 8 – Q1: SMS Spam Classification using AdaBoost

This notebook implements **Q1 (SMS Spam Collection Dataset)** from Assignment 8:

1. Data loading and preprocessing (TF–IDF).
2. Baseline weak learner (Decision Stump).
3. **Manual AdaBoost** implementation (T = 15 rounds).
4. **Sklearn AdaBoost** implementation and comparison.

> **Note:** Place the dataset file `spam.csv` or `SMSSpamCollection` in the same folder as this notebook before running.

In [None]:
import numpy as np
import pandas as pd
import re
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, ENGLISH_STOP_WORDS
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score
from sklearn.ensemble import AdaBoostClassifier
import matplotlib.pyplot as plt

np.random.seed(42)

## Part A — Data Preprocessing & Exploration

Steps:
1. Load the SMS spam dataset.
2. Convert label: `'spam' → 1`, `'ham' → 0`.
3. Text preprocessing: lowercase, remove punctuation, remove stopwords.
4. Convert text to numeric features using **TF–IDF**.
5. Train–test split (80/20).
6. Show class distribution.

In [None]:
# === Load SMS Spam Dataset ===
# This cell tries to load a Kaggle-style 'spam.csv'. If your file uses different
# column names, adjust accordingly.

import os

def load_sms_dataset():
    # Try common filenames
    possible_files = [
        'spam.csv',                # Kaggle CSV
        'SMSSpamCollection',      # UCI raw text
        'SMSSpamCollection.txt'
    ]

    dataset_path = None
    for fname in possible_files:
        if os.path.exists(fname):
            dataset_path = fname
            break

    if dataset_path is None:
        raise FileNotFoundError(
            "Dataset file not found. Please place 'spam.csv' or 'SMSSpamCollection' in the notebook folder."
        )

    print(f"Using dataset file: {dataset_path}\n")

    # Case 1: Kaggle CSV (spam.csv) with columns: v1 (label), v2 (text)
    if dataset_path.endswith('.csv'):
        df = pd.read_csv(dataset_path, encoding='latin-1')
        # Try to standardize column names
        if 'label' in df.columns and 'text' in df.columns:
            pass
        elif 'v1' in df.columns and 'v2' in df.columns:
            df = df.rename(columns={'v1': 'label', 'v2': 'text'})
        else:
            # Keep first two columns as label & text if names are unexpected
            df = df.rename(columns={df.columns[0]: 'label', df.columns[1]: 'text'})
        # Drop any completely extra unnamed columns
        df = df[['label', 'text']]
        return df

    # Case 2: UCI raw text file 'SMSSpamCollection' (tab-separated)
    else:
        df = pd.read_csv(dataset_path, sep='\t', header=None, names=['label', 'text'])
        return df


df = load_sms_dataset()
print('First 5 rows:')
display(df.head())
print('\nDataset shape:', df.shape)

In [None]:
# === Encode labels: spam -> 1, ham -> 0 ===
df['label_num'] = df['label'].map({'ham': 0, 'spam': 1})
if df['label_num'].isnull().any():
    raise ValueError("Found labels other than 'ham' and 'spam'. Check dataset.")

print(df['label'].value_counts())
print('\nNumeric labels distribution:')
print(df['label_num'].value_counts())

In [None]:
# === Text preprocessing: lowercase, remove punctuation, remove stopwords ===
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove punctuation / non-alphanumeric (keep spaces)
    text = re.sub(r'[^a-z0-9\s]', ' ', text)
    # Tokenize by whitespace
    tokens = text.split()
    # Remove English stopwords
    tokens = [t for t in tokens if t not in ENGLISH_STOP_WORDS]
    return ' '.join(tokens)

df['clean_text'] = df['text'].astype(str).apply(preprocess_text)
print('Original vs Cleaned (first 5):')
display(df[['text', 'clean_text']].head())

In [None]:
# === TF–IDF Vectorization ===
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(df['clean_text'])
y = df['label_num'].values

print('Feature matrix shape:', X.shape)

In [None]:
# === Train–Test Split (80/20) ===
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print('Train size:', X_train.shape[0])
print('Test size:', X_test.shape[0])

# Class distribution in train & test
print('\nClass distribution (train):')
print(pd.Series(y_train).value_counts(normalize=True))

print('\nClass distribution (test):')
print(pd.Series(y_test).value_counts(normalize=True))

## Part B — Weak Learner Baseline (Decision Stump)

We train a **Decision Stump**:
`DecisionTreeClassifier(max_depth=1)`

We will report:
- Train accuracy
- Test accuracy
- Confusion matrix
- Brief comment on stump performance

In [None]:
# === Train Decision Stump Baseline ===
stump_baseline = DecisionTreeClassifier(max_depth=1, random_state=42)
stump_baseline.fit(X_train, y_train)

y_train_pred_stump = stump_baseline.predict(X_train)
y_test_pred_stump = stump_baseline.predict(X_test)

train_acc_stump = accuracy_score(y_train, y_train_pred_stump)
test_acc_stump = accuracy_score(y_test, y_test_pred_stump)

print(f'Train Accuracy (Stump): {train_acc_stump:.4f}')
print(f'Test Accuracy  (Stump): {test_acc_stump:.4f}\n')

print('Confusion Matrix (Test, Stump):')
cm_stump = confusion_matrix(y_test, y_test_pred_stump)
print(cm_stump)

print('\nClassification Report (Test, Stump):')
print(classification_report(y_test, y_test_pred_stump, target_names=['ham', 'spam']))

### Comment on Stump Performance

A single decision stump performs only one split on one feature. For high-dimensional
text data (TF–IDF features), the relationship between words and spam/ham labels is
complex and cannot be captured by just one split. Therefore, the stump typically has
limited accuracy and cannot separate all spam and ham messages effectively.

AdaBoost addresses this by **combining many stumps**, each focusing on different
mistakes, to form a much stronger classifier.

## Part C — Manual AdaBoost (T = 15 rounds)

We implement **AdaBoost from scratch** using decision stumps as weak learners.
At each iteration we will:

- Train a stump with current sample weights.
- Compute the weighted error.
- Compute the stump weight (alpha).
- Print:
  - Iteration number
  - Misclassified sample indices
  - Weights of misclassified samples
  - Alpha value
- Update and normalize the sample weights.

We will also plot:
- Iteration vs **weighted error**
- Iteration vs **alpha**

Finally, we report train/test accuracy and confusion matrix for the **final ensemble**.

In [None]:
# === Manual AdaBoost Implementation (with decision stumps) ===

def manual_adaboost(X_train, y_train, X_test, T=15):
    n_samples = X_train.shape[0]
    # Initialize weights uniformly
    w = np.ones(n_samples) / n_samples

    stumps = []
    alphas = []
    errors = []

    for t in range(1, T + 1):
        stump = DecisionTreeClassifier(max_depth=1, random_state=42)
        stump.fit(X_train, y_train, sample_weight=w)
        y_pred = stump.predict(X_train)

        incorrect = (y_pred != y_train)
        eps = np.dot(w, incorrect)  # weighted error (weights sum to 1)

        # Avoid division by zero or eps >= 0.5 issues
        eps = np.clip(eps, 1e-10, 0.499999)

        alpha = 0.5 * np.log((1 - eps) / eps)

        # Store
        stumps.append(stump)
        alphas.append(alpha)
        errors.append(eps)

        # Print details for this iteration
        mis_idx = np.where(incorrect)[0]
        print(f"Iteration {t}")
        print(f"  Weighted error (eps): {eps:.6f}")
        print(f"  Alpha: {alpha:.6f}")
        print(f"  # Misclassified samples: {len(mis_idx)}")
        # WARNING: printing all indices & weights can be very long for large datasets.
        # To avoid huge outputs, we show at most first 20 misclassified samples.
        max_show = 20
        show_idx = mis_idx[:max_show]
        print(f"  Misclassified indices (first {max_show}): {show_idx}")
        print(f"  Weights of these misclassified samples: {w[show_idx]}\n")

        # Update weights: increase for misclassified, decrease for correctly classified
        # w_i <- w_i * exp(alpha) if misclassified, exp(-alpha) if correct
        w[~incorrect] *= np.exp(-alpha)
        w[incorrect] *= np.exp(alpha)

        # Normalize weights
        w /= w.sum()

    return stumps, np.array(alphas), np.array(errors)


T = 15
stumps, alphas, errors = manual_adaboost(X_train, y_train, X_test, T=T)

In [None]:
# === Helper: Predict using the manual AdaBoost ensemble ===
def adaboost_predict(X, stumps, alphas):
    # Aggregate stump predictions with weights (alphas)
    stump_preds = np.array([stump.predict(X) for stump in stumps])  # shape: (T, n_samples)
    # Convert {0,1} -> {-1, +1}
    stump_preds_pm1 = 2 * stump_preds - 1
    # Weighted sum
    agg = np.dot(alphas, stump_preds_pm1)
    # Final prediction: sign -> {0,1}
    y_pred_final = (np.sign(agg) == 1).astype(int)
    return y_pred_final


# Predictions for train and test using manual AdaBoost
y_train_pred_ada = adaboost_predict(X_train, stumps, alphas)
y_test_pred_ada = adaboost_predict(X_test, stumps, alphas)

train_acc_ada = accuracy_score(y_train, y_train_pred_ada)
test_acc_ada = accuracy_score(y_test, y_test_pred_ada)

print(f"Manual AdaBoost Train Accuracy: {train_acc_ada:.4f}")
print(f"Manual AdaBoost Test Accuracy : {test_acc_ada:.4f}\n")

print('Confusion Matrix (Test, Manual AdaBoost):')
cm_ada_manual = confusion_matrix(y_test, y_test_pred_ada)
print(cm_ada_manual)

print('\nClassification Report (Test, Manual AdaBoost):')
print(classification_report(y_test, y_test_pred_ada, target_names=['ham', 'spam']))

In [None]:
# === Plots: Iteration vs Weighted Error and Alpha ===
iterations = np.arange(1, len(errors) + 1)

plt.figure()
plt.plot(iterations, errors, marker='o')
plt.xlabel('Boosting Round (t)')
plt.ylabel('Weighted Error (epsilon_t)')
plt.title('Manual AdaBoost: Iteration vs Weighted Error')
plt.grid(True)
plt.show()

plt.figure()
plt.plot(iterations, alphas, marker='o')
plt.xlabel('Boosting Round (t)')
plt.ylabel('Alpha (alpha_t)')
plt.title('Manual AdaBoost: Iteration vs Alpha')
plt.grid(True)
plt.show()

### Interpretation of Weight Evolution

- In early iterations, many samples are misclassified, so the **weighted error** may be
  relatively high and the alpha values moderate.
- As boosting proceeds, the algorithm **increases weights** on samples that are hard to
  classify (often borderline or rare spam/ham messages).
- Later stumps focus more on these difficult samples. If the weighted error decreases
  over iterations, it indicates that the ensemble is correcting previous mistakes.
- Samples that remain misclassified across many rounds accumulate **very high weights**, 
  showing that AdaBoost is strongly focusing on them.

## Part D — Sklearn AdaBoost

Now we use `sklearn.ensemble.AdaBoostClassifier` with decision stumps as base learners:

```python
AdaBoostClassifier(
    base_estimator=DecisionTreeClassifier(max_depth=1),
    n_estimators=100,
    learning_rate=0.6
)
```

We will:
- Train the model
- Report train/test accuracy
- Show confusion matrix
- Compare with the manual AdaBoost implementation

In [None]:
# === Sklearn AdaBoost ===
base_stump = DecisionTreeClassifier(max_depth=1, random_state=42)
ada_sklearn = AdaBoostClassifier(
    estimator=base_stump,
    n_estimators=100,
    learning_rate=0.6,
    random_state=42
)

ada_sklearn.fit(X_train, y_train)

y_train_pred_sklearn = ada_sklearn.predict(X_train)
y_test_pred_sklearn = ada_sklearn.predict(X_test)

train_acc_sklearn = accuracy_score(y_train, y_train_pred_sklearn)
test_acc_sklearn = accuracy_score(y_test, y_test_pred_sklearn)

print(f'Sklearn AdaBoost Train Accuracy: {train_acc_sklearn:.4f}')
print(f'Sklearn AdaBoost Test Accuracy : {test_acc_sklearn:.4f}\n')

print('Confusion Matrix (Test, Sklearn AdaBoost):')
cm_ada_sklearn = confusion_matrix(y_test, y_test_pred_sklearn)
print(cm_ada_sklearn)

print('\nClassification Report (Test, Sklearn AdaBoost):')
print(classification_report(y_test, y_test_pred_sklearn, target_names=['ham', 'spam']))

### Comparison: Manual vs Sklearn AdaBoost

- Both methods use decision stumps as base learners and combine them via weighted voting.
- The **manual implementation** uses a fixed number of boosting rounds `T = 15`, whereas
  the sklearn model here uses `n_estimators = 100`, so it may achieve higher accuracy.
- If you increase `T` in the manual version, its performance should become closer to
  sklearn's implementation.
- In practice, sklearn's AdaBoost is optimized and handles edge cases and numerical
  stability more robustly, but the manual version is very useful to understand how
  weights, errors, and alphas evolve during boosting.