
# Spam Detection: Strategy, Models, and Selection (Do Not Execute Here)

> **Important:** This notebook is provided as **code for your boss to run**. Please do **not** execute it in this environment.
> It demonstrates multiple encoders and models, performs proper model selection, and **saves the best model** and associated
> artifacts into the `models/` directory.

## What this notebook does
1. Loads a labelled spam dataset (default: SMS Spam Collection).  
2. Tries multiple **text encoders** (TF–IDF word & char n-grams, HashingVectorizer, SentenceTransformer embeddings — optional).  
3. Trains multiple **models** (Logistic Regression, Linear SVM, Multinomial Naive Bayes, RandomForest; optional: LinearSVC calibrated).  
4. Uses **pipelines** + **cross-validation** and **RandomizedSearchCV** for robust selection.  
5. Evaluates with **ROC-AUC** and **PR-AUC**, plus confusion matrix and classification report.  
6. Saves the **best pipeline** (vectorizer + model) to `models/best_model.joblib` and metadata to `models/metadata.json`.

## Dataset
- Default: SMS Spam Collection dataset (labelled "ham"/"spam").
- You can replace this with your own CSV that has `text` and `label` columns.

## Encoders & Models Included
**Encoders**
- `TfidfVectorizer` (word-level, char-level, mixed n-grams)
- `HashingVectorizer` (for speed/memory; paired with an online learner or standard linear models)
- *(Optional)* Sentence embeddings via `sentence-transformers` (commented out by default to keep dependencies light)

**Models**
- Logistic Regression (liblinear/saga)
- Linear SVM (LinearSVC with CalibratedClassifierCV for probabilities)
- Multinomial Naive Bayes
- RandomForest (as a non-linear baseline; expect slower training on large data)

## Metrics
We prioritize **ROC-AUC** and **Average Precision (PR-AUC)**, track **F1**, and also log confusion matrix.

## Reproducibility
- Set `RANDOM_STATE` where applicable.
- Log all hyperparameters and CV splits.

---


In [None]:
# 📦 Data Loading (Read from local filename in this notebook's folder)
import pandas as pd
from pathlib import Path

CSV_PATH = Path("Spam_Detector.csv")  # Expect the file to be next to this notebook
assert CSV_PATH.exists(), f"❌ File not found: {CSV_PATH}. Place Spam_Detector.csv alongside this notebook."
df = pd.read_csv(CSV_PATH)

required_cols = {"text", "label"}
assert required_cols.issubset(df.columns), f"❌ CSV must contain the columns: {required_cols}"

# Normalize string labels if present
if df["label"].dtype == object:
    df["label"] = df["label"].astype(str).str.lower().map({"spam": 1, "ham": 0}).fillna(df["label"]).astype(int)

print("✅ Loaded", len(df), "rows from", CSV_PATH)
print(df["label"].value_counts(normalize=True).rename("class_proportion"))
df.head()

In [None]:

# ⚙️ Setup (Do Not Execute Here)
# This cell installs imports if needed. Your boss should run this in their environment (local or Colab).
# !pip install -U scikit-learn pandas numpy joblib matplotlib seaborn tqdm

import os, json, math, random, warnings
import numpy as np
import pandas as pd
from pathlib import Path
from typing import Dict, Any

from sklearn.model_selection import train_test_split, StratifiedKFold, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, classification_report, confusion_matrix
from sklearn.feature_extraction.text import TfidfVectorizer, HashingVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.calibration import CalibratedClassifierCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import Bunch
import joblib
import matplotlib.pyplot as plt

warnings.filterwarnings("ignore")

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)


In [None]:

# ✂️ Train/Validation Split (Do Not Execute Here)
X = df["text"].astype(str).values
y = df["label"].values

X_train, X_valid, y_train, y_valid = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=RANDOM_STATE
)
print(len(X_train), len(X_valid))


In [None]:

# 🧪 Define Pipelines & Hyperparameters (Do Not Execute Here)
# We'll configure a few candidate pipelines and hyperparameter grids for RandomizedSearchCV.

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)

candidates = []

# 1) TF-IDF + Logistic Regression
pipe_lr = Pipeline([
    ("tfidf", TfidfVectorizer(strip_accents='unicode')),
    ("clf", LogisticRegression(max_iter=2000, n_jobs=None, random_state=RANDOM_STATE, solver="liblinear"))
])
params_lr = {
    "tfidf__ngram_range": [(1,1), (1,2)],
    "tfidf__min_df": [1,2,3,5],
    "tfidf__max_df": [0.9, 1.0],
    "tfidf__sublinear_tf": [True, False],
    "clf__C": np.logspace(-2,2,10),
    "clf__penalty": ["l1","l2"]
}
candidates.append(("tfidf+logreg", pipe_lr, params_lr))

# 2) TF-IDF (char + word) + LinearSVC (Calibrated for probabilities)
pipe_svc = Pipeline([
    ("tfidf", TfidfVectorizer(strip_accents='unicode')),
    ("svc", CalibratedClassifierCV(LinearSVC(random_state=RANDOM_STATE), cv=3))
])
params_svc = {
    "tfidf__ngram_range": [(1,2), (1,3)],
    "tfidf__analyzer": ["word", "char", "char_wb"],
    "tfidf__min_df": [1,2,3],
    "tfidf__max_df": [0.9, 1.0]
}
candidates.append(("tfidf+svc(calibrated)", pipe_svc, params_svc))

# 3) TF-IDF + MultinomialNB
pipe_mnb = Pipeline([
    ("tfidf", TfidfVectorizer(strip_accents='unicode')),
    ("clf", MultinomialNB())
])
params_mnb = {
    "tfidf__ngram_range": [(1,1), (1,2)],
    "tfidf__min_df": [1,2,3],
    "tfidf__max_df": [0.9, 1.0],
    "clf__alpha": np.logspace(-3,0,8)
}
candidates.append(("tfidf+mnb", pipe_mnb, params_mnb))

# 4) HashingVectorizer + Logistic Regression
pipe_hash_lr = Pipeline([
    ("hash", HashingVectorizer(alternate_sign=False)),
    ("clf", LogisticRegression(max_iter=2000, solver="liblinear", random_state=RANDOM_STATE))
])
params_hash_lr = {
    "hash__n_features": [2**16, 2**18, 2**20],
    "clf__C": np.logspace(-2,2,10),
    "clf__penalty": ["l1","l2"]
}
candidates.append(("hash+logreg", pipe_hash_lr, params_hash_lr))

# 5) TF-IDF + RandomForest (baseline non-linear)
pipe_rf = Pipeline([
    ("tfidf", TfidfVectorizer(strip_accents='unicode', ngram_range=(1,2))),
    ("rf", RandomForestClassifier(random_state=RANDOM_STATE, n_estimators=300))
])
params_rf = {
    "tfidf__min_df": [1,3,5],
    "tfidf__max_df": [0.9, 1.0],
    "rf__max_depth": [None, 10, 20],
    "rf__min_samples_split": [2, 5, 10]
}
candidates.append(("tfidf+rf", pipe_rf, params_rf))

# (Optional) SentenceTransformers + Linear models could be added here for larger budgets.
print(f"Configured {len(candidates)} candidate pipelines.")


In [None]:

# 🔍 Randomized Search over Candidates (Do Not Execute Here)
# We evaluate each candidate with RandomizedSearchCV and keep the best overall by PR-AUC primarily, then ROC-AUC.

def eval_candidate(name, pipe, param_dist, X_train, y_train, X_valid, y_valid, n_iter=20):
    search = RandomizedSearchCV(
        pipe,
        param_distributions=param_dist,
        n_iter=n_iter,
        scoring="average_precision",
        n_jobs=-1,
        cv=cv,
        random_state=RANDOM_STATE,
        verbose=1
    )
    search.fit(X_train, y_train)
    best = search.best_estimator_
    y_proba = best.predict_proba(X_valid)[:, 1] if hasattr(best, "predict_proba") else None
    y_pred = best.predict(X_valid)
    pr_auc = average_precision_score(y_valid, y_pred if y_proba is None else y_proba)
    if y_proba is None:
        # fall back to decision_function if available
        if hasattr(best, "decision_function"):
            scores = best.decision_function(X_valid)
            try:
                roc = roc_auc_score(y_valid, scores)
            except:
                roc = float("nan")
        else:
            roc = roc_auc_score(y_valid, y_pred)
    else:
        roc = roc_auc_score(y_valid, y_proba)
    report = classification_report(y_valid, y_pred, digits=4)
    cm = confusion_matrix(y_valid, y_pred).tolist()
    return {
        "name": name,
        "search": search,
        "best": best,
        "pr_auc": float(pr_auc),
        "roc_auc": float(roc),
        "report": report,
        "cm": cm,
        "best_params": search.best_params_
    }

results = []
for name, pipe, param_dist in candidates:
    res = eval_candidate(name, pipe, param_dist, X_train, y_train, X_valid, y_valid, n_iter=20)
    print(f"\n{name}: PR-AUC={res['pr_auc']:.4f} | ROC-AUC={res['roc_auc']:.4f}")
    print(res["report"])
    results.append(res)

# Select best by PR-AUC, then ROC-AUC
results = sorted(results, key=lambda r: (r["pr_auc"], r["roc_auc"]), reverse=True)
best = results[0]
print("Best model:", best["name"])
print("Best params:", best["best_params"])


In [None]:

# 💾 Save Best Model & Metadata (Do Not Execute Here)
MODELS_DIR = Path("models")
MODELS_DIR.mkdir(exist_ok=True, parents=True)

best_path = MODELS_DIR / "best_model.joblib"
joblib.dump(best["best"], best_path)

metadata = {
    "selected_model": best["name"],
    "best_params": best["best_params"],
    "pr_auc": best["pr_auc"],
    "roc_auc": best["roc_auc"],
    "random_state": RANDOM_STATE
}
with open(MODELS_DIR / "metadata.json", "w") as f:
    json.dump(metadata, f, indent=2)

print("Saved:", best_path, "and metadata.json")
