<a href="https://colab.research.google.com/github/EthanGaoZhiyuan/A2-Spam-Email-Detection/blob/main/A2_%E2%80%93_Spam_Email_Detection_(Option%E2%80%AF2).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Task**: Binary classification—predict whether an email is spam.<br/>
**Training Input**: text: str (typical length 20–2,000 chars, max 5,000; may contain URLs, HTML, emoji, typos; most but not all inputs fit this).<br/>
**Training Label**: y ∈ {0=ham, 1=spam}.<br/>
**Deployment Output**: p_spam ∈ [0,1] (model confidence; if the model outputs logit z, then p=1/(1+e^-z)).<br/>
**Failure policy**: if cleaned text is empty, return p_spam=0 and log a warning.

## Import dependencies

In [1]:
!pip -q install scikit-learn imbalanced-learn
import random, numpy as np
SEED = 42
random.seed(SEED); np.random.seed(SEED)

## Load & clean

In [2]:
import pandas as pd

df = pd.read_csv("https://raw.githubusercontent.com/EthanGaoZhiyuan/A2-Spam-Email-Detection/main/spam.csv")
df = df.rename(columns={df.columns[0]:"Category", df.columns[1]:"Message"})
df = df.dropna(subset=["Message"]).drop_duplicates(subset=["Message"]).reset_index(drop=True)
df["y"] = (df["Category"].str.lower().str.strip()=="spam").astype(int)
X = df["Message"].astype(str); y = df["y"].values

## Split

In [3]:
from sklearn.model_selection import train_test_split
X_tr, X_tmp, y_tr, y_tmp = train_test_split(X, y, test_size=0.30, stratify=y, random_state=SEED)
X_va, X_te, y_va, y_te = train_test_split(X_tmp, y_tmp, test_size=2/3, stratify=y_tmp, random_state=SEED)

## Pipeline & baseline

In [4]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), min_df=2, max_df=0.95)),
    ("clf", LogisticRegression(max_iter=1000, class_weight="balanced"))
]).fit(X_tr, y_tr)

## Threshold search on validation (optimize spam‑class F1)

In [5]:
import numpy as np
from sklearn.metrics import precision_recall_curve, average_precision_score, f1_score, classification_report, confusion_matrix

def scores(m, X):
    return m.decision_function(X) if hasattr(m, "decision_function") else m.predict_proba(X)[:,1]

s_val = scores(pipe, X_va)
prec, rec, thr = precision_recall_curve(y_va, s_val)
f1 = 2*prec*rec/(prec+rec+1e-12)
best = np.nanargmax(f1); t_best = thr[best]
print(f"Val: AUPRC={average_precision_score(y_va, s_val):.3f} | best F1={f1[best]:.3f} at t={t_best:.3f}")

Val: AUPRC=0.941 | best F1=0.905 at t=0.085


## Test once at locked threshold & confusion matrix

In [6]:
s_te = scores(pipe, X_te)
y_hat = (s_te >= t_best).astype(int)
print(classification_report(y_te, y_hat, digits=4))
print("Confusion matrix:\n", confusion_matrix(y_te, y_hat))

              precision    recall  f1-score   support

           0     0.9878    0.9856    0.9867       904
           1     0.9000    0.9141    0.9070       128

    accuracy                         0.9767      1032
   macro avg     0.9439    0.9498    0.9468      1032
weighted avg     0.9769    0.9767    0.9768      1032

Confusion matrix:
 [[891  13]
 [ 11 117]]


## 5‑fold CV for mean ± std

In [7]:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(5, shuffle=True, random_state=SEED)
f1s, aprs = [], []
for tr, va in skf.split(X, y):
    pipe.fit(X.iloc[tr], y[tr])
    s = scores(pipe, X.iloc[va])
    f1s.append(f1_score(y[va], (s>=0.5).astype(int)))
    aprs.append(average_precision_score(y[va], s))
print(f"CV F1={np.mean(f1s):.3f}±{np.std(f1s):.3f} | CV AUPRC={np.mean(aprs):.3f}±{np.std(aprs):.3f}")

CV F1=0.931±0.013 | CV AUPRC=0.973±0.011


## Deployment‑style API

In [8]:
def predict_spam_proba(texts:list[str]) -> list[float]:
    s = scores(pipe, texts)
    # If using decision_function, optionally map to [0,1] via logistic
    try: return pipe.predict_proba(texts)[:,1].tolist()
    except:
        import numpy as np
        return (1/(1+np.exp(-s))).tolist()