# Task 2 — Spam Email Classifier (Naive Bayes)
**Machine Learning**

**Goal:** Build a spam vs. ham email classifier using `TfidfVectorizer` + Naive Bayes.

**Dataset:** Upload the Kaggle CSV (two columns like `label`, `text`) or any small email dataset.

**Steps:**
- Load dataset
- Clean & split
- Vectorize (`TfidfVectorizer`)
- Train `MultinomialNB`
- Evaluate accuracy, precision, recall, F1
- Test with custom samples


In [1]:
# ---- Utility: SAFE_READ_CSV (no google.colab required) ----
import os, pandas as pd

def SAFE_READ_CSV(preferred_paths, fallback_msg):
    # Try a list of paths. If not found, ask for a manual path via input().
    for p in preferred_paths:
        if os.path.exists(p):
            try:
                df = pd.read_csv(p)
                print(f"Loaded dataset from: {p}")
                return df
            except Exception as e:
                print(f"Found {p} but couldn't read it as CSV: {e}")
    print(fallback_msg)
    manual = input("➡ Enter full path to your CSV (or press Enter to cancel): ").strip()
    if manual:
        if not os.path.exists(manual):
            raise FileNotFoundError(f"Path does not exist: {manual}")
        return pd.read_csv(manual)
    raise FileNotFoundError("CSV not found. Please place the file next to this notebook or give a valid path.")


In [None]:
# --- Spam Email Classifier (Kaggle "Email Spam Classification Dataset CSV") ---
# Dataset: https://www.kaggle.com/datasets/balaka18/email-spam-classification-dataset-csv
# The CSV is PRE-VECTORIZED: ~3000 word-count columns + a binary label column.
# I convert counts -> TF-IDF with TfidfTransformer, then train NB models.

import os, pandas as pd, numpy as np
from scipy import sparse
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.feature_extraction.text import TfidfTransformer

# 1) Load CSV
PREFERRED_PATHS = [
    "data/email_spam.csv", "emails.csv", "email_spam.csv",
    "/mnt/data/email_spam.csv", "/mnt/data/emails.csv"
]

def load_csv(paths):
    for p in paths:
        if os.path.exists(p):
            try:
                try:
                    df = pd.read_csv(p)
                except UnicodeDecodeError:
                    df = pd.read_csv(p, encoding="latin-1")
                print(f"Loaded dataset from: {p}")
                return df
            except Exception as e:
                print(f"Found {p} but could not read: {e}")
    manual = input("➡ Enter full path to your Kaggle emails CSV: ").strip()
    if not manual or not os.path.exists(manual):
        raise FileNotFoundError("CSV not found. Put it next to the notebook or in data/.")
    try:
        try:
            return pd.read_csv(manual)
        except UnicodeDecodeError:
            return pd.read_csv(manual, encoding="latin-1")
    except Exception as e:
        raise

df = load_csv(PREFERRED_PATHS)
print("Shape:", df.shape)
display(df.head())

# 2) Identify label + drop non-feature ID columns
cols_lower = [str(c).strip().lower() for c in df.columns]
label_col = None
for name in ("prediction", "label", "target", "class"):
    if name in cols_lower:
        label_col = df.columns[cols_lower.index(name)]
        break
if label_col is None:
    label_col = df.columns[-1]  # default: last column is label (0=ham, 1=spam)

# Drop obvious identifier columns if present
drop_ids = []
for cand in ("email no.","email name","email_no","email_name","email","name","id"):
    if cand in cols_lower:
        drop_ids.append(df.columns[cols_lower.index(cand)])

X = df.drop(columns=[label_col] + drop_ids, errors="ignore")
y = df[label_col]

# Ensure numeric features (safe even if already numeric)
X = X.apply(pd.to_numeric, errors="coerce").fillna(0)
assert X.shape[1] > 0, "No feature columns detected."

# 3) Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42,
    stratify=y if len(np.unique(y)) > 1 else None
)

# 4) Counts -> TF-IDF weights (for count matrices use TfidfTransformer, not TfidfVectorizer)
tfidf = TfidfTransformer(norm="l2", use_idf=True, smooth_idf=True, sublinear_tf=False)
X_train_tfidf = tfidf.fit_transform(sparse.csr_matrix(X_train.values))
X_test_tfidf  = tfidf.transform(sparse.csr_matrix(X_test.values))

# 5) Models & hyperparameters
param_grid = {"alpha": [0.05, 0.1, 0.3, 0.5, 1.0]}

mn_nb = GridSearchCV(MultinomialNB(), param_grid, cv=5, n_jobs=-1, scoring="accuracy")
mn_nb.fit(X_train_tfidf, y_train)

cb_nb = GridSearchCV(ComplementNB(), param_grid, cv=5, n_jobs=-1, scoring="accuracy")
cb_nb.fit(X_train_tfidf, y_train)

models = {
    "MultinomialNB": mn_nb.best_estimator_,
    "ComplementNB": cb_nb.best_estimator_,
}

# 6) Evaluate & pick best
def evaluate(model, Xtr, ytr, Xte, yte, name):
    pred = model.predict(Xte)
    acc = accuracy_score(yte, pred)
    print(f"\n=== {name} ===")
    print("Best alpha:", getattr(model, "alpha", None))
    print("Accuracy:", acc)
    print("\nClassification Report:\n",
          classification_report(yte, pred, target_names=['ham','spam'] if 1 in np.unique(yte) else None))
    print("Confusion Matrix:\n", confusion_matrix(yte, pred))
    return acc, pred

scores = {}
for name, model in models.items():
    acc, _ = evaluate(model, X_train_tfidf, y_train, X_test_tfidf, y_test, name)
    scores[name] = acc

best_name = max(scores, key=scores.get)
best_model = models[best_name]
print(f"\n🏁 Best model: {best_name} (alpha={best_model.alpha}) with accuracy={scores[best_name]:.4f}")

# 7) (Optional) Save the winning model + TF-IDF transformer for later use
# from joblib import dump
# dump({"model": best_model, "tfidf": tfidf, "feature_names": X.columns.tolist()}, "spam_nb_model.joblib")
# print("Saved spam_nb_model.joblib")


Loaded dataset from: data/email_spam.csv
Shape: (5172, 3002)


Unnamed: 0,Email No.,the,to,ect,and,for,of,a,you,hou,...,connevey,jay,valued,lay,infrastructure,military,allowing,ff,dry,Prediction
0,Email 1,0,0,1,0,0,0,2,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Email 2,8,13,24,6,6,2,102,1,27,...,0,0,0,0,0,0,0,1,0,0
2,Email 3,0,0,1,0,0,0,8,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Email 4,0,5,22,0,5,1,51,2,10,...,0,0,0,0,0,0,0,0,0,0
4,Email 5,7,6,17,1,5,2,57,0,9,...,0,0,0,0,0,0,0,1,0,0



=== MultinomialNB ===
Best alpha: 0.05
Accuracy: 0.9584541062801932

Classification Report:
               precision    recall  f1-score   support

         ham       0.98      0.97      0.97       735
        spam       0.92      0.94      0.93       300

    accuracy                           0.96      1035
   macro avg       0.95      0.95      0.95      1035
weighted avg       0.96      0.96      0.96      1035

Confusion Matrix:
 [[710  25]
 [ 18 282]]

=== ComplementNB ===
Best alpha: 1.0
Accuracy: 0.9410628019323671

Classification Report:
               precision    recall  f1-score   support

         ham       0.99      0.93      0.96       735
        spam       0.85      0.97      0.91       300

    accuracy                           0.94      1035
   macro avg       0.92      0.95      0.93      1035
weighted avg       0.95      0.94      0.94      1035

Confusion Matrix:
 [[683  52]
 [  9 291]]

🏁 Best model: MultinomialNB (alpha=0.05) with accuracy=0.9585
