# 🤖 Notebook 03: Model Training

In this notebook, I train a machine learning model to classify emails as spam or ham based on cleaned text.

---

## 📥 Step 1: Load Preprocessed Data

I load the `processed_emails.csv` file that I saved after cleaning in the previous notebook.

This version includes:
- `label`: spam or ham
- `processed_text`: cleaned and tokenized email content

Now we can begin transforming the text into features for training the model.

In [1]:
import pandas as pd
import sys
sys.path.append('../src')

df = pd.read_csv('../data/processed_emails.csv')
df[['label', 'processed_text']].head()

Unnamed: 0,label,processed_text
0,spam,nextpart content type text html charset iso co...
1,spam,mailings sent complying proposed unsolicited c...
2,spam,need health insurance addition featuring large...
3,spam,html align center font ptsize family sansserif...
4,spam,worldwide great restaurants shopping activitie...


## Step 2: TF-IDF Vectorization

I convert the cleaned email text into numeric features using `TfidfVectorizer`.

- Removes stopwords
- Ignores overly common and rare words
- Includes unigrams and bigrams

This transforms each email into a vector of word importance scores.

In [2]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words='english',     
    max_df=0.85,               
    min_df=3,                 
    ngram_range=(1, 2)        
)

X = vectorizer.fit_transform(df['processed_text'])
y = df['label']
                                 # Target


## Step 3: Train-Test Split

I split the data into training (80%) and test (20%) sets while keeping the spam/ham ratio consistent using `stratify=y`.


In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

## Step 4: Train Naive Bayes Classifier

I use `MultinomialNB`, which works well for text classification problems like spam filtering.  
The model learns word patterns that distinguish spam from ham.

In [4]:
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB()
model.fit(X_train, y_train)


0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,


## Step 5: Evaluate Model Performance

I predict labels on the test set and print the confusion matrix + classification report.

This shows how well the model catches spam and avoids false positives.


In [5]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[777   4]
 [ 41 339]]
              precision    recall  f1-score   support

         ham       0.95      0.99      0.97       781
        spam       0.99      0.89      0.94       380

    accuracy                           0.96      1161
   macro avg       0.97      0.94      0.95      1161
weighted avg       0.96      0.96      0.96      1161



In [6]:
import joblib

joblib.dump(model, '../models/naive_bayes_spam_model.pkl')
joblib.dump(vectorizer, '../models/tfidf_vectorizer.pkl')

['../models/tfidf_vectorizer.pkl']

## Model Evaluation Summary

The Naive Bayes model performs well overall:

- **Accuracy:** 96% of emails were correctly classified
- **Spam Recall:** 0.89 → it correctly identified 89% of spam emails
- **Spam Precision:** 0.99 → when it predicts spam, it's almost always right
- **F1 Score for Spam:** 0.94 → strong balance between precision and recall

📌 Only 4 ham was misclassified as spam, and 41 spam emails were missed — not bad!

## Model Comparison: Logistic Regression vs Naive Bayes

To compare models, I trained a **Logistic Regression** classifier using the same TF-IDF features.

In [7]:
from sklearn.linear_model import LogisticRegression

logreg_model = LogisticRegression(max_iter=1000)  # increased to ensure convergence
logreg_model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [8]:
from sklearn.metrics import classification_report, confusion_matrix

logreg_pred = logreg_model.predict(X_test)

print(confusion_matrix(y_test, logreg_pred))
print(classification_report(y_test, logreg_pred))


[[777   4]
 [ 32 348]]
              precision    recall  f1-score   support

         ham       0.96      0.99      0.98       781
        spam       0.99      0.92      0.95       380

    accuracy                           0.97      1161
   macro avg       0.97      0.96      0.96      1161
weighted avg       0.97      0.97      0.97      1161



In [9]:
probs = logreg_model.predict_proba(X_test)

print(probs[0]) 

[0.01063197 0.98936803]


### Results Summary:

- **Accuracy:** slightly higher than Naive Bayes (97%)
- **Spam Recall:** 0.92 (slightly higher than Naive Bayes)
- **Spam Precision:** 0.99 (very high, same as NB)
- **F1 Score (Spam):** 0.95 (very close to Naive Bayes)



### Interpretation:

Both models perform very similarly, but **Logistic Regression has slightly better recall**, which is important for spam detection (we want to catch as much spam as possible).  
Therefore,for this project, I will use VotingClassifier which would give me the best of both models (NB & LR)

## Voting Classifier (Ensemble Voting)

In [10]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import VotingClassifier


In [11]:
nb_model = MultinomialNB()
lr_model = LogisticRegression(max_iter=1000)

voting_model = VotingClassifier(
    estimators=[('nb', nb_model), ('lr', lr_model)],
    voting='soft'  # average class probabilities
)


In [12]:
voting_model.fit(X_train, y_train)


0,1,2
,estimators,"[('nb', ...), ('lr', ...)]"
,voting,'soft'
,weights,
,n_jobs,
,flatten_transform,True
,verbose,False

0,1,2
,alpha,1.0
,force_alpha,True
,fit_prior,True
,class_prior,

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [13]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = voting_model.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))


[[776   5]
 [ 35 345]]
              precision    recall  f1-score   support

         ham       0.96      0.99      0.97       781
        spam       0.99      0.91      0.95       380

    accuracy                           0.97      1161
   macro avg       0.97      0.95      0.96      1161
weighted avg       0.97      0.97      0.97      1161



In [14]:
import joblib

joblib.dump(voting_model, '../models/spam_classifier_voting_model.pkl')
joblib.dump(vectorizer, '../models/tfidf_vectorizer.pkl')


['../models/tfidf_vectorizer.pkl']

### Voting Classifier Summary

- **Accuracy**: **97%** — best overall so far  
- **Spam Recall**: **0.91** → caught 91% of spam  
- **Spam Precision**: **0.99** → very few false positives  
- **F1 Score (Spam)**: **0.95** → excellent balance between precision and recall

 **Only 5 ham and 35 spam emails were misclassified.**


###  Final Verdict:

 **Best performing model overall.**  
 It combines the strengths of Naive Bayes and Logistic Regression, making it ideal for deployment.
