# Training the ML Model

In [13]:
## Loading the training set into a Pandas dataframe
import pandas as pd
labeled_df = pd.read_csv(r'C:\Users\u411296\OneDrive - United Airlines\Documents\My Space\Upskilling Msyelf\Machine Learning\Personal Gmail Classifier ML Project\data\labeled_emails.csv')

labeled_df.head()

Unnamed: 0,message_id,subject,snippet,sender,sender_domain,internal_date,is_important,is_promo,is_spam,label
0,17e99c136d770537,Your mobile recharge for Rs. 15.00 is success...,"Amazon.in Recharges Dear customer, Your rechar...",Amazon Pay <noreply@amazonpay.in>,amazonpay.in,1/27/2022,True,False,False,important
1,18c39485254e45a4,Your order is on its way,Download app The one you&#39;ve been waiting f...,RENTOMOJO <noreply@rentomojo.com>,rentomojo.com,12/5/2023,True,False,False,important
2,182c941da6fa286d,Discover Fresh Arrivals for Our Brand New Cate...,Shop for ₹599 | Get Upto 40% Off | Code: MAMA4...,Mamaearth <support@info.mamaearth.in>,info.mamaearth.in,8/23/2022,True,False,False,promotional
3,190b1e939cc8f72b,Product registration confirmation,PRODUCTS SUPPORT PRODUCT REGISTRATION Dear Pri...,Sony India Product Registration System <no-rep...,alerts.sony.co.in,7/14/2024,True,False,False,important
4,17b4f317cb3f7b10,Online Live Project / Work Experience Program ...,"Dear Sri Venkateswara College Students, Here i...",Finlatics Hub <finlatics@fincruxtech.com>,fincruxtech.com,8/16/2021,True,False,False,important


## Class Balance

In [6]:
labeled_df["label"].value_counts()

label
promotional    510
spam           399
important      290
Name: count, dtype: int64

After sampling based on Gmail labels and manually labelling the mails based on my personal relevance, my training dataset has a total of around 1200 mails with the following composition:

| Label         | Count            |
| ------------- | ------------------- |
| `important`   | 290   |
| `promotional` | 510 |
| `spam`        | 399 |

The dataset is not perfectly balanced but the **class balance is healthy** because:
- No class is tiny
- No class dominates
- Model will see enough examples of each

## Identifying Features and Target

In [14]:
# Features
X = labeled_df[["subject", "snippet"]]

#Target
y = labeled_df["label"]

## Splitting the data into Training and Test Set 

We need to preserve the proportions of different classes in our training and test sets so that 
- Each class is represented in train & test
- Evaluation is meaningful

Hence,

> **We'll use Stratified Split**.

In [15]:
from sklearn.model_selection import train_test_split

X = labeled_df[["subject", "snippet"]]
y = labeled_df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)


In [16]:
## Combining both the test fields

X_train_text = (
    X_train["subject"].fillna("") + " " +
    X_train["snippet"].fillna("")
)

X_test_text = (
    X_test["subject"].fillna("") + " " +
    X_test["snippet"].fillna("")
)


## TF-IDF Vectorization

We'll use this vectorizer to convert text features into feature vectors of TF-IDF scores which can be used to train the model.

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1, 2),
    min_df=2,
    max_df=0.9,
    sublinear_tf=True
)

X_train_tfidf = vectorizer.fit_transform(X_train_text)
X_test_tfidf = vectorizer.transform(X_test_text)


## Fitting a Logistic Regression Model

We'll start with fitting a simple, strong model to the data.

In [7]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=1000,
    multi_class="multinomial"
)

model.fit(X_train_tfidf, y_train)




0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,1000


## Model Evaluation

In [12]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test_tfidf)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))


              precision    recall  f1-score   support

   important       0.92      0.59      0.72        58
 promotional       0.73      0.97      0.84       102
        spam       1.00      0.85      0.92        80

    accuracy                           0.84       240
   macro avg       0.88      0.80      0.82       240
weighted avg       0.87      0.84      0.83       240

[[34 24  0]
 [ 3 99  0]
 [ 0 12 68]]


### Understanding Evaluation Metrics 

Every classifier makes two kinds of mistakes:

1. False Positive (FP) → saying X when it’s not X

2. False Negative (FN) → missing a real X

Different metrics care about different mistakes.

| Metric | Formula | What question it answers | What it penalizes | When it is important |
|------|--------|--------------------------|------------------|---------------------|
| Accuracy | (TP + TN) / Total | How often is the model correct overall? | Treats all errors equally | Useful only when classes are balanced and all mistakes cost the same |
| Precision | TP / (TP + FP) | When the model predicts a class, how often is it correct? | False Positives (false alarms) | Important when false alarms are costly (e.g., marking spam as important) |
| Recall (Sensitivity) | TP / (TP + FN) | Of all real cases, how many did the model catch? | False Negatives (misses) | Important when missing a true case is costly (e.g., missing important emails) |
| F1-score | 2 × (Precision × Recall) / (Precision + Recall) | How well does the model balance precision and recall? | Imbalance between precision and recall | Useful when you care about both false alarms and misses |
| Support | Number of true samples per class | How many real examples exist for this class? | — | Helps interpret reliability of metrics |
| Macro Average | Mean of metric across classes | How does the model perform equally across classes? | Poor performance on any class | Useful when all classes are equally important |
| Weighted Average | Weighted mean by class size | How does the model perform overall considering class frequency? | Poor performance on large classes | Useful when class distribution reflects real-world usage |
| Confusion Matrix | Row-wise counts of actual vs predicted classes across each class | Where exactly is the model making mistakes? | Specific misclassifications | Best tool for understanding model behavior |


### Metric Intuition ( Final mental model ) 

**> Accuracy answers:** “How often was I right?”

**> Precision answers:** “Can I trust you?”

**> Recall answers:** “Did you miss anything important?”

**> F1-score:** "Am I balancing trust and coverage?" 


### *Why accuracy can be misleading?*

Accuracy is good for sanity check in Binary Classification, but it can be biased towards the majority class if the data is imbalanced.

Our model is a Multi-class Classification model and using accuracy can be even more misleading because Accuracy hides:
- Which classes are failing?
- Whether mistakes are serious or minor?
- Bias toward majority classes

Hence, Always inspect class-wise precision, recall, and confusion matrix.




## Understanding our Model Results

#### Classification Report

| Class | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| Important | **0.92** | **0.59** | 0.72 | 58 |
| Promotional | **0.73** | **0.97** | 0.84 | 102 |
| Spam | **1.00** | **0.85** | 0.92 | 80 |
| **Accuracy** |  |  | **0.84** | **240** |
| Macro Avg| 0.88 | 0.80 | 0.82 | 240 |
| Weighted Avg | 0.87 | 0.84 | 0.83 | 240 |


#### Confusion Matrix  

*(Rows = Actual, Columns = Predicted)*

| Actual \\ Predicted | Important | Promotional | Spam |
|--------------------|-----------|-------------|------|
| **Important** | 34 | 24 | 0 |
| **Promotional** | 3 | 99 | 0 |
| **Spam** | 0 | 12 | 68 |


### Key Observations

- **Important:**
  - High precision (0.92): predictions are trustworthy
  - Low recall (0.59): many important emails are downgraded to promotional

- **Promotional:**
  - Very high recall (0.97): almost all promotions are caught
  - Moderate precision (0.73): some emails predicted as promotional are actually important or spam

- **Spam:**
  - Perfect precision (1.00): no false spam alarms
  - Strong recall (0.85): most spam is correctly detected
  - No spam is misclassified as important (excellent behavior) 

- **Accuracy**(0.84) looks good, but hides class-specific tradeoffs.


#### Is this a GOOD model?
Short answer: Yes — for a first iteration, this is strong.

Especially:

- Spam handling is excellent

- Important precision is high (trustworthy)

- Promotions are well separated

The only weakness:
> Low recall for important emails

This means:

> The model prefers to be safe and under-predict “important”. This is actually a reasonable default.
The model is conservative, not bad. It trusts “important” only when confident.

That’s a good baseline.

## Model Evaluation using Cost Function (Andrew NG Method)

One way to evaluate a model's performance is to calculate error on training set and test set and compare the two.

The error in this case will be calculated as mean logistic loss function.

> Log loss measures how confident and correct your probabilities are.

- Correct & confident → very low loss
- Correct & under-confident → very high loss
- Wrong & confident → very high loss
- Wrong & under-confident → very low loss
- Unsure (0.33 / 0.33 / 0.33) → moderate loss

Lower is always better.
Perfect model → 0.0

Let's do this.

In [10]:
from sklearn.metrics import log_loss

y_train_proba = model.predict_proba(X_train_tfidf)
y_test_proba = model.predict_proba(X_test_tfidf)

labels = model.classes_

train_loss = log_loss(y_train,y_train_proba, labels=labels)

test_loss = log_loss(y_test, y_test_proba,labels=labels)

print("Training Log Loss:", train_loss)
print("Test Log Loss:", test_loss)

Training Log Loss: 0.3109483362198767
Test Log Loss: 0.4560343733981221


Here, we see that the gap between training set loss and test set loss is not very high. This is good. But let's quantify it.

> **The ONLY correct way to judge log loss: comparison. Log loss is relative, not absolute.**

So we compare against baselines.

### Model Performance Comparison (Log Loss)

| Model / Baseline | Train Log Loss | Test Log Loss | Interpretation |
|------------------|---------------|---------------|----------------|
| Random Guessing (3 classes) | ~1.10 | ~1.10 | No learning; assigns equal probability to all classes |
| Majority Class (Always Promotional) | ~0.90–1.20 | ~0.90–1.20 | High bias baseline; ignores minority classes |
| TF-IDF + Logistic Regression (Your Model) | **0.31** | **0.46** | Strong classical ML model with good generalization |


### Key Insights

- Random guessing sets the worst acceptable baseline (~1.10 for 3 classes).
- A naive majority-class model can appear competitive on accuracy but performs poorly on log loss.
- Our model significantly **outperforms both baselines, indicating it has learned real structure in the data.**
- The gap between train and test log loss (~0.15) is healthy and suggests limited overfitting.


### Bias–Variance Diagnosis

| Observation | What it indicates |
|------------|------------------|
| Train loss ≪ Random baseline | Model has learned meaningful patterns |
| Test loss ≪ Random baseline | Model generalizes beyond training data |
| Train loss < Test loss | Expected generalization gap |
| Small train–test gap | Good bias–variance tradeoff |


### Accuracy vs Log Loss Perspective

| Metric | What it measures | Limitation |
|------|------------------|------------|
| Accuracy (0.84) | Hard correctness | Hides confidence and class-wise failures |
| Log Loss (0.46) | Confidence-weighted correctness | Sensitive to wrong but confident predictions |


#### Final One-Line Summary
*The TF-IDF + Logistic Regression model achieves a test log loss of 0.46, substantially outperforming random and majority-class baselines, indicating strong probabilistic calibration and good generalization despite a limited labeled dataset.*
