# Model Improvement

Right now, our model stands here:

#### Classification Report

| Class | Precision | Recall | F1-score | Support |
|------|-----------|--------|----------|---------|
| Important | **0.92** | **0.59** | 0.72 | 58 |
| Promotional | **0.73** | **0.97** | 0.84 | 102 |
| Spam | **1.00** | **0.85** | 0.92 | 80 |
| **Accuracy** |  |  | **0.84** | **240** |
| Macro Avg| 0.88 | 0.80 | 0.82 | 240 |
| Weighted Avg | 0.87 | 0.84 | 0.83 | 240 |


I want to improve Recall for class **Important** . But before we do that, let's do the pre-requisites here in this notebook.



In [2]:
## Loading the training set into a Pandas dataframe
import pandas as pd
labeled_df = pd.read_csv(r'C:\Users\u411296\OneDrive - United Airlines\Documents\My Space\Upskilling Msyelf\Machine Learning\Personal Gmail Classifier ML Project\data\labeled_emails.csv')

labeled_df.head()

Unnamed: 0,message_id,subject,snippet,sender,sender_domain,internal_date,is_important,is_promo,is_spam,label
0,17e99c136d770537,Your mobile recharge for Rs. 15.00 is success...,"Amazon.in Recharges Dear customer, Your rechar...",Amazon Pay <noreply@amazonpay.in>,amazonpay.in,1/27/2022,True,False,False,important
1,18c39485254e45a4,Your order is on its way,Download app The one you&#39;ve been waiting f...,RENTOMOJO <noreply@rentomojo.com>,rentomojo.com,12/5/2023,True,False,False,important
2,182c941da6fa286d,Discover Fresh Arrivals for Our Brand New Cate...,Shop for ₹599 | Get Upto 40% Off | Code: MAMA4...,Mamaearth <support@info.mamaearth.in>,info.mamaearth.in,8/23/2022,True,False,False,promotional
3,190b1e939cc8f72b,Product registration confirmation,PRODUCTS SUPPORT PRODUCT REGISTRATION Dear Pri...,Sony India Product Registration System <no-rep...,alerts.sony.co.in,7/14/2024,True,False,False,important
4,17b4f317cb3f7b10,Online Live Project / Work Experience Program ...,"Dear Sri Venkateswara College Students, Here i...",Finlatics Hub <finlatics@fincruxtech.com>,fincruxtech.com,8/16/2021,True,False,False,important


In [3]:
# Features
X = labeled_df[["subject", "snippet"]]

#Target
y = labeled_df["label"]

In [4]:
# Splitting data into training and test set
from sklearn.model_selection import train_test_split

X = labeled_df[["subject", "snippet"]]
y = labeled_df["label"]

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)

In [5]:
## Combining both the text feature fields

X_train_text = (
    X_train["subject"].fillna("") + " " +
    X_train["snippet"].fillna("")
)

X_test_text = (
    X_test["subject"].fillna("") + " " +
    X_test["snippet"].fillna("")
)


## Cross-Validation

There are multiple ways in which the Recall for class Important could be improved. We could play around with class weights, threshold tuning or perform error analysis to manually see where the model is making mistakes.

Whichever method we choose to drive this behavior change in the model, we will need to use a procedure called **Cross-Validation** to validate that the **behavior change generalizes well to unseen data.**

Thus, 

> **Cross-Validation is a method to estimate how well a model will perform on unseen data.**

- We could have a separate fixed CV set OR
- We could use the K-fold CV method using the training data to perform this procedure.
- Having a separate fixed CV set makes sense when the dataset size is huge and training is very expensive.
- In our case however, dataset size is small and thus, training won't be very expensive. Also, there are other benefits of using K-fold CV using training data such as a single train–CV-test split can be lucky or unlucky and K-fold CV reduces this randomness by testing the model on multiple validation splits.

So, we'll use this method to perform CV.

### Stratified K-fold CV

We'll be using the stratified version of k-fold CV to maintain the proportions of our dataset.

1. Split the training data into k equal parts (folds)

2. Repeat k times:

- Train on k−1 folds

- Validate on the remaining fold

3. Average the validation results

The metric used to evaluate model performance is the CV Error. In our case, it will be CV log loss as we are using Logistic Regression algorithm.

- If we have 3 different models to choose from, we perform CV on each of them and compare their CV Errors to choose the one with the lowest error. 
- In our case, we already have a good enough model. We are only making a modification to it in order to drive a behaviour change. So, in our case, CV will help validating that this modified model is stable or not and how well does it generalize to unseen data compared to the old model.

## Measure 1 : Class Weight

Let's start with increasing weight of class **Important**, so that the model knows that this class is significant and we don't want to miss this.
> In classification, the model is trained by minimizing a loss function. Normally, every training example contributes equally to the loss. With class_weight, errors on some classes are penalized more heavily than others.

Increasing class weight for **Important** will increase its contribution to loss. So, the model understands that these mistakes are more costly than other mistakes and will try to do less of these mistakes.

In [6]:
# Using Stratified K-fold Split (Stratified = preserves class proportions)
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)

print(skf)

StratifiedKFold(n_splits=5, random_state=42, shuffle=True)


### Model Selection - Comparing two models :

Let's try using **weights 1.5 and 2 for Important** and see which model is better:

#### Weight 1.5

In [17]:
# Running CV loop on X_train_text, y_train
from sklearn.metrics import log_loss, recall_score, precision_score,f1_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np

cv_losses = []
important_recalls = []
important_precisions = []
important_f1s = []

#Creating K-fold splits in training data
for train_idx, val_idx in skf.split(X_train_text, y_train):
    X_tr = X_train_text.iloc[train_idx]
    X_val = X_train_text.iloc[val_idx]

    y_tr = y_train.iloc[train_idx]
    y_val = y_train.iloc[val_idx]
    
    
    # Fit vectorizer ONLY on training fold
    vectorizer = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 1),
        min_df=1,
        sublinear_tf=True
    )

    X_tr_tfidf = vectorizer.fit_transform(X_tr)
    X_val_tfidf = vectorizer.transform(X_val)
    
    #Training the model
    model = LogisticRegression(
        max_iter=1000,
        multi_class="multinomial",
        class_weight={"important": 1.5, "promotional": 1, "spam": 1}
    )

    model.fit(X_tr_tfidf, y_tr)

    # ---- Log loss ----
    y_val_proba = model.predict_proba(X_val_tfidf)
    loss = log_loss(y_val, y_val_proba, labels=model.classes_)
    cv_losses.append(loss)

    # ---- Precision & Recall for IMPORTANT ----
    y_val_pred = model.predict(X_val_tfidf)

    important_recalls.append(
        recall_score(y_val, y_val_pred, labels=["important"], average=None)[0]
    )

    important_precisions.append(
        precision_score(y_val, y_val_pred, labels=["important"], average=None)[0]
    )
    
    important_f1s.append(
        f1_score(y_val, y_val_pred, labels=["important"], average=None)[0]
    )



In [18]:
# Aggregating CV Log Loss
print("CV Log Losses:", cv_losses)
print("Mean CV Log Loss:", np.mean(cv_losses))
print("Std Dev:", np.std(cv_losses))

CV Log Losses: [0.463446673417334, 0.457184826299196, 0.4997583338724707, 0.4490745466419084, 0.4392842196258664]
Mean CV Log Loss: 0.46174971997135505
Std Dev: 0.0206578380312398


In [19]:
# Looking at Precision-Recall for sanity check
print("CV Important Recall:", important_recalls)
print("Mean Recall:", np.mean(important_recalls))

print("CV Important Precision:", important_precisions)
print("Mean Precision:", np.mean(important_precisions))

print("Mean Important F1:", np.mean(important_f1s))


CV Important Recall: [np.float64(0.6956521739130435), np.float64(0.6304347826086957), np.float64(0.6808510638297872), np.float64(0.6595744680851063), np.float64(0.6304347826086957)]
Mean Recall: 0.6593894542090657
CV Important Precision: [np.float64(0.8421052631578947), np.float64(0.8055555555555556), np.float64(0.8205128205128205), np.float64(0.8611111111111112), np.float64(0.7435897435897436)]
Mean Precision: 0.8145748987854251
Mean Important F1: 0.7285497549141642


#### Weight 2

In [20]:
# Running CV loop on X_train_text, y_train
from sklearn.metrics import log_loss, recall_score, precision_score
import numpy as np

cv_losses = []
important_recalls = []
important_precisions = []
important_f1s = []

#Creating K-fold splits in training data
for train_idx, val_idx in skf.split(X_train_text, y_train):
    X_tr = X_train_text.iloc[train_idx]
    X_val = X_train_text.iloc[val_idx]

    y_tr = y_train.iloc[train_idx]
    y_val = y_train.iloc[val_idx]
    
    
    # Fit vectorizer ONLY on training fold
    vectorizer = TfidfVectorizer(
        stop_words="english",
        ngram_range=(1, 1),
        min_df=1,
        sublinear_tf=True
    )

    X_tr_tfidf = vectorizer.fit_transform(X_tr)
    X_val_tfidf = vectorizer.transform(X_val)
    
    #Training the model
    model = LogisticRegression(
        max_iter=1000,
        multi_class="multinomial",
        class_weight={"important": 2, "promotional": 1, "spam": 1}
    )

    model.fit(X_tr_tfidf, y_tr)

    # ---- Log loss ----
    y_val_proba = model.predict_proba(X_val_tfidf)
    loss = log_loss(y_val, y_val_proba, labels=model.classes_)
    cv_losses.append(loss)

    # ---- Precision & Recall for IMPORTANT ----
    y_val_pred = model.predict(X_val_tfidf)

    important_recalls.append(
        recall_score(y_val, y_val_pred, labels=["important"], average=None)[0]
    )

    important_precisions.append(
        precision_score(y_val, y_val_pred, labels=["important"], average=None)[0]
    )
    
    important_f1s.append(
        f1_score(y_val, y_val_pred, labels=["important"], average=None)[0]
    )




In [21]:
# Aggregating CV Log Loss
print("CV Log Losses:", cv_losses)
print("Mean CV Log Loss:", np.mean(cv_losses))
print("Std Dev:", np.std(cv_losses))

CV Log Losses: [0.46515709286302, 0.4615146213028387, 0.5026583050067471, 0.44853796847971267, 0.4408295457954755]
Mean CV Log Loss: 0.4637395066895588
Std Dev: 0.02134284704584207


In [22]:
# Looking at Precision-Recall for sanity check
print("CV Important Recall:", important_recalls)
print("Mean Recall:", np.mean(important_recalls))

print("CV Important Precision:", important_precisions)
print("Mean Precision:", np.mean(important_precisions))

print("Mean Important F1:", np.mean(important_f1s))


CV Important Recall: [np.float64(0.782608695652174), np.float64(0.782608695652174), np.float64(0.723404255319149), np.float64(0.7872340425531915), np.float64(0.7391304347826086)]
Mean Recall: 0.7629972247918595
CV Important Precision: [np.float64(0.782608695652174), np.float64(0.7058823529411765), np.float64(0.723404255319149), np.float64(0.7551020408163265), np.float64(0.7083333333333334)]
Mean Precision: 0.735066135612432
Mean Important F1: 0.7485037161721837


### CV Interpretation

We experimented with different class weights for the important class. While both models achieved similar cross-validated log loss, a weight of 2.0 significantly improved recall for important emails with an acceptable reduction in precision. Based on CV F1-score and personal priorities, we selected the higher-weight model.

### Retraining the model with the following class_weights

> class_weight = {"important": 2, "promotional": 1, "spam": 1}

In [23]:
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    stop_words="english",
    ngram_range=(1, 1),
    min_df=1,
    sublinear_tf=True
)

X_train_tfidf = vectorizer.fit_transform(X_train_text)
X_test_tfidf = vectorizer.transform(X_test_text)

In [24]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression(
    max_iter=1000,
    multi_class="multinomial",
    class_weight={
        "important": 2.0,
        "promotional": 1.0,
        "spam": 1.0
    }
)

model.fit(X_train_tfidf, y_train)



0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,"{'important': 2.0, 'promotional': 1.0, 'spam': 1.0}"
,random_state,
,solver,'lbfgs'
,max_iter,1000


In [25]:
from sklearn.metrics import classification_report, confusion_matrix

y_pred = model.predict(X_test_tfidf)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

   important       0.83      0.78      0.80        58
 promotional       0.80      0.92      0.86       102
        spam       1.00      0.86      0.93        80

    accuracy                           0.87       240
   macro avg       0.88      0.85      0.86       240
weighted avg       0.88      0.87      0.87       240

[[45 13  0]
 [ 8 94  0]
 [ 1 10 69]]


In [26]:
from sklearn.metrics import log_loss

y_train_proba = model.predict_proba(X_train_tfidf)
y_test_proba = model.predict_proba(X_test_tfidf)

labels = model.classes_

train_loss = log_loss(y_train,y_train_proba, labels=labels)

test_loss = log_loss(y_test, y_test_proba,labels=labels)

print("Training Log Loss:", train_loss)
print("Test Log Loss:", test_loss)

Training Log Loss: 0.30390291491932464
Test Log Loss: 0.4584545457866945



## Model Evaluation

### Key Observations

- **Important emails**:
  - Recall of **0.78** indicates that the model successfully captures most important emails.
  - Precision of **0.83** ensures that the important inbox is not flooded with noise.
  - No important emails are misclassified as spam.

- **Spam safety**:
  - Spam precision is **1.00**, meaning no legitimate emails are incorrectly flagged as spam.
  - Only one spam email is misclassified as important, which is acceptable and controlled.

- **Promotional emails**:
  - High recall (**0.92**) shows strong separation from important and spam classes.
  - Some promotions are intentionally upgraded to important, reflecting the recall-focused objective.


### Log Loss Analysis (Generalization Check)

- **Training Log Loss:** 0.304  
- **Test Log Loss:** 0.458  

The close alignment between cross-validation and test log loss indicates:
- No evidence of overfitting
- Stable probability calibration
- Good generalization to unseen data



#### Final Conclusion

*By introducing a higher class weight for important emails, the model achieves a **significant improvement in recall** while maintaining strong precision and strict spam safety. Cross-validation confirmed that this improvement generalizes well, and test results validate the expected trade-offs.*
