# Model Training & Evaluation

In this notebook, we train and evaluate machine learning models to predict customer churn.

The main goal is not only to achieve high accuracy, but also to correctly identify customers who are likely to churn, since missing such customers can have a direct negative business impact.


In [1]:
import pandas as pd
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)


The dataset was split into training and testing sets and saved under data/processed/model_input/ to ensure reproducibility and consistency across notebooks.

## Loading Processed Data


In [2]:
# Load processed data
X_train = pd.read_csv("../data/processed/model_input/X_train.csv")
X_test = pd.read_csv("../data/processed/model_input/X_test.csv")
y_train = pd.read_csv("../data/processed/model_input/y_train.csv").squeeze()
y_test = pd.read_csv("../data/processed/model_input/y_test.csv").squeeze()

## Baseline Model: Logistic Regression

We start with Logistic Regression as a baseline model because:
- It is simple and interpretable
- It performs well on binary classification problems
- It provides a strong reference point for comparison


In [3]:
# Initialize model
log_reg = LogisticRegression(max_iter=1000, random_state=42)

# Train model
log_reg.fit(X_train, y_train)

In [4]:
# Predictions
y_pred = log_reg.predict(X_test)

In [5]:
accuracy_score(y_test, y_pred)

0.8289567068843151

In [6]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.87      0.90      0.89      1035
           1       0.70      0.63      0.66       374

    accuracy                           0.83      1409
   macro avg       0.78      0.76      0.77      1409
weighted avg       0.82      0.83      0.83      1409



In [7]:
confusion_matrix(y_test, y_pred)

array([[933, 102],
       [139, 235]])

### Baseline Model Results

The baseline Logistic Regression model achieved an accuracy of approximately **83%**.

However, when focusing on the churn class (label = 1), we observe:
- Recall of around **63%**
- A relatively high number of false negatives

This means that many customers who actually churned were not detected by the model, which is undesirable from a business perspective.


## Handling Class Imbalance

The dataset is imbalanced, with significantly more non-churn customers than churn customers.

In such cases, accuracy alone can be misleading.  
Our main concern is reducing **false negatives**, i.e., customers who churn but are predicted as non-churn.


## Logistic Regression with Class Weight Balancing

To address class imbalance, we retrain Logistic Regression using `class_weight='balanced'`.

This forces the model to pay more attention to the minority class (churned customers).


In [8]:
# Logistic Regression with class balancing
log_reg_balanced = LogisticRegression(
    max_iter=1000,
    random_state=42,
    class_weight='balanced'
)

# Train model
log_reg_balanced.fit(X_train, y_train)

# Predictions
y_pred_balanced = log_reg_balanced.predict(X_test)

# Evaluation
accuracy_score(y_test, y_pred_balanced)

0.7814052519517388

In [9]:
print(classification_report(y_test, y_pred_balanced))


              precision    recall  f1-score   support

           0       0.93      0.76      0.84      1035
           1       0.56      0.83      0.67       374

    accuracy                           0.78      1409
   macro avg       0.74      0.80      0.75      1409
weighted avg       0.83      0.78      0.79      1409



In [10]:
confusion_matrix(y_test, y_pred_balanced)


array([[789, 246],
       [ 62, 312]])

### Balanced Model Results

After applying class balancing:
- Overall accuracy decreased slightly to around **78%**
- Recall for churned customers increased significantly to **83%**

Most importantly, the number of false negatives was reduced by more than 50%.


## Model Comparison

| Metric | Baseline Model | Balanced Model |
|------|---------------|---------------|
| Accuracy | Higher | Slightly Lower |
| Recall (Churn) | Lower | Higher |
| False Negatives | High | Significantly Lower |

Although the balanced model has lower accuracy, it is more effective at identifying customers who are likely to churn.


## Final Model Selection

The balanced Logistic Regression model is selected as the final model.

From a business perspective:
- Missing a churned customer is more costly than incorrectly flagging a loyal one
- The balanced model significantly reduces false negatives
- This allows the company to take proactive actions (offers, retention campaigns)

Therefore, the balanced model provides better real-world value despite slightly lower accuracy.

This trade-off aligns better with real-world churn prevention strategies.