# Phase 2: Modeling & Evaluation

## Context
Phase 1 (EDA, data cleaning, feature engineering, and documentation) has been completed.  
This phase focuses on building a baseline churn prediction model using the processed dataset:

**Input dataset:** `/content/drive/MyDrive/ML_internship_projects/ConnectTel-churn-prediction/data /processed/churn_cleaned.csv`


In [117]:
import pandas as pd

df = pd.read_csv("/content/drive/MyDrive/ML_internship_projects/ConnectTel-churn-prediction/data/processed/churn_cleaned.csv")
df.head()


Unnamed: 0,CustomerID,Gender,Senior_Citizen,Partner,Dependents,Tenure_Months,Phone_Service,Multiple_Lines,Internet_Service,Online_Security,...,Contract,Paperless_Billing,Payment_Method,Monthly_Charges,Total_Charges,Churn_Label,CLTV,TotalChargesPerTenure,ServiceCount,IsLongTermContract
0,3668-QPYBK,Male,No,No,No,2,Yes,No,DSL,Yes,...,Month-to-month,Yes,Mailed check,53.85,108.15,Yes,3239,36.05,2,0
1,9237-HQITU,Female,No,No,Yes,2,Yes,No,Fiber optic,No,...,Month-to-month,Yes,Electronic check,70.7,151.65,Yes,2701,50.55,0,0
2,9305-CDSKC,Female,No,No,Yes,8,Yes,Yes,Fiber optic,No,...,Month-to-month,Yes,Electronic check,99.65,820.5,Yes,5372,91.166667,3,0
3,7892-POOKP,Female,No,Yes,Yes,28,Yes,Yes,Fiber optic,No,...,Month-to-month,Yes,Electronic check,104.8,3046.05,Yes,5003,105.036207,4,0
4,0280-XJGEX,Male,No,No,Yes,49,Yes,Yes,Fiber optic,No,...,Month-to-month,Yes,Bank transfer (automatic),103.7,5036.3,Yes,5340,100.726,4,0


In [118]:
df.columns

Index(['CustomerID', 'Gender', 'Senior_Citizen', 'Partner', 'Dependents',
       'Tenure_Months', 'Phone_Service', 'Multiple_Lines', 'Internet_Service',
       'Online_Security', 'Online_Backup', 'Device_Protection', 'Tech_Support',
       'Streaming_TV', 'Streaming_Movies', 'Contract', 'Paperless_Billing',
       'Payment_Method', 'Monthly_Charges', 'Total_Charges', 'Churn_Label',
       'CLTV', 'TotalChargesPerTenure', 'ServiceCount', 'IsLongTermContract'],
      dtype='object')

In [119]:
X = df.drop(columns=["Churn_Label"])
y = df["Churn_Label"].map({"Yes": 1, "No": 0})


In [120]:
X = X.drop(columns=["CustomerID"])


## Feature and Target Definition

The dataset is split into:
- **Features (X):** All customer attributes
- **Target (y):** Customer churn indicator

The churn variable is mapped to a binary format to enable probabilistic modeling and ROC-based evaluation.


In [121]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

categorical_cols = X.select_dtypes(include="object").columns
numerical_cols = X.select_dtypes(exclude="object").columns


preprocessor = ColumnTransformer(
    transformers=[
        ("cat", OneHotEncoder(drop="first", handle_unknown="ignore"), categorical_cols),
        ("num", "passthrough", numerical_cols)
    ]
)


## Preprocessing Pipeline

Feature encoding is implemented using a **ColumnTransformer** and integrated into a modeling pipeline.

### Benefits
- Prevents data leakage
- Ensures consistent preprocessing across training and validation
- Simplifies experimentation and deployment


In [122]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=42
)


In [123]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

lr_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", LogisticRegression(max_iter=1000))
    ]
)

lr_pipeline.fit(X_train, y_train)


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


## Baseline Model Selection

**Logistic Regression** is selected as the baseline model.

### Reasons
- Strong interpretability
- Stable convergence
- Produces well-calibrated churn probabilities


In [124]:
y_pred = lr_pipeline.predict(X_val)
y_prob = lr_pipeline.predict_proba(X_val)[:, 1]


## Model Predictions

- `y_pred`: Binary churn predictions for the validation set  
- `y_prob`: Predicted probability of churn for each customer  

Predicted probabilities are used for ROC-AUC evaluation and business decision analysis.


In [125]:
# import warnings
# warnings.filterwarnings(
#     "ignore",
#     category=UserWarning,
#     module="sklearn.preprocessing"
# )


In [126]:
from sklearn.metrics import classification_report

print(classification_report(y_val, y_pred))


              precision    recall  f1-score   support

           0       0.85      0.90      0.87      1035
           1       0.66      0.57      0.61       374

    accuracy                           0.81      1409
   macro avg       0.76      0.73      0.74      1409
weighted avg       0.80      0.81      0.81      1409



## Classification Metrics

The following metrics are used to evaluate the model:
- **Precision:** Measures false alarm rate in churn prediction
- **Recall:** Measures the ability to identify actual churners
- **F1-score:** Balances precision and recall


In [127]:
from sklearn.metrics import roc_auc_score

roc_auc = roc_auc_score(y_val, y_prob)
roc_auc


np.float64(0.8554625539280271)

In [128]:
feature_names = (
    lr_pipeline.named_steps["preprocessor"]
    .get_feature_names_out()
)


In [129]:
import numpy as np

coefficients = lr_pipeline.named_steps["model"].coef_[0]

coef_df = pd.DataFrame({
    "Feature": feature_names,
    "Coefficient": coefficients,
    "Odds_Ratio": np.exp(coefficients)
}).sort_values(by="Odds_Ratio", ascending=False)

coef_df.head(10)


Unnamed: 0,Feature,Coefficient,Odds_Ratio
7,cat__Internet_Service_Fiber optic,0.581874,1.789389
6,cat__Multiple_Lines_Yes,0.372497,1.451354
23,cat__Paperless_Billing_Yes,0.338471,1.402801
2,cat__Partner_Yes,0.314351,1.36937
18,cat__Streaming_TV_Yes,0.302599,1.353371
20,cat__Streaming_Movies_Yes,0.287401,1.332959
25,cat__Payment_Method_Electronic check,0.229971,1.258564
5,cat__Multiple_Lines_No phone service,0.189283,1.208383
14,cat__Device_Protection_Yes,0.042659,1.043582
28,num__Monthly_Charges,0.03567,1.036314


## Logistic Regression Coefficient Analysis

To interpret the baseline Logistic Regression model, the learned coefficients are analyzed after feature encoding.

### Interpretation Logic
- **Coefficient** represents the direction and strength of a feature’s impact on churn.
- **Odds Ratio = exp(coefficient)** provides a business-friendly interpretation:
  - Odds Ratio **> 1** → increases the likelihood of churn  
  - Odds Ratio **< 1** → decreases the likelihood of churn  

### Methodology
- Extract coefficients from the trained Logistic Regression model.
- Align coefficients with feature names generated after One-Hot Encoding.
- Sort features by Odds Ratio in descending order.

### Outcome
The table below presents the **top 10 most influential features** affecting customer churn, ranked by their odds ratio. These insights help identify key churn drivers and protective factors.


In [130]:
# Cross-Validation Strategy
from sklearn.model_selection import StratifiedKFold

cv = StratifiedKFold(
    n_splits=5,
    shuffle=True,
    random_state=42
)


*Random Forest Model (Baseline)*

In [131]:
# Model Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline

rf_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", RandomForestClassifier(
            random_state=42,
            n_jobs=-1
        ))
    ]
)


In [132]:
rf_pipeline.fit(X_train, y_train)


In [133]:
import joblib

joblib.dump(
    rf_pipeline,
    "/content/drive/MyDrive/ML_internship_projects/ConnectTel-churn-prediction/churn_artifacts/rf_pipeline.pkl"
)


['/content/drive/MyDrive/ML_internship_projects/ConnectTel-churn-prediction/churn_artifacts/rf_pipeline.pkl']

In [134]:
# Cross-Validated Performance
from sklearn.model_selection import cross_val_score

rf_auc = cross_val_score(
    rf_pipeline,
    X,
    y,
    cv=cv,
    scoring="roc_auc"
)

rf_auc_mean = rf_auc.mean()
rf_auc_mean


np.float64(0.8456343469667376)

CustomerID was removed from modeling as it is a unique identifier with no predictive value and causes high-cardinality encoding issues.


In [135]:
# for col in categorical_cols:
#     unseen = set(X_val[col]) - set(X_train[col])
#     if unseen:
#         print(f"{col}: {unseen}")


*XGBoost Hyperparameter Tuning*

In [136]:
from sklearn.model_selection import GridSearchCV
from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline

xgb_pipeline = Pipeline(
    steps=[
        ("preprocessor", preprocessor),
        ("model", XGBClassifier(
            random_state=42,
            eval_metric="logloss"
        ))
    ]
)

xgb_param_grid = {
    "model__n_estimators": [300, 500],
    "model__max_depth": [3, 5],
    "model__learning_rate": [0.05, 0.1],
    "model__subsample": [0.8, 1.0]
}

xgb_grid = GridSearchCV(
    xgb_pipeline,
    param_grid=xgb_param_grid,
    scoring="roc_auc",
    cv=cv,
    n_jobs=-1
)

xgb_grid.fit(X, y)

xgb_grid.best_score_


np.float64(0.8618985174598596)

## XGBoost Cross-Validated Performance

After hyperparameter tuning and removal of high-cardinality identifiers, the XGBoost model achieves a mean ROC-AUC of **0.862** using stratified K-Fold cross-validation.

This improvement confirms the model’s ability to capture non-linear relationships and feature interactions relevant to customer churn.


In [137]:
#Model comparison
comparison_df = pd.DataFrame({
    "Model": [
        "Logistic Regression (Baseline)",
        "Random Forest",
        "XGBoost (Tuned)"
    ],
    "CV ROC-AUC": [
        roc_auc,
        rf_auc_mean,
        xgb_grid.best_score_
    ]
}).sort_values(by="CV ROC-AUC", ascending=False)

comparison_df



Unnamed: 0,Model,CV ROC-AUC
2,XGBoost (Tuned),0.861899
0,Logistic Regression (Baseline),0.855463
1,Random Forest,0.845634


*ROC-AUC improvements compound in real systems.*

*Using stratified cross-validation and ROC-AUC as the primary metric, XGBoost achieved the best performance, followed by Logistic Regression and Random Forest. This indicates that while linear relationships explain much of the churn behavior, additional non-linear interactions captured by gradient boosting further improve predictive power.*

In [138]:
# Final model section
best_model = xgb_grid.best_estimator_


In [139]:
model = best_model.named_steps["model"]

In [140]:
import os, joblib

ARTIFACTS_DIR = "/content/drive/MyDrive/ML_internship_projects/ConnectTel-churn-prediction/churn_artifacts"
os.makedirs(ARTIFACTS_DIR, exist_ok=True)

joblib.dump(best_model, f"{ARTIFACTS_DIR}/churn_pipeline.pkl")
joblib.dump(X_val, f"{ARTIFACTS_DIR}/X_test.pkl")
joblib.dump(y_val, f"{ARTIFACTS_DIR}/y_test.pkl")

os.listdir(ARTIFACTS_DIR)



['y_test.pkl', 'X_test.pkl', 'churn_pipeline.pkl', 'rf_pipeline.pkl']

In [141]:
best_model.named_steps


{'preprocessor': ColumnTransformer(transformers=[('cat',
                                  OneHotEncoder(drop='first',
                                                handle_unknown='ignore'),
                                  Index(['Gender', 'Senior_Citizen', 'Partner', 'Dependents', 'Phone_Service',
        'Multiple_Lines', 'Internet_Service', 'Online_Security',
        'Online_Backup', 'Device_Protection', 'Tech_Support', 'Streaming_TV',
        'Streaming_Movies', 'Contract', 'Paperless_Billing', 'Payment_Method'],
       dtype='object')),
                                 ('num', 'passthrough',
                                  Index(['Tenure_Months', 'Monthly_Charges', 'Total_Charges', 'CLTV',
        'TotalChargesPerTenure', 'ServiceCount', 'IsLongTermContract'],
       dtype='object'))]),
 'model': XGBClassifier(base_score=None, booster=None, callbacks=None,
               colsample_bylevel=None, colsample_bynode=None,
               colsample_bytree=None, device=None, early_s

In [142]:
model = best_model.named_steps["model"]


In [143]:
# type(best_model)


In [144]:
# type(best_model.named_steps["model"])


In [145]:
type(best_model)
X_val.shape
y_val.shape


(1409,)

In [146]:
import os
os.listdir("/content/drive/MyDrive/ML_internship_projects/ConnectTel-churn-prediction/churn_artifacts")


['y_test.pkl', 'X_test.pkl', 'churn_pipeline.pkl', 'rf_pipeline.pkl']