<div style="border:solid green 2px; padding: 20px">
<b>Iteration - 1</b><br>
<b>Reviewer’s comments</b><br>


  Hello Collin!,
  
I am Alexangel, your reviewer in this first iteration,
  
Another project successfully completed - well done! 🏆 Your consistent effort and progress are truly commendable.

Our team is here to help you keep pushing forward and honing your skills as you advance through the program.

You can contact me via Tripleten Hub for further feedback. This information is described below.

**What Was Great**:
- Excellent job on following the structure of the project.
- You’ve shown strong skills in testing the models in this project.

**Tips for Future Projects**:
- Consider adding brief comments after the analysis or graph of every dataframe to make your work even more integral.

Congratulations again on your accomplishment! Each project you complete adds to your growing expertise, and it’s exciting to see you make such great strides. Keep up the great work! 🎯

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

***Name of the reviewer***: Alexangel Bracho

***Reviewer's Tripleten Hub  link*** : [reviewer's link](https://hub.tripleten.com/u/6b1cbe37)


## Introduction

The telecom operator Interconnect has contracted me to create a machine learning model to predict customer churn. In this notebook, I will create and test multiple models to solve this task.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good preview about what the project will be about in general terms.

</div>

In [1]:
# Import Libraries

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import roc_auc_score
from catboost import CatBoostClassifier
from xgboost import XGBClassifier

In [2]:
# Load Datasets
contract = pd.read_csv('/datasets/final_provider/contract.csv')
personal = pd.read_csv('/datasets/final_provider/personal.csv')
internet = pd.read_csv('/datasets/final_provider/internet.csv')
phone = pd.read_csv('/datasets/final_provider/phone.csv')

In [3]:
# Merging DataFrames together
df = pd.merge(contract, personal, on='customerID', how='left')
df = pd.merge(df, internet, on='customerID', how='left')  
df = pd.merge(df, phone, on='customerID', how='left')

print(df.info())
df.head()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7043 entries, 0 to 7042
Data columns (total 20 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   BeginDate         7043 non-null   object 
 2   EndDate           7043 non-null   object 
 3   Type              7043 non-null   object 
 4   PaperlessBilling  7043 non-null   object 
 5   PaymentMethod     7043 non-null   object 
 6   MonthlyCharges    7043 non-null   float64
 7   TotalCharges      7043 non-null   object 
 8   gender            7043 non-null   object 
 9   SeniorCitizen     7043 non-null   int64  
 10  Partner           7043 non-null   object 
 11  Dependents        7043 non-null   object 
 12  InternetService   5517 non-null   object 
 13  OnlineSecurity    5517 non-null   object 
 14  OnlineBackup      5517 non-null   object 
 15  DeviceProtection  5517 non-null   object 
 16  TechSupport       5517 non-null   object 


Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,Partner,Dependents,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,Yes,No,DSL,No,Yes,No,No,No,No,
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,No,No,DSL,Yes,No,Yes,No,No,No,No
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,No,No,DSL,Yes,Yes,No,No,No,No,No
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,No,No,DSL,Yes,No,Yes,Yes,No,No,
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,No,No,Fiber optic,No,No,No,No,No,No,No


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good work with the preview of the data.

</div>

In [4]:
# Fill internet service columns
internet_columns = ['InternetService', 'OnlineSecurity', 'OnlineBackup', 
                   'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies']

for col in internet_columns:
    df[col] = df[col].fillna('No internet service')

# Fill phone service column
df['MultipleLines'] = df['MultipleLines'].fillna('No phone service')

# Verify the changes
print("\nAfter filling missing values:")
print(df.isnull().sum())


After filling missing values:
customerID          0
BeginDate           0
EndDate             0
Type                0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
MultipleLines       0
dtype: int64


### Feature Engineering

In [5]:
# Converting Total Charges to float
df['TotalCharges'] = df['TotalCharges'].replace(' ', np.nan)
df['TotalCharges'] = df['TotalCharges'].astype(float)
print(df['TotalCharges'].dtype)
print(df['TotalCharges'].isna().sum())

float64
11


In [6]:
# Create churn target variable
df['Churn'] = df['EndDate'] != 'No'
print("Churn distribution:")
print(df['Churn'].value_counts())

Churn distribution:
False    5174
True     1869
Name: Churn, dtype: int64


In [7]:
df['BeginDate'] = pd.to_datetime(df['BeginDate'])
print(df['BeginDate'].dtype)

datetime64[ns]


In [8]:
# Converting Begin Date to numerical features
reference_date = df['BeginDate'].max()
print(f"Reference date: {reference_date}")

Reference date: 2020-02-01 00:00:00


In [9]:
df['tenure_days'] = (reference_date - df['BeginDate']).dt.days
df['tenure_months'] = df['tenure_days'] / 30.44  # average days per month
df['tenure_years'] = df['tenure_days'] / 365.25  # accounting for leap years

In [10]:
df.head()

Unnamed: 0,customerID,BeginDate,EndDate,Type,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,gender,SeniorCitizen,...,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,MultipleLines,Churn,tenure_days,tenure_months,tenure_years
0,7590-VHVEG,2020-01-01,No,Month-to-month,Yes,Electronic check,29.85,29.85,Female,0,...,Yes,No,No,No,No,No phone service,False,31,1.018397,0.084873
1,5575-GNVDE,2017-04-01,No,One year,No,Mailed check,56.95,1889.5,Male,0,...,No,Yes,No,No,No,No,False,1036,34.034166,2.836413
2,3668-QPYBK,2019-10-01,2019-12-01 00:00:00,Month-to-month,Yes,Mailed check,53.85,108.15,Male,0,...,Yes,No,No,No,No,No,True,123,4.040736,0.336756
3,7795-CFOCW,2016-05-01,No,One year,No,Bank transfer (automatic),42.3,1840.75,Male,0,...,No,Yes,Yes,No,No,No phone service,False,1371,45.039422,3.753593
4,9237-HQITU,2019-09-01,2019-11-01 00:00:00,Month-to-month,Yes,Electronic check,70.7,151.65,Female,0,...,No,No,No,No,No,No,True,153,5.026281,0.418891


### Preparing DataFrame for Machine Learning

In [11]:
# Train test split 70/15/15

features = df.drop(['customerID', 'BeginDate', 'EndDate', 'Churn'], axis=1)
target = df['Churn']

features_train, features_temp, target_train, target_temp = train_test_split(features, target, test_size=0.3, random_state=42, stratify=target)

features_val, features_test, target_val, target_test = train_test_split(features_temp, target_temp, test_size=0.5, random_state=42, stratify=target_temp)

In [12]:
print("Train:", len(features_train))
print("Validation:", len(features_val))
print("Test:", len(features_test))

Train: 4930
Validation: 1056
Test: 1057


In [13]:
features_train_enc = pd.get_dummies(features_train, drop_first=True)
features_val_enc = pd.get_dummies(features_val, drop_first=True)
features_test_enc = pd.get_dummies(features_test, drop_first=True)

In [14]:
features_train_enc.head()

Unnamed: 0,MonthlyCharges,TotalCharges,SeniorCitizen,tenure_days,tenure_months,tenure_years,Type_One year,Type_Two year,PaperlessBilling_Yes,PaymentMethod_Credit card (automatic),...,DeviceProtection_No internet service,DeviceProtection_Yes,TechSupport_No internet service,TechSupport_Yes,StreamingTV_No internet service,StreamingTV_Yes,StreamingMovies_No internet service,StreamingMovies_Yes,MultipleLines_No phone service,MultipleLines_Yes
5557,80.2,384.25,0,153,5.026281,0.418891,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
2270,86.85,220.95,1,184,6.044678,0.503765,0,0,1,0,...,0,1,0,0,0,1,0,0,0,0
6930,75.15,216.75,0,184,6.044678,0.503765,0,0,1,1,...,0,0,0,0,0,0,0,0,0,1
2257,80.55,4847.05,0,1826,59.986859,4.999316,1,0,0,1,...,0,1,0,1,0,1,0,1,0,1
898,98.9,1120.95,0,457,15.013141,1.251198,0,0,1,0,...,0,0,0,1,0,1,0,1,0,0


In [15]:
# Standard Scaler

scaler = StandardScaler()

features_train_scaled = scaler.fit_transform(features_train_enc)
features_val_scaled = scaler.fit_transform(features_val_enc)
features_test_scaled = scaler.fit_transform(features_test_enc)

#### Attempted to use SMOTE to balance classes, but permissions don't allow to install imblearn library, will continue with model based weight adjustments

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good work with preparing the dataframe for the models.

</div>

## HistGradientBoostingClassifier

In [16]:
# Base model
base = HistGradientBoostingClassifier(
    loss="binary_crossentropy",
    random_state=42,
    early_stopping=True,               
    validation_fraction=0.1,
    n_iter_no_change=20,
    scoring="roc_auc"                  
)

In [17]:
# Hyperparameter search space (compact but effective)
param_dist = {
    "learning_rate": np.logspace(-2.5, -0.3, 12),
    "max_depth": [None, 3, 5, 7, 9],
    "max_leaf_nodes": [15, 31, 63, 127],
    "min_samples_leaf": [5, 10, 20, 50, 100],
    "l2_regularization": np.logspace(-4, 1, 10),
    "max_bins": [127, 255],            # 255 can help with many unique values
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

search = RandomizedSearchCV(
    estimator=base,
    param_distributions=param_dist,
    n_iter=40,
    scoring="roc_auc",
    n_jobs=-1,
    cv=cv,
    refit=True,
    random_state=42,
    verbose=0
)

In [18]:
X_train = features_train_scaled
y_train = target_train
X_val   = features_val_scaled
y_val   = target_val
X_test  = features_test_scaled
y_test  = target_test

In [19]:
# Tune on training data
search.fit(X_train, y_train)
best_model = search.best_estimator_

In [20]:
# Validate
val_proba = best_model.predict_proba(X_val)[:, 1]
val_auc = roc_auc_score(y_val, val_proba)
print(f"Validation ROC-AUC (tuned HGB): {val_auc:.4f}")
print("Best params:", search.best_params_)

Validation ROC-AUC (tuned HGB): 0.9252
Best params: {'min_samples_leaf': 100, 'max_leaf_nodes': 15, 'max_depth': 7, 'max_bins': 255, 'learning_rate': 0.5011872336272722, 'l2_regularization': 0.016681005372000592}


In [21]:
# Retrain on train+val with the best hyperparameters, then evaluate on test
from sklearn.utils import shuffle
X_tv = np.vstack([X_train, X_val])
y_tv = np.concatenate([y_train, y_val])
X_tv, y_tv = shuffle(X_tv, y_tv, random_state=42)

final_model = HistGradientBoostingClassifier(
    **{**best_model.get_params(), "early_stopping": True, "scoring": "roc_auc", "random_state": 42}
)
final_model.fit(X_tv, y_tv)

test_proba = final_model.predict_proba(X_test)[:, 1]
test_auc = roc_auc_score(y_test, test_proba)
print(f"Test ROC-AUC (final HGB): {test_auc:.4f}")

Test ROC-AUC (final HGB): 0.9073


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good work with the fine tuning of the `HistGradientBoostingClassifier` model.

</div>

## CatBoost

In [22]:
# Base model
clf = CatBoostClassifier(
    loss_function="Logloss",
    eval_metric="AUC",          
    iterations=5000,            
    learning_rate=0.03,
    depth=6,
    l2_leaf_reg=3.0,
    random_strength=1.0,
    border_count=128,
    auto_class_weights="Balanced",   
    random_seed=42,
    verbose=200,
    use_best_model=True
)


In [23]:
# Train the model
clf.fit(
    features_train_scaled, target_train,
    eval_set=(features_val_scaled, target_val),
    early_stopping_rounds=300
)

0:	test: 0.8541720	best: 0.8541720 (0)	total: 47.8ms	remaining: 3m 58s
200:	test: 0.8784334	best: 0.8786497 (192)	total: 358ms	remaining: 8.54s
400:	test: 0.8860963	best: 0.8860963 (400)	total: 668ms	remaining: 7.66s
600:	test: 0.8920885	best: 0.8923279 (585)	total: 983ms	remaining: 7.19s
800:	test: 0.8977541	best: 0.8977541 (800)	total: 1.3s	remaining: 6.82s
1000:	test: 0.9003176	best: 0.9003912 (992)	total: 1.62s	remaining: 6.46s
1200:	test: 0.9022275	best: 0.9023104 (1190)	total: 1.93s	remaining: 6.11s
1400:	test: 0.9039994	best: 0.9042296 (1375)	total: 2.24s	remaining: 5.77s
1600:	test: 0.9047220	best: 0.9047450 (1573)	total: 2.56s	remaining: 5.43s
1800:	test: 0.9056655	best: 0.9057898 (1796)	total: 2.88s	remaining: 5.11s
2000:	test: 0.9066136	best: 0.9066228 (1997)	total: 3.19s	remaining: 4.78s
2200:	test: 0.9068023	best: 0.9070186 (2183)	total: 3.5s	remaining: 4.46s
2400:	test: 0.9069265	best: 0.9070186 (2183)	total: 3.82s	remaining: 4.13s
Stopped by overfitting detector  (300 it

<catboost.core.CatBoostClassifier at 0x7f74e09ab190>

In [24]:
# Evaluate ROC-AUC on validation and test
val_pred = clf.predict_proba(features_val_scaled)[:, 1]
test_pred = clf.predict_proba(features_test_scaled)[:, 1]

val_auc = roc_auc_score(target_val, val_pred)
test_auc = roc_auc_score(target_test, test_pred)

print(f"Validation ROC-AUC (CatBoost): {val_auc:.4f}")
print(f"Test ROC-AUC (CatBoost): {test_auc:.4f}")

Validation ROC-AUC (CatBoost): 0.9070
Test ROC-AUC (CatBoost): 0.9078


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good work with the fine tuning of the `CatBoost` model.

</div>

## XGBoost

In [25]:
# Handle class imbalance
pos = np.sum(target_train == 1)
neg = np.sum(target_train == 0)
spw = neg / max(pos, 1)  # avoid div by zero

In [26]:
# Base model
xgb = XGBClassifier(
    n_estimators=5000,          
    learning_rate=0.03,
    max_depth=6,
    min_child_weight=1.0,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_lambda=1.0,
    reg_alpha=0.0,
    gamma=0.0,
    objective="binary:logistic",
    eval_metric="auc",          
    tree_method="hist",         
    random_state=42,
    scale_pos_weight=spw,       
    n_jobs=-1
)

In [27]:
# Train model
xgb.fit(
    features_train_scaled, target_train,
    eval_set=[(features_val_scaled, target_val)],
    early_stopping_rounds=300,
    verbose=200
)

[0]	validation_0-auc:0.79122




[200]	validation_0-auc:0.85594
[400]	validation_0-auc:0.86713
[600]	validation_0-auc:0.87393
[800]	validation_0-auc:0.87947
[1000]	validation_0-auc:0.87997
[1200]	validation_0-auc:0.88366
[1400]	validation_0-auc:0.88170
[1500]	validation_0-auc:0.88217


XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.8,
              enable_categorical=False, eval_metric='auc', gamma=0.0, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.03, max_delta_step=0, max_depth=6,
              min_child_weight=1.0, missing=nan, monotone_constraints='()',
              n_estimators=5000, n_jobs=-1, num_parallel_tree=1,
              predictor='auto', random_state=42, reg_alpha=0.0, reg_lambda=1.0,
              scale_pos_weight=2.7691131498470947, subsample=0.8,
              tree_method='hist', validate_parameters=1, verbosity=None)

In [28]:
# Evaluate ROC-AUC on validation and test
val_proba = xgb.predict_proba(features_val_scaled)[:, 1]
test_proba = xgb.predict_proba(features_test_scaled)[:, 1]

val_auc = roc_auc_score(target_val, val_proba)
test_auc = roc_auc_score(target_test, test_proba)

print(f"Validation ROC-AUC (XGBoost): {val_auc:.4f}")
print(f"Test ROC-AUC (XGBoost): {test_auc:.4f}")

Validation ROC-AUC (XGBoost): 0.8837
Test ROC-AUC (XGBoost): 0.8488


<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good work with the fine tuning of the `XGBoost` model.

</div>

# Conclusions

The telecom operator Interconnect has contracted me to create a machine learning model to predict customer churn. After training multiple models, here are the results along with next steps:

1. Models tested:
   - HistGradientBoostingClassifier
   - CatBoost
   - XGBoost
2. All models produced an AUC-ROC score above the 0.88 threshold. HistGradientBoostingClassifier produced the highest scores on both the validation and test sets because of the use of parameter distributions to find the best model. This did not have a significant effect on training time.
3. With access to GPU hardware acceleration, training, accuracy, and hyperparameter tuning could be shorter and more extensive to produce the highest quality model.

<div class="alert alert-block alert-success">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

Good conclusions.

</div>