# TELCO CUSTOMER CHURN PREDICTIVE MODELING

**Author**: Devin Vliet  
**Date**: 24 September, 2024

---

### Agenda:
1. Introduction
2. Data Exploration
3. Model Development
4. Evaluation
5. Conclusion

## Introduction:
This project is desgined around the Telco Customer Data from Kaggle, an online dataset that is free to use. The goal for this project is to build a machine learning model that can predict customer churn in a robust nature. To do this, I will be implementing a wide lens approach, starting with a large selection of models and sampling techniques and incremently narrowing down to a concise set of predictive models. I will be using ROC-AUC as my main evlaution metric and the target score is 0.80.

---
## Data Exploration
I will begin with reading in the data from the file taken from Kaggle and examining the data for preprocessing, this includes examining data-types, checking for nulls, determining categorical feature types (ordinal, nominal, binary) for encoding.


In [316]:
#Reading in Data from kaggle
import pandas as pd
data = pd.read_csv('WA_Fn-UseC_-Telco-Customer-Churn.csv')
pd.set_option('display.max_colwidth', None)
data

Unnamed: 0,customerID,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,...,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,7590-VHVEG,Female,0,Yes,No,1,No,No phone service,DSL,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,5575-GNVDE,Male,0,No,No,34,Yes,No,DSL,Yes,...,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,3668-QPYBK,Male,0,No,No,2,Yes,No,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,7795-CFOCW,Male,0,No,No,45,No,No phone service,DSL,Yes,...,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.30,1840.75,No
4,9237-HQITU,Female,0,No,No,2,Yes,No,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Electronic check,70.70,151.65,Yes
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,6840-RESVB,Male,0,Yes,Yes,24,Yes,Yes,DSL,Yes,...,Yes,Yes,Yes,Yes,One year,Yes,Mailed check,84.80,1990.5,No
7039,2234-XADUH,Female,0,Yes,Yes,72,Yes,Yes,Fiber optic,No,...,Yes,No,Yes,Yes,One year,Yes,Credit card (automatic),103.20,7362.9,No
7040,4801-JZAZL,Female,0,Yes,Yes,11,No,No phone service,DSL,Yes,...,No,No,No,No,Month-to-month,Yes,Electronic check,29.60,346.45,No
7041,8361-LTMKD,Male,1,Yes,No,4,Yes,Yes,Fiber optic,No,...,No,No,No,No,Month-to-month,Yes,Mailed check,74.40,306.6,Yes


In [317]:
#A closer look
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7043 entries, 0 to 7042
Data columns (total 21 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   customerID        7043 non-null   object 
 1   gender            7043 non-null   object 
 2   SeniorCitizen     7043 non-null   int64  
 3   Partner           7043 non-null   object 
 4   Dependents        7043 non-null   object 
 5   tenure            7043 non-null   int64  
 6   PhoneService      7043 non-null   object 
 7   MultipleLines     7043 non-null   object 
 8   InternetService   7043 non-null   object 
 9   OnlineSecurity    7043 non-null   object 
 10  OnlineBackup      7043 non-null   object 
 11  DeviceProtection  7043 non-null   object 
 12  TechSupport       7043 non-null   object 
 13  StreamingTV       7043 non-null   object 
 14  StreamingMovies   7043 non-null   object 
 15  Contract          7043 non-null   object 
 16  PaperlessBilling  7043 non-null   object 


In [318]:
#Checking for null-entries
data.isnull().sum()

customerID          0
gender              0
SeniorCitizen       0
Partner             0
Dependents          0
tenure              0
PhoneService        0
MultipleLines       0
InternetService     0
OnlineSecurity      0
OnlineBackup        0
DeviceProtection    0
TechSupport         0
StreamingTV         0
StreamingMovies     0
Contract            0
PaperlessBilling    0
PaymentMethod       0
MonthlyCharges      0
TotalCharges        0
Churn               0
dtype: int64

In [319]:
#Removing customerID, offers nothing towards prediction
data = data.drop(columns = "customerID",axis = 1)
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,Female,0,Yes,No,1,No,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,Yes,Electronic check,29.85,29.85,No
1,Male,0,No,No,34,Yes,No,DSL,Yes,No,Yes,No,No,No,One year,No,Mailed check,56.95,1889.5,No
2,Male,0,No,No,2,Yes,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,Yes,Mailed check,53.85,108.15,Yes
3,Male,0,No,No,45,No,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,No,Bank transfer (automatic),42.3,1840.75,No
4,Female,0,No,No,2,Yes,No,Fiber optic,No,No,No,No,No,No,Month-to-month,Yes,Electronic check,70.7,151.65,Yes


### Encoding features
---
Examining categorical features, first taking the yes, no, features and transforming them into binary encodings where the "Yes" acts as the boolean True, and "No" acts as the boolean False

In [320]:
import pandas as pd

# avoiding downcasting error messages
pd.set_option('future.no_silent_downcasting', True)

# Function to determine which columns have binary classes ('Yes' and 'No')
def has_yes_no(col):
    return col.isin(['Yes', 'No']).all()

# Apply the function to each column
yes_no_columns = data.apply(has_yes_no)

# Filter for columns that contain 'Yes' and 'No' (include 'gender' as well)
yes_no_columns = yes_no_columns[yes_no_columns].index.tolist() 

# Replace 'Yes'/'No' with numerical values
data[yes_no_columns] = data[yes_no_columns].replace({'Yes': 1, 'No': 0})

# Replace 'Female'/'Male' with numerical values in 'gender' column
data["gender"] = data["gender"].replace({'Female': 1, 'Male': 0})

# Explicitly convert these columns to numeric types (int)
data[yes_no_columns] = data[yes_no_columns].astype(int)
data["gender"] = data["gender"].astype(int)

# Display the data
data.head()


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,1,0,1,0,No phone service,DSL,No,Yes,No,No,No,No,Month-to-month,1,Electronic check,29.85,29.85,0
1,0,0,0,0,34,1,No,DSL,Yes,No,Yes,No,No,No,One year,0,Mailed check,56.95,1889.5,0
2,0,0,0,0,2,1,No,DSL,Yes,Yes,No,No,No,No,Month-to-month,1,Mailed check,53.85,108.15,1
3,0,0,0,0,45,0,No phone service,DSL,Yes,No,Yes,Yes,No,No,One year,0,Bank transfer (automatic),42.3,1840.75,0
4,1,0,0,0,2,1,No,Fiber optic,No,No,No,No,No,No,Month-to-month,1,Electronic check,70.7,151.65,1


Clearly not all of the features in our dataset that have "Yes" and "No" entries are accoutned for, this likely means that there are more categorical entries to account for within these features given that there are no nulls in the data. Lets take a closer look at the features in question:

In [321]:
#Columns in question
MultipleLines_val = data["MultipleLines"].unique()
OnlineSecurity_val = data["OnlineSecurity"].unique()
OnlineBackup_val = data["OnlineBackup"].unique()
DeviceProtection_val = data["DeviceProtection"].unique()
TechSupport_val = data["TechSupport"].unique()
StreamingTV_val = data["StreamingTV"].unique()
StreamingMovies_val = data["StreamingMovies"].unique()
PaymentMethod_val = data["PaymentMethod"].unique()

columns_dict = {
    "MultipleLines": MultipleLines_val,
    "OnlineSecurity": OnlineSecurity_val,
    "OnlineBackup": OnlineBackup_val,
    "DeviceProtection": DeviceProtection_val,
    "TechSupport": TechSupport_val,
    "StreamingTV": StreamingTV_val,
    "StreamingMovies": StreamingMovies_val,
    "PaymentMethod": PaymentMethod_val
}
#generating a list of the categorical columns for my encoder
columns_list = list(columns_dict.keys())

#Printing for visualization
for col_name, col_values in columns_dict.items():
    print(f"{col_name} values: {col_values}\n")

MultipleLines values: ['No phone service' 'No' 'Yes']

OnlineSecurity values: ['No' 'Yes' 'No internet service']

OnlineBackup values: ['Yes' 'No' 'No internet service']

DeviceProtection values: ['No' 'Yes' 'No internet service']

TechSupport values: ['No' 'Yes' 'No internet service']

StreamingTV values: ['No' 'Yes' 'No internet service']

StreamingMovies values: ['No' 'Yes' 'No internet service']

PaymentMethod values: ['Electronic check' 'Mailed check' 'Bank transfer (automatic)'
 'Credit card (automatic)']



### Ordinal v.s non-ordinal
---
Ordinal categorical features specify a heirarchy within the data, for example such as freshman, sophmore, junior, senior, encoding cases such as this takes a different approach rather than one-hot encoding, you want to create representitive features that accurately capture this hierarchy present within the feature. That being said, the features displayed above are mostly binary, the "No internet service" and "No phone service" act essentially as an extension of "No". To handle this I will replace those entires with "No" and encode them as zeros. Payment Method is a nominal feature, and to handle this I will use one-hot encoding to account for it. The two ordinal features present are Internet and Contract, and I will use ordinal encoding to handle these cases.

In [322]:
#Handling additional binary encodings
secondary_binary = ["OnlineSecurity", "OnlineBackup","DeviceProtection", "TechSupport","StreamingTV","StreamingMovies"]

#replacing no internet service with "No"
for element in secondary_binary:
    data[element] = data[element].replace("No internet service", 'No')

data["MultipleLines"] = data["MultipleLines"].replace("No phone service", "No")

data[secondary_binary] = data[secondary_binary].replace({'Yes': 1, 'No': 0})
data["MultipleLines"] = data["MultipleLines"].replace({'Yes': 1, 'No': 0})
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,1,0,1,0,0,DSL,0,1,0,0,0,0,Month-to-month,1,Electronic check,29.85,29.85,0
1,0,0,0,0,34,1,0,DSL,1,0,1,0,0,0,One year,0,Mailed check,56.95,1889.5,0
2,0,0,0,0,2,1,0,DSL,1,1,0,0,0,0,Month-to-month,1,Mailed check,53.85,108.15,1
3,0,0,0,0,45,0,0,DSL,1,0,1,1,0,0,One year,0,Bank transfer (automatic),42.3,1840.75,0
4,1,0,0,0,2,1,0,Fiber optic,0,0,0,0,0,0,Month-to-month,1,Electronic check,70.7,151.65,1


In [323]:
#importing ordinal encoder from sci-kit learn
from sklearn.preprocessing import OrdinalEncoder

#defining the categorical ordering for the encoder
internet_ordering = [['No', 'DSL', 'Fiber optic']]
contract_ordering = [['Month-to-month', 'One year', 'Two year']]

# Initialize and fit the OrdinalEncoder for each ordinal feature
encoder = OrdinalEncoder(categories=internet_ordering)
data['InternetService'] = encoder.fit_transform(data[['InternetService']])

encoder = OrdinalEncoder(categories=contract_ordering)
data['Contract'] = encoder.fit_transform(data[['Contract']])

# Display the transformed DataFrame
data.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod,MonthlyCharges,TotalCharges,Churn
0,1,0,1,0,1,0,0,1.0,0,1,0,0,0,0,0.0,1,Electronic check,29.85,29.85,0
1,0,0,0,0,34,1,0,1.0,1,0,1,0,0,0,1.0,0,Mailed check,56.95,1889.5,0
2,0,0,0,0,2,1,0,1.0,1,1,0,0,0,0,0.0,1,Mailed check,53.85,108.15,1
3,0,0,0,0,45,0,0,1.0,1,0,1,1,0,0,1.0,0,Bank transfer (automatic),42.3,1840.75,0
4,1,0,0,0,2,1,0,2.0,0,0,0,0,0,0,0.0,1,Electronic check,70.7,151.65,1


In [324]:
#Importing one-hot encoder from sklearn
from sklearn.preprocessing import OneHotEncoder

#Initialize encoder
encoder = OneHotEncoder(handle_unknown = 'ignore', sparse_output = False).set_output(transform = 'pandas')

#transforming the data
encoder_transform = encoder.fit_transform(data[["PaymentMethod"]])

#display the data
encoder_transform.head()

Unnamed: 0,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,0.0,0.0,1.0,0.0
1,0.0,0.0,0.0,1.0
2,0.0,0.0,0.0,1.0
3,1.0,0.0,0.0,0.0
4,0.0,0.0,1.0,0.0


In [325]:
#Now lets combine these dataframes and remove the categorical columns
dataset = pd.concat([data, encoder_transform], axis = 1).drop(columns = columns_list, axis = 1)

#display the data
dataset.head()

Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,InternetService,Contract,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check
0,1,0,1,0,1,0,1.0,0.0,1,29.85,29.85,0,0.0,0.0,1.0,0.0
1,0,0,0,0,34,1,1.0,1.0,0,56.95,1889.5,0,0.0,0.0,0.0,1.0
2,0,0,0,0,2,1,1.0,0.0,1,53.85,108.15,1,0.0,0.0,0.0,1.0
3,0,0,0,0,45,0,1.0,1.0,0,42.3,1840.75,0,1.0,0.0,0.0,0.0
4,1,0,0,0,2,1,2.0,0.0,1,70.7,151.65,1,0.0,0.0,1.0,0.0


### Final Checking for nulls, and data type
---
Checking once again for Nulls perhaps generated by the encoding process, and again checking datatypes, at the end of this preproccessing all features should be numeric, but it is possible that some features may not be transformed accurately.

In [275]:
#Checking for null values
has_nulls = dataset.isnull().sum().any()
has_nulls

False

In [326]:
#Checking all columns are numeric
data_types = dataset.dtypes
data_types

gender                                       int64
SeniorCitizen                                int64
Partner                                      int64
Dependents                                   int64
tenure                                       int64
PhoneService                                 int64
InternetService                            float64
Contract                                   float64
PaperlessBilling                             int64
MonthlyCharges                             float64
TotalCharges                                object
Churn                                        int64
PaymentMethod_Bank transfer (automatic)    float64
PaymentMethod_Credit card (automatic)      float64
PaymentMethod_Electronic check             float64
PaymentMethod_Mailed check                 float64
dtype: object

In [327]:
#Lets convert total charges to a numeric column
dataset["TotalCharges"] = dataset['TotalCharges'] = pd.to_numeric(dataset['TotalCharges'], errors='coerce')

dataset.dtypes

gender                                       int64
SeniorCitizen                                int64
Partner                                      int64
Dependents                                   int64
tenure                                       int64
PhoneService                                 int64
InternetService                            float64
Contract                                   float64
PaperlessBilling                             int64
MonthlyCharges                             float64
TotalCharges                               float64
Churn                                        int64
PaymentMethod_Bank transfer (automatic)    float64
PaymentMethod_Credit card (automatic)      float64
PaymentMethod_Electronic check             float64
PaymentMethod_Mailed check                 float64
dtype: object

In [329]:
#Checking for Null once again
dataset.isnull().sum()

gender                                      0
SeniorCitizen                               0
Partner                                     0
Dependents                                  0
tenure                                      0
PhoneService                                0
InternetService                             0
Contract                                    0
PaperlessBilling                            0
MonthlyCharges                              0
TotalCharges                               11
Churn                                       0
PaymentMethod_Bank transfer (automatic)     0
PaymentMethod_Credit card (automatic)       0
PaymentMethod_Electronic check              0
PaymentMethod_Mailed check                  0
dtype: int64

### Splitting into target
---
Now that the "TotalCharges" feature has been transformed we see that there are 11 null entries. Since I am going to use cross validation for my baseline evaluations I will impute these missing data points in a pipeline within each fold to avoid leakage across folds when evaluating. Since that will be handled down the line we can split the data into our target variable: Churn, and the rest of the dataset

In [330]:
#First lets seperate the data into our input data and our target variable
X = dataset.drop(columns = "Churn", axis = 1)
y = dataset['Churn']

#Display the shape of input and target
print(f'shape of X: {X.shape}\n')
print(f'shape of y: {y.shape}\n')

shape of X: (7043, 15)

shape of y: (7043,)



### Train, Validation, and Test
---
I want to split the data into three distinct sets, train, validation and test. It is also important to balance the proportions of class splits within each of these splits as there is a class imbalance and in order for the model to generalize well, I need each set to be representitiave the data, in other words I don't want to train the model on data that is not representitive of the whole set. It is incredibly important to avoid data leakage at this stage and in order to do so, I will first split the test set, then using the training set I will split into validationa nd train.

In [332]:
from sklearn.model_selection import train_test_split

#Now we will split our data intellgently (stratify) in order to preserve class distribution
X_temp, X_test, y_temp, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

#Splitting into train and validation
X_train, X_val, y_train, y_val = train_test_split(X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42)

print(f"Class proportions among splits:\n")
y_train.value_counts()[1] / (y_train.value_counts()[0]+y_train.value_counts()[1]), y_val.value_counts()[1]/(y_val.value_counts()[0]+y_val.value_counts()[1]), y_test.value_counts()[1]/(y_test.value_counts()[0]+y_test.value_counts()[1])


Class proportions among splits:



(0.26532544378698225, 0.2654364797728886, 0.2654364797728886)

### Storing data
---
Now that we have split our data in a manner that avoids direct data leakage, and each split has the same proportion of class balances, lets save these splits into csv files to further prevent any tampering. we will save them as train_data, val_data, and test_data respectively. We will NOT touch the test data until our final evaluation.
This is not technically neccessary, but it gives me piece of mind and helps me be direct in which set I am using.

In [333]:
# Combine features and labels for each split
train_data = pd.concat([X_train, y_train], axis=1)
val_data = pd.concat([X_val, y_val], axis=1)
test_data = pd.concat([X_test, y_test], axis=1)

#Now we will save this data as csv for later use
train_data.to_csv('train_data.csv', index=False)
val_data.to_csv('val_data.csv', index=False)
test_data.to_csv('test_data.csv', index=False)


## Model Development
---
For my baseline models, I will import and intialize a variety of models frequently used for classification problems, and specifically for Churn. I will then intialize a variety of sampling techniques, in order to get around the class imbalance, and train each model on each sampling techinqie and store the resutls, this helps give me direction towards which models are viable and what sampling actually works best for each said model.


In [281]:
# Import necessary libraries for bagging
from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier

# Import necessary libraries for ensemble models
from sklearn.ensemble import StackingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier

# Import necessary libraries for gradient boosting
from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier

In [282]:
# Initialize base model (Decision Tree) and Bagging model
base_model = DecisionTreeClassifier(random_state=42)
bagging_model = BaggingClassifier(estimator=base_model, n_estimators=100, random_state=42)

# Intialize base models for stacking
base_models = [
    ('svc', SVC(probability=True, random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42))
]

# Define the meta-model (e.g., Logistic Regression)
meta_model = LogisticRegression()

# Initialize stacking classifier
stacking_model = StackingClassifier(estimators=base_models, final_estimator=meta_model, cv=5)

# Initialize logistic regression model
logistic_model = LogisticRegression(random_state=42, max_iter=1000)

# Initialize gradient boosting model
gb_model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, random_state=42)

# Initialize random forest model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)

# Initialize support vector machine model
svm_model = SVC(kernel='linear', C=1.0, probability=True, random_state=42)

# Initialize xgboost model
xg_model = XGBClassifier(objective = 'binary:logistic',
                         eval_metric = 'auc')


In [283]:
#importing libraries for stratified k-folds
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
import numpy as np


In [284]:
#Importing sampling packages
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import BorderlineSMOTE
from imblearn.combine import SMOTEENN

#creating instances
smote = SMOTE(sampling_strategy='minority', random_state=42)
borderline_smote = BorderlineSMOTE(sampling_strategy='minority', kind='borderline-1', random_state=42)
smote_enn = SMOTEENN(sampling_strategy='minority', random_state=42)

### Pipeline for Baseline Evaluation
---
I will intialize a dictionary to store scoring metircs, and two lists of each model and sampling technique, the reason behind this is that I find it simpler to create a function that can handle the training and evaluating of each of these models and sampling techniques more clean and concise. This way I can train and evaluate all of these model and sampling technique combinations in one fell swoop instead of training and evalutaing each model seperately

In [285]:
#Preparing our pipeline

# Dictionary to store results
results = {
    'model': [],
    'dataset': [],
    'accuracy': [],
    'precision': [],
    'recall': [],
    'f1': [],
    'roc_auc': []
}

#Sampler list
sampler_list = [
    ('SMOTE', smote),
    ('BORDERLINE_SMOTE', borderline_smote),
    ('SMOTE_ENN', smote_enn),
    (None, None)
]


model_list = [
    ('Bagging', bagging_model),
    ('Stacking', stacking_model),
    ('Logistic Regression', logistic_model),
    ('Gradient Boosting', gb_model),
    ('Random Forest', rf_model),
    ('SVM', svm_model),
    ('XGBOOST', xg_model)
]



### Evaluating Functions
Now we will define a function that will take a model, and a dataset and train and evaluate the model using said dataset. It will evaluate based of off stratified k-folds cross evaluation. Then it will return a dictionary storing the average scores for accuracy, precision, recall, f1-score, and roc-auc. The goal being simplifying the process of training a variety of models and enforcing clean data management. After designing this function we will implement another function to quickly train and evaluate all the proposed models and store their results in a dataframe.

In [286]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, Normalizer
from sklearn.impute import SimpleImputer


def evaluate_model(model, X, y, sampler, splits=5):
    # Initialize StratifiedKFold
    skf = StratifiedKFold(n_splits=splits, shuffle=True, random_state=42)
    
    # Lists to hold the results
    accuracies = []
    precisions = []
    recalls = []
    f1_scores = []
    roc_aucs = []
    
    # Pull the model name
    model_name = model.__class__.__name__
    
    # Define a Pipeline that handles preprocessing
    preprocessing_pipeline = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('normalizer', Normalizer()),
    ])
    
    # Stratified K-Fold Cross-validation
    for train_idx, val_idx in skf.split(X, y):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Apply preprocessing
        X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train)
        X_val_preprocessed = preprocessing_pipeline.transform(X_val)
        
        # Apply resampling only to the training data and after the preprocessing, these sampling techniques are sensitve to scaling, and dont handle nan values
        if sampler is not None:
            X_train_resampled, y_train_resampled = sampler.fit_resample(X_train_preprocessed, y_train)
        else:
            X_train_resampled, y_train_resampled = X_train_preprocessed, y_train
        
        # Train the model
        model.fit(X_train_resampled, y_train_resampled)
        
        # Predict and evaluate on the validation set
        y_pred = model.predict(X_val_preprocessed)
        y_proba = model.predict_proba(X_val_preprocessed)[:, 1]  # For ROC-AUC, use predicted probabilities
        
        # Append metrics for each fold
        accuracies.append(accuracy_score(y_val, y_pred))
        precisions.append(precision_score(y_val, y_pred))
        recalls.append(recall_score(y_val, y_pred))
        f1_scores.append(f1_score(y_val, y_pred))
        roc_aucs.append(roc_auc_score(y_val, y_proba))
    
    # Average metrics across folds
    avg_accuracy = np.mean(accuracies)
    avg_precision = np.mean(precisions)
    avg_recall = np.mean(recalls)
    avg_f1 = np.mean(f1_scores)
    avg_roc_auc = np.mean(roc_aucs)
    
    # Return a dictionary of results
    results_dict = {
        'model': model_name,
        'sampler': sampler.__class__.__name__ if sampler else 'None',  # Include sampler class name for clarity
        'accuracy': avg_accuracy,
        'precision': avg_precision,
        'recall': avg_recall,
        'f1_score': avg_f1,
        'roc_auc': avg_roc_auc
    }
    return results_dict

In [287]:
def helper_function(model_list, sampler_list, X, y, skf=5):
    '''
    This function takes a list of models structured as tuples
    and a list of samplers structured as tuples, 
    and implements the evaluate function for each sampler-model 
    pair, returning a DataFrame of the results for evaluation.
    '''
    # Initialize return storage
    model_names = []
    sampler_names = []
    accuracy_store = []
    precision_store = []
    recall_store = []
    f1_store = []
    roc_auc_store = []

    # Loop through each model
    for model_name, model in model_list:
        for sampler_name, sampler in sampler_list:
            # Store model name and sampler name for each iteration
            model_names.append(model_name)
            sampler_names.append(sampler_name)
            
            # Make copies of X and y for sampling
            X_copy, y_copy = X.copy(), y.copy()
            
            # Call the evaluate function and get results
            results_dict = evaluate_model(model, X_copy, y_copy, sampler=sampler, splits=skf)

            # Storing the results in the respective lists
            accuracy_store.append(results_dict['accuracy'])
            precision_store.append(results_dict['precision'])
            recall_store.append(results_dict['recall'])
            f1_store.append(results_dict['f1_score'])
            roc_auc_store.append(results_dict['roc_auc'])

    # Create a DataFrame to store the results
    results_df = pd.DataFrame({
        'Sampler': sampler_names,
        'Model': model_names,
        'Accuracy': accuracy_store,
        'Precision': precision_store,
        'Recall': recall_store,
        'F1-score': f1_store,
        'ROC-AUC': roc_auc_store
    })

    return results_df


### Display Results:
---
Now we have our evaluation functions defined lets evaluate all the models proposed on all the varieties of sampling training data. This dataframe will store each dataset all the models were tested on and each models respective score.

In [288]:
#Reading in our data
training_data = pd.read_csv("train_data.csv")

#Splitting Data
X, y = training_data.drop(columns = "Churn", axis = 1), training_data["Churn"]

#Pulling scores
eval_data = helper_function(model_list, X = X, y = y, sampler_list= sampler_list , skf = 5)

#Displaying Scores
eval_data.sort_values(by = 'ROC-AUC', ascending=False)

Unnamed: 0,Sampler,Model,Accuracy,Precision,Recall,F1-score,ROC-AUC
15,,Gradient Boosting,0.8,0.660854,0.511206,0.574846,0.841162
8,SMOTE,Logistic Regression,0.741775,0.509121,0.801087,0.62232,0.841123
11,,Logistic Regression,0.794556,0.640346,0.517452,0.571536,0.840395
20,SMOTE,SVM,0.733964,0.499738,0.824278,0.622096,0.840335
9,BORDERLINE_SMOTE,Logistic Regression,0.726627,0.491549,0.827857,0.616653,0.840176
22,SMOTE_ENN,SVM,0.731834,0.497324,0.824274,0.620193,0.839312
12,SMOTE,Gradient Boosting,0.758107,0.530518,0.777909,0.630516,0.83894
10,SMOTE_ENN,Logistic Regression,0.725207,0.48984,0.824278,0.614341,0.838657
14,SMOTE_ENN,Gradient Boosting,0.746982,0.515479,0.799294,0.62651,0.837973
6,SMOTE_ENN,Stacking,0.754556,0.526122,0.770762,0.625087,0.835724


### Model Selection:
---
There are a variety of factors that affect model selection at this point, time, and returns. SVM is very computationaly expensive and its performance while ranking it amongst the best baseline scores in my case is not worth the time it will take to run, so I will omit it at this stage. Logistic regression dominates the baseline scores for a couple of reasons, one it is a simplistic and robust model meaning that it does not benefit signigantly based on it's parameters, and I know from experience that using hyperparameter tuning techniques such as optuna, gridsearch and randomsearch do not nessarily improve its score, but take many hours to run. For this reason I will be keeping it to evaluate on but not tuning it intenseivly. XGBOOST, GradientBoosting, and RandomForest, benefit signfigantly from tuning, and for this reason even though they stack up lower on the baseline scores I will select them to tune on, simply because I do not think it is fair to judge them by the same metric as logistic regression without parameter tuning. For this reason the models I will select to move on with will be Logistic Regression, XGBOOST, RandomForest, and gradient boosting, though later I will consider ensemble methods.

In [289]:
#sorting data
eval_data = eval_data.sort_values(by = 'ROC-AUC', ascending=False)
top_models = eval_data.loc[eval_data.groupby('Model')['ROC-AUC'].idxmax()]
top_models_sorted = top_models.sort_values(by='ROC-AUC', ascending=False)

filtered_models = top_models_sorted[top_models_sorted["Model"].isin(["Logistic Regression", "Gradient Boosting", "XGBOOST", "Random Forest"])]
filtered_models

Unnamed: 0,Sampler,Model,Accuracy,Precision,Recall,F1-score,ROC-AUC
15,,Gradient Boosting,0.8,0.660854,0.511206,0.574846,0.841162
8,SMOTE,Logistic Regression,0.741775,0.509121,0.801087,0.62232,0.841123
18,SMOTE_ENN,Random Forest,0.753373,0.525127,0.759171,0.620327,0.832299
26,SMOTE_ENN,XGBOOST,0.752426,0.52427,0.743115,0.614441,0.830674


### Hyperparameter Tuning:
---
There are many different techniques when it comes to hyper parameter tuning, the most common being GridSearch, RandomisedSearch, and BayesianSearch. I have been reading about a different method called "optuna". Optuna utilizes a wrapper function and sugget methods to find optimum hyperparameters, another reason optuna is attractive in this case is the pruning method that can be implemented, all of these hyper parameter optimization methods take alot of time and compute and since I am working with three models this would take alot of time but with the pruner I can prune unpromising trials saving serious time and compute. After I have determined my optimal model parameters for each of the models I will train them on the whole training set, but in order to do this I will need to generate seperate pipeline for each, the reason being that when it comes to evaluating on the validation and test set, I will need to transform the validation and test set, but not fit, to avoid dataleakage. In order to do this I will store each of the pipelines in a dictionary for each model and when it comes time to evaluate I will use each model with their specified pipeline.

In [334]:
#importing optuna for hyper parameter optimization
import optuna

#importing skf and cross val scoring for evaluation
from sklearn.model_selection import StratifiedKFold, cross_val_score

#training data for optimization
hyper_data = pd.read_csv('train_data.csv')

#Split for optimization
X, y = hyper_data.drop(columns = "Churn", axis = 1), hyper_data["Churn"]

#Intialize stratifed k-folds
n = 5
skf = StratifiedKFold(n_splits = n, shuffle = True, random_state = 42)


In [291]:
'''
Defining objective function for optuna hyperparameter optimization, 
this is a wrapper function that is used to determine optimal
parameters, I will also implement a pruner in order to preserve compute
'''

def objective(trial):
    
    #Parameter Grid
    n_estimators = trial.suggest_int('n_estimators', 100, 500)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.5, log=True)  
    max_depth = trial.suggest_int('max_depth', 3, 20)
    min_samples_split = trial.suggest_int('min_samples_split', 2, 16)
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 16)
    subsample = trial.suggest_float('subsample', 0.6, 1.0)
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2', None])
    min_impurity_decrease = trial.suggest_float('min_impurity_decrease', 0.0, 0.1)

    # Create a pipeline with imputation, scaling, normalizing, and Gradient Boosting
    pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),  
        ('scaler', StandardScaler()),
        ('normalizer', Normalizer()),  
        ('classifier', GradientBoostingClassifier(
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            max_depth=max_depth,
            min_samples_split=min_samples_split,
            min_samples_leaf=min_samples_leaf,
            subsample=subsample,
            max_features=max_features,
            min_impurity_decrease=min_impurity_decrease,
            random_state=42
        ))
    ])

    # Initialize cross-validation and placeholder for scores
    skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = []

    # Perform cross-validation manually to report intermediate values
    for fold_idx, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]

        # Fit pipeline on training data
        pipeline.fit(X_train, y_train)

        # Predict probabilities on validation data
        y_pred_proba = pipeline.predict_proba(X_val)[:, 1]

        # Calculate the ROC-AUC score
        auc = roc_auc_score(y_val, y_pred_proba)
        scores.append(auc)

        # Report intermediate value for pruning
        trial.report(np.mean(scores), fold_idx)

        # Check if the trial should be pruned
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    # Return the mean ROC-AUC score across all folds
    return np.mean(scores)
    
# Create an Optuna study with the TPE sampler and MedianPruner
study = optuna.create_study(direction='maximize', pruner=optuna.pruners.MedianPruner(), sampler=optuna.samplers.TPESampler())

# Optimize the objective function with a specified number of trials
study.optimize(objective, n_trials=100)

print("Best parameters:", study.best_params)
print("Best ROC-AUC score:", study.best_value)

[I 2024-09-23 22:26:12,323] A new study created in memory with name: no-name-4194ef53-1ade-42af-93cd-21efef68ffca
[I 2024-09-23 22:26:53,632] Trial 0 finished with value: 0.8366248315486395 and parameters: {'n_estimators': 449, 'learning_rate': 0.035156066452700495, 'max_depth': 3, 'min_samples_split': 9, 'min_samples_leaf': 16, 'subsample': 0.9285895961410906, 'max_features': None, 'min_impurity_decrease': 0.0825537614699766}. Best is trial 0 with value: 0.8366248315486395.
[I 2024-09-23 22:27:05,852] Trial 1 finished with value: 0.8113596447679898 and parameters: {'n_estimators': 500, 'learning_rate': 0.08154380704685057, 'max_depth': 18, 'min_samples_split': 10, 'min_samples_leaf': 3, 'subsample': 0.8422626107148343, 'max_features': 'sqrt', 'min_impurity_decrease': 0.034833068447044126}. Best is trial 0 with value: 0.8366248315486395.
[I 2024-09-23 22:28:43,047] Trial 2 finished with value: 0.79930922326855 and parameters: {'n_estimators': 438, 'learning_rate': 0.12142103054413522, 

Best parameters: {'n_estimators': 245, 'learning_rate': 0.01821929664357426, 'max_depth': 3, 'min_samples_split': 14, 'min_samples_leaf': 8, 'subsample': 0.6609048900665013, 'max_features': None, 'min_impurity_decrease': 0.09636324290341301}
Best ROC-AUC score: 0.843284210838769


In [292]:
#Storing Model
best_gbc_params = study.best_params

best_gbc_model = GradientBoostingClassifier(
    n_estimators=best_gbc_params['n_estimators'],
    learning_rate=best_gbc_params['learning_rate'],
    max_depth=best_gbc_params['max_depth'],
    min_samples_split=best_gbc_params['min_samples_split'],
    min_samples_leaf=best_gbc_params['min_samples_leaf'],
    subsample=best_gbc_params['subsample'],
    max_features=best_gbc_params['max_features'],
    min_impurity_decrease=best_gbc_params['min_impurity_decrease'],
    random_state=42
)

#Fitting on the entire dataset (this pipeline will be fit on the training data and will be used finally to evaluate )
preprocessing_gbc = Pipeline([
    ('impute',SimpleImputer(strategy= 'mean')),
    ('scaler', StandardScaler()),
    ('normalizer', Normalizer())
])
preprocesssed_X = preprocessing_gbc.fit_transform(X)

best_gbc_model.fit(preprocesssed_X, y)

In [293]:
# Define the Optuna objective function with pruning
def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 100, 1000)
    learning_rate = trial.suggest_float('learning_rate', 0.01, 0.3, log=True)
    max_depth = trial.suggest_int('max_depth', 3, 15)
    min_child_weight = trial.suggest_int('min_child_weight', 1, 10)
    gamma = trial.suggest_float('gamma', 0.0, 0.5)
    subsample = trial.suggest_float('subsample', 0.6, 1.0)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.6, 1.0)

    # Create a pipeline with SMOTEENN, imputation, scaling, and XGBoost
    preprocessing_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('normalizer', Normalizer()),
    ])
    # Stratified K-Fold Cross-validation
    roc_auc = []
    
    for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Apply preprocessing
        X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train)
        X_val_preprocessed = preprocessing_pipeline.transform(X_val)

        #Resampling data
        X_train_resampled, y_train_resampled = smote_enn.fit_resample(X_train_preprocessed, y_train)

        #Fit model
        model = XGBClassifier(
            n_estimators=n_estimators,
            learning_rate=learning_rate,
            max_depth=max_depth,
            min_child_weight=min_child_weight,
            gamma=gamma,
            subsample=subsample,
            colsample_bytree=colsample_bytree,
            eval_metric='auc',
            random_state=42)
        
        model.fit(X_train_resampled, y_train_resampled)

        #Predict
        y_proba = model.predict_proba(X_val_preprocessed)[:, 1]

        auc = roc_auc_score(y_val, y_proba)
        roc_auc.append(auc)
        
        # Report intermediate results to Optuna for pruning
        intermediate_score = np.mean(roc_auc)
        trial.report(intermediate_score, fold)

        # Check if trial should be pruned
        if trial.should_prune():
            raise optuna.TrialPruned()

    # Perform cross-validation and calculate the mean ROC-AUC score
    score = np.mean(roc_auc)
    return score
    
# Create an Optuna study with the TPE sampler
study = optuna.create_study(direction='maximize', pruner=optuna.pruners.MedianPruner(), sampler = optuna.samplers.TPESampler())

# Optimize the objective function with a specified number of trials
study.optimize(objective, n_trials=100)

print("Best parameters:", study.best_params)
print("Best ROC-AUC score:", study.best_value)
   


[I 2024-09-23 22:42:43,470] A new study created in memory with name: no-name-738a9a03-359c-44a3-b12c-2ddcf8bdc1d4
[I 2024-09-23 22:42:46,562] Trial 0 finished with value: 0.8372833478038245 and parameters: {'n_estimators': 606, 'learning_rate': 0.02671237958767696, 'max_depth': 10, 'min_child_weight': 9, 'gamma': 0.2857145811238927, 'subsample': 0.6878968180435411, 'colsample_bytree': 0.8237811098551622}. Best is trial 0 with value: 0.8372833478038245.
[I 2024-09-23 22:42:50,125] Trial 1 finished with value: 0.8286067894432201 and parameters: {'n_estimators': 540, 'learning_rate': 0.14856334979797617, 'max_depth': 12, 'min_child_weight': 4, 'gamma': 0.1521082858971236, 'subsample': 0.7350432876625727, 'colsample_bytree': 0.6348009215295596}. Best is trial 0 with value: 0.8372833478038245.
[I 2024-09-23 22:43:07,684] Trial 2 finished with value: 0.8318252213243097 and parameters: {'n_estimators': 971, 'learning_rate': 0.03680210846613654, 'max_depth': 5, 'min_child_weight': 10, 'gamma':

Best parameters: {'n_estimators': 264, 'learning_rate': 0.011881586305027423, 'max_depth': 15, 'min_child_weight': 7, 'gamma': 0.26009231674903405, 'subsample': 0.7000308940253648, 'colsample_bytree': 0.7485955023097265}
Best ROC-AUC score: 0.8440337416609156


In [294]:
#Storing the Model
best_xgb_params = study.best_params

best_xgb_model = XGBClassifier(
    n_estimators=best_xgb_params['n_estimators'],
    learning_rate=best_xgb_params['learning_rate'],
    max_depth=best_xgb_params['max_depth'],
    min_child_weight=best_xgb_params['min_child_weight'],
    gamma=best_xgb_params['gamma'],
    subsample=best_xgb_params['subsample'],
    colsample_bytree=best_xgb_params['colsample_bytree'],
    eval_metric='auc',
    random_state=42
)
#Fitting on the entire dataset (this pipeline will be fit on the training data and will be used finally to evaluate )
preprocessing_xgb = Pipeline([
    ('impute',SimpleImputer(strategy= 'mean')),
    ('scaler', StandardScaler()),
    ('normalizer', Normalizer())
])
preprocesssed_X = preprocessing_xgb.fit_transform(X)

best_gbc_model.fit(preprocesssed_X, y)
resamp_X, resamp_y = smote_enn.fit_resample(preprocesssed_X, y)
best_xgb_model.fit(resamp_X, resamp_y)

In [295]:
#Define objective  function for random forests
# Define the Optuna objective function with pruning
def objective(trial):
    n_estimators = trial.suggest_int('n_estimators', 100, 500)  # Number of trees
    max_depth = trial.suggest_int('max_depth', 3, 20, log=True)  # Max depth of each tree
    min_samples_split = trial.suggest_int('min_samples_split', 2, 13)  # Minimum samples to split a node
    min_samples_leaf = trial.suggest_int('min_samples_leaf', 1, 13)  # Minimum samples per leaf node
    max_features = trial.suggest_categorical('max_features', ['sqrt', 'log2', None])  # Features to consider at every split
    bootstrap = trial.suggest_categorical('bootstrap', [True, False])  # Whether to use bootstrap samples
    criterion = trial.suggest_categorical('criterion', ['gini', 'entropy', 'log_loss'])  # Criterion for splitting
    max_samples = trial.suggest_float('max_samples', 0.5, 1.0) if bootstrap else None  # Number of samples for each tree
    
    
    preprocessing_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler()),
        ('normalizer', Normalizer()),
    ])
    # Stratified K-Fold Cross-validation
    roc_auc = []
    
    for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
        X_train, X_val = X.iloc[train_idx], X.iloc[val_idx]
        y_train, y_val = y.iloc[train_idx], y.iloc[val_idx]
        
        # Apply preprocessing
        X_train_preprocessed = preprocessing_pipeline.fit_transform(X_train)
        X_val_preprocessed = preprocessing_pipeline.transform(X_val)

        #Resampling data
        X_train_resampled, y_train_resampled = smote_enn.fit_resample(X_train_preprocessed, y_train)

        #Fit model
        model = RandomForestClassifier(
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        bootstrap=bootstrap,
        criterion=criterion,
        max_samples=max_samples,
        random_state=42
    )
        
        model.fit(X_train_resampled, y_train_resampled)

        #Predict
        y_proba = model.predict_proba(X_val_preprocessed)[:, 1]

        auc = roc_auc_score(y_val, y_proba)
        roc_auc.append(auc)
        
        # Report intermediate results to Optuna for pruning
        intermediate_score = np.mean(roc_auc)
        trial.report(intermediate_score, fold)

        # Check if trial should be pruned
        if trial.should_prune():
            raise optuna.TrialPruned()

    # Perform cross-validation and calculate the mean ROC-AUC score
    score = np.mean(roc_auc)
    return score
    
# Create an Optuna study with the TPE sampler
study = optuna.create_study(direction='maximize', pruner=optuna.pruners.SuccessiveHalvingPruner(
        min_resource=1,  # Minimum resource per trial
        reduction_factor=3  # Reduction factor
    ), sampler = optuna.samplers.TPESampler())

# Optimize the objective function with a specified number of trials
study.optimize(objective, n_trials=100)

print("Best parameters:", study.best_params)
print("Best ROC-AUC score:", study.best_value)
   


[I 2024-09-23 22:46:28,648] A new study created in memory with name: no-name-695ee825-58d8-4ed7-aa33-c382158274ad


[I 2024-09-23 22:46:58,076] Trial 0 finished with value: 0.797450111868029 and parameters: {'n_estimators': 149, 'max_depth': 8, 'min_samples_split': 11, 'min_samples_leaf': 8, 'max_features': None, 'bootstrap': False, 'criterion': 'log_loss'}. Best is trial 0 with value: 0.797450111868029.
[I 2024-09-23 22:47:57,269] Trial 1 finished with value: 0.8363981188361272 and parameters: {'n_estimators': 452, 'max_depth': 11, 'min_samples_split': 12, 'min_samples_leaf': 6, 'max_features': None, 'bootstrap': True, 'criterion': 'entropy', 'max_samples': 0.9936032724479706}. Best is trial 1 with value: 0.8363981188361272.
[I 2024-09-23 22:48:01,217] Trial 2 finished with value: 0.8424036276408646 and parameters: {'n_estimators': 114, 'max_depth': 7, 'min_samples_split': 10, 'min_samples_leaf': 10, 'max_features': 'sqrt', 'bootstrap': True, 'criterion': 'entropy', 'max_samples': 0.6554265315801853}. Best is trial 2 with value: 0.8424036276408646.
[I 2024-09-23 22:48:10,876] Trial 3 pruned. 
[I 20

Best parameters: {'n_estimators': 495, 'max_depth': 7, 'min_samples_split': 9, 'min_samples_leaf': 11, 'max_features': 'sqrt', 'bootstrap': True, 'criterion': 'entropy', 'max_samples': 0.5334758089147925}
Best ROC-AUC score: 0.8430215435932829


In [296]:
#Storing the Model
best_rf_params = study.best_params

best_rf_model = RandomForestClassifier(
        n_estimators=best_rf_params["n_estimators"],
        max_depth=best_rf_params["max_depth"],
        min_samples_split=best_rf_params["min_samples_split"],
        min_samples_leaf=best_rf_params["min_samples_leaf"],
        max_features=best_rf_params["max_features"],
        bootstrap=best_rf_params["bootstrap"],
        criterion=best_rf_params["criterion"],
        max_samples=best_rf_params["max_samples"],
        random_state=42
    )
#Fitting on the entire dataset (this pipeline will be fit on the training data and will be used finally to evaluate )
preprocessing_rf = Pipeline([
    ('impute',SimpleImputer(strategy= 'mean')),
    ('scaler', StandardScaler()),
    ('normalizer', Normalizer())
])
preprocesssed_X = preprocessing_rf.fit_transform(X)#keeping track of individual pipelines that are fit only on training data

resampled_X, resampled_y = smote_enn.fit_resample(preprocesssed_X, y)
best_rf_model.fit(resampled_X, resampled_y)

In [297]:
#fitting our logistic model
#Fitting on the entire dataset (this pipeline will be fit on the training data and will be used finally to evaluate )
preprocessing_logistic = Pipeline([
    ('impute',SimpleImputer(strategy= 'mean')),
    ('scaler', StandardScaler()),
    ('normalizer', Normalizer())
])
preprocesssed_X = preprocessing_logistic.fit_transform(X)

best_gbc_model.fit(preprocesssed_X, y)
best_logistic_model = logistic_model.fit(preprocesssed_X, y)

#Storing each model
best_models = {
    "Logistic Regression" : best_logistic_model,
    "Gradient Boosting Classifier" : best_gbc_model,
    "XG-BOOST" : best_xgb_model,
    "Random Forest Classifier": best_rf_model
}


In [298]:
pipeline_dict = {
    "Logistic Regression" : preprocessing_logistic,
    "Gradient Boosting Classifier" : preprocessing_gbc,
    "XG-BOOST" : preprocessing_xgb,
    "Random Forest Classifier" : preprocessing_rf
}

## Evaluating
---
Now that the best models are saved to a dictionary and the corresponding pipelines are stored to a dictionary as well, I am going to create another evaluting function to evaluate on both the validation and test set. This function will take a model dictionary and a corresponding pipeline dictionary and will predict off of a given dataset, using each model with their corresponding preprocessing pipeline.

In [304]:
def evaluator(model_dict, val_X, val_y, pipeline_dict):
    model_names = []
    precision = []
    accuracy = []
    f1_scores = []
    roc_auc = []
    recall = []
    
    

    
    for key, value, in model_dict.items():
        model_names.append(key)
        pipeline = pipeline_dict[key]
        preprocesssed_X = pipeline.transform(val_X)
        predictions = value.predict(preprocesssed_X)
        probabilities = value.predict_proba(preprocesssed_X)[:,1]

        accuracy.append(accuracy_score(val_y, predictions))
        precision.append(precision_score(val_y, predictions))
        recall.append(recall_score(val_y, predictions))
        f1_scores.append(f1_score(val_y, predictions))
        roc_auc.append(roc_auc_score(val_y, probabilities))



    results_df = pd.DataFrame({
            'Model': model_names,
            'Accuracy': accuracy,
            'Precision': precision,
            'Recall': recall,
            'F1-score': f1_scores,
            'ROC-AUC': roc_auc
        })

    return results_df

#validation data
validation_data = pd.read_csv('val_data.csv')
val_X, val_y = validation_data.drop(columns = "Churn", axis = 1), validation_data["Churn"]

results = evaluator(best_models, val_X, val_y, pipeline_dict=pipeline_dict)#using preproccesing pipeline fitted on training data
results

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score,ROC-AUC
0,Logistic Regression,0.802697,0.666667,0.513369,0.58006,0.831854
1,Gradient Boosting Classifier,0.809794,0.700758,0.494652,0.579937,0.835003
2,XG-BOOST,0.740951,0.507666,0.796791,0.620187,0.829248
3,Random Forest Classifier,0.739532,0.505942,0.796791,0.618899,0.826839


Since the models performance is satisfactory on the validation set I will procede to evaluating preformance on the test set.

In [305]:
#Reading in our test data
test_data = pd.read_csv('test_data.csv')

#Split
test_X, test_y = test_data.drop(columns = "Churn", axis = 1), test_data["Churn"]

#evaluate and display results
final_results = evaluator(best_models, test_X, test_y, pipeline_dict)#using preprocessing pipeline fit on training data
final_results.sort_values(by='ROC-AUC', ascending=False)

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-score,ROC-AUC
1,Gradient Boosting Classifier,0.791341,0.641844,0.483957,0.551829,0.841917
2,XG-BOOST,0.736693,0.502496,0.807487,0.619487,0.837657
3,Random Forest Classifier,0.733854,0.499171,0.804813,0.616172,0.83702
0,Logistic Regression,0.789922,0.622642,0.529412,0.572254,0.833447


## Conclusion
---


The best model is the Gradient Boosting Classifier with a ROC-AUC score of 0.841917, however the rest of the models are right behind it. Some next steps to be taken in order to improve this model, would be threshold tuning and feature engineering. There are more improvments to be made here and I will continue to work on imrproving these models.