This end-to-end data science project is executed as part of the requirements of the course *INSY-695 Advanced Topics in Information Systems* for the **McGill Master of Management Analytics** program. This is an extension of [Telco Churn Project](https://github.com/McGill-MMA-EnterpriseAnalytics/MMA4_ChurnPrediction/blob/main/final/TelcoChurn_Final.ipynb)
 contributors for this project, along with their github user names and roles are as follows:

1. **Nathan Murstein (NathanMurstein)** - Data Scientist
2. **Hisham Salem (HishamSalem)** - Data Analyst
3. **Dany Stefan (dany-stefan)** - MLOps Engineer
4. **Jeewon Kim (jeewonk)** - Data Scientist
5. **Reza Soleimani** - Lead Data Scientist
6. **Danyal Hamid** - Business Intelligence
7. **Uzair Ahmad (uzairahmadxy)** - Project Manager

# Project Description

Telecommunicaion industry has a huge presence in the economic world. It is estimated that the global telecom services was [valued at](https://www.grandviewresearch.com/industry-analysis/global-telecom-services-market)  **USD 1,657.7 billion** in 2020. One of the major problems Telcos (Telecommunicaion Companies) face is customer churn. According to [Profitwell](https://www.smartlook.com/blog/customer-churn-retention/), the average churn rate in telecom businesses is 22%.

## Objective

This project aims is to use machine learning techniques to remedy the problem of customer churn and provide measurable business value to the telecommunication companies. This particular notebook has been modelled do be deployed on **Databricks**, as part of the extension of the [Telco Churn Project for Enterprise-1](https://github.com/McGill-MMA-EnterpriseAnalytics/MMA4_ChurnPrediction/blob/main/final/TelcoChurn_Final.ipynb)

## Pain Points

- Churn leads to higher Customer Acquisition Costs & reduced revenue - acquiring new customers is more costly than keeping the existing ones.

- High churn rates are more likely to compound over time. 

## Business Value

- Increased revenue
- Higher customer satisfaction and loyalty
- Higher market share

## Methodology Overview

Our team focuses on predicting whether or not a customer is likely to churn, using **predictive modelling**. Using this information of churn likelihood, we can take *proactive steps*. One of the most effective methods in marketing, is sending promotions or coupons to customers. Since we cannot send coupons to everyone due to unsustainability of this strategy & high distribution costs, we pick specific customers to send coupons to. To achieve this we leverage the predictive information into an **optimization model**, which recommends a coupon distribution strategy to minimize the loss of revenue (opportunity cost) as a consequency of customer churn. Please refer to the [Telco Churn Project for Enterprise-1](https://github.com/McGill-MMA-EnterpriseAnalytics/MMA4_ChurnPrediction/blob/main/final/TelcoChurn_Final.ipynb) for optimization section. This particular project extends the previous one by doing the following:
- Hyperparameter tuning using Optuna
- Model logging and serving using MLFlow
- Deployment on Databricks

# Preliminary imports

Please note that the missing libraries need to be installed (pip install, conda install).

In [0]:
# !pip install category-encoders
# !pip install pandas-profiling
# !pip install Boruta
# !pip install -U imbalanced-learn
# !pip install optuna
# !pip install mlflow

In [0]:
#!conda install pandas==1.2.5
#!conda install -c conda-forge pandas-profiling=2.8.0
#!conda install -c conda-forge boruta_py
#!conda install -c conda-forge imbalanced-learn
#!conda install -c conda-forge optuna
import sys
assert sys.version_info >= (3, 5)
import sklearn
assert sklearn.__version__ >= '0.20'
import numpy as np
import pandas as pd
import mlflow
import mlflow.pyfunc
import mlflow.sklearn
import mlflow.xgboost
import mlflow.lightgbm
from mlflow.models.signature import infer_signature
from mlflow.utils.environment import _mlflow_conda_env
import os
import warnings
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
import category_encoders as ce
from sklearn.preprocessing import OneHotEncoder
#import pandas_profiling
#from pandas_profiling import ProfileReport
from boruta import BorutaPy
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from patsy import dmatrices
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.ensemble  import IsolationForest
from imblearn.over_sampling import SMOTE
from imblearn.over_sampling import RandomOverSampler
from imblearn.under_sampling import NearMiss
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import confusion_matrix, accuracy_score,pairwise_distances,roc_curve, precision_recall_curve, auc, make_scorer, recall_score, accuracy_score, precision_score
import lightgbm as lgb
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn import linear_model

from sklearn.metrics import plot_confusion_matrix
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from numpy import argmax
import optuna
import seaborn as sns

In [0]:
warnings.filterwarnings(action="ignore", message="^internal gelsd")

In [0]:
PROJECT_ROOT_DIR = '.'
CHAPTER_ID = 'end_to_end_project'
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, 'images', CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension='png', resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

In [0]:
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

Data Extraction

In [0]:
df = pd.read_csv('/dbfs/FileStore/tables/WA_Fn_UseC__Telco_Customer_Churn.csv')

# Data Pre-processing
 Contributor(s): @uzairxy
 
 Categorical Variables are encoded.

In [0]:
ore = ce.OrdinalEncoder(
    mapping=[
        {
            'col': 'MultipleLines',
            'mapping': {
                'No phone service': -1,
                'No': 1,
                'Yes': 2
            }
        },
        {
            'col': 'gender',
            'mapping': {
                'Male': 0,
                'Female': 1
            }
        },
        {
            'col': 'Contract',
            'mapping': {
                'Month-to-month': 1,
                'One year': 2,
                'Two year': 3
            }
        },
        {
            'col': 'InternetService',
            'mapping': {
                'No': 0,
                'DSL': 1,
                'Fiber optic': 2
            }
        },
        {
            'col': 'OnlineSecurity',
            'mapping': {
                'No internet service': -1,
                'No': 1,
                'Yes': 2
            }
        },
        {
            'col': 'OnlineBackup',
            'mapping': {
                'No internet service': -1,
                'No': 1,
                'Yes': 2
            }
        },
        {
            'col': 'DeviceProtection',
            'mapping': {
                'No internet service': -1,
                'No': 1,
                'Yes': 2
            }
        },
        {
            'col': 'TechSupport',
            'mapping': {
                'No internet service': -1,
                'No': 1,
                'Yes': 2
            }
        },
        {
            'col': 'StreamingTV',
            'mapping': {
                'No internet service': -1,
                'No': 1,
                'Yes': 2
            }
        },
        {
            'col': 'StreamingMovies',
            'mapping': {
                'No internet service': -1,
                'No': 1,
                'Yes': 2
            }
        }
    ]
)

In [0]:
df = ore.fit_transform(df)

In [0]:
df = df.replace({'Yes': 1,'No': 0})
df = pd.get_dummies(df, columns=['PaymentMethod'])

In [0]:
df.columns

In [0]:
column_names = [
    'Churn',
    'gender', 
    'SeniorCitizen', 
    'Partner', 
    'Dependents', 
    'tenure',
    'PhoneService', 
    'MultipleLines', 
    'InternetService', 
    'OnlineSecurity',
    'OnlineBackup', 
    'DeviceProtection', 
    'TechSupport', 
    'StreamingTV',
    'StreamingMovies', 
    'Contract', 
    'PaperlessBilling', 
    'PaymentMethod_Bank transfer (automatic)',
    'PaymentMethod_Credit card (automatic)',
    'PaymentMethod_Electronic check', 
    'PaymentMethod_Mailed check', 
    'MonthlyCharges',
    'TotalCharges']
df = df.reindex(columns=column_names)

In [0]:
df.to_csv('Churn_Processed.csv')

# Preliminary Data Exploration
Contributor(s): @uzairxy

Looking at nulls across columns, multi-collinearity and variable distributions.

In [0]:
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df

Unnamed: 0,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,MonthlyCharges,TotalCharges
0,0,1,0,1,0,1,0,-1,1,1,2,1,1,1,1,1,1,0,0,1,0,29.85,29.85
1,0,0,0,0,0,34,1,1,1,2,1,2,1,1,1,2,0,0,0,0,1,56.95,1889.50
2,1,0,0,0,0,2,1,1,1,2,2,1,1,1,1,1,1,0,0,0,1,53.85,108.15
3,0,0,0,0,0,45,0,-1,1,2,1,2,2,1,1,2,0,1,0,0,0,42.30,1840.75
4,1,1,0,0,0,2,1,1,2,1,1,1,1,1,1,1,1,0,0,1,0,70.70,151.65
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7038,0,0,0,1,1,24,1,2,1,2,1,2,2,2,2,2,1,0,0,0,1,84.80,1990.50
7039,0,1,0,1,1,72,1,2,2,1,2,2,1,2,2,2,1,0,1,0,0,103.20,7362.90
7040,0,1,0,1,1,11,0,-1,1,2,1,1,1,1,1,1,1,0,0,1,0,29.60,346.45
7041,1,0,1,1,0,4,1,2,2,1,1,1,1,1,1,1,1,0,0,0,1,74.40,306.60


In [0]:
df[df.isnull().any(axis=1)]

Unnamed: 0,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,MonthlyCharges,TotalCharges
488,0,1,0,1,1,0,0,-1,1,2,1,2,2,2,1,3,1,1,0,0,0,52.55,
753,0,0,0,0,1,0,1,1,0,-1,-1,-1,-1,-1,-1,3,0,0,0,0,1,20.25,
936,0,1,0,1,1,0,1,1,1,2,2,2,1,2,2,3,0,0,0,0,1,80.85,
1082,0,0,0,1,1,0,1,2,0,-1,-1,-1,-1,-1,-1,3,0,0,0,0,1,25.75,
1340,0,1,0,1,1,0,0,-1,1,2,2,2,2,2,1,3,0,0,1,0,0,56.05,
3331,0,0,0,1,1,0,1,1,0,-1,-1,-1,-1,-1,-1,3,0,0,0,0,1,19.85,
3826,0,0,0,1,1,0,1,2,0,-1,-1,-1,-1,-1,-1,3,0,0,0,0,1,25.35,
4380,0,1,0,1,1,0,1,1,0,-1,-1,-1,-1,-1,-1,3,0,0,0,0,1,20.0,
5218,0,0,0,1,1,0,1,1,0,-1,-1,-1,-1,-1,-1,2,1,0,0,0,1,19.7,
6670,0,1,0,1,1,0,1,2,1,1,2,2,2,2,1,3,0,0,0,0,1,73.35,


In [0]:
df = df.dropna()
df[df.isnull().any(axis=1)]

Unnamed: 0,Churn,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,MultipleLines,InternetService,OnlineSecurity,OnlineBackup,DeviceProtection,TechSupport,StreamingTV,StreamingMovies,Contract,PaperlessBilling,PaymentMethod_Bank transfer (automatic),PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,MonthlyCharges,TotalCharges


In [0]:
df.to_csv('Churn_Preprocessed.csv')

# Feature Selection
Contributor(s): @uzairxy, @dany-stefan

## Feature Importance using Boruta

Boruta is a popular method of feature selection, with it's primary advantages to handle regresssion/classification tasks. It can handle multi-variable relationships and interactions between variables.

For our dataset, the most important features are *'tenure', 'Contract', 'MonthlyCharges', 'TotalCharges*.

In [0]:
feat_selector = BorutaPy(
    verbose=1,
    estimator=RandomForestClassifier(),
    max_iter=10
)

In [0]:
BorutaY = df.iloc[:, 0:1]
BorutaY = BorutaY.to_numpy(dtype=None, copy=False) 

BorutaX = df.iloc[:, 1:23]
BorutaX_np = BorutaX.to_numpy(dtype=None, copy=False) 

In [0]:
feat_selector.fit(BorutaX_np, BorutaY)

In [0]:
for i in range(len(feat_selector.support_)):
    if feat_selector.support_[i]:
        print("Passes the test: ", 
            BorutaX.columns[i],
            " - Ranking: ", 
            feat_selector.ranking_[i])
    else:
        print("Doesn't pass the test: ",
            BorutaX.columns[i], 
            " - Ranking: ", 
            feat_selector.ranking_[i])

In [0]:
accepted = BorutaX.columns[feat_selector.support_].to_list()
print('Accepted features:', accepted)

undecided = BorutaX.columns[feat_selector.support_weak_].to_list()
print('Undecided features', undecided)

selected_features = accepted + undecided # include undecided to be safe
selected_features.append(df.columns[0])

In [0]:
df_selected = df[selected_features]

## Correlation Analysis
Contributor(s): @HishamSalem

Correlation Analysis with VIF helps with identifying features that are strongly correlated. We want to reduce the excess feature as this would improve training time for models without significant loss in information from the predictors.

We find that the Total Charges and Monthly Charges are highly correlated, which makes sense as the Total Charges should be based on the Monthly Charges for a customer.

In [0]:
VIFAddition = '+'.join(df_selected.columns)
VIFAddition
formula = "Churn ~ {X_vars}"
full_formula = formula.format(X_vars=VIFAddition)
print(full_formula)

In [0]:
def run_vif(formula):
    y, X = dmatrices(formula, data=df_selected, return_type='dataframe')    
    vif = pd.DataFrame()
    vif['variable'] = X.columns
    vif['VIF'] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
    print(vif)
    print('\n')


run_vif(full_formula)
run_vif('Churn ~ tenure+Contract+MonthlyCharges')

In [0]:
df_selected = df_selected.drop(['TotalCharges'],axis=1)
df_selected

Unnamed: 0,tenure,Contract,MonthlyCharges,Churn
0,1,1,29.85,0
1,34,2,56.95,0
2,2,1,53.85,1
3,45,2,42.30,0
4,2,1,70.70,1
...,...,...,...,...
7038,24,2,84.80,0
7039,72,2,103.20,0
7040,11,1,29.60,0
7041,4,1,74.40,1


## Train/Test Split

In [0]:
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)

for train_index, test_index in split.split(df_selected, df_selected['Churn']):
    print('train:', train_index, 'test:', test_index)
    strat_train_set =df_selected.iloc[train_index]
    strat_test_set = df_selected.iloc[test_index]

# Analyze Outliers
Contributor(s): @HishamSalem

Using isolation forest in order to detect outliers. By removing outliers we will be able to get a model that generalizes better.

In [0]:
AnomalyX = strat_train_set.iloc[:, 1:23]
AnomalyX_np = AnomalyX.to_numpy(dtype= None, copy= False) 

In [0]:
iforest = IsolationForest(max_samples='auto',
                          n_estimators=50, 
                          contamination=0.02,
                          max_features=1.0)

iforest_= iforest.fit(AnomalyX_np)

In [0]:
strat_train_set['scores'] = iforest_.decision_function(AnomalyX_np)
strat_train_set['anomaly'] = iforest_.predict(AnomalyX_np)

In [0]:
strat_train_set

Unnamed: 0,tenure,Contract,MonthlyCharges,Churn,scores,anomaly
1413,65,3,94.55,0,0.117505,1
7003,26,1,35.75,0,0.099800,1
3355,68,3,90.20,0,0.141129,1
4494,3,1,84.30,0,0.151634,1
3541,49,1,40.65,0,0.112138,1
...,...,...,...,...,...,...
3451,65,2,70.95,0,0.142802,1
5123,15,1,75.30,1,0.162792,1
4135,36,3,92.90,0,0.120129,1
4249,10,2,65.90,0,0.147786,1


In [0]:
strat_train_set = strat_train_set[strat_train_set['anomaly'] != -1]

In [0]:
strat_train_set

Unnamed: 0,tenure,Contract,MonthlyCharges,Churn,scores,anomaly
1413,65,3,94.55,0,0.117505,1
7003,26,1,35.75,0,0.099800,1
3355,68,3,90.20,0,0.141129,1
4494,3,1,84.30,0,0.151634,1
3541,49,1,40.65,0,0.112138,1
...,...,...,...,...,...,...
3451,65,2,70.95,0,0.142802,1
5123,15,1,75.30,1,0.162792,1
4135,36,3,92.90,0,0.120129,1
4249,10,2,65.90,0,0.147786,1


In [0]:
strat_train_set = strat_train_set.drop(['scores','anomaly'], axis=1)

# Sampling Training Data
Contributor(s): HishamSalem

We try over-sampling and under-sampling methods as we have a significant class imbalance of *3:1*. Our predictive results are better with over-sampling the minority class.

## X/Y Split

In [0]:
strat_train_set_y = strat_train_set['Churn'].copy()
strat_train_set_x = strat_train_set.drop(['Churn'], axis=1)

In [0]:
sum(strat_train_set_y)/len(strat_train_set_y)

## Oversampler

In [0]:
oversample = SMOTE()
sample_x_over, sample_y_over = oversample.fit_resample(strat_train_set_x, strat_train_set_y)

In [0]:
strat_test_set_y = strat_test_set['Churn'].copy()
strat_test_set_x = strat_test_set.drop(['Churn'], axis=1)

## Production Code Pickling & Logging
Contributor(s): @HishamSalem, @rsoleimani

Dependencies Pickle and Optuna TPESampler + base dependencies

In [0]:
TrialNumber=20

In [0]:
from optuna.samplers import TPESampler
import pickle
def ChampionModelOptimization(trial):
    Classifier=trial.suggest_categorical("Classifier", ['xgb','gb','lightGBM'])
    if Classifier=='xgb':
        n_estimators_option_xgb=trial.suggest_int("n_estimators_option_xgb", 1, 100, step=1)
        max_features_option_xgb=trial.suggest_int('max_features_option_xgb',1,15,step=1,log=True)    
        max_depth_option_xgb=trial.suggest_int('max_depth_option_xgb', 1, 20,step=1, log=True)
        eval_metric_option_xgb=trial.suggest_categorical("eval_metric_option_xgb", ['aucpr','map','error@0.1','error@0.3','error@0.5','error@0.7','error@0.8'])
        learning_rate_option_xgb=trial.suggest_float("learning_rate_option_xgb", 0, 0.03,step=0.005)
        tree_method_option_xgb=trial.suggest_categorical("tree_method_option_xgb", ['auto','exact','approx','hist'])
        scale_pos_weight_option_xgb=trial.suggest_int("scale_pos_weight_option_xgb", 10, 1000, step=10)
        max_delta_step_option_xgb=trial.suggest_int("max_delta_step_option_xgb", 0, 100, step=1)    

        Classifier_obj =  XGBClassifier(n_estimators=n_estimators_option_xgb,
                                    max_features=max_features_option_xgb,
                                    max_depth=max_depth_option_xgb,
                                    eval_metric=eval_metric_option_xgb,
                                    learning_rate=learning_rate_option_xgb,
                                    tree_method=tree_method_option_xgb,
                                    scale_pos_weight=scale_pos_weight_option_xgb,
                                    max_delta_step=max_delta_step_option_xgb,
                                    probability=True)
        mlflow.xgboost.autolog()
        with mlflow.start_run():
            Classifier_obj.fit(sample_x_over, sample_y_over)
            y_scores = Classifier_obj.predict_proba(strat_test_set_x)
            # Keep probabilities for the positive outcome only
            y_scores = y_scores[:, 1]
            # Calculate p/r curves
            precision, recall, thresholds = precision_recall_curve(strat_test_set_y, y_scores)
            # Convert to f-score
            fscore = (2 * precision * recall) / (precision + recall)
            # Locate the index of the largest f-score
            ix = argmax(fscore)
            ModelThreshold=thresholds[ix]
            Modelrecall=recall[ix]
            Modelprecision=precision[ix]
            Modelfscore=fscore[ix]
            signature = infer_signature(sample_x_over, Classifier_obj.predict(sample_x_over))
            mlflow.xgboost.log_model(Classifier_obj, "xgb_model", signature=signature)
            mlflow.log_metric('recall', Modelrecall)
    
    elif Classifier=='gb':
        n_estimators_option_gb=trial.suggest_int("n_estimators_option_gb", 1, 100, step=1)      
        max_depth_option_gb=trial.suggest_int('max_depth_option_gb', 1, 20,step=1, log=True)
        max_features_option_gb=trial.suggest_categorical('max_features_option_gb',['auto', 'sqrt', 'log2'])
        loss_option_gb=trial.suggest_categorical('loss_option_gb',['deviance', 'exponential'])
        learning_rate_option_gb=trial.suggest_float("learning_rate_option_gb", 0.00000001, 0.03,step=0.005)


        Classifier_obj =  GradientBoostingClassifier(n_estimators=n_estimators_option_gb,
                                                 max_depth=max_depth_option_gb,
                                                 max_features=max_features_option_gb,
                                                 loss=loss_option_gb,
                                                 learning_rate=learning_rate_option_gb)  
        mlflow.sklearn.autolog()
        with mlflow.start_run():
            Classifier_obj.fit(sample_x_over, sample_y_over)
            y_scores = Classifier_obj.predict_proba(strat_test_set_x)
            # Keep probabilities for the positive outcome only
            y_scores = y_scores[:, 1]
            # Calculate p/r curves
            precision, recall, thresholds = precision_recall_curve(strat_test_set_y, y_scores)
            # Convert to f-score
            fscore = (2 * precision * recall) / (precision + recall)
            # Locate the index of the largest f-score
            ix = argmax(fscore)
            ModelThreshold=thresholds[ix]
            Modelrecall=recall[ix]
            Modelprecision=precision[ix]
            Modelfscore=fscore[ix]
            signature = infer_signature(sample_x_over, Classifier_obj.predict(sample_x_over))
            mlflow.sklearn.log_model(Classifier_obj, "gb_model", signature=signature)
            mlflow.log_metric('recall', Modelrecall)    
    else:
        n_estimators_option_lgbm=trial.suggest_int("n_estimators_option_lgbm", 1, 100, step=1)    
        max_depth_option_lgbm=trial.suggest_int('max_depth_option_lgbm', 1, 20,step=1, log=True)
        metric_option_lgbm=trial.suggest_categorical("metric_option_lgbm", ['mape','huber','l1','l2','fair','poisson','auc','average_precision','binary_logloss','binary_error','cross_entropy','cross_entropy_lambda','kullback_leibler'])
        learning_rate_option_lgbm=trial.suggest_float("learning_rate_option_lgbm", 0.00000001, 0.03,step=0.005)
        scale_pos_weight_option_lgbm=trial.suggest_int("scale_pos_weight_option_lgbm", 10, 1000, step=10)
        max_delta_step_option_lgbm=trial.suggest_int("max_delta_step_option_lgbm", 1, 100, step=1)
        tree_learner_option_lgbm=trial.suggest_categorical('tree_learner_option_lgbm',['serial', 'feature', 'data', 'voting'])


        Classifier_obj =  lgb.LGBMClassifier(n_estimators=n_estimators_option_lgbm,
                                         max_depth=max_depth_option_lgbm,
                                         metric=metric_option_lgbm,
                                         learning_rate=learning_rate_option_lgbm,
                                         scale_pos_weight=scale_pos_weight_option_lgbm,
                                         max_delta_step=max_delta_step_option_lgbm,
                                         tree_learner=tree_learner_option_lgbm)   #scale_pos_weight=25
  
        mlflow.lightgbm.autolog()
        with mlflow.start_run():
            Classifier_obj.fit(sample_x_over, sample_y_over)
            y_scores = Classifier_obj.predict_proba(strat_test_set_x)
            # Keep probabilities for the positive outcome only
            y_scores = y_scores[:, 1]
            # Calculate p/r curves
            precision, recall, thresholds = precision_recall_curve(strat_test_set_y, y_scores)
            # Convert to f-score
            fscore = (2 * precision * recall) / (precision + recall)
            # Locate the index of the largest f-score
            ix = argmax(fscore)
            ModelThreshold=thresholds[ix]
            Modelrecall=recall[ix]
            Modelprecision=precision[ix]
            Modelfscore=fscore[ix]
            signature = infer_signature(sample_x_over, Classifier_obj.predict(sample_x_over))
            mlflow.lightgbm.log_model(Classifier_obj, "lightgbm_model", signature=signature)
            mlflow.log_metric('recall', Modelrecall)  
    return Modelfscore

  
sampler = TPESampler(seed=42)
study = optuna.create_study(directions=['maximize'],sampler=sampler)
study.optimize(lambda trial: ChampionModelOptimization(trial), n_trials=TrialNumber)

if study.best_params['Classifier']=='xgb':
    model=XGBClassifier(n_estimators=study.best_params['n_estimators_option_xgb'],
                                            max_features=study.best_params['max_features_option_xgb'],
                                            max_depth=study.best_params['max_depth_option_xgb'],
                                            eval_metric=study.best_params['eval_metric_option_xgb'],
                                            learning_rate=study.best_params['learning_rate_option_xgb'],
                                            tree_method=study.best_params['tree_method_option_xgb'],
                                            scale_pos_weight=study.best_params['scale_pos_weight_option_xgb'],
                                            max_delta_step=study.best_params['max_delta_step_option_xgb'],
                                            probability=True)
    model.fit(sample_x_over, sample_y_over)
    with open("BestModel{}.pickle".format(study.best_params['Classifier']), "wb") as fout:
        pickle.dump(model, fout)

elif study.best_params['Classifier']=='gb':
    model=GradientBoostingClassifier(n_estimators=study.best_params['n_estimators_option_gb'],
                                                 max_depth=study.best_params['max_depth_option_gb'],
                                                 max_features=study.best_params['max_features_option_gb'],
                                                 loss=study.best_params['loss_option_gb'],
                                                 learning_rate=study.best_params['learning_rate_option_gb'])
    model.fit(sample_x_over, sample_y_over)
    with open("BestModel{}.pickle".format(study.best_params['Classifier']), "wb") as fout:
        pickle.dump(model, fout)  
else:
    model=lgb.LGBMClassifier(n_estimators=study.best_params['n_estimators_option_lgbm'],
                                            max_depth=study.best_params['max_depth_option_lgbm'],
                                            metric=study.best_params['metric_option_lgbm'],
                                            learning_rate=study.best_params['learning_rate_option_lgbm'],
                                            scale_pos_weight=study.best_params['scale_pos_weight_option_lgbm'],
                                            max_delta_step=study.best_params['max_delta_step_option_lgbm'],
                                            tree_learner=study.best_params['tree_learner_option_lgbm'])
    model.fit(sample_x_over, sample_y_over)
    with open("BestModel{}.pickle".format(study.best_params['Classifier']), "wb") as fout:
        pickle.dump(model, fout)

In [0]:
with open("BestModel{}.pickle".format(study.best_params['Classifier']), "rb") as fin:
    Best_Model = pickle.load(fin)
y_scores = Best_Model.predict_proba(strat_test_set_x)
# Keep probabilities for the positive outcome only
y_scores = y_scores[:, 1]
# Calculate p/r curves
precision, recall, thresholds = precision_recall_curve(strat_test_set_y, y_scores)
# Convert to f-score
fscore = (2 * precision * recall) / (precision + recall)
# Locate the index of the largest f-score
ix = argmax(fscore)
ModelThreshold=thresholds[ix]
Modelrecall=recall[ix]
Modelprecision=precision[ix]
Modelfscore=fscore[ix]
with open("Threshold.pickle", "wb") as fout:
    pickle.dump(ModelThreshold, fout)

In [0]:
with open("Threshold.pickle", "rb") as fin:
    ModelThreshold = pickle.load(fin)
for i in range(len(y_scores)):
    if y_scores[i]>=ModelThreshold:
        y_scores[i]=1
    else:
        y_scores[i]=0

In [0]:
#from google.colab import files
#files.download("BestModel{}.pickle".format(study.best_params['Classifier']))
#files.download("Threshold.pickle")

## Details of All Runs
Contributor(s): @NathanMurstein

Status of all our experiments.

In [0]:
runs = mlflow.search_runs()
print(f"Total experiments ran: {len(runs)}")
runs.head(10)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.recall,metrics.training_recall_score,metrics.training_precision_score,metrics.training_f1_score,metrics.training_score,metrics.training_roc_auc_score,metrics.training_accuracy_score,metrics.training_log_loss,params.subsample_for_bin,params.verbose,params.boosting_type,params.keep_training_booster,params.random_state,params.subsample_freq,params.evals_result,params.colsample_bytree,params.num_boost_round,params.min_child_weight,params.min_split_gain,params.min_child_samples,params.scale_pos_weight,params.tree_learner,params.categorical_feature,params.max_delta_step,params.metric,params.n_jobs,params.verbose_eval,params.reg_alpha,params.subsample,params.objective,params.early_stopping_rounds,params.feature_name,params.max_depth,params.learning_rate,...,params.verbosity,params.probability,params.gpu_id,params.grow_policy,params.eval_metric,params.booster,params.maximize,params.predictor,params.max_cat_to_onehot,params.validate_parameters,params.sampling_method,params.max_bin,params.metric_params,params.algorithm,params.n_neighbors,params.p,params.leaf_size,params.radius,params.bootstrap,params.max_samples,params.contamination,params.class_weight,params.oob_score,tags.mlflow.databricks.cluster.id,tags.mlflow.user,tags.mlflow.databricks.workspaceID,tags.mlflow.databricks.workspaceURL,tags.mlflow.databricks.notebookPath,tags.mlflow.source.name,tags.mlflow.databricks.notebookID,tags.mlflow.source.type,tags.mlflow.autologging,tags.mlflow.log-model.history,tags.mlflow.databricks.cluster.info,tags.mlflow.databricks.notebook.commandID,tags.mlflow.databricks.webappURL,tags.mlflow.databricks.cluster.libraries,tags.mlflow.databricks.notebookRevisionID,tags.estimator_class,tags.estimator_name
0,1770d7bb58bd442a9e4fb7c37374a01e,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:18:06.774000+00:00,2022-04-18 00:18:09.476000+00:00,,,,,,,,,200000,-1,gbdt,False,,0,,1.0,1,0.001,0.0,20,910,feature,auto,2,['average_precision'],-1,warn,0.0,1.0,binary,,auto,18,0.02500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,lightgbm,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",,,
1,23b0297d90864d7f8308d2d6c39ece01,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:18:01.694000+00:00,2022-04-18 00:18:06.659000+00:00,0.783422,,,,,,,,200000,-1,gbdt,False,,0,,1.0,46,0.001,0.0,20,420,data,auto,44,['mape'],-1,warn,0.0,1.0,binary,,auto,9,0.02000001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241086813.0,,
2,208df67fb8fb4c64be2a8676ab87cec5,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:56.684000+00:00,2022-04-18 00:18:01.540000+00:00,0.764706,,,,,,,,200000,-1,gbdt,False,,0,,1.0,1,0.001,0.0,20,990,feature,auto,16,['mape'],-1,warn,0.0,1.0,binary,,auto,11,0.02500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241081672.0,,
3,414fe4d2bedf4e4096ea47f703e72a00,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:51.544000+00:00,2022-04-18 00:17:56.540000+00:00,0.820856,,,,,,,,200000,-1,gbdt,False,,0,,1.0,23,0.001,0.0,20,750,feature,auto,25,['cross_entropy_lambda'],-1,warn,0.0,1.0,binary,,auto,2,0.01500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241076693.0,,
4,43d56886779243ccb461512ea566e26b,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:46.573000+00:00,2022-04-18 00:17:51.392000+00:00,0.703209,,,,,,,,200000,-1,gbdt,False,,0,,1.0,26,0.001,0.0,20,390,feature,auto,26,['cross_entropy'],-1,warn,0.0,1.0,binary,,auto,6,0.01500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241071516.0,,
5,e6d73c1f21b249208cb646c917fadf39,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:41.530000+00:00,2022-04-18 00:17:46.425000+00:00,0.68984,,,,,,,,200000,-1,gbdt,False,,0,,1.0,29,0.001,0.0,20,870,voting,auto,32,['huber'],-1,warn,0.0,1.0,binary,,auto,6,0.02000001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241066548.0,,
6,fecba01e4f8e45c6a3c203cf23b418fc,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:36.627000+00:00,2022-04-18 00:17:41.381000+00:00,1.0,,,,,,,,200000,-1,gbdt,False,,0,,1.0,4,0.001,0.0,20,620,feature,auto,1,['l2'],-1,warn,0.0,1.0,binary,,auto,20,0.02500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241061541.0,,
7,7bcfb02a50c24ef8b27bf7aa2e383e53,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:31.638000+00:00,2022-04-18 00:17:36.479000+00:00,0.764706,,,,,,,,200000,-1,gbdt,False,,0,,1.0,1,0.001,0.0,20,1000,feature,auto,4,['mape'],-1,warn,0.0,1.0,binary,,auto,18,0.02500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241056637.0,,
8,51cbf650b3e04c549e012f001fbef671,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:26.767000+00:00,2022-04-18 00:17:31.480000+00:00,0.764706,,,,,,,,200000,-1,gbdt,False,,0,,1.0,1,0.001,0.0,20,1000,feature,auto,4,['average_precision'],-1,warn,0.0,1.0,binary,,auto,20,0.02500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241051644.0,,
9,7d332a20dd694221aebb65253b427080,1452363639079685,FINISHED,dbfs:/databricks/mlflow-tracking/1452363639079...,2022-04-18 00:17:21.812000+00:00,2022-04-18 00:17:26.619000+00:00,0.764706,,,,,,,,200000,-1,gbdt,False,,0,,1.0,1,0.001,0.0,20,910,feature,auto,2,['average_precision'],-1,warn,0.0,1.0,binary,,auto,18,0.02500001,...,,,,,,,,,,,,,,,,,,,,,,,,0417-225212-v1ch2xxi,dany.stefan@mail.mcgill.ca,1962064964953307,adb-1962064964953307.7.azuredatabricks.net,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,/Users/dany.stefan@mail.mcgill.ca/ProductionCo...,1452363639079685,NOTEBOOK,,"[{""artifact_path"":""model"",""signature"":{""inputs...","{""cluster_name"":""INSY_695_cluster"",""spark_vers...",2323102042788931009_7549896367611906941_046585...,https://canadacentral.azuredatabricks.net,"{""installable"":[{""pypi"":{""package"":""mlflow""}},...",1650241046781.0,,


In [0]:
runs.columns
runs['tags.estimator_name']

In [0]:
runs['tags.mlflow.autologging'].value_counts()

## Best Experiment
Contributor(s): @dany-stefan, @danyalhamid1996

This section deploys the model on databricks.

In [0]:
runs = mlflow.search_runs(order_by=['metrics.recall'], max_results=1)
best_run_id = runs.run_id
runs.loc[0]

In [0]:
# register model to ML FLow registry

# this requires a database to hold the registry so it doesn't work locally, but it should work on Databricks
# if this doesn't work, try removing the '/gbmodel' from the first parameter of register_model

best_run_id = runs.loc[0]['run_id']
model_name = 'best_model'

best_model = mlflow.register_model(
    f'runs:/{best_run_id}/gb_model',
    model_name
    )


In [0]:
# fetch model from registry for production

from mlflow.tracking import MlflowClient
 
client = MlflowClient()

# first, we change the stage of the model to production within the registry
client.transition_model_version_stage(
  name=model_name,
  version=best_model.version,
  stage="Production",
)

# now, we load the model

production_model = mlflow.pyfunc.load_model(f"models:/{model_name}/production")


In [0]:
# code for updating the model in the registry if we make changes or optimize it further
# it will automatically create a new version

new_best_model = mlflow.register_model(f"runs:/{best_model.run_id}/gb_model", model_name) # replace gb_model with new model type if applicable

# Archive the old model version
client.transition_model_version_stage(
  name=model_name,
  version=best_model.version,
  stage="Archived"
)
 
# Promote the new model version to Production
client.transition_model_version_stage(
  name=model_name,
  version=new_best_model.version,
  stage="Production"
)

In [0]:
# from datetime import datetime, timedelta
# earliest_start_time = (datetime.now() - timedelta(days=14)).strftime('%Y-%m-%d')
# recent_runs = runs[runs.start_time >= earliest_start_time]
# recent_runs

# recent_runs['Run Date'] = recent_runs.start_time.dt.floor(freq='D')

# best_runs_per_day_idx = recent_runs.groupby(
#   ['Run Date']
# )['metrics.recall'].idxmin()
# best_runs = recent_runs.loc[best_runs_per_day_idx]
# display(best_runs[['Run Date', 'metrics.recall']])

## Generating deepchecks report
Contributor(s): @rsoleimani

In [None]:
from deepchecks.tabular import Dataset
from deepchecks.tabular.suites import full_suite

df_train = pd.DataFrame(sample_x_over, sample_y_over).reset_index()
df_test = pd.DataFrame(strat_test_set_x, strat_test_set_y).reset_index()

label_col = 'Churn'
ds_train = Dataset(df_train, label=label_col, cat_features=[])
ds_test =  Dataset(df_test,  label=label_col, cat_features=[])

In [None]:
suite = full_suite()
suite.run(train_dataset=ds_train, test_dataset=ds_test, model = model)