# Model Training and Evaluation

Sub-Task 1:Build churn model(s) to try to predict the churn probability of any customer.

Sub-Task 2:Evaluate your model, using a holdout set, and with metrics of your choosing.

Sub-Task 3:Interpret the results and use them to formulate answers to the client’s hypotheses and questions.

### Import packages

In [None]:
import datetime
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
import pickle
import seaborn as sns
import shap

from sklearn.model_selection import train_test_split

# Handle class imbalance
from imblearn.pipeline import Pipeline as imbPipeline
from imblearn.over_sampling import SMOTE, ADASYN
from imblearn.under_sampling import (TomekLinks, 
                                     NeighbourhoodCleaningRule as NCR, 
                                     RandomUnderSampler)


# ML
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB

# Assemble pipeline(s)
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn import set_config

# Performance metrics
from sklearn.model_selection import RepeatedStratifiedKFold, StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, classification_report
from mlxtend.evaluate import lift_score

In [None]:
os.environ["OMP_NUM_THREADS"] = "1"  #export OMP_NUM_THREADS=1

In [None]:
# Show plots in jupyter notebook
%matplotlib inline

In [None]:
# Set plot style
sns.set(color_codes=True)

In [None]:
# Set maximum number of columns to be displayed
pd.set_option('display.max_columns', 100)

In [None]:
# load JS visualization code to notebook
shap.initjs()

#### Loading data (pickle)

In [None]:
os.getcwd()

In [None]:
os.chdir('/Users/soumyadeepray/My Documents/DS_Projects/BCG_Customer_Churn_Case_Study/data')

In [None]:
df = pd.read_pickle('./processed_data.pkl')
df.head(5)

### Defining sampling approaches

Churn data sets suffer usually from high-class imbalance. This means that the number of churners are in the minority. To deal with this class imbalance the package imbalanced-learn comes with a battery of different sampling approaches.

- `SMOTE - (Synthetic Minority Oversampling Technique)` is an oversampling technique where the synthetic samples are generated for the minority class. 

- `ADASYN - (Adaptive Synthetic)` is an algorithm that generates synthetic data. Its greatest advantages are not copying the same minority data, and generating more data for “harder to learn" examples.

- `TomekLinks` is used as an undersampling method and removes noisy and borderline majority class examples. The procedure for finding Tomek Links can be used to locate all cross-class nearest neighbors. If the examples in the minority class are held constant, the procedure can be used to find all of those examples in the majority class that are closest to the minority class, then removed.

- `NCR - Neighborhood Cleaning rule` is an undersampling technique that combines both the Condensed Nearest Neighbor (CNN) Rule to remove redundant examples and the Edited Nearest Neighbors (ENN) Rule to remove noisy or ambiguous examples.


In [None]:
# Store different sampling approaches
sampl_app = dict()

# No sampling
#sampl_app['no_sampling'] = ('no_sampling', None)

# SMOTE
sampl_app['o_SMOTE'] = ('smote', SMOTE())

# ADASYN (Adaptive Synthetic) 
sampl_app['o_ADASYN'] = ('adasyn', ADASYN(sampling_strategy='not minority'))

# TomekLinks
sampl_app['u_TomekLinks'] = ('tomeklinks', TomekLinks())

# # NCR
# sampl_app['u_NCR'] = ('ncr', NCR())

# # SMOTE + TomekLinks
# sampl_app['h_SMOTE_Tomek'] = imbPipeline([('smote', SMOTE()),
#                                           ('tomeklinks', TomekLinks())])

# # SMOTE + NCR
# sampl_app['h_SMOTE_NCR'] = imbPipeline([('smote', SMOTE()),
#                                         ('ncr', 
#                                          NCR(sampling_strategy='not majority'))]
#                                        )

To use these sampling methods later in the pipeline, they have to be brought in the right format (tuple) first. Approaches that use a combination of multiple sampling methods, have to be wrapped in an `imbPipeline` object.

In [None]:
sampl_app

In [39]:
# Linear model (logistic regression)
lr = LogisticRegression(solver='saga',
                            warm_start=True,
                            max_iter=1000)

# RandomForest
rf = RandomForestClassifier()

# XGB
xgb = XGBClassifier(tree_method="hist",
                        verbosity=0,
                        silent=True)

# LR, XGB,RF, FFNN
lr_xgb_rf = VotingClassifier(estimators=[('lr', lr),
                                         ('xgb', xgb),
                                         ('rf', rf)
                                        ], voting='soft')

In [None]:
# Store them as tuples in a list
classifiers = {'LogisiticRegression': lr,
          'RandomForestClassifier': rf,
          'XGBClassifier': xgb}

#### Splitting data


First of all we will split the data into the variable that we are trying to predict y (churn) and those variables that we will use to predict churn X (the
rest)

In [None]:
y = df["churn"]
X = df.drop(labels = ["Unnamed: 0","origin_up", "id","churn"],axis = 1)

Next we will split the data into training and validation data. The percentages of each test can be changed but a 75%-25% is a good ratio.
We also use a random state generator in order to split it randomly.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25)
print(f"len of X {len(y)}\nlen of train {len(y_train)}\nlen of test {len(y_test)}")

In [None]:
# Scores to track
scorer = {
    'lift_score': make_scorer(lift_score),
    'roc_auc':'roc_auc', 
    'f1_macro':'f1_macro', 
    'recall':'recall'
}

# To store the performance
bnchmrk_results = {}

In [40]:
# Store them as tuples in a list
models = [#('lr', lr),
          #('rf', rf),
          #('xgb', xgb)
        #   ('svc',svc),
        #   ('gnb', gnb),
        #   ('lgb', lgb),
        #   ('knn', knn)]
        #   ('gev_nn', gev_nn),
        #   ('ffnn', ffnn),
          ('lr_xgb_rf', lr_xgb_rf)]
        #   ('lr_xgb_rf_ffnn', lr_xgb_rf_ffnn)]

In [None]:
# Initial pipeline
ppl = imbPipeline([
    ('transformation', ColumnTransformer([
        ('num',make_pipeline(
            SimpleImputer(strategy='mean'),
            MinMaxScaler()),
         make_column_selector(dtype_include='number')
        ),
        ('cat',make_pipeline(
            SimpleImputer(strategy='most_frequent'),
            OneHotEncoder(sparse=False, handle_unknown='ignore')),
         make_column_selector(dtype_include='object')
        )])
    )
])

initial_steps = len(ppl.steps)

In [None]:
# Scores to track
scorer = {
    'lift_score': make_scorer(lift_score),
    'roc_auc':'roc_auc', 
    'f1_macro':'f1_macro', 
    'recall':'recall'
}

# To store the performance
bnchmrk_results = {}

In [41]:
model_results = {} #Store model performance

for m in models:

    sampling_results = {} # Store sampling performance for respective model
    for sa in sampl_app.keys():
        #logging.info(f"== Running {m[0]} with {sa} strategy ==")
          
        # Extend initial pipeline by sampling approach and model
        # Since some sampling approaches have multiple steps 
        # (e.g., SMOTE + RND) I have to append them via loop
        if hasattr(sampl_app[sa], 'steps'):
            for s in sampl_app[sa].steps:
                ppl.steps.append(s)
        else:
            ppl.steps.append(sampl_app[sa])
            
        # Add model to pipeline
        ppl.steps.append(m)

        # Configure KFold and CV
        rsf = RepeatedStratifiedKFold(n_repeats=5, random_state=42)
            
        scores = cross_validate(ppl, X, y, 
                                    cv=rsf, 
                                    scoring=scorer, 
                                    verbose=0, 
                                    n_jobs=1,
                                    error_score='raise',
                                    fit_params=None,
                                    return_estimator=False
                                   )
            
        # Write results in dict
        sampling_results[sa] = scores
            
        # After running CV we reset pipeline to initial state
        # to be clean for next iteration
        ppl = ppl[:initial_steps]
        
    # Write results in dict
    model_results[m[0]] = sampling_results

In [44]:
model_results['lr_xgb_rf']['o_ADASYN']

{'fit_time': array([3.426965  , 3.29950309, 3.36576891, 3.4700439 , 3.36356997,
        3.55037904, 3.44505906, 3.60998607, 3.4273479 , 3.45576406,
        3.34746909, 3.36455202, 3.64999604, 3.52729607, 3.38929296,
        3.32763386, 3.41151786, 3.40283394, 3.42948508, 3.52487588,
        3.42055511, 3.57840395, 3.43593502, 3.31255007, 3.63628793]),
 'score_time': array([0.10512996, 0.13772273, 0.13218832, 0.12826228, 0.10882497,
        0.13108087, 0.12550473, 0.15912294, 0.11616611, 0.13758898,
        0.11991286, 0.20744014, 0.17795181, 0.14138913, 0.13046002,
        0.14185214, 0.20169711, 0.11568689, 0.12576008, 0.12802505,
        0.15417862, 0.15334988, 0.12695098, 0.13796687, 0.15740085]),
 'test_lift_score': array([ 8.22816901,  9.55055332,  8.91384977, 10.28521127,  9.17471535,
         9.55055332,  8.47017399,  7.34657948, 10.28521127,  9.17471535,
         7.91170098,  9.49404117,  9.14241002,  9.14241002,  8.02787593,
         7.99960876,  8.81589537,  9.55055332, 10.28

In [None]:
model_results['xgb']['o_SMOTE']

In [None]:
sampling_results

In [None]:
type(model_results)

In [None]:
model_results_df = pd.DataFrame.from_dict(model_results['xgb'])
model_results_df