# **Bank Customer Exit Predictor (CI PP-5)** 

# **ML Modeling : Classification**

## Objectives

* To fit and evaluate a classification based model and predict if a customer will exit or not.

## Inputs

* outputs/datasets/collection/BankCustomerData.csv
* Data cleaning and Feature Engineering conclusions based on respective notebooks.

## Outputs

* Train and Test set (Features and Target)
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline to predict customer exit
* Feature Importance Plot


---

# Change working directory

* Notebooks are being stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to parent folder

1. We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

2. We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You have set a new current directory")

3. Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

*  Loading dataset from outputs folder, however we are not including variables: CustomerID, Surname and RowNumber as they are just identifiers and dont impact the exit study. Also we are  removing Tenure which will be our target variable for regression model.

In [None]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/BankCustomerData.csv")
      .drop(labels=['Tenure', 'CustomerId', 'Surname', 'RowNumber'], axis=1) 
  )
print(df.shape)
df.head(3)

---

# ML Pipeline: Classification

### 1. ML Pipeline : Data Cleaning and Feature Engineering

* Basis Data cleaning and Feature Engineering notebooks we prepare a custom pipleline.
* We dont require any data cleaning steps.

In [None]:
from sklearn.pipeline import Pipeline

# Feature Engineering : Ordinal Encoder ans Transformation
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt


def PipelineDataCleanAndFeatEng():
    pipeline_base = Pipeline([
        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary', variables=['Gender', 'Geography'])),
        ("log", vt.LogTransformer(variables=['Age']) )                                              
    ])

    return pipeline_base


PipelineDataCleanAndFeatEng()

### 2. ML Pipeline for Modeling and Hyperparameter Optimisation

* We prepare a pipeline for feature scaling and selection

In [None]:
# Standard Scaler for Feature Scaling
from sklearn.preprocessing import StandardScaler

# SelectFromModel for Feature Selection
from sklearn.feature_selection import SelectFromModel

# All ML classification algorithms
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier


def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("feat_selection", SelectFromModel(model)),
        ("model", model),
    ])

    return pipeline_base

* For hyperparameter optimisation we use a custom class from Code Institute's Scikit lesson

In [None]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")

            model = PipelineClf(self.models[key])
            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring, )
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)
        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]
        return df[columns], self.grid_searches

---

# Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['Exited'], axis=1),
    df['Exited'],
    test_size=0.2,
    random_state=0,
)

print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)
X_train.info()

---

# Managing Target Imbalance 

1. Fitting train and test sets with Data cleaning and Feat Eng Pipeline.

In [None]:
pipeline_preprocessed = PipelineDataCleanAndFeatEng()
X_train = pipeline_preprocessed.fit_transform(X_train)
X_test = pipeline_preprocessed.transform(X_test)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

2. Checking target distribution in train set 

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

* Target variable is not balanced in train set as we have low number of Exited customers cases.

3. Using SMOTE (Synthetic Minority Oversampling Technique) to balance Train Set target

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority',k_neighbors=30, random_state=0)
X_train, y_train = oversample.fit_resample(X_train, y_train)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape)

4. Checking Train Set Target distribution after resampling

In [None]:
import matplotlib.pyplot as plt
y_train.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

---

# Grid Search CV - Sklearn

1. Using Algorithms with standard hyperparameters to identify most suitable algorithm

In [None]:
models_default = {
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
}

params_default = {
    "GradientBoostingClassifier": {},
    "ExtraTreesClassifier": {},
    "AdaBoostClassifier": {},
    "XGBClassifier": {},
    "DecisionTreeClassifier": {},
    "RandomForestClassifier": {},
}

2. Using make scorer and recall score for Quick GridSearch CV for Binary Classifier

In [None]:
from sklearn.metrics import make_scorer, recall_score
best_alg = HyperparameterOptimizationSearch(models=models_default, params=params_default)
best_alg.fit(X_train, y_train,
           scoring =  make_scorer(recall_score, pos_label=1),
           n_jobs=-1, cv=5)

3. Checking Score Summary

In [None]:
grid_search_summary, grid_search_pipelines = best_alg.score_summary(sort_by='mean_score')
grid_search_summary 

### Identifying best hyperparameter configuration for top two ML Algorithms

1. Defining top two models and parameters for further analysis

In [None]:
models_search = {
   "AdaBoostClassifier":AdaBoostClassifier(random_state=0),
   "GradientBoostingClassifier":GradientBoostingClassifier(random_state=0),
}

params_search = {
  "AdaBoostClassifier":{'model__n_estimators': [5,10,15,20],
                          'model__learning_rate':[0.1,0.5,1],
                          'model__algorithm': ['SAMME', 'SAMME.R']
                            },
  "GradientBoostingClassifier":{'model__n_estimators':[10,20,30],
                                'model__learning_rate':[.1,.5,1], 
                                'model__max_depth': [3,5,8],
                                'model__min_samples_split': [2,25,40],
                                'model__min_samples_leaf': [1,50,100], 
                                'model__max_leaf_nodes': [None,25,40]
                            }
}

2. Using make scorer and recall score for Quick GridSearch CV for Binary Classifier

In [None]:
from sklearn.metrics import recall_score, make_scorer
best_alg = HyperparameterOptimizationSearch(models=models_search, params=params_search)
best_alg.fit(X_train, y_train,
           scoring =  make_scorer(recall_score, pos_label=1),
           n_jobs=-1, cv=5)

3. Checking Score Summary

In [None]:
grid_search_summary, grid_search_pipelines = best_alg.score_summary(sort_by='mean_score')
grid_search_summary 

4. Identifying best model 

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

5. Identifying best parameters

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

6. Defining best classification pipeline basis extensive search

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

---

# Assess Feature Importance

Trainset dataframe for variables

In [None]:
X_train.head(3)

Assessing feature importance with current classification model

In [None]:
# Creating Dataframe
df_feature_importance = (pd.DataFrame(data={
    'Features': X_train.columns[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# Arranging best features in order
best_features = df_feature_importance['Features'].to_list()

# Plotting best features against importance
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance['Features'].to_list()}")

df_feature_importance.plot(kind='bar', x='Features', y='Importance')
plt.show()

---

# Evaluate Pipeline on Train and Test Sets

Evaluating pipeline performance using classification report and confusion matrix.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

Evaluating against metrics defined in ML business case
* 70% Recall for Will-Exit on train and test set
* 70% Precision for No-Exit on train and test set 

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                X_test=X_test, y_test=y_test,
                pipeline=pipeline_clf,
                label_map= ['No-Exit', 'Will-Exit'] 
                )

* Model is meeting the performance requirements set in ML business case.

---

# **Refit Pipeline with Best Features** 

1. Best features identified

In [None]:
best_features

2. Creating new pipeline for Data Cleaning and Feature Engineering. We are not using Ordinal encoder as those features(Gender and Geography) has been dropped.

In [None]:
from feature_engine import transformation as vt
def PipelineDataCleanAndFeatEngRefit():
    pipeline_base = Pipeline([
        ("log", vt.LogTransformer(variables=['Age']) )                                              
    ])

    return pipeline_base


PipelineDataCleanAndFeatEngRefit()

3. Modifying classification pipeline as there is no feature selection required anymore as we are aware of best features.

In [None]:
def PipelineClf(model):
    pipeline_base = Pipeline([
        ("scaler", StandardScaler()),
        ("model", model),
    ])

    return pipeline_base

4. Splitting Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train_refit, X_test_refit, y_train_refit, y_test_refit = train_test_split(
    df.drop(['Exited'], axis=1),
    df['Exited'],
    test_size=0.2,
    random_state=0,
)

print(X_train_refit.shape, y_train_refit.shape, X_test_refit.shape, y_test_refit.shape)

5. Filtering the most important variables

In [None]:
X_train_refit = X_train_refit.filter(best_features)
X_test_refit = X_test_refit.filter(best_features)

print(X_train_refit.shape, y_train_refit.shape, X_test_refit.shape, y_test_refit.shape)


### Managing Target Imbalance

1. Fitting train and test sets with Data cleaning and Feat Eng Pipeline.

In [None]:
pipeline_preprocessed_refit = PipelineDataCleanAndFeatEngRefit()
X_train_refit = pipeline_preprocessed_refit.fit_transform(X_train_refit)
X_test_refit = pipeline_preprocessed_refit.transform(X_test_refit)
print(X_train_refit.shape, y_train_refit.shape, X_test_refit.shape, y_test_refit.shape)


2. Checking target distribution in train set

In [None]:
import matplotlib.pyplot as plt
y_train_refit.value_counts().plot(kind='bar', title='Train Set Target Distribution')
plt.show()

3. Using SMOTE (Synthetic Minority Oversampling Technique) to balance Train Set target

In [None]:
from imblearn.over_sampling import SMOTE
oversample = SMOTE(sampling_strategy='minority', random_state=0)
X_train_refit , y_train_refit = oversample.fit_resample(X_train_refit, y_train_refit)
print(X_train_refit.shape, y_train_refit.shape, X_test_refit.shape, y_test_refit.shape)


4. Checking Train Set Target distribution after resampling

In [None]:
y_train_refit.value_counts().plot(kind='bar',title='Train Set Target Distribution')
plt.show()

---

# Grid Search CV - Sklearn

1. Using the best model and its best hyperparameter configuration from last GridCV search.

Best Model

In [None]:
models_search = {
   "AdaBoostClassifier":AdaBoostClassifier(random_state=0),
}

Best Parameters

In [None]:
best_parameters

2. Preparing best parameter list.

In [None]:
params_search = {'AdaBoostClassifier':  {
 'model__algorithm': ['SAMME'],
 'model__learning_rate': [0.5],
 'model__n_estimators': [10]
 },
}
params_search

3. Using make scorer and recall score for Quick GridSearch CV for Binary Classifier.

In [None]:
from sklearn.metrics import recall_score, make_scorer
best_alg_refit = HyperparameterOptimizationSearch(models=models_search, params=params_search)
best_alg_refit.fit(X_train_refit, y_train_refit,
                 scoring=make_scorer(recall_score, pos_label=1),
                 n_jobs=-1, cv=5)

4. Checking Score Summary

In [None]:
grid_search_summary, grid_search_pipelines = best_alg_refit.score_summary(sort_by='mean_score')
grid_search_summary 

5. Choosing best model

In [None]:
best_model_refit = grid_search_summary.iloc[0, 0]
pipeline_clf_refit = grid_search_pipelines[best_model_refit].best_estimator_
pipeline_clf_refit

---

# Assess Feature Importance

In [None]:
best_features = X_train_refit.columns

# create DataFrame to display feature importance
df_feature_importance_refit = (pd.DataFrame(data={
    'Feature': best_features,
    'Importance': pipeline_clf_refit['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)


# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{df_feature_importance_refit['Feature'].to_list()}")

df_feature_importance_refit.plot(kind='bar', x='Feature', y='Importance')
plt.show()


---

# Evaluate Pipeline on Train and Test Sets

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance_refit(X_train_refit, y_train_refit, X_test_refit, y_test_refit, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train_refit, y_train_refit, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test_refit, y_test_refit, pipeline, label_map)


Evaluating against metrics defined in ML business case

* 70% Recall for Will-Exit on train and test set
* 70% Precision for No-Exit on train and test set

In [None]:
clf_performance_refit(X_train_refit=X_train_refit, y_train_refit=y_train_refit,
                X_test_refit=X_test_refit, y_test_refit=y_test_refit,
                pipeline=pipeline_clf_refit,
                label_map= ['No-Exit', 'Will-Exit'] 
                )

### We are getting low recall (69%) for test data.

---

# Choosing Model

#### We will proceed with the initially tuned model with all features, as we are getting **low performance ( Recall value : 69% ) on test set** with the refitted model (using best features). This is to meet the ML business case requirement of:

* 70% Recall for Will-Exit on train and test set
* 70% Precision for No-Exit on train and test set


---

# Save Files To Repository

####  We will use/save the datasets and pipelines from initially tuned model before refitting.

## 1. Create file path and version

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_exit/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

## 2. Saving Datasets

* We will be using datasets with all features used before refitting.

### Train Dataset

In [None]:
X_train

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:

y_train.to_csv(f"{file_path}/y_train.csv", index=False)

### Test Dataset

In [None]:
X_test

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:

y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## 3. Saving ML Pipelines

* Both pipelines were selected from initial model before refitting. 

### Data Cleaning and Feature Engineering Pipeline

In [None]:

pipeline_preprocessed

In [None]:

joblib.dump(value=pipeline_preprocessed ,
            filename=f"{file_path}/clf_pipeline_preprocessed.pkl")

### Feature Scaling and Modeling Pipeline

In [None]:
pipeline_clf

In [None]:
joblib.dump(value=pipeline_clf ,
            filename=f"{file_path}/clf_pipeline_model.pkl")

## 4. Saving Plots

### Feature Importance plot

* We will display the important features identified, however we didn't use them to train the selected model due to low recall value on test data. All features were used to train the selected model.

In [None]:
df_feature_importance.plot(kind='bar',x='Features',y='Importance')
plt.show()

In [None]:

df_feature_importance.plot(kind='bar', x='Features', y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')