# **Bank Customer Exit Predictor (CI PP-5)** 

# **ML Modeling : Regression**

## Objectives

* To fit and evaluate a regression based model and predict when a customer will exit.

## Inputs

* outputs/datasets/collection/BankCustomerData.csv
* Data cleaning and Feature Engineering steps and conclusions based on respective notebooks.

## Outputs

* Train and Test set (Features and Target)
* Data cleaning and Feature Engineering pipeline
* Modeling pipeline to predict tenure
* Feature Importance Plot

---

# Change working directory

* Notebooks are being stored in a subfolder, therefore when running the notebook in the editor, we need to change the working directory from its current folder to parent folder

1. We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

2. We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You have set a new current directory")

3. Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

---

# Load Data

*  Loading dataset from outputs folder, however we are not including variables: CustomerID, Surname and RowNumber as they are just identifiers and dont impact the exit study. Also we are removing non-exited as we are only need to analyse exited cases (exited=1)

* We are removing missing data from ['Age', 'Geography', 'HasCrCard', 'IsActiveMember'] as the missing data level is not significant.

In [None]:
import numpy as np
import pandas as pd
df = (pd.read_csv("outputs/datasets/collection/BankCustomerData.csv")
      .query("Exited == 1")
      .drop(labels=['CustomerId', 'Surname', 'RowNumber','Exited'], axis=1) 
  )
df.dropna(inplace=True) 
print(df.shape)
df.head(3)

---

# ML Pipeline: Regression

### 1. ML Pipeline 

* Basis Data cleaning and Feature Engineering notebooks we prepare pipleline.
* We dont require any data cleaning steps.

In [None]:
from sklearn.pipeline import Pipeline

# Feature Engineering : Ordinal Encoder and Numerical Transformer
from feature_engine.encoding import OrdinalEncoder
from feature_engine import transformation as vt

# Feat Scaling : Standard Scaler
from sklearn.preprocessing import StandardScaler

# Feat Selection : Select From Model
from sklearn.feature_selection import SelectFromModel

# All Regression ML algorithms
from sklearn.tree import DecisionTreeRegressor
from xgboost import XGBRegressor
from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import ExtraTreesRegressor

# Data cleaning, Feature Engineering , Feature Scaling, Feature Selection and Modeling Pipeline
def PipelineOpt(model):
    pipeline_base = Pipeline([

        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['Gender', 'Geography'])),
        ("log", vt.LogTransformer(variables=['Age']) ),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base

### 2. Custom Class for Hyperparameter Optimisation

* For hyperparameter optimisation we use a custom class from Code Institute's Scikit lesson

In [None]:
from sklearn.model_selection import GridSearchCV


class HyperparameterOptimizationSearch:

    def __init__(self, models, params):
        self.models = models
        self.params = params
        self.keys = models.keys()
        self.grid_searches = {}

    def fit(self, X, y, cv, n_jobs, verbose=1, scoring=None, refit=False):
        for key in self.keys:
            print(f"\nRunning GridSearchCV for {key} \n")
            model = PipelineOpt(self.models[key])

            params = self.params[key]
            gs = GridSearchCV(model, params, cv=cv, n_jobs=n_jobs,
                              verbose=verbose, scoring=scoring)
            gs.fit(X, y)
            self.grid_searches[key] = gs

    def score_summary(self, sort_by='mean_score'):
        def row(key, scores, params):
            d = {
                'estimator': key,
                'min_score': min(scores),
                'max_score': max(scores),
                'mean_score': np.mean(scores),
                'std_score': np.std(scores),
            }
            return pd.Series({**params, **d})

        rows = []
        for k in self.grid_searches:
            params = self.grid_searches[k].cv_results_['params']
            scores = []
            for i in range(self.grid_searches[k].cv):
                key = "split{}_test_score".format(i)
                r = self.grid_searches[k].cv_results_[key]
                scores.append(r.reshape(len(params), 1))

            all_scores = np.hstack(scores)
            for p, s in zip(params, all_scores):
                rows.append((row(k, s, p)))

        df = pd.concat(rows, axis=1).T.sort_values([sort_by], ascending=False)

        columns = ['estimator', 'min_score',
                   'mean_score', 'max_score', 'std_score']
        columns = columns + [c for c in df.columns if c not in columns]

        return df[columns], self.grid_searches

---

# Split Train and Test Set

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    df.drop(['Tenure'], axis=1),
    df['Tenure'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

---

# Grid Search CV - Sklearn

1. Using Algorithms with standard hyperparameters to identify most suitable algorithm

In [None]:
models_default = {
    'LinearRegression': LinearRegression(),
    "DecisionTreeRegressor": DecisionTreeRegressor(random_state=0),
    "RandomForestRegressor": RandomForestRegressor(random_state=0),
    "ExtraTreesRegressor": ExtraTreesRegressor(random_state=0),
    "AdaBoostRegressor": AdaBoostRegressor(random_state=0),
    "GradientBoostingRegressor": GradientBoostingRegressor(random_state=0),
    "XGBRegressor": XGBRegressor(random_state=0),
}

params_default = {
    'LinearRegression': {},
    "DecisionTreeRegressor": {},
    "RandomForestRegressor": {},
    "ExtraTreesRegressor": {},
    "AdaBoostRegressor": {},
    "GradientBoostingRegressor": {},
    "XGBRegressor": {},
}

2. Hyperparameter optimisation search using default parameters

In [None]:
best_alg = HyperparameterOptimizationSearch(models=models_default, params=params_default)
best_alg.fit(X_train, y_train, scoring='r2', n_jobs=-1, cv=5)

3. Checking Score Summary

In [None]:
grid_search_summary, grid_search_pipelines = best_alg.score_summary(sort_by='mean_score')
grid_search_summary

## **As we are getting negative scores as results, we wont be proceeding with regression model and switch to other model.**

---

---

# **ML Modeling : Classification**

## Target Discretisation

1. Checking unique values in target variable (Tenure).

In [None]:
y_train.unique()


2. Converting target variable into categories by Discretisation. We divide the target into groups so that the first group will represent upto 1 year customer exits, which we are primarly focussed on.

In [None]:
# Equal Frequency Discretiser
from feature_engine.discretisation import EqualFrequencyDiscretiser
import seaborn as sns
import matplotlib.pyplot as plt
disc = EqualFrequencyDiscretiser(q=10, variables=['Tenure']) 
df_clf = disc.fit_transform(df)

print(f"* The classes represent the following ranges: \n{disc.binner_dict_} \n")
sns.countplot(data=df_clf, x='Tenure')
plt.show()

---

# New ML Pipeline for Modeling

In [None]:
# All Classification ML algorithms

from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier 
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier

# Data cleaning, Feature Engineering , Feature Scaling, Feature Selection and Modeling Pipeline

def PipelineOpt(model):
    pipeline_base = Pipeline([

        ("OrdinalCategoricalEncoder", OrdinalEncoder(encoding_method='arbitrary',
                                                     variables=['Gender', 'Geography'])),
        ("log", vt.LogTransformer(variables=['Age']) ),

        ("feat_scaling", StandardScaler()),

        ("feat_selection",  SelectFromModel(model)),

        ("model", model),

    ])

    return pipeline_base

---

# Split Train and Test Set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_clf.drop(['Tenure'], axis=1),
    df_clf['Tenure'],
    test_size=0.2,
    random_state=0
)

print("* Train set:", X_train.shape, y_train.shape,
      "\n* Test set:",  X_test.shape, y_test.shape)

---

# Grid Search CV - Sklearn

1. Using Algorithms with standard hyperparameters to identify most suitable algorithm

In [None]:
models_quick_search = {
    "XGBClassifier": XGBClassifier(random_state=0),
    "DecisionTreeClassifier": DecisionTreeClassifier(random_state=0),
    "RandomForestClassifier": RandomForestClassifier(random_state=0),
    "GradientBoostingClassifier": GradientBoostingClassifier(random_state=0),
    "ExtraTreesClassifier": ExtraTreesClassifier(random_state=0),
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_quick_search = {
    "XGBClassifier":{},
    "DecisionTreeClassifier":{},
    "RandomForestClassifier":{},
    "GradientBoostingClassifier":{},
    "ExtraTreesClassifier":{},
    "AdaBoostClassifier":{},
}

2. Using make scorer and recall score for Quick GridSearch CV for Classification

In [None]:
from sklearn.metrics import make_scorer, recall_score
quick_search = HyperparameterOptimizationSearch(models=models_quick_search, params=params_quick_search)
quick_search.fit(X_train, y_train,
                 scoring = make_scorer(recall_score, labels=[0], average=None),
                 n_jobs=-1,
                 cv=5)

3. Checking Score Summary

In [None]:
grid_search_summary, grid_search_pipelines = quick_search.score_summary(sort_by='mean_score')
grid_search_summary

### Identifying best hyperparameter configuration for best ML Algorithm

1. Defining model and parameter for further analysis

In [None]:
models_search = {
    "AdaBoostClassifier": AdaBoostClassifier(random_state=0),
}

params_search = {
    "AdaBoostClassifier": {
        'model__n_estimators': [50, 100, 300],
        'model__learning_rate': [1e-1, 1e-2, 1e-3],
    }
}

2. Using make scorer and recall score for Quick GridSearch CV 

In [None]:
from sklearn.metrics import make_scorer,  recall_score
search = HyperparameterOptimizationSearch(
    models=models_search, params=params_search)
search.fit(X_train, y_train,
           scoring=make_scorer(recall_score, labels=[0], average=None),
           n_jobs=-1, cv=5)

3. Checking Score Summary

In [None]:
grid_search_summary, grid_search_pipelines = search.score_summary(sort_by='mean_score')
grid_search_summary

4. Identifying best model

In [None]:
best_model = grid_search_summary.iloc[0,0]
best_model

5. Identifying best parameters

In [None]:
best_parameters = grid_search_pipelines[best_model].best_params_
best_parameters

6. Defining best classification pipeline basis extensive search

In [None]:
pipeline_clf = grid_search_pipelines[best_model].best_estimator_
pipeline_clf

---

# Assess Feature Importance

In [None]:
# create DataFrame to display feature importance
df_feature_importance = (pd.DataFrame(data={
    'Feature': X_train.columns[pipeline_clf['feat_selection'].get_support()],
    'Importance': pipeline_clf['model'].feature_importances_})
    .sort_values(by='Importance', ascending=False)
)

# reassign best features in order
best_features = df_feature_importance['Feature'].to_list()

# Most important features statement and plot
print(f"* These are the {len(best_features)} most important features in descending order. "
      f"The model was trained on them: \n{best_features}")

df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

---

# Evaluate Pipeline on Train and Test Sets

Evaluating pipeline performance using classification report and confusion matrix.

In [None]:
from sklearn.metrics import classification_report, confusion_matrix


def confusion_matrix_and_report(X, y, pipeline, label_map):

    prediction = pipeline.predict(X)

    print('---  Confusion Matrix  ---')
    print(pd.DataFrame(confusion_matrix(y_true=prediction, y_pred=y),
          columns=[["Actual " + sub for sub in label_map]],
          index=[["Prediction " + sub for sub in label_map]]
          ))
    print("\n")

    print('---  Classification Report  ---')
    print(classification_report(y, prediction, target_names=label_map), "\n")


def clf_performance(X_train, y_train, X_test, y_test, pipeline, label_map):
    print("#### Train Set #### \n")
    confusion_matrix_and_report(X_train, y_train, pipeline, label_map)

    print("#### Test Set ####\n")
    confusion_matrix_and_report(X_test, y_test, pipeline, label_map)

Creating label map for performance metrics

In [None]:
label_map = ['Upto 1','1-2','2-3','3-4','4-5','5-6','6-7','7-8','8-9','9+']
label_map

Evaluating against new metrics defined after discussing with stakeholders
* 70% Recall for 'Upto 1 Yrs'

In [None]:
clf_performance(X_train=X_train, y_train=y_train,
                        X_test=X_test, y_test=y_test,
                        pipeline=pipeline_clf,
                        label_map= label_map )

* The model meets the performance requirement and has Recall value (Train: 100% , Test: 98%)

---

---

# Choosing Pipeline

We tested 2 pipelines: **Regression** and **Classification**

* **Regression** resulted in **negative R2 scores**.
* **Classification** resulted in resonable performance in **predicting Upto 1 Yrs customer exit** which is the primary focus of bank.

Based on the above we decide to **proceed with classification model** as it serves the business requirement.

* As we require all the features for predicting exit, we wont be proceeding with refittig this pipeline with most important features as it will serve no purpose and further decrease pipeline performance.

---

# Save Files To Repository

## 1. Create file path and version

In [None]:
import joblib
import os

version = 'v1'
file_path = f'outputs/ml_pipeline/predict_tenure/{version}'

try:
    os.makedirs(name=file_path)
except Exception as e:
    print(e)

## 2. Saving Datasets

### Train Dataset

In [None]:
X_train

In [None]:
X_train.to_csv(f"{file_path}/X_train.csv", index=False)

In [None]:
y_train

In [None]:
y_train.to_csv(f"{file_path}/y_train.csv", index=False)

### Test Dataset

In [None]:
X_test

In [None]:
X_test.to_csv(f"{file_path}/X_test.csv", index=False)

In [None]:
y_test

In [None]:
y_test.to_csv(f"{file_path}/y_test.csv", index=False)

## 3. Saving ML Pipelines

In [None]:
pipeline_clf

In [None]:
joblib.dump(value=pipeline_clf, filename=f"{file_path}/clf_pipeline.pkl")

## 4. Saving Label Maps

In [None]:
label_map

In [None]:
joblib.dump(value=label_map, filename=f"{file_path}/label_map.pkl")

## 5. Saving Plots

### Feature Importance plot

In [None]:
df_feature_importance.plot(kind='bar', x='Feature', y='Importance')
plt.show()

In [None]:
df_feature_importance.plot(kind='bar',x='Feature',y='Importance')
plt.savefig(f'{file_path}/features_importance.png', bbox_inches='tight')