# Capstone Project 1 In-depth Analysis
## Federico Di Martino

## Problem Statement 
The churn rate is a key metric for almost any business, one that should ideally be minimised. In this specific case, the business will be a bank. The hypothetical client for my work would be that same bank.  I will be building a classifier to predict whether an individual customer of the bank will churn. For clarity, I will always use client to refer to the bank and customer to refer to an individual customer of the bank. 

This part of the project will be constructing machine learning models to predict customer churn. 



### Preliminary setup including steps detailed in previous parts.

In [1]:
## Import libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

## Import data
churn_data = pd.read_csv("Churn_Modelling.csv", index_col = 0)

## Print head to show structure of data
print(churn_data.head())


## To work, all values need to be numeric
# churn_data.fillna(value=np.nan, inplace=True)
## reshape data so that geography column becomes three binary columns
heatmap_data = churn_data
heatmap_data['IsFrance'] = 0
heatmap_data['IsSpain'] = 0
heatmap_data['IsGermany'] = 0

heatmap_data.loc[heatmap_data['Geography'] == 'France','IsFrance'] = 1
heatmap_data.loc[heatmap_data['Geography'] == 'Spain','IsSpain'] = 1
heatmap_data.loc[heatmap_data['Geography'] == 'Germany','IsGermany'] = 1

heatmap_data['IsFrance'] = pd.to_numeric(heatmap_data['IsFrance'])
heatmap_data['IsSpain'] = pd.to_numeric(heatmap_data['IsSpain'])
heatmap_data['IsGermany'] = pd.to_numeric(heatmap_data['IsGermany'])

## Change gender column such that female -> 1, male -> 0
heatmap_data.loc[heatmap_data['Gender'] == 'Female','Gender'] = 1
heatmap_data.loc[heatmap_data['Gender'] == 'Male','Gender'] = 0
heatmap_data["Gender"] = pd.to_numeric(heatmap_data["Gender"])
#print(churn_data.head())


# Drop columns not be used
heatmap_data = heatmap_data.drop(['CustomerId', 'Surname', 'Geography'], axis = 'columns')

#print(heatmap_data.head())

#sns.heatmap(heatmap_data)
#plt.show()


# Calculate correlations
corr = heatmap_data.corr()

# Visualise correlation matrix
corr.style.background_gradient(cmap='coolwarm', axis = None).set_precision(2)

#corr.style.background_gradient(cmap='coolwarm', axis=None)

           CustomerId   Surname  CreditScore Geography  Gender  Age  Tenure  \
RowNumber                                                                     
1            15634602  Hargrave          619    France  Female   42       2   
2            15647311      Hill          608     Spain  Female   41       1   
3            15619304      Onio          502    France  Female   42       8   
4            15701354      Boni          699    France  Female   39       1   
5            15737888  Mitchell          850     Spain  Female   43       2   

             Balance  NumOfProducts  HasCrCard  IsActiveMember  \
RowNumber                                                        
1               0.00              1          1               1   
2           83807.86              1          0               1   
3          159660.80              3          1               0   
4               0.00              2          0               0   
5          125510.82              1          1    

Unnamed: 0,CreditScore,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,IsFrance,IsSpain,IsGermany
CreditScore,1.0,0.0029,-0.004,0.00084,0.0063,0.012,-0.0055,0.026,-0.0014,-0.027,-0.0089,0.0048,0.0055
Gender,0.0029,1.0,0.028,-0.015,-0.012,0.022,-0.0058,-0.023,0.0081,0.11,-0.0068,-0.017,0.025
Age,-0.004,0.028,1.0,-0.01,0.028,-0.031,-0.012,0.085,-0.0072,0.29,-0.039,-0.0017,0.047
Tenure,0.00084,-0.015,-0.01,1.0,-0.012,0.013,0.023,-0.028,0.0078,-0.014,-0.0028,0.0039,-0.00057
Balance,0.0063,-0.012,0.028,-0.012,1.0,-0.3,-0.015,-0.01,0.013,0.12,-0.23,-0.13,0.4
NumOfProducts,0.012,0.022,-0.031,0.013,-0.3,1.0,0.0032,0.0096,0.014,-0.048,0.0012,0.009,-0.01
HasCrCard,-0.0055,-0.0058,-0.012,0.023,-0.015,0.0032,1.0,-0.012,-0.0099,-0.0071,0.0025,-0.013,0.011
IsActiveMember,0.026,-0.023,0.085,-0.028,-0.01,0.0096,-0.012,1.0,-0.011,-0.16,0.0033,0.017,-0.02
EstimatedSalary,-0.0014,0.0081,-0.0072,0.0078,0.013,0.014,-0.0099,-0.011,1.0,0.012,-0.0033,-0.0065,0.01
Exited,-0.027,0.11,0.29,-0.014,0.12,-0.048,-0.0071,-0.16,0.012,1.0,-0.1,-0.053,0.17


### Choose which machine learning method to use

#### Preliminary setup.

In [2]:
## Import libraries for next section
import random
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifier

import time

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import cross_validate
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

## Set up data and target
X = heatmap_data.loc[:, heatmap_data.columns != 'Exited']
y = heatmap_data.iloc[:,9]

random.seed( 123456789 )

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state = 123456789)

#### Testing the methods.

In [5]:
# First try a variety of methods and see which ones perform better

# Create list of methods to try
methods = [] # Generate empty list and then append name and function
methods.append(('KNN', KNeighborsClassifier()))
methods.append(('LR', LogisticRegression(solver='liblinear', multi_class='ovr')))
methods.append(('DT', DecisionTreeClassifier()))
methods.append(('SVM', SVC(gamma='auto')))
methods.append(('RF', RandomForestClassifier(n_estimators = 100)))
methods.append(('BC', BaggingClassifier()))

# Function to build and test models from list of methods assigned
def model_tester(methods):
    results = []
    names = []
    for name, method in methods:
        start = time.time()
        end = time.time()
        kf = KFold(n_splits=10, random_state=123456789, shuffle = True)
        cv_results = cross_val_score(method, X_train, y_train, cv= kf, scoring='accuracy')
        end = time.time()
        time_elapsed = end - start
        results.append(cv_results)
        
        names.append(name)
        print('%s Accuracy: Mean %f StD (%f) Time: %f' % (name, cv_results.mean(), cv_results.std(), time_elapsed))
        
## Use the function   
model_tester(methods)

KNN Accuracy: Mean 0.769857 StD (0.013053) Time: 0.499040
LR Accuracy: Mean 0.796857 StD (0.011759) Time: 0.277039
DT Accuracy: Mean 0.786286 StD (0.019306) Time: 0.331011
SVM Accuracy: Mean 0.799714 StD (0.012678) Time: 34.908538
RF Accuracy: Mean 0.861143 StD (0.013337) Time: 6.584527
BC Accuracy: Mean 0.848286 StD (0.010969) Time: 1.829168



Random Forests has the highest accuracy and is not the slowest method, even if it is slower than the non ensemble methods by an order of magnitude.


### Tuning the hyper-parameters

In [6]:
## What hyperparameters are there?
RF = RandomForestClassifier(n_estimators = 100,random_state=123456789) 
print(RF.get_params())

# Will tune n_estimators and max_depth,

# Total number of trees in the random forest
n_estimators = [int(i) for i in np.linspace(100, 100, num=10 ) ]  # has to be integers

# Maximum number of levels in each tree
max_depth = [int(i) for i in np.linspace(10, 90, num = 9) ]
max_depth = np.append(max_depth,None) ## 'None' means no arbitrary maximum




## Create hyperparameter grid
hyper_grid = {'n_estimators': n_estimators,
               'max_depth': max_depth}

## instantiate grid search
grid_search = GridSearchCV(estimator = RF, param_grid = hyper_grid, 
                          cv = 5) # 5 fold

# fit to data
grid_search.fit(X_train, y_train)

## See which hyper-parameters are best
print(grid_search.best_params_)

{'bootstrap': True, 'ccp_alpha': 0.0, 'class_weight': None, 'criterion': 'gini', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123456789, 'verbose': 0, 'warm_start': False}
{'max_depth': 10, 'n_estimators': 100}


Use best model hyperparameters on test set see the results.

### Testing the model

In [7]:
rf = RandomForestClassifier(n_estimators = 100, max_depth=10 ,random_state=123456789) 
rf.fit(X_train, y_train)
rf_predict = rf.predict(X_test)




print ( 'Accuracy:', accuracy_score(y_test, rf_predict))
print ('F1 score:', f1_score(y_test, rf_predict))
print ('Recall:', recall_score(y_test, rf_predict))
print ('Precision:', precision_score(y_test,rf_predict))
print ('\n clasification report:\n', classification_report(y_test, rf_predict))
print ('\n confussion matrix:\n',confusion_matrix(y_test, rf_predict))

Accuracy: 0.8596666666666667
F1 score: 0.5655314757481941
Recall: 0.431496062992126
Precision: 0.8203592814371258

 clasification report:
               precision    recall  f1-score   support

           0       0.86      0.97      0.92      2365
           1       0.82      0.43      0.57       635

    accuracy                           0.86      3000
   macro avg       0.84      0.70      0.74      3000
weighted avg       0.86      0.86      0.84      3000


 confussion matrix:
 [[2305   60]
 [ 361  274]]


The recall and f1 score for case 1 (customer exiting) are surprisingly low. This is probably due to the data we have being unbalanced towards case 0. However the macro and weighted averages are better, and these are less sensitive to class imbalance.
