**Intro**

In the following notebook, I will be creating a churn predictor for our bank 

**Read in libraries**

In [35]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

**Set Notebook Preferences**

In [36]:
#Set plot style
plt.style.use('Solarize_Light2')

#Set path to draw_data
path = r'C:\Users\kishe\Documents\Data Science\Projects\Python Projects\In Progress\Bank Churn Analysis\Data\02_Cleaned_Data'

**Read in data**

In [37]:
df = pd.read_csv(path + '/2020_0720_Cleaned_Churn_Date.csv', index_col=0)

## Data Overview

**Preview data**

In [38]:
#Display data shape and head
print('Data shape:', df.shape)
display(df.head())


Data shape: (10000, 11)


Unnamed: 0,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,619,France,0,42,2,0.0,1,1,1,101348.88,1
1,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0
2,502,France,0,42,8,159660.8,3,1,0,113931.57,1
3,699,France,0,39,1,0.0,2,0,0,93826.63,0
4,850,Spain,0,43,2,125510.82,1,1,1,79084.1,0


**About the data - Info**

In [39]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   creditscore      10000 non-null  int64  
 1   geography        10000 non-null  object 
 2   gender           10000 non-null  int64  
 3   age              10000 non-null  int64  
 4   tenure           10000 non-null  int64  
 5   balance          10000 non-null  float64
 6   numofproducts    10000 non-null  int64  
 7   hascrcard        10000 non-null  int64  
 8   isactivemember   10000 non-null  int64  
 9   estimatedsalary  10000 non-null  float64
 10  exited           10000 non-null  int64  
dtypes: float64(2), int64(8), object(1)
memory usage: 937.5+ KB


**About the data - Description Statistics**

In [40]:
df.describe()

Unnamed: 0,creditscore,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,650.5288,0.5457,38.9218,5.0128,76485.889288,1.5302,0.7055,0.5151,100090.239881,0.2037
std,96.653299,0.497932,10.487806,2.892174,62397.405202,0.581654,0.45584,0.499797,57510.492818,0.402769
min,350.0,0.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,584.0,0.0,32.0,3.0,0.0,1.0,0.0,0.0,51002.11,0.0
50%,652.0,1.0,37.0,5.0,97198.54,1.0,1.0,1.0,100193.915,0.0
75%,718.0,1.0,44.0,7.0,127644.24,2.0,1.0,1.0,149388.2475,0.0
max,850.0,1.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


# Machine Learning

## Prepare data

**Split data**

In [41]:
#Read in library
from sklearn.model_selection import train_test_split

#Seperate train/test data from target
X = df.drop('exited', axis = 1)
y = df.exited.values

#Split into training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, test_size =0.2, shuffle = False )

#Check
print('Training data shape:{} Label shape:{}'.format(X_train.shape, y_train.shape))
print('Test data shape:{} Label shape:{}'.format(X_test.shape, y_test.shape))

Training data shape:(8000, 10) Label shape:(8000,)
Test data shape:(2000, 10) Label shape:(2000,)


## Develop Preprocessing Pipeline

In [55]:
#Read in libraries
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.model_selection import cross_val_predict

#Get names of numeric/categorical features
num_features = X_train.select_dtypes(include = ['int64', 'float64']).columns
cat_features = X_train.select_dtypes(include = ['object']).columns

#Init steps for column transformera
num_transformer = Pipeline([('scaler', MinMaxScaler())])
cat_transformer = Pipeline([('onehot', OneHotEncoder(handle_unknown='ignore'))])

#Build preprecessor
preprocessor = ColumnTransformer(transformers=[('numerics', num_transformer, num_features),
                                              ('categoricals', cat_transformer, cat_features)],
                                n_jobs=-1)

#Build initial pipeline
classifier = Pipeline([('Preprocessor', preprocessor)])

## Build Base Model - Naive Bayes Classifier

**Build base model and get predictions**

In [56]:
#Import naive bayes
from sklearn.naive_bayes import GaussianNB

#Append GaussianNB to pipeline
classifier.steps.append(['model', GaussianNB()])

#Get predictions on training data
predictions = cross_val_predict(classifier, X_train, y_train,  n_jobs=-1)

**Evaluate base model**

In [60]:
#Import classification report
from sklearn.metrics import classification_report

#Evaluate
print('Naive Bayes Classification Report: \n',classification_report(y_train, predictions))

Naive Bayes Classification Report: 
               precision    recall  f1-score   support

           0       0.85      0.94      0.89      6353
           1       0.60      0.37      0.46      1647

    accuracy                           0.82      8000
   macro avg       0.73      0.65      0.68      8000
weighted avg       0.80      0.82      0.80      8000



## Model Selection

Evaluate a series of additional models to later tune into final classifier.

Goal is to optimize recall, we do not want to mislabel people that will probably be leaving so we have a chance to intervene and keep them.

In [61]:
#Read in classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

#Read in metrics
metrics.accuracy_score
metrics.precision_score
metrics.recall_score

#Init classifiers
classifiers = [KNeighborsClassifier(n_jobs=-1),
LogisticRegression(n_jobs=-1),
DecisionTreeClassifier(max_depth=10),
RandomForestClassifier(max_depth=10, n_jobs=-1)]

#Init model names
names= ['KNeighborsClassifier',
        'LogisticRegression',
'DecisionTreeClassifier',
'RandomForestClassifier'
    ]

In [None]:
#Write a for-loop that fits each model to the data and evaluated recall
for name, model in zip(names, models):
    classifier.steps.pop(1) #Delete last step of pipe
    classifier.steps.append([name,model]) # append new step of pipe
    

In [64]:
classifier.steps[1]

['model', GaussianNB()]