# Bank Customer Churn Prediction

In this exercice, we are going to build and train a model that predict which customers
may churn in future so that they can take steps to incentivise those customers to stay. 
We will classify the predictions of those customers in either exited or stayed in binary classification (0 and 1)


In [None]:
# Reading the input directory files
import os
print(os.listdir("../input/"))
# Importing Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import scipy.stats as stats
import seaborn as sns
import math
%matplotlib inline

In [None]:
# Reading the Bank Customers file using pandas function read.csv()
customers_data = pd.read_csv('../input/Churn_Modelling.csv')

## 1. Exploratory Analysis

In this phase 1, we will explore data, to have an understanding of its format, content and see if there is need to clean them before using them in our model prediction

In [None]:
# Displaying the top rows of the dataset for a quick visualization of the data
print(customers_data.head())

In [None]:
# running a script on customers data  file: customers_data.describe() to run the descriptive statistics on the data
#in order to screen outliers and potential bad data.

customers_data.describe(include="all")

In [None]:
# analyzing the data, to know the number of rows and columns and see if there are any missing data
customers_data.shape
print(" The number of null values is: " , customers_data.isnull().values.sum())
print(customers_data.isnull().sum())

From the results, we can see that there are no missing data 

In [None]:
# Running customers_data.info () command to check if there are no missing values in any of the fields or NaN 
# and if all columns types were consistent with the data they contains. All were complete and consistent.
customers_data.info () 


#### From the above analysis from .info() we identify 3 columns with object dtype, in which two Geography and Gender are categorical features, the 3rd one surname is just a string data but not categorical
#### And all data are complete, there are no missing values, as we have in all columns the total number of rows which is 10000

In [None]:
#Creating helper functions to see visualy the distributon of the the different predictor variables

def visual_exploratory(x):
    
    for var in x.select_dtypes(include = [np.number]).columns :
        print( var + ' : ')
        x[var].plot('hist')
        plt.show()
        
visual_exploratory(customers_data)

# ploting the box plot to visually inspect numeric data

def boxPlot_exploratory(x):
    
    for var in x.select_dtypes(include = [np.number]).columns :
        print( var + ' : ')
        x.boxplot(column = var)
        plt.show()
        
boxPlot_exploratory(customers_data)

In [None]:
#Creating a variable of Categorical features

cat_df_customers = customers_data.select_dtypes(include = ['object']).copy()
print(cat_df_customers.head()) 
print(" The number of null values is: " , cat_df_customers.isnull().values.sum())

#### Plotting the distribution of the above categorical features

In [None]:
#Plotting categorical features

## 1. Plot for Geographical location

location_count = cat_df_customers['Geography'].value_counts()
sns.barplot(location_count.index, location_count.values)
plt.title('Geographical location Distribution of Bank Customers')
plt.ylabel('Frequency', fontsize=11)
plt.xlabel('Geography', fontsize=11)
plt.show()


## 2. Plot for Gender 

location_count = cat_df_customers['Gender'].value_counts()
sns.barplot(location_count.index, location_count.values)
plt.title('Gender Distribution of Bank Customers')
plt.ylabel('Frequency', fontsize=11)
plt.xlabel('Gender', fontsize=11)
plt.show()

## 2. Building classification Model with Extreme Gradient Boosting(XGBoost) algorithm

We decided to use XGBoost as it s a strong model which tries to create a strong learner from an ensemble of weak learners (models)
hence from the ensemble of weak models it learns from their error and combine all together to build a combination of them and keep only the parts where they performed well
it has the advantages of combining different models into one and apply regularization, Penalisation of trees, performance, speed all in one model:hence has an inbuilt optimization
it reduce the collinearity amongs features for a better performing model.

In [None]:
#gradient boosting decision tree algorithm
import xgboost as xgb
import sklearn as skt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder

In [None]:
new_customers_data=customers_data.copy()

### 2.1. Label Encode string values in the dataset

Since XGBoost models takes only numeric values as input as it considers problems as regression modelling problem we will transform all string features values of gender into numerical value. Here we use label encoder as we have only two choices for gender which is ok as it will not create any wrong intrepretation of weighting.

In [None]:
# encode string class values as integers

Gender = new_customers_data['Gender']
label_encoder = LabelEncoder()
label_encoder = label_encoder.fit(Gender)
label_encoded =label_encoder.transform(Gender)
new_customers_data['Gender']=label_encoded

### 2.2 Transform Geography string values with one-Hot Encoding

Here one-hot encoding  convert each geographic name  into a new column and assign it a 1 or 0 and for this we will use 
.get_dummies(), a pandas method.This will make our model not interprete the values as weight since we will have only 1 and 0 instead of 0,1 and 2 for the case of label encoder

In [None]:
#print(new_customers_data.head())
#Gend = new_customers_data['Gender']
#print(Gend)

In [None]:
temp_customers_data=new_customers_data.copy()
temp_customers_data = pd.get_dummies(temp_customers_data, columns=['Geography'], prefix = ['Geography'])
print(temp_customers_data.head())

In [None]:
# Appending the new column to the new_customers_data dataframe

new_customers_data.insert(13, 'Geography_France' , temp_customers_data['Geography_France'])
new_customers_data.insert(14, 'Geography_Germany' , temp_customers_data['Geography_Germany'])
new_customers_data.insert(15, 'Geography_Spain' , temp_customers_data['Geography_Spain'])
print(new_customers_data.head())

### 2.3 Creating new transformed features and adding them to the dataset

In [None]:
# Helper function that will create and add a new column tof credit score range the data frame
def creditscore(data):
    score = data.CreditScore
    score_range =[]
    for i in range(len(score)) : 
        if (score[i] < 600) :  
            score_range.append(1) # 'Very Bad Credit'
        elif ( 600 <= score[i] < 650) :  
            score_range.append(2) # 'Bad Credit'
        elif ( 650 <= score[i] < 700) :  
            score_range.append(3) # 'Good Credit'
        elif ( 700 <= score[i] < 750) :  
            score_range.append(4) # 'Very Good Credit'
        elif score[i] >= 750 : 
            score_range.append(5) # 'Excellent Credit'
    return score_range

# converting the returned list into a dataframe
CreditScore_category = pd.DataFrame({'CreditScore_range': creditscore(new_customers_data)})

# Appending the new column to the new_customers_data dataframe
new_customers_data.insert(16, 'CreditScore_range' , CreditScore_category['CreditScore_range'])

In [None]:
# Helper function that will create and add a new column of age group to the data frame
def agegroup(data):
    age = data.Age
    age_range =[]
    for i in range(len(age)) : 
        if (age[i] < 30) :  
            age_range.append(1) # 'Between 18 and 30 year'   
        elif ( 30 <= age[i] < 40) :  
            age_range.append(2) # 'Between 30 and 40 year'
        elif ( 40 <= age[i] < 50) :  
            age_range.append(3) # 'Between 40 and 50 year'
        elif ( 50 <= age[i] < 60) :  
            age_range.append(4) # ''Between 50 and 60 year'
        elif ( 60 <= age[i] < 70) :  
            age_range.append(5) # 'Between 60 and 70 year'
        elif ( 70 <= age[i] < 80) :  
            age_range.append(6) # 'Between 70 and 80 year'
        elif age[i] >= 80 : 
            age_range.append(7) # ''Above 80 year'
    return age_range

# converting the returned list into a dataframe
AgeGroup_category = pd.DataFrame({'age_group': agegroup(new_customers_data)})

# Appending the new column to the new_customers_data dataframe
new_customers_data.insert(17, 'age_group' , AgeGroup_category['age_group'])

In [None]:
print(new_customers_data.head())

### 2.4 Training and Building the XGBoost model

For training and testing the performance of our XGBoost model, we will base on the principle of using 67% of the data as training dataset and 33% as testing dataset.

In [None]:
new_customers_data_xgboost=new_customers_data.copy()
Target = 'Exited'
Surname = 'Surname'
Geography = 'Geography'
#Gender= 'Gender'
ID= 'RowNumber'
CustomerId = 'CustomerId'
#Choose all predictors except Target, Surname, Geography, CustomerId & ID and also separate the response variable
X = [x for x in new_customers_data_xgboost.columns if x not in [Surname,Geography, Target, ID, CustomerId]]
Y = new_customers_data_xgboost.iloc[:,-1]


In [None]:
predictors = new_customers_data_xgboost[X] #predictor variable
response = Y # response variable
print(predictors.head())
print(response.head())

### 2.5 Training the model with train and test technique

In [None]:
# split data into train and test sets
seed = 7
test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(predictors, response, test_size=test_size,
random_state=seed)
# fit model on training data

## xg_reg = xgb.XGBRegressor(objective ='reg:linear', colsample_bytree = 0.3, learning_rate = 0.1,
##                max_depth = 5, alpha = 10, n_estimators = 10)

#model = xgb.XGBClassifier(objective ='binary:logistic', colsample_bytree = 0.3, learning_rate = 0.1,
                #max_depth = 5, alpha = 10, n_estimators = 10)
model = xgb.XGBClassifier()
model.fit(X_train, y_train)
# make predictions for test data
predictions = model.predict(X_test)
# evaluate predictions
accuracy = accuracy_score(y_test, predictions)

print("Accuracy: %.2f%%" % (accuracy * 100.0))

#### 2.5.1 Inspecting the model parameters

#### 2.5.2 plotting important features identified by the model

In [None]:
# plotting important features for a quick idea of which contribute to the model perfromance better
import matplotlib.pyplot as plt
params = {"objective":"binary:logistic",'colsample_bytree': 0.3,'learning_rate': 0.1,
                'max_depth': 5, 'alpha': 10}
xgb.plot_importance(model)
plt.rcParams['figure.figsize'] = [5, 5]
plt.show()

### 2.6 Training the model with k-fold cross validation technique

This split the data into folds. The algorithm is trained on k − 1 folds with one held back and tested on the held back fold.
This is repeated so that each fold of the dataset is given a chance to be the held back test set

it is more accurate because the algorithm is trained and evaluated multiple times on different data

In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import StratifiedKFold # to nforce the same distribution of classes in each fold
from sklearn.model_selection import cross_val_score

In [None]:
# testing the cross validated model
model2 = xgb.XGBClassifier()
kfold = StratifiedKFold(n_splits=10, random_state=7)
results = cross_val_score(model2, predictors, response, cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))

In [None]:
print(results) # These results represent the accuracy at each fold in the cross validated model

In [None]:
print(model.feature_importances_) # the inbuild method from the model  display the importance score according to the input order of the predictors

In [None]:
# plot feature importance
from matplotlib import pyplot
xgb.plot_importance(model)
pyplot.show()

#### 2.6.1 Selecting features based on  their respective feature importance scores

This is to avoid including redundant features in our training dataset as they do not  contribute to the improvemenet of the model

In [None]:
print(np.sort(model.feature_importances_)) # Sorting them according to the importance order of the features

In [None]:
from sklearn.feature_selection import SelectFromModel
thresholds = np.sort(model.feature_importances_)
for thresh in thresholds:
    selection = SelectFromModel(model, threshold=thresh, prefit=True)
    select_X_train = selection.transform(X_train)
    # train model
    selection_model = xgb.XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    # eval model
    select_X_test = selection.transform(X_test)
    y_pred = selection_model.predict(select_X_test)
    accuracy = accuracy_score(y_test, y_pred)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],accuracy*100.0))

### 2.7 Re-run the model and prediction with the optimal threshold of 0.04582651, taking only 7 predictors

In [None]:
from sklearn.feature_selection import SelectFromModel
# select features using threshold
thresh= 0.04582651

selection = SelectFromModel(model, threshold=thresh, prefit=True)
select_X_train = selection.transform(X_train)
# train model
selection_model = xgb.XGBClassifier()
selection_model.fit(select_X_train, y_train)
# eval model
select_X_test = selection.transform(X_test)
y_pred = selection_model.predict(select_X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1],accuracy*100.0))

In [None]:
pred = pd.DataFrame(y_pred)
print (pred.head())
with open('churns_predict.csv', 'w') as f:
    print( pred, file=f) 

In [None]:
predictions = [round(value) for value in y_pred]
print(predictions)
preds = pd.DataFrame(predictions)
print(preds)