#  Imbalanced Classification Project: Beta Bank Churn

## 1. Defining the Question

### a) Specifying the Data Analysis Question

The task is  to develop a model will predict customer churn


### b) Defining the Metric for Success

We will have accomplished our objective if we build a model with the maximum possible f1 score

### c) Understanding the Context

Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on
clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1
score of at least 0.59.

### d) Recording the Experimental Design

1. Load libraries and datasets.
2. Prepare the data
3. Analyze the data
4. Machine learning modelling 
5. Predictions 
6. Conclusions and recommedation

### e) Data Relevance

The given data sets were relevant in answering the research question.

## 2. Data Cleaning & Analysis

###2.1.   Data cleaning & exploration

In [1]:
# Loading the required libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# To preview all rows and columns
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [2]:
# Loading the dataset
df=pd.read_csv('https://bit.ly/2XZK7Bo')

In [3]:
# Previewing the first 5 records
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [4]:
#standardization of column names
df.columns = df.columns.str.strip().str.lower().str.replace(')','').str.replace('?','')
df.head(2)

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0


In [5]:
# Getting our dataset shape

df.shape

#dataframe has 10,000 rows and 14 columns

(10000, 14)

In [6]:
# Looking for duplicates
df.duplicated().sum()

#there are no duplicate values 

0

In [7]:
df.dtypes

rownumber            int64
customerid           int64
surname             object
creditscore          int64
geography           object
gender              object
age                  int64
tenure             float64
balance            float64
numofproducts        int64
hascrcard            int64
isactivemember       int64
estimatedsalary    float64
exited               int64
dtype: object

In [8]:
# Looking for null values 

df.isnull().sum()

#there are no missing values 

rownumber            0
customerid           0
surname              0
creditscore          0
geography            0
gender               0
age                  0
tenure             909
balance              0
numofproducts        0
hascrcard            0
isactivemember       0
estimatedsalary      0
exited               0
dtype: int64

In [9]:
# A closer inspection of the missing observations 

df['tenure'].value_counts().sort_values(ascending=False)
df[df['tenure'].isnull()].head()
df['tenure'] = df['tenure'].replace(np.nan,df['tenure'].mean())
df.isnull().sum()

#fill missing tenure values with mean values

rownumber          0
customerid         0
surname            0
creditscore        0
geography          0
gender             0
age                0
tenure             0
balance            0
numofproducts      0
hascrcard          0
isactivemember     0
estimatedsalary    0
exited             0
dtype: int64

In [10]:
df['tenure'] = df['tenure'].astype(int)
df.dtypes

rownumber            int64
customerid           int64
surname             object
creditscore          int64
geography           object
gender              object
age                  int64
tenure               int64
balance            float64
numofproducts        int64
hascrcard            int64
isactivemember       int64
estimatedsalary    float64
exited               int64
dtype: object

In [11]:
#transform categorical values in our dataset into numerical values: gender
df['gender'] = np.where(df['gender'].str.contains('Male','Female'),1,0)
df.head(2)

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,France,0,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,0,41,1,83807.86,1,0,1,112542.58,0


In [12]:
#convert column into category
df["geography"] = df["geography"].astype('category')
df.dtypes

# assign the encoded variable to a new column using the cat.codes accessor:
df["geography"] = df["geography"].cat.codes
df.head(2)

Unnamed: 0,rownumber,customerid,surname,creditscore,geography,gender,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited
0,1,15634602,Hargrave,619,0,0,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,2,0,41,1,83807.86,1,0,1,112542.58,0


### 2.2. Machine Learning

#### 2.2.1 Data Modelling

Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.


In [13]:
#Examine the balance of classes
print(df[df['exited']== 1]['exited'].count())
print(df[df['exited']== 0]['exited'].count())

#there is imbalance of classes of ratio 1:4

2037
7963


Training models ignoring class imbalance

In [14]:
#preparing data 
x = df.drop(['rownumber','exited','surname'], axis = 1)
y = df['exited']  

#drop rownumber and customer surname since they are not relevant in model prediction

#Split the source data into a training set, a validation set, and a test set.
#spliting the dataset (ratio 3:1:1)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 12345)

#confirm size of datasets
print(df.shape)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(10000, 14)
(8000, 11)
(2000, 11)
(8000,)
(2000,)


In [17]:
#normalisation to scale the data between 0 and 1 to get better accuracy 
#performing normalization
from sklearn.preprocessing import MinMaxScaler
norm = MinMaxScaler().fit(X_train)
X_train = norm.transform(X_train)
X_test = norm.transform(X_test)

In [18]:
#import classifers 
from sklearn.linear_model import LogisticRegression # Logistic Regression Classifier
from sklearn.tree import DecisionTreeClassifier     # Decision Tree Classifier
from sklearn.ensemble import RandomForestClassifier #Random Forest Classifer
from sklearn.dummy import DummyClassifier           #dummy classifer

In [19]:
#instance of classifers 
logistic_classifier = LogisticRegression()
decision_classifier = DecisionTreeClassifier()
random_classifer = RandomForestClassifier()
dummy_classifer = DummyClassifier()

In [20]:
#train model
logistic_classifier.fit(X_train, y_train)
decision_classifier.fit(X_train, y_train)
random_classifer.fit(X_train, y_train)
dummy_classifer.fit(X_train, y_train)



DummyClassifier(constant=None, random_state=None, strategy='warn')

In [21]:
#predict test results
logistic_y_prediction = logistic_classifier.predict(X_test) 
decision_y_prediction = decision_classifier.predict(X_test) 
random_y_prediction = random_classifer.predict(X_test) 
dummy_y_prediction = dummy_classifer.predict(X_test) 

In [22]:
from sklearn.metrics import accuracy_score 
#print accuracy of classifers
print('Logistic accuracy:') 
print(accuracy_score(logistic_y_prediction, y_test)) 
print('Decision Tree accuracy:')
print(accuracy_score(decision_y_prediction, y_test))
print('Random Forest accuracy:')
print(accuracy_score(random_y_prediction, y_test))
print('Dummy accuracy:')
print(accuracy_score(dummy_y_prediction, y_test))

#the most accurate classifer is random forest classifer at 0.86
#by random guessing the accuracy score is at 0.68

Logistic accuracy:
0.7985
Decision Tree accuracy:
0.7715
Random Forest accuracy:
0.848
Dummy accuracy:
0.6855


In [23]:
from sklearn.metrics import f1_score
#print accuracy of classifers
print('Logistic f1_score:') 
print(f1_score(logistic_y_prediction, y_test)) 
print('Decision Tree f1_score:')
print(f1_score(decision_y_prediction, y_test))
print('Random Forest f1_score:')
print(f1_score(random_y_prediction, y_test))
print('Dummy f1_score:')
print(f1_score(dummy_y_prediction, y_test))
#the best classifer is random forest classifer at 0.57

Logistic f1_score:
0.25231910946196656
Decision Tree f1_score:
0.46922183507549364
Random Forest f1_score:
0.5351681957186545
Dummy f1_score:
0.24670658682634727


In [24]:
#prediction probabilities test results
logistic_y_proba = logistic_classifier.predict_proba(X_test)[:,1]
decision_y_proba = decision_classifier.predict_proba(X_test)[:,1]
random_y_proba = random_classifer.predict_proba(X_test)[:,1] 
dummy_y_proba = dummy_classifer.predict_proba(X_test)[:,1] 

In [25]:
from sklearn.metrics import roc_auc_score

#print accuracy of classifers
print('Logistic auc:') 
print(roc_auc_score(y_test,logistic_y_proba)) 
print('Decision Tree auc:')
print(roc_auc_score(y_test, decision_y_proba))
print('Random Forest auc:')
print(roc_auc_score(y_test, random_y_proba))
print('Dummy auc:')
print(roc_auc_score(y_test,dummy_y_proba))

#random forest classifier has the highest chance to distinguish the positive and negative class values 

Logistic auc:
0.7443376295835311
Decision Tree auc:
0.6627895204646324
Random Forest auc:
0.8564490650928803
Dummy auc:
0.5022980000625307


#### 2.2.2 Evaluation Metrics

Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.


In [37]:
# fixing class imbalance
from sklearn.utils import shuffle
#upsampling

def upsample(features, target, repeat):
    features_zeros = x[y == 0]
    features_ones = x[y == 1]
    target_zeros = y[y == 0]
    target_ones = y[y == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle( 
        features_upsampled, target_upsampled,random_state=12345
    )

    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(
    X_train, y_train, 10
)

upsampled_logistic =LogisticRegression(random_state=12345,solver='liblinear')
upsampled_logistic.fit(features_upsampled, target_upsampled)
upsampled_logistic_pred = upsampled_logistic.predict(X_test)

print('Accuracy', upsampled_logistic.score(X_test, y_test))
print('F1 score:' ,f1_score(y_test, upsampled_logistic_pred))
logistic_proba = upsampled_logistic.predict_proba(X_test)[:,1]
print('AUC:',roc_auc_score(y_test, logistic_proba))

#upsampling for original data was done correctly 

Accuracy 0.2135
F1 score: 0.3518747424804285
AUC: 0.5717352692017372


In [67]:
#downsampling

def downsample(features, target, fraction):
    features_zeros = x[y == 0]
    features_ones = x[y == 1]
    target_zeros = y[y == 0]
    target_ones = y[y == 1]

    features_downsampled = pd.concat(
    [features_zeros.sample(frac=fraction,random_state=12345)] + [features_ones])

    target_downsampled = pd.concat(
    [target_zeros.sample(frac=fraction,random_state=12345)] + [target_ones])

    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(
    X_train,y_train, 0.1
)

downsampled_logistic =LogisticRegression(random_state=12345,solver='liblinear')
downsampled_logistic.fit(features_downsampled, target_downsampled)
downsampled_logistic_pred = downsampled_logistic.predict(X_test)

print('Accuracy', downsampled_logistic.score(X_test, y_test))
print('F1 score:' ,f1_score(y_test, downsampled_logistic_pred))
logistic_proba = downsampled_logistic.predict_proba(X_test)[:,1]
print('AUC:',roc_auc_score(y_test, logistic_proba))

#downsampling for original data was done correctly 

Accuracy 0.214
F1 score: 0.3520197856553998
AUC: 0.5717054927189056


#### 2.2.3 Hyperparameter Tuning 

from the previous codes, Random forest gives us the best F1 score of 0.53. In order to improve the F1 score, the above model, we need to get the best hyperparameters

In [75]:
#random forest tuning
from sklearn.model_selection import GridSearchCV
depth_param = {'max_depth':range(1,10), 'n_estimators':range(1,50)}
hyper_forest = RandomForestClassifier(random_state=12345, class_weight='balanced')
grid_forest = GridSearchCV(hyper_forest,depth_param)
grid_forest.fit(X_train, y_train)
print(grid_forest.best_estimator_)
hyper_predicted = grid_forest.predict(X_test)
print('Forest Accuracy:', grid_forest.score(X_test, y_test))
print('Forest F1 score:', f1_score(y_test, hyper_predicted))
forest_proba = grid_forest.predict_proba(X_test)[:,1]
print('Forest AUC:', roc_auc_score(y_test,forest_proba))

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight='balanced',
                       criterion='gini', max_depth=9, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=42,
                       n_jobs=None, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)
Forest Accuracy: 0.833
Forest F1 score: 0.6329670329670329
Forest AUC: 0.8607562333344747


We get a f1 score of 0.63 for random forest, which is an greater than the F1 score expected of 0.59 

## 3. Summary of Findings and Recommedations

Findings:

Random Forest is the best classifer since it gives an F1 score of: 0.63