# Imbalanced Classification Project

# Defining the question

#a) Specifying the data analysis question

Beta Bank would like to build a model to predict whether a customer will leave the bank soon.

#b) Defining the Metric for Success

We will have accomplished our objective if we can develop a model that will predict whether a customer will leave the bank soon.

 # c)Understanding the Context

Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
The bank needs to predict whether a customer will leave the bank soon.
The clients’ past behavior and termination of contracts with the bank is available

# d) Recording the Experimental Design



1. Importing libraries
2. Data Importation
3. Data Modeling
4. Model Evaluation
5. Hyparameter Tuning
6. Sanity Check
7. Findings and Recommendations




# e) Data Relevance

The data was relevant for the analysis

#Data importation and preparation

In [None]:
#importing the libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.utils import resample
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve

In [None]:
#importing the data
bank_df = pd.read_csv('https://bit.ly/2XZK7Bo')
#previewing first 5 records
bank_df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
#previewing last 5 records
bank_df.tail()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.0,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.0,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1
9999,10000,15628319,Walker,792,France,Female,28,,130142.79,1,1,0,38190.78,0


In [None]:
#checking the shape of the data
bank_df.shape

(10000, 14)

In [None]:
#checking the data info
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Columns RowNumber, CustomerId and Surname will not be required in our analysis, we will therefore drop them from our dataset      

In [None]:
bank_df = bank_df.drop(['RowNumber','CustomerId','Surname'],axis=1)
bank_df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
#checking for null values
bank_df.isnull().sum()

CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

We will impute missing values in tenure with the mean, this will not affect the outcome as our model will be evaluated using F1 score

In [None]:
#imputing null values with the mean
bank_df['Tenure']= bank_df['Tenure'].fillna(bank_df['Tenure'].mean())

#converting tenure to type int
bank_df['Tenure'] = bank_df['Tenure'].astype(int)

In [None]:
bank_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


In [None]:
#checking for duplicates
bank_df.duplicated().sum()

0

#Data Modelling

#Examining the balance of the classes


In [None]:
#Examining the balance of the classes
print(bank_df[bank_df['Exited'] == 1]['Exited'].count())
print(bank_df[bank_df['Exited'] == 0]['Exited'].count())

2037
7963


we do have an imbalanced classification problem

In [None]:
scaler = StandardScaler()

#OHE
bank_df = pd.get_dummies(bank_df, drop_first=True)

#creating features and target
features = bank_df.drop(['Exited'],axis=1)
target =  bank_df['Exited']

#splitting the data into train, test and validation sets, 20% test, 75% train,20% valid
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.20, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.2, random_state=12345 )

In [None]:
#Training the model without taking into account the imbalance

model_lr = LogisticRegression(solver='liblinear', random_state=12345)

model_lr.fit(features_train,target_train)

prediction =model_lr.predict(features_train)

accuracy = model_lr.score(features_valid, target_valid)

print('Accuracy', accuracy)
print('f1 score:' ,f1_score(target_valid, model_lr.predict(features_valid)))
print('AUC:', roc_auc_score(target_valid, model_lr.predict_proba(features_valid)[:,1]))


Accuracy 0.800625
f1 score: 0.08069164265129683
AUC: 0.6750307258944861


Using logistic regression and an imbalanced classification, we get an accuracy score of 0.80, an F1 score of 0.08 and an AUC of 0.65


In [None]:
#Training the other models without taking into account the imbalance

model_dt = DecisionTreeClassifier(random_state=12345)
model_rf = RandomForestClassifier(random_state=12345,n_estimators=3)

model_dt.fit(features_train,target_train)
model_rf.fit(features_train,target_train)

prediction_dt =model_dt.predict(features_train)
prediction_rf =model_rf.predict(features_train)

accuracy_dt = model_dt.score(features_valid, target_valid)
accuracy_rf = model_rf.score(features_valid, target_valid)


print('accuracy_dt', accuracy_dt)
print('accuracy_rf', accuracy_rf)

print('f1 score_dt:' ,f1_score(target_valid, model_dt.predict(features_valid)))
print('f1 score_rf:' ,f1_score(target_valid, model_rf.predict(features_valid)))

print('AUC_dt:', roc_auc_score(target_valid, model_dt.predict_proba(features_valid)[:,1]))
print('AUC_rf:', roc_auc_score(target_valid, model_rf.predict_proba(features_valid)[:,1]))

accuracy_dt 0.793125
accuracy_rf 0.825
f1 score_dt: 0.46353322528363045
f1 score_rf: 0.49640287769784175
AUC_dt: 0.6700522403820953
AUC_rf: 0.7437880256799774


the accuracy of random forest is highest at 0.82, the f1 scores of random forest and decision trees are higher than that of logistic regression meaning they have an improved F1 score, AUC of decision tree and that of losigtic regression are very close and the highest AUC is that of random forest

#Improving model quality and fixing class imbalance

Downsampling

In [None]:
#downsampling
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

model_lr_downsample =LogisticRegression(random_state=12345,solver='liblinear')
model_lr_downsample.fit(features_downsampled, target_downsampled)
predicted_valid_downsample = model_lr_downsample.predict(features_valid)

print("F1:", f1_score(target_valid, predicted_valid_downsample))
print('Accuracy:', model_lr_downsample.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, model_lr_downsample.predict_proba(features_valid)[:,1]))

F1: 0.33906071019473083
Accuracy: 0.27875
AUC-ROC: 0.7149298584445954


we need an F1 Score of at least 0.59. downsampling using logistic regression gives us 0.33. we will proceed to try upsampling

Upsampling

In [None]:
#Upsampling
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

model_lr_upsample =LogisticRegression(random_state=12345,solver='liblinear')
model_lr_upsample.fit(features_upsampled, target_upsampled)
predicted_valid_upsample = model_lr_upsample.predict(features_valid)


print('Accuracy', model_lr_upsample.score(features_valid, target_valid))
print('f1 score:' ,f1_score(target_valid, predicted_valid_upsample))
print('AUC:',roc_auc_score(target_valid, model_lr_upsample.predict_proba(features_valid)[:,1]))

Accuracy 0.294375
f1 score: 0.34398605461940734
AUC: 0.7221005061184609


F1 score slightly improves but still with logistic regression.

Fixing class imbalance with decision tree

In [None]:
#downsampling decision tree
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

model_dt_downsample =DecisionTreeClassifier(random_state=12345)
model_dt_downsample.fit(features_downsampled, target_downsampled)
predicted_valid_downsample_dt = model_dt_downsample.predict(features_valid)

print("F1:", f1_score(target_valid, predicted_valid_downsample_dt))
print('Accuracy:', model_dt_downsample.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, model_dt_downsample.predict_proba(features_valid)[:,1]))

F1: 0.4186046511627908
Accuracy: 0.59375
AUC-ROC: 0.6621602021420339


In [None]:
#Upsampling decision tree
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

model_dt_upsample =DecisionTreeClassifier(random_state=12345)
model_dt_upsample.fit(features_upsampled, target_upsampled)
predicted_valid_upsample_dt = model_lr_upsample.predict(features_valid)


print('Accuracy', model_dt_upsample.score(features_valid, target_valid))
print('f1 score:' ,f1_score(target_valid, predicted_valid_upsample_dt))
print('AUC:',roc_auc_score(target_valid, model_dt_upsample.predict_proba(features_valid)[:,1]))

Accuracy 0.803125
f1 score: 0.34398605461940734
AUC: 0.6698970205424551


Upsampling using decision tree raises our F1 score to 0.41 but this is still low. we will now try random forest

Fixing class imbalance with random forest

In [None]:
#downsampling random forest
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

model_rf_downsample =RandomForestClassifier(random_state=12345)
model_rf_downsample.fit(features_downsampled, target_downsampled)
predicted_valid_downsample_rf = model_rf_downsample.predict(features_valid)

print("F1:", f1_score(target_valid, predicted_valid_downsample_dt))
print('Accuracy:', model_rf_downsample.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, model_rf_downsample.predict_proba(features_valid)[:,1]))

F1: 0.4186046511627908
Accuracy: 0.565625
AUC-ROC: 0.8377062070123744


In [None]:
#Upsampling random forest
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

model_rf_upsample =RandomForestClassifier(random_state=12345)
model_rf_upsample.fit(features_upsampled, target_upsampled)
predicted_valid_upsample_rf = model_rf_upsample.predict(features_valid)


print('Accuracy', model_rf_upsample.score(features_valid, target_valid))
print('f1 score:' ,f1_score(target_valid, predicted_valid_upsample_rf))
print('AUC:',roc_auc_score(target_valid, model_rf_upsample.predict_proba(features_valid)[:,1]))

Accuracy 0.860625
f1 score: 0.5981981981981981
AUC: 0.8516556358797022


Random forest is our ideal model with an F1 score of 0.59 which is above our target f1 score of 0.56. AUC is also above 0.85

In [None]:
#tuning hyperparameters
for estimator in range(1,15):
        model_rf_upsample = RandomForestClassifier(random_state=12345, n_estimators=estimator,class_weight='balanced')
        
        model_rf_upsample.fit(features_train, target_train)

        prediction_rf_2 = model_rf_upsample.predict(features_valid)
        #score(test_features, target)
        #predictions = model.predict(test_features)
        accuracy_rf_2 = accuracy_score(target_valid, prediction_rf_2)
       
        print("estimator =", estimator, ": ", end='')
        print(accuracy_score(target_valid, prediction_rf_2))
        print('f1 score:' ,f1_score(target_valid, prediction_rf_2))

estimator = 1 : 0.791875
f1 score: 0.4722662440570523
estimator = 2 : 0.8425
f1 score: 0.4375
estimator = 3 : 0.828125
f1 score: 0.5098039215686275
estimator = 4 : 0.84875
f1 score: 0.48945147679324896
estimator = 5 : 0.85
f1 score: 0.5437262357414449


estimator 13 has the highest f1 score of 0.568, we will use this to tune the max_depth

In [None]:
#tuning hyperparameters
for depth in range(1,15):
        model_rf_upsample = RandomForestClassifier(random_state=12345,max_depth=depth, n_estimators=13,class_weight='balanced')
        
        model_rf_upsample.fit(features_train, target_train)

        prediction_rf_2 = model_rf_upsample.predict(features_valid)
        #score(test_features, target)
        #predictions = model.predict(test_features)
        accuracy_rf_2 = accuracy_score(target_valid, prediction_rf_2)
       
        print("depth =", depth, ": ", end='')
        print(accuracy_score(target_valid, prediction_rf_2))
        print('f1 score:' ,f1_score(target_valid, prediction_rf_2))

at max_depth =11 we get the highest F1 score of 0.605. We will tune our final model with n_estimator = 13 and max_depth =11

In [None]:
#final modelling and testing
final_model = RandomForestClassifier(random_state=12345,max_depth=11, n_estimators=13,class_weight='balanced')
final_model.fit(features_train, target_train)
predicted_test_rf = final_model.predict(features_test)
print("F1:", f1_score(target_test, predicted_test_rf))
print("AUC-ROC:", roc_auc_score(target_test, final_model.predict_proba(features_test)[:,1]))
print('Accuracy:', final_model.score(features_valid, target_valid))

#Conclusion

With an F1 score of 0.60 and AUC-ROC of 0.84 we can conclude that we have a quality model and meet the requirement of an F1 score of at least 0.59