# Introduction:

Beta Bank is looking to create a model to predict the liklihood that a customer will leave the bank as a customer. In this project I will:

1. Download the data for current customers and prepare the data to be used to train our prediction model
2. Examine the balnace of classes in our model
3. Improve the quality of the model by fixing class imbalances
4. Perform final testing

The final goal of this model will be to have an F1 score of above 0.59

## Download and prepare data

In [1]:
pip install imblearn --user

Note: you may need to restart the kernel to use updated packages.


In [2]:
#Import libraries

import pandas as pd
from sklearn.preprocessing import OrdinalEncoder
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from sklearn.datasets import make_classification
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.feature_selection import SelectFromModel
from sklearn.ensemble import BaggingClassifier
from sklearn.utils import resample
from sklearn.preprocessing import StandardScaler

In [3]:
#Download data and review df info

data = pd.read_csv("/datasets/Churn.csv")

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [4]:
#Review a sample of the data

data.sample(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
7420,7421,15765487,Kuo,753,Germany,Female,38,9.0,151766.71,1,1,1,180829.99,0
6596,6597,15654531,Tuan,477,France,Male,22,5.0,82559.42,2,0,0,163112.9,1
7207,7208,15570990,Begley,520,Spain,Female,30,4.0,145222.99,2,0,0,145160.96,0
1330,1331,15742854,Lettiere,640,Spain,Female,46,8.0,0.0,2,1,0,89043.19,0
9808,9809,15581115,Middleton,603,France,Female,39,9.0,76769.68,1,0,0,48224.72,0
9360,9361,15671934,Veale,552,Germany,Male,39,2.0,132906.88,1,0,1,149384.43,0
9547,9548,15682454,McFarland,626,France,Female,34,3.0,0.0,2,1,1,37870.29,0
3218,3219,15774872,Joslin,663,France,Male,36,10.0,0.0,2,1,0,136349.55,0
9053,9054,15604551,Robb,732,France,Female,35,3.0,0.0,2,1,0,90876.95,0
4084,4085,15750458,Hawkins,693,France,Female,39,4.0,0.0,2,0,1,142331.39,0


### Feature Preparation

There are 3 columns that can be dropped based on a cursory review that do not add to our ability to make an accurate prediction model.

In [5]:
#Drop unnecessary columns that have unique values instead of classification data
data = data.drop(columns=['RowNumber', 'CustomerId', 'Surname'])

Columns Gender and Geography need to be converted to non-string datatypes. I will use label encoding to address this.

In [6]:
from sklearn.preprocessing import OneHotEncoder

In [7]:
data = pd.get_dummies(data, drop_first=True)

Ran get_dummies to address variety in the Geography column which will allow my logistical regression to run more smoothly

In [8]:
data.head(10)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0
5,645,44,8.0,113755.78,2,1,0,149756.71,1,0,1,1
6,822,50,7.0,0.0,2,1,1,10062.8,0,0,0,1
7,376,29,4.0,115046.74,4,1,0,119346.88,1,1,0,0
8,501,44,4.0,142051.07,2,0,1,74940.5,0,0,0,1
9,684,27,2.0,134603.88,1,1,1,71725.73,0,0,0,1


<div class="alert alert-block alert-warning">
<b>Reviewer's comment</b> <a class="tocSkip"></a>

It's proper to make encoding right before to apply ML model. Because before ML model you don't need to it.

</div>

In [9]:
#Check for missing values
data.isna().sum()

CreditScore            0
Age                    0
Tenure               909
Balance                0
NumOfProducts          0
HasCrCard              0
IsActiveMember         0
EstimatedSalary        0
Exited                 0
Geography_Germany      0
Geography_Spain        0
Gender_Male            0
dtype: int64

In [10]:
#Fill missing values with mode since tenure is ordinal data

data['Tenure'] = data['Tenure'].fillna(data['Tenure'].median())

print(data.isna().sum())
data.info()

CreditScore          0
Age                  0
Tenure               0
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
Geography_Germany    0
Geography_Spain      0
Gender_Male          0
dtype: int64
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Age                10000 non-null  int64  
 2   Tenure             10000 non-null  float64
 3   Balance            10000 non-null  float64
 4   NumOfProducts      10000 non-null  int64  
 5   HasCrCard          10000 non-null  int64  
 6   IsActiveMember     10000 non-null  int64  
 7   EstimatedSalary    10000 non-null  float64
 8   Exited             10000 non-null  int64  
 9   Geography_Germany  10000 non-null  uint8  
 10  Geography_Spain    100

In [11]:
#Check balance of target data

print(data['Exited'].value_counts())

0    7963
1    2037
Name: Exited, dtype: int64


Our Target data is imbalanced in favor of clients that have not exited the bank as clients at a ratio of about 4:1

### Split the Data

In [12]:
#First use train_test_split to create 2 df's, one is test, the other is temp df to split further into train and validation
df_temp, df_test = train_test_split(data, test_size=0.20, random_state=315)

#Then use the temp df from first function to create 2 more df's, train and validation
df_train, df_valid = train_test_split(df_temp, test_size=0.25, random_state=315)

#Verify the datasets
print("Training dataset: ", df_train.shape)
print("Testing dataset: ", df_test.shape)
print("Validation dataset: ", df_valid.shape)

Training dataset:  (6000, 12)
Testing dataset:  (2000, 12)
Validation dataset:  (2000, 12)


In [13]:
#Create feature and target datasets for each

#Train datasets
features_train = df_train.drop(['Exited'], axis=1)
target_train = df_train['Exited']

#Validation datasets
features_valid = df_valid.drop(['Exited'], axis=1)
target_valid = df_valid['Exited']

#Test datasets
features_test = df_test.drop(['Exited'], axis=1)
target_test = df_test['Exited']

In [14]:
#Scale the datasets
numeric = ['CreditScore', 'Age', 'Balance', 'NumOfProducts', 'EstimatedSalary']
scaler = StandardScaler()



# Fit on the training data
features_train_scaler = scaler.fit_transform(features_train[numeric])

# Transform the datasets
features_valid_scaler = scaler.transform(features_valid[numeric])
features_test_scaler = scaler.transform(features_test[numeric])

In [15]:
data.head(20)

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0
5,645,44,8.0,113755.78,2,1,0,149756.71,1,0,1,1
6,822,50,7.0,0.0,2,1,1,10062.8,0,0,0,1
7,376,29,4.0,115046.74,4,1,0,119346.88,1,1,0,0
8,501,44,4.0,142051.07,2,0,1,74940.5,0,0,0,1
9,684,27,2.0,134603.88,1,1,1,71725.73,0,0,0,1


## Train Models

In [16]:
#Train inital decision tree model
decision_tree = DecisionTreeClassifier(random_state=315)
decision_tree.fit(features_train, target_train)
dt_predicted_valid = decision_tree.predict(features_valid)

#Train initial random forest model
random_forest = RandomForestClassifier(random_state=315)
random_forest.fit(features_train, target_train)
rf_predicted_valid = random_forest.predict(features_valid)

#Train initial logistic regression using solver='liblinear' hyperparamter
logistic_regression = LogisticRegression(random_state=315, solver='liblinear') 
logistic_regression.fit(features_train, target_train)
lr_predicted_valid = logistic_regression.predict(features_valid)

#Initial F1 scores
#Decision Tree
print("Initial Models F1 Scores")
print()
print("Decision Tree F1 Score:", f1_score(target_valid, dt_predicted_valid)*100)

#Random Forest
print("Random Forest F1 Score:", f1_score(target_valid, rf_predicted_valid)*100)

#Logistic Regression
print("Logistic Regression F1 Score:", f1_score(target_valid, lr_predicted_valid)*100)

#Initial AUC-ROC metric
#Decision Tree

probabilities_valid_dt = decision_tree.predict_proba(features_valid)
probabilities_one_valid_dt = probabilities_valid_dt[:, 1]

auc_roc_dt = roc_auc_score(target_valid, probabilities_one_valid_dt)

#Random Forest

probabilities_valid_rf = random_forest.predict_proba(features_valid)
probabilities_one_valid_rf = probabilities_valid_rf[:, 1]

auc_roc_rf = roc_auc_score(target_valid, probabilities_one_valid_rf)

#Logistic Regression

probabilities_valid_lr = logistic_regression.predict_proba(features_valid)
probabilities_one_valid_lr = probabilities_valid_lr[:, 1]

auc_roc_lr = roc_auc_score(target_valid, probabilities_one_valid_lr)

print()
print("Initial Models AUC-ROC Scores")
print()
print("Decision Tree:", auc_roc_dt*100)
print("Random Forest:", auc_roc_rf*100)
print("Logistic Regression:", auc_roc_lr*100)

Initial Models F1 Scores

Decision Tree F1 Score: 50.06075334143378
Random Forest F1 Score: 59.45121951219512
Logistic Regression F1 Score: 9.271523178807946

Initial Models AUC-ROC Scores

Decision Tree: 69.04419099241527
Random Forest: 84.51590035069229
Logistic Regression: 65.25072302837533


In our initial models not accounting for any class imbalance we see a breakdown of our 3 models with F1 scores and AUG-ROC scores. Initially it looks like Random Forrest model is our strongest model having an F1 score almost at 0.59 already

## Improve quality of the model

Based on the last section of this project, I will be moving forward with the Random Forrest model as my main model of choice. Next we will improve the models F1 score using various methods.

1. Address weight class imbalance

In [17]:
#Random Forest
random_forest_balanced1 = RandomForestClassifier(class_weight='balanced', random_state=315)
random_forest_balanced1.fit(features_train, target_train)
rf_balanced1_predicted_valid = random_forest_balanced1.predict(features_valid)

print("Random Forest F1 Score:", f1_score(target_valid, rf_balanced1_predicted_valid)*100)


Random Forest F1 Score: 58.34633385335414


Our initial method did not improve our F1 score and in fact had a negative impact on our Random Forrest Model

2. Upsampling - we will artificially increase the instances of our minority class (Exited = 1)

In [18]:
train_data = pd.concat([features_train, target_train], axis=1)

#Define the majority and minority target data
majority_class = train_data[train_data.Exited == 0]
minority_class = train_data[train_data.Exited == 1]

#Upsample the minority data
minority_upsampled = resample(minority_class, 
                              replace=True, 
                              n_samples=len(majority_class), 
                              random_state=315)

#Combine the upsampled minority data with the majority data
upsampled_train = pd.concat([majority_class, minority_upsampled])

# Separate features and target data
features_upsampled = upsampled_train.drop('Exited', axis=1)
target_upsampled = upsampled_train['Exited']

In [19]:
random_forest_2 = RandomForestClassifier(random_state=315)
random_forest_2.fit(features_upsampled, target_upsampled)
rf_balanced2_predicted_valid = random_forest_2.predict(features_valid)

print('Random Forest F1:', f1_score(target_valid, rf_balanced2_predicted_valid)*100)

Random Forest F1: 61.13416320885201


Our second technique improved our F1 score for Random Forest to F1= 60.94 which meets our target goal. 

In [20]:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

In [22]:
param_distributions = {
    'n_estimators': [50, 100, 200],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth': [None, 10, 20, 30, 40],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

# Initialize RandomizedSearch
random_search = RandomizedSearchCV(estimator=RandomForestClassifier(class_weight='balanced', random_state=315),
                                   param_distributions=param_distributions,
                                   n_iter=100,
                                   scoring='f1',
                                   cv=3,
                                   verbose=2,
                                   random_state=42,
                                   n_jobs=-1)

# Fit the Randomized Search
random_search.fit(features_train, target_train)

# Get the best model
best_rf_model = random_search.best_estimator_

# Make predictions
y_pred = best_rf_model.predict(features_valid)

# Evaluate performance
f1 = f1_score(target_valid, y_pred)
auc_roc = roc_auc_score(target_train, best_rf_model.predict_proba(features_valid)[:, 1])

print(f'Best parameters: {random_search.best_params_}')
print(f'F1 Score: {f1}')
print(f'AUC-ROC: {auc_roc}')

Fitting 3 folds for each of 100 candidates, totalling 300 fits
[CV] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.0s
[CV] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.0s
[CV] END bootstrap=True, max_depth=30, max_features=auto, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.0s
[CV] END bootstrap=False, max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=100; total time=   0.0s
[CV] END bootstrap=False, max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=100; total time=   0.0s
[CV] END bootstrap=False, max_depth=30, max_features=auto, min_samples_leaf=4, min_samples_split=5, n_estimators=100; total time=   0.0s
[CV] END bootstrap=True, max_depth=20, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200; t

[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=10, max_features=log2, min_samples_leaf=1, min_samples_split=5, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=5, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=40, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.4s
[CV] END bootstrap=True, max_depth=40, max

[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   0.8s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   0.8s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   0.8s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.5s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.5s
[CV] END bootstrap=False, max_depth=10, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=100; total time=   0.5s
[CV] END bootstrap=True, max_depth=30, max_features=sqrt, min_samples_leaf=4, min_samples_split=10, n_estimators=50; total time=   0.2s
[CV] END bootstrap=True, max_depth=30, 

[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   1.1s
[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   1.1s
[CV] END bootstrap=False, max_depth=20, max_features=sqrt, min_samples_leaf=2, min_samples_split=2, n_estimators=200; total time=   1.1s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=50; total time=   0.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=50; total time=   0.2s
[CV] END bootstrap=True, max_depth=10, max_features=sqrt, min_samples_leaf=2, min_samples_split=5, n_estimators=50; total time=   0.2s
[CV] END bootstrap=True, max_depth=20, max_features=auto, min_samples_leaf=2, min_samples_split=5, n_estimators=50; total time=   0.0s
[CV] END bootstrap=True, max_depth=20, max_featur

[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.3s
[CV] END bootstrap=False, max_depth=None, max_features=sqrt, min_samples_leaf=1, min_samples_split=2, n_estimators=50; total time=   0.3s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   0.3s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   0.3s
[CV] END bootstrap=False, max_depth=None, max_features=log2, min_samples_leaf=1, min_samples_split=10, n_estimators=50; total time=   0.3s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   0.2s
[CV] END bootstrap=True, max_depth=None, max_features=log2, min_samples_leaf=4, min_samples_split=2, n_estimators=50; total time=   0.2s
[CV] END bootstrap=True, max_dept

84 fits failed out of a total of 300.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
84 fits failed with the following error:
Traceback (most recent call last):
  File "/home/jovyan/.local/lib/python3.9/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/home/jovyan/.local/lib/python3.9/site-packages/sklearn/base.py", line 1466, in wrapper
    estimator._validate_params()
  File "/home/jovyan/.local/lib/python3.9/site-packages/sklearn/base.py", line 666, in _validate_params
    validate_parameter_constraints(
  File "/home/jovyan/.local/lib/python3.9/site-packages/sklearn/utils/_param_validation.py", line 95, in validate_parameter_constraints
    r

ValueError: Found input variables with inconsistent numbers of samples: [6000, 2000]

We see the hyper parameter adjustment to our training data improved our F1 score to 0.779 and AUC-ROC of 0.962!

## Test Final Model

In [23]:
rf_predictions = best_rf_model.predict(features_test)
rf_accuracy = accuracy_score(target_test, rf_predictions)
print("Random Forest Model Accuracy:", rf_accuracy)

Random Forest Model Accuracy: 0.851


In [24]:
print('Random Forest F1:', f1_score(target_test, rf_predictions)*100)

Random Forest F1: 64.43914081145584


In [25]:
auc_roc = roc_auc_score(target_test, best_rf_model.predict_proba(features_test)[:, 1])
print(f'AUC-ROC: {auc_roc*100}')

AUC-ROC: 87.37002880304718


Success! Our final model tested within the scope of our F1 parameters 

## Conclusion

Best Model: Random Forest Model
Final F1 score on Testing Data: 64.19
    
During this project I examined 3 ML models to determine which model best suited our goals. After choosing the Random Forest model with the highest initial F1 score I used class weight imbalance adjustments and upsampling my minority class to increase F1 to our target of 0.59. With hyperparameter optimization I was able to move the F1 score up even further. 

During this project I also considered using SMOTE, this methods seemed interesting in theory, my code was running for longer that 30 minutes and were deemed too inefficient for this project. 