# Introduction

### Stated Objective(s) Here:
Beta Bank's customer retention rate is slowly deteriorating and they want to predict whether a customer will leave the bank soon or not.

### Stated Goal(s) Here:
Building a model that will predict the customer retention rate with a minimum 0.59 F1 Score and compared with the AUC-ROC metric

### Initial Question(s): 
Which model will outperform the other two models?

In [1]:
#Load Libraries Here
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score, accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

In [2]:
#Load Data Here
data = pd.read_csv('/datasets/Churn.csv')

# Data Wrangling

This is where the data becomes sensible as I become more familiar with it.

In [3]:
data.head(30)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [5]:
data.isna().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [6]:
data['Tenure'].unique()

array([ 2.,  1.,  8.,  7.,  4.,  6.,  3., 10.,  5.,  9.,  0., nan])

In [7]:
data.sample(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
4969,4970,15584477,K?,655,Spain,Female,35,1.0,106405.03,1,1,1,82900.25,0
8783,8784,15617052,Watson,782,France,Male,34,9.0,0.0,1,1,0,183021.06,1
1297,1298,15793247,Hancock,498,France,Male,34,5.0,0.0,2,1,1,91711.66,0
5620,5621,15752409,Grant,553,France,Male,31,6.0,0.0,2,0,0,124596.63,0
7895,7896,15660571,Halpern,668,Spain,Male,43,,113034.31,1,1,1,100423.88,0
2900,2901,15668575,Hao,626,Spain,Female,26,,148610.41,3,0,1,104502.02,1
4234,4235,15567335,Allsop,559,France,Female,42,7.0,0.0,2,1,1,190040.29,0
306,307,15594898,Hewitt,731,France,Male,43,2.0,0.0,1,1,1,170034.95,1
6582,6583,15785975,Mason,525,Spain,Female,60,7.0,0.0,2,0,1,168034.9,0
7943,7944,15774250,Gallo,532,France,Male,42,1.0,159024.71,1,1,0,100982.93,1


I noticed that the Tenure's column has 909 *NAN* values and some of those missing values may have rows associated with customers who left and hasn't left so I'm considering replacing the NAN values with the column's mean.

In [8]:
nan_values = data.query('Tenure.isna()')
nan_values.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
30,31,15589475,Azikiwe,591,Spain,Female,39,,0.0,3,1,0,140469.38,1
48,49,15766205,Yin,550,Germany,Male,38,,103391.38,1,0,1,90878.13,0
51,52,15768193,Trevisani,585,Germany,Male,36,,146050.97,2,0,0,86424.57,0
53,54,15702298,Parkhill,655,Germany,Male,41,,125561.97,1,0,0,164040.94,1
60,61,15651280,Hunter,742,Germany,Male,35,,136857.0,1,0,0,84509.57,0


In [9]:
is_customer = nan_values[nan_values['Exited'] == 0]
not_customer = nan_values[nan_values['Exited'] == 1]

print(is_customer['Exited'].value_counts())
print(not_customer['Exited'].value_counts())

0    726
Name: Exited, dtype: int64
1    183
Name: Exited, dtype: int64


Upon inspecting closely, with the 909 (726 existing & 183 exited) customers, I will give them an average placeholder value in order to proceed onward with the main objective: seeing how the bank can retain their current customers with a trained model.

In [10]:
avg_tenure = data['Tenure'].mean()
avg_tenure # the mean is 4.9979

data['Tenure'] = data['Tenure'].fillna(5)

I decided to round the column's mean to 5 as a easier way to input the missing customer's longevity value.

In [11]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [12]:
data.sample(4)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
8354,8355,15669994,Greece,556,Germany,Female,31,1.0,128663.81,2,1,0,125083.29,0
6800,6801,15743149,Findlay,711,France,Female,35,8.0,0.0,1,1,1,67508.01,0
3132,3133,15619407,Buckley,615,France,Male,39,4.0,133707.09,1,1,1,108152.75,0
5438,5439,15633274,Tai,679,France,Male,34,7.0,160515.37,1,1,0,121904.14,0


Now that the data's shape is now uniform, I also went ahead and removed a few unnecessary categorical columns: *Customer ID*, *Row Number* and the customer's *Surname* from the dataframe.

In [13]:
data = data.drop(['CustomerId', 'RowNumber', 'Surname'], axis=1)
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


# Machine Learning

Now that the data is sensible and ready we can now move forward with the inital preprocessing for creating a model.

First and foremost, encoding and spliting the target from the features and respectively grouping these elements into their dataset types: training, validation, and testing set. I also prepped the features by scaling and transforming when seen as fit. Then, I printed each variables' shapes within their dataset type. And Lastly, the *enhanced data_ohe* info stats was printed to ensure the data was split and grouped successfully.

In [14]:
#Preprocessed the data via encoding
data_ohe = pd.get_dummies(data, drop_first=True)

#Splitting the data into two groupings & three datasets: Train, Valid and Test.
target = data_ohe['Exited']
features = data_ohe.drop('Exited', axis=1)

train_features, split_features, train_target, split_target = train_test_split(
    features, target, test_size=0.40, random_state=12345)

valid_features, test_features, valid_target, test_target = train_test_split(
    split_features, split_target, test_size=0.50, random_state=12345 )

#Train Datasets' Shapes
print(f'train_features:', train_features.shape)
print(f'train_target:', train_target.shape)

#Validation Datasets' Shapes
print(f'valid_features:', valid_features.shape)
print(f'valid_target:', valid_target.shape)

#Testing Datasets' Shapes
print(f'test_features:', test_features.shape)
print(f'test_target:', test_target.shape)

train_features: (6000, 11)
train_target: (6000,)
valid_features: (2000, 11)
valid_target: (2000,)
test_features: (2000, 11)
test_target: (2000,)


In [15]:
data_ohe.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 12 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CreditScore        10000 non-null  int64  
 1   Age                10000 non-null  int64  
 2   Tenure             10000 non-null  float64
 3   Balance            10000 non-null  float64
 4   NumOfProducts      10000 non-null  int64  
 5   HasCrCard          10000 non-null  int64  
 6   IsActiveMember     10000 non-null  int64  
 7   EstimatedSalary    10000 non-null  float64
 8   Exited             10000 non-null  int64  
 9   Geography_Germany  10000 non-null  uint8  
 10  Geography_Spain    10000 non-null  uint8  
 11  Gender_Male        10000 non-null  uint8  
dtypes: float64(3), int64(6), uint8(3)
memory usage: 732.5 KB


## The Imbalanced LogitisicRegression Model

The Imbalanced Model yielded a F1 Score of 0.08 and a ROC score of 0.62.

In [16]:
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(train_features, train_target)
predicted_valid = model.predict(valid_features)

probabilities_valid = model.predict_proba(valid_features)
probabilities_one_valid = probabilities_valid[:, 1]

#AUC-ROC Metric
auc_roc = roc_auc_score(valid_target, probabilities_one_valid)

#F1-score Metric
f1_score_value = f1_score(valid_target, predicted_valid)

print("AUC-ROC Score:", auc_roc)
print("F1 Score:", f1_score_value)

AUC-ROC Score: 0.6727947180904797
F1 Score: 0.08385744234800838


## The Balanced LogitisicRegression Model

The Balanced Model yielded a F1 Score of 0.42 and a ROC Score of 0.72. 
There's a clear understanding and difference how powerful balancing can strengthen a model/algorithm in the training stages!

In [17]:
#Upsampling the Minority Class
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345
    )

    return features_upsampled, target_upsampled


features_upsampled, target_upsampled = upsample(train_features, train_target, 6)


#experimental model
model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_upsampled, target_upsampled)
valid_predicted = model.predict(valid_features)


probabilities_valid = model.predict_proba(valid_features)
probabilities_one_valid = probabilities_valid[:, 1]

#AUC-ROC Metric
auc_roc = roc_auc_score(valid_target, probabilities_one_valid)
print('AUC-ROC Score:', auc_roc)

#F1 Score
print('F1:', f1_score(valid_target, valid_predicted))

AUC-ROC Score: 0.720853319945076
F1: 0.4216725559481743


## Model Training & Validating

Now that our dataset has been prepped and strengthen, we can train our models to become a trained models, and then onwards to be tested again to ensure our data is reacting the way the anticipate it to be in the final round. But first, we're going to test our data with three different models to see how they all behave with the influence of class adjustments.

### LogisticRegression Model

As for the Logistic Model, the results shows for a clear *room for improvement*. The confusion matrix indicates a significant part of the data was guessed correctly, but not enough to be considered reliable for a baseline of accuracy. You can say pretty much the same for the remaining 3 given the Accuracy and ROC scores are at least between 70-75% so like a C+ and as for the F1-Score this seems like a poor performance since it hardly increase positively towards *1.0* but actually decreased.

In [18]:
class_weights = {0:1, 1:3}

model = LogisticRegression(random_state=12345, solver='liblinear', class_weight=class_weights)
model.fit(train_features, train_target)
predicted_valid = model.predict(valid_features)

probabilities_valid = model.predict_proba(valid_features)
probabilities_one_valid = probabilities_valid[:, 1]

#AUC-ROC Metric
auc_roc = roc_auc_score(valid_target, probabilities_one_valid)

# Accuracy metric
accuracy_valid = accuracy_score(valid_target, predicted_valid)

# Confusion matrix
con_matrix = confusion_matrix(valid_target, predicted_valid)

#F1-score Metric
f1_score_value = f1_score(valid_target, predicted_valid)


print("Confusion Matrix:")
print(con_matrix)
print()
print("Accuracy Metric:", accuracy_valid)
print("AUC-ROC Score:", auc_roc)
print("F1 Score:", f1_score_value)

Confusion Matrix:
[[1239  343]
 [ 196  222]]

Accuracy Metric: 0.7305
AUC-ROC Score: 0.7239806071897361
F1 Score: 0.4516785350966429


### DecisionTreeClassifier Model

As for this tree model, there's a clear improvement compared to the previous model overall. The confusion matrix's correct predictions (both TN & TP) yield a larger pool of values compared to the incorrect ones (FN & FP) so that clearly indicates that the model is performing well. This can also be said for the remaining metrics, the accuracy & AUC-ROC yield at least a 80% rate and although there could be more improvement, the F1-Score did get really, really close to the **0.59 score benchmark** clearly showing again, how the model is capable of distingushing between customers who've stayed or parted ways from the bank. Yet, not accurate enough to qualified as a dependable testing model.

In [19]:
class_weights = {0:1, 1:3}

# Training & Validating the DTC Model
model = DecisionTreeClassifier(random_state=12345, class_weight=class_weights, max_depth=5)
model.fit(train_features, train_target)
predicted_valid = model.predict(valid_features)

# Confusion matrix
conf_matrix = confusion_matrix(valid_target, predicted_valid)

# Accuracy metric
accuracy = accuracy_score(valid_target, predicted_valid)

#AUC-ROC Metric
probabilities_valid = model.predict_proba(valid_features)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(valid_target, probabilities_one_valid)

#F1-score Metric
f1 = f1_score(valid_target, predicted_valid)

# Print metrics
print("Confusion Matrix:")
print(conf_matrix)
print("Accuracy:", accuracy)
print("AUC-ROC Score:", auc_roc)
print("F1 Score:", f1)

Confusion Matrix:
[[1419  163]
 [ 177  241]]
Accuracy: 0.83
AUC-ROC Score: 0.8229097986317362
F1 Score: 0.586374695863747


As for the final model, this has similar but a slightly higher performance than the tree model and easily out performs the linear model; and therefore, ultimately wins for overall model performance.

### RandomForestClassifier Model

In [20]:
class_weights = {0:1, 1:3} #class adjustments

# Training & Validating the RFC Model
model = RandomForestClassifier(random_state=12345, class_weight=class_weights, max_depth=5)
model.fit(train_features, train_target)
predicted_valid = model.predict(valid_features)

#confusion matrix
conf_matrix = confusion_matrix(valid_target, predicted_valid)

#accuracy metric
accuracy = accuracy_score(valid_target, predicted_valid)

#AUC-ROC Metric
probabilities_valid = model.predict_proba(valid_features)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(valid_target, probabilities_one_valid)

#F1-score Metric
f1 = f1_score(valid_target, predicted_valid)

#Print metrics
print("Confusion Matrix:")
print(conf_matrix)
print("Accuracy:", accuracy)
print("AUC-ROC Score:", auc_roc)
print("F1 Score:", f1)

Confusion Matrix:
[[1420  162]
 [ 155  263]]
Accuracy: 0.8415
AUC-ROC Score: 0.847938682184141
F1 Score: 0.6239620403321471


Although all of our models share the same *random_state=12345*, upsampling technique and *class_weights*, the best model that won was the RandomForestClassifier model. This was proven after tinkering around with the parameters, with 15-20 adjustments, for each model -- and until the end, the model with the highest accuracy stayed consistently in favor with the *RandomForestClassifer*.

## Model Testing

It's not surprising that even the model outshines itself from the validating phase while in the testing stage -- which just roughly a 0.04-0.05 improvement.

In [21]:
class_weights = {0:1, 1:3} #class adjustments

# Training & Validating the RFC Model
model = RandomForestClassifier(random_state=12345, class_weight=class_weights, max_depth=5)
model.fit(train_features, train_target)
test_predictions = model.predict(test_features)


#confusion matrix
conf_matrix = confusion_matrix(test_target, test_predictions)

#Accuracy
score = accuracy_score(test_target, test_predictions)

#AUC-ROC Metric
probabilities_valid = model.predict_proba(test_features)
probabilities_one_valid = probabilities_valid[:, 1]
auc_roc = roc_auc_score(test_target, probabilities_one_valid)

#F1-score Metric
f1 = f1_score(test_target, test_predictions)

#Print metrics
print("Confusion Matrix:")
print(conf_matrix)
print("Accuracy:", score)
print("AUC-ROC Score:", auc_roc)
print("F1 Score:", f1)

Confusion Matrix:
[[1391  186]
 [ 171  252]]
Accuracy: 0.8215
AUC-ROC Score: 0.8428503112862049
F1 Score: 0.5853658536585366


# Conclusion

In short and through several fine-tuning adjustments, the  RandomForestClassifier Model clearly surpassed the DecisionTree and LinearRegression models in being capable of quickly and accurately predicting which customers are more likely to part ways from the bank or continue to stay being a loyal customer. I believe the RandomForest model can still have improvements in some shape or form but the model could definitely could be tested with a fresh dataset to see what it yield in return knowing confidently that it will perform well.