<a href="https://colab.research.google.com/github/Alina-Tur/cusomer_churn/blob/main/project8.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project description

Beta Bank has a high decrease in number of existing customers. It is cheaper for the company to save the existing customers than to attract new ones.
We need to predict whether a customer will leave the bank soon. We have the data on clients’ past behavior and termination of contracts with the bank.
We will build a model with the maximum possible F1 score(at least 0.59). Additionally,we will measure the AUC-ROC metric and compare it with the F1.


<a href='step1'> Data Preparation </a>

<a href='step2'> Model without Imbalance Handling </a>

<a href='step3'> Balanced Class Parameters </a>

<a href='step4'> Oversampling Minorities </a>

<a href='step5'> Conclusion </a>

<a id='step1'></a>

# Data Preparation

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats as st
from sklearn.tree import DecisionTreeClassifier #DecisionTreeRegressor
# import other models
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier #RandomForestRegressor
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, balanced_accuracy_score, confusion_matrix, classification_report

from sklearn.model_selection import train_test_split
%matplotlib inline


In [None]:
df = pd.read_csv('https://code.s3.yandex.net/datasets/Churn.csv')

In [None]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [None]:
df.columns

Index(['RowNumber', 'CustomerId', 'Surname', 'CreditScore', 'Geography',
       'Gender', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [None]:
df.Exited.value_counts()/df.shape[0]

0    0.7963
1    0.2037
Name: Exited, dtype: float64

We can see that 20% of customers left the bank

In [None]:
df.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

We can see that the dataset is in good condition and only one column has missing values.

In [None]:
df.Geography.value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

The current variable is a categorical and has three values. Later on, we will be using the method of dummy enconding.

In [None]:
df.Gender.value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

The current variable is also categorical and has two values. We will use dummy encoding for this one as well.

In [None]:
df.dropna(inplace=True)

We do not need NULL values for our data model, therefore we have deleted those null values from the dataframe.

In [None]:
X = df.drop(['RowNumber', 'CustomerId', 'Surname','Exited'], axis=1)
y = df['Exited']

In [None]:
X

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,619,France,Female,42,2.0,0.00,1,1,1,101348.88
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58
2,502,France,Female,42,8.0,159660.80,3,1,0,113931.57
3,699,France,Female,39,1.0,0.00,2,0,0,93826.63
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10
...,...,...,...,...,...,...,...,...,...,...
9994,800,France,Female,29,2.0,0.00,2,0,0,167773.55
9995,771,France,Male,39,5.0,0.00,2,1,0,96270.64
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77
9997,709,France,Female,36,7.0,0.00,1,0,1,42085.58


We will not need to use Row Number, Cutomer Id, Surname for our model as feaures. They will not add any values, therefore we will drop them from the table.

The data set was split for X that is features and Y that is prediction.

In [None]:
X = pd.get_dummies(X,drop_first=True)

In [None]:
X

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.00,1,1,1,101348.88,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,1,0
2,502,42,8.0,159660.80,3,1,0,113931.57,0,0,0
3,699,39,1.0,0.00,2,0,0,93826.63,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.10,0,1,0
...,...,...,...,...,...,...,...,...,...,...,...
9994,800,29,2.0,0.00,2,0,0,167773.55,0,0,0
9995,771,39,5.0,0.00,2,1,0,96270.64,0,0,1
9996,516,35,10.0,57369.61,1,1,1,101699.77,0,0,1
9997,709,36,7.0,0.00,1,0,1,42085.58,0,0,0


We have used dummies encoding method for categorical data such as Geography and Gender to ransfer it to numerical type.

In [None]:
#75% train, 25% test
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size= 0.25, random_state=42)


In [None]:
np.array([X_train.shape[0], X_test.shape[0]])/df.shape[0]

array([0.7499725, 0.2500275])

We have splitted the data set into 75% for training and 25% for testing. We have also confirmed with the tool of np. array if it was split correctly.

<a id='step2'></a>

# Model without imbalance handling

In [None]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(X_train,y_train)

NameError: ignored

We have tested Decision Tree Classifier for the first model that doesn't take into consideration the imbalancing.

In [None]:
print('F1 Score: ' + str((f1_score(y_test, model.predict(X_test))*100).round(2))+'%')
print('Accuracy Score: ' + str((accuracy_score(y_test, model.predict(X_test))*100).round(2))+'%')
print('Balanced Accuracy Score: ' + str((balanced_accuracy_score(y_test, model.predict(X_test))*100).round(2))+'%')
print('ROC AUC Score: ' + str((roc_auc_score(y_test, model.predict(X_test))*100).round(2))+'%')

We can see that Accuracy Score is relatively high and it is about 80%. We could say that the model is good however F1 score is below required F1=0.59.

In [None]:
confusion_matrix(y_test, model.predict(X_test))

Based on the confusion matrix we can see that the class 1 has almost the same numbe rpredicted correctly and incorrectly. It means that our model is not accurate by predicting 50/50 correct and incorrect especially for the class 1 that is our churn rate.

In [None]:
print(classification_report(y_test, model.predict(X_test)))

Based on the classification report we can confirm again that class 1 doesn't have good scores for precision, recall and f-1 score.

In [None]:
for i in range(1,10):
    model = DecisionTreeClassifier(max_depth=i,random_state=12345, )
    model.fit(X_train,y_train)
    print('Max Depth ' + str(i))
    print('F1 Score: ' + str((f1_score(y_test, model.predict(X_test))*100).round(2))+'%')

We wanted to test if by changing max depth in Decision Tree we can achieve the required minimum of F-1 = 0.59. However we can see that the max score could 56.19 with imbalanced decision tree model that is below our requirement.

<a id='step3'></a>

# Balanced Class Parameters

In [None]:
#Decision Tree Classifier Model
for i in range(1,10):
    model = DecisionTreeClassifier(class_weight='balanced',max_depth=i,random_state=12345, )
    model.fit(X_train,y_train)
    print('Max Depth ' + str(i))
    print('F1 Score: ' + str((f1_score(y_test, model.predict(X_test))*100).round(2))+'%')

Max Depth 1
F1 Score: 48.03%
Max Depth 2
F1 Score: 50.33%
Max Depth 3
F1 Score: 50.33%
Max Depth 4
F1 Score: 55.06%
Max Depth 5
F1 Score: 55.95%
Max Depth 6
F1 Score: 56.86%
Max Depth 7
F1 Score: 57.09%
Max Depth 8
F1 Score: 55.62%
Max Depth 9
F1 Score: 55.63%


We have passed class weight hypermater "balanced" to balance the data. We can see the improvements in the scores in comaprison with unbalanced testing, however they are still below required f-1 score of 0.59.

In [None]:
#Random Forest Classifier
for i in range(1,10):
    rf_model = RandomForestClassifier(class_weight='balanced',max_depth=i, random_state=12345, )
    rf_model.fit(X_train,y_train)
    print('Max Depth ' + str(i))
    print('F1 Score: ' + str((f1_score(y_test, rf_model.predict(X_test))*100).round(2))+'%')




Max Depth 1
F1 Score: 49.61%
Max Depth 2
F1 Score: 53.07%
Max Depth 3
F1 Score: 55.65%
Max Depth 4
F1 Score: 57.25%




Max Depth 5
F1 Score: 57.47%
Max Depth 6
F1 Score: 58.41%
Max Depth 7
F1 Score: 57.83%
Max Depth 8
F1 Score: 59.5%
Max Depth 9
F1 Score: 58.0%




We have applied the same method of balancing for Random Forest Classifier, and we can see that we were able to achieve threshold score of 59 with max depth of 8.

In [None]:
rf_model = RandomForestClassifier(class_weight='balanced',max_depth=8, random_state=12345, )
rf_model.fit(X_train,y_train)



RandomForestClassifier(bootstrap=True, class_weight='balanced',
                       criterion='gini', max_depth=8, max_features='auto',
                       max_leaf_nodes=None, min_impurity_decrease=0.0,
                       min_impurity_split=None, min_samples_leaf=1,
                       min_samples_split=2, min_weight_fraction_leaf=0.0,
                       n_estimators=10, n_jobs=None, oob_score=False,
                       random_state=12345, verbose=0, warm_start=False)

In [None]:
confusion_matrix(y_test, rf_model.predict(X_test))

array([[1567,  248],
       [ 159,  299]])

We can see that class 1 has a higher number of predicted correct answers over incorrect. We can state the accuracy score is better than imbalanced model.

In [None]:
print(classification_report(y_test, rf_model.predict(X_test)))

              precision    recall  f1-score   support

           0       0.91      0.86      0.89      1815
           1       0.55      0.65      0.60       458

    accuracy                           0.82      2273
   macro avg       0.73      0.76      0.74      2273
weighted avg       0.84      0.82      0.83      2273



We can see that F-1 score is 60% and accuracy score is 82% that is higher in another models.

In [None]:
print(roc_auc_score(y_test, rf_model.predict(X_test)))

0.758099654745149


<a id='step4'></a>

# Oversampling Minorities

In [None]:
y_train.value_counts()

0    5422
1    1396
Name: Exited, dtype: int64

In [None]:
X_train_resample = X_train[y_train==0]
X_train_resample = X_train_resample.append(X_train[y_train==1].sample((y_train==0).sum(),replace=True))

In [None]:
y_train_resample = np.array([0]*(y_train==0).sum()+[1]*(y_train==0).sum())

We have chosen to use the method of oversampling minorities to resample our class 1 and have it closer to the number of class 0. This method balnces two parameters when the quantity of classes is too different.

In [None]:
#Decision Tree Classifier
for i in range(1,10):
    model = DecisionTreeClassifier(max_depth=i,random_state=12345, )
    model.fit(X_train_resample,y_train_resample)
    print('Max Depth ' + str(i))
    print('F1 Score: ' + str((f1_score(y_test, model.predict(X_test))*100).round(2))+'%')

Max Depth 1
F1 Score: 48.03%
Max Depth 2
F1 Score: 50.33%
Max Depth 3
F1 Score: 52.61%
Max Depth 4
F1 Score: 52.73%
Max Depth 5
F1 Score: 56.59%
Max Depth 6
F1 Score: 55.79%
Max Depth 7
F1 Score: 57.48%
Max Depth 8
F1 Score: 56.86%
Max Depth 9
F1 Score: 53.49%


We can see that Decision Tree Classifier and oversampling didn't help to achieve F-1 score of 0.59, but we can still conclude that oversampling performs better than imbalanced.

In [None]:
#Random Forest Classifier
for i in range(1,10):
    rf_model = RandomForestClassifier(max_depth=i, random_state=12345, )
    rf_model.fit(X_train_resample,y_train_resample)
    print('Max Depth ' + str(i))
    print('F1 Score: ' + str((f1_score(y_test, rf_model.predict(X_test))*100).round(2))+'%')




Max Depth 1
F1 Score: 51.42%
Max Depth 2
F1 Score: 53.41%
Max Depth 3
F1 Score: 54.19%
Max Depth 4
F1 Score: 56.89%
Max Depth 5
F1 Score: 57.25%
Max Depth 6
F1 Score: 59.21%




Max Depth 7
F1 Score: 59.11%
Max Depth 8
F1 Score: 59.18%




Max Depth 9
F1 Score: 59.51%


We can see that Random Forest model achieves F-1 score over 0.59 with the max depths starting from 6 till 9. We can see that the highest score is with max depth=9.

In [None]:
print(roc_auc_score(y_test, rf_model.predict(X_test)))

0.7559469245852732


<a id='step5'></a>

# Conclusion

The data was prepared in the beginning of the project and we have dropped some null values and dermined features and predicted variables.
We have trained the model with imbalanced classes, and we saw that we aren't able to achieve a high enough F-1 score of 0.59.
We have balanced the classes with the hypermeter class weight "balanced" that helped to achieve a minimum of F-1 score with the Random Forest Model. Decision Tree model had higher scores but still not 0.59 as it was mentioned.
Another method was used to balance the classes is to oversample minorities. With this method, we have also achieved F-1 score 0.59 with the Random Forest model.
We can say that for our task the balancing is important to achieve a higher score and class weight balances and oversampling helped to achieve the required F-1 score.