# Project Beta Bank by Maria Shemyakina

## Project description
Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1 score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUC-ROC metric and compare it with the F1.

### Features
* RowNumber — data string index
* CustomerId — unique customer identifier
* Surname — surname
* CreditScore — credit score
* Geography — country of residence
* Gender — gender
* Age — age
* Tenure — period of maturation for a customer’s fixed deposit (years)
* Balance — account balance
* NumOfProducts — number of banking products used by the customer
* HasCrCard — customer has a credit card
* IsActiveMember — customer’s activeness
* EstimatedSalary — estimated salary

**Target**
    Exited — сustomer has left

## 1. Open the data file and explore the general information 

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve, f1_score, confusion_matrix
from sklearn.metrics import precision_recall_curve
import matplotlib.pyplot as plt


from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

import warnings
warnings.filterwarnings('ignore')

SEED=12345

In [2]:
data = pd.read_csv('Churn.csv')

In [3]:
data.sample(5)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
9180,9181,15568326,Kenenna,637,France,Female,44,2.0,0.0,2,1,0,149665.65,0
6257,6258,15617301,Chamberlin,774,Germany,Male,36,9.0,130809.77,1,1,0,152290.28,0
3638,3639,15684367,Chigbogu,555,Spain,Male,27,5.0,0.0,2,0,0,96398.51,0
3618,3619,15750867,Nucci,489,Germany,Female,46,8.0,92060.06,1,1,0,147222.95,1
2923,2924,15631159,H?,705,Germany,Male,41,4.0,72252.64,2,1,1,142514.66,0


In [4]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


In [5]:
data.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RowNumber,10000.0,5000.5,2886.89568,1.0,2500.75,5000.5,7500.25,10000.0
CustomerId,10000.0,15690940.0,71936.186123,15565701.0,15628528.25,15690740.0,15753230.0,15815690.0
CreditScore,10000.0,650.5288,96.653299,350.0,584.0,652.0,718.0,850.0
Age,10000.0,38.9218,10.487806,18.0,32.0,37.0,44.0,92.0
Tenure,9091.0,4.99769,2.894723,0.0,2.0,5.0,7.0,10.0
Balance,10000.0,76485.89,62397.405202,0.0,0.0,97198.54,127644.2,250898.09
NumOfProducts,10000.0,1.5302,0.581654,1.0,1.0,1.0,2.0,4.0
HasCrCard,10000.0,0.7055,0.45584,0.0,0.0,1.0,1.0,1.0
IsActiveMember,10000.0,0.5151,0.499797,0.0,0.0,1.0,1.0,1.0
EstimatedSalary,10000.0,100090.2,57510.492818,11.58,51002.11,100193.9,149388.2,199992.48


Check the missing data in  `Tenure` columns, fill it by median

In [6]:
data.loc[data['Tenure'].isna(), 'Tenure'] = data['Tenure'].median()
data['Tenure'] = data['Tenure'].astype('int')
data.isna().sum().sum()

0

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null int64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB


In [8]:
data.duplicated().sum() 

0

In [9]:
data.describe(include=object)

Unnamed: 0,Surname,Geography,Gender
count,10000,10000,10000
unique,2932,3,2
top,Smith,France,Male
freq,32,5014,5457


* Let's remove the obviously useless columns: `RowNumber`,  `Surname`
* Check the column `CustomerId` for uniqueness, if it is unique, declared it an index

Verify customer uniqueness

In [10]:
data['CustomerId'].nunique()

10000

All customers are unique

Categorical features must be converted to numerical features. We will use the direct coding technique `pd.get_dummies` with the attribute` drop_first = True`.

In [11]:
data_ohe = pd.get_dummies(data.loc[:, 'CreditScore' : 'Exited'], drop_first=True)

In [12]:
data_ohe.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,0


Save `features` и `target` features.

In [13]:
features = data_ohe.drop(['Exited'] , axis=1)
target = data_ohe['Exited']

Use train_test_split.

In [14]:
features_train_val, features_test, target_train_val,  target_test = train_test_split(
    features, target, test_size=0.2, random_state=SEED, stratify=target)

In [15]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features_train_val, target_train_val, test_size=0.25, random_state=SEED,stratify=target_train_val)

Check it

In [16]:
features_train.shape

(6000, 11)

In [17]:
features_test.shape

(2000, 11)

In [18]:
features_valid.shape

(2000, 11)

Let's save the numeric features in the variable, which will be standardized. Use StandardScaler

In [19]:
numeric = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Tenure', 'NumOfProducts']

In [20]:
scaler = StandardScaler()
scaler.fit(features_train[numeric])

StandardScaler(copy=True, with_mean=True, with_std=True)

Let's transform the training, validation and test samples with the `transform ()` function. I saved the these sets in variables: `features_train`, `features_valid` и `features_test`. 

In [21]:
features_train[numeric] = scaler.transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

In [22]:
features_train.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_Germany,Geography_Spain,Gender_Male
5536,-0.143332,0.577533,-0.001274,-1.220573,0.797767,1,1,1.029613,0,1,1
8530,1.632702,-0.564119,-1.092954,0.435807,-0.916018,1,0,0.237986,0,0,0
1762,1.116413,-0.468981,-1.456847,1.245822,-0.916018,1,1,-0.686104,0,0,0
9090,1.643028,0.006707,-0.001274,-1.220573,-0.916018,1,0,-0.391097,0,0,0
8777,-0.484083,-1.420358,-1.456847,1.421989,0.797767,1,0,-1.361559,0,1,1


Our data is ready

# 2. Using models on our data

In [23]:
target_train.value_counts()

0    4777
1    1223
Name: Exited, dtype: int64

Classes are unbalanced, but we must train models without removing the disbalance, and then after we balance our data.

I will use decision tree,  random forest  and  logistic regression models

## 2.1 Random Forest

Use hyperparameters n_estimators=50, min_samples_split=10, min_samples_leaf=5

In [24]:
for i in range(5, 21, 1):
    model_RF = RandomForestClassifier(n_estimators=50, max_depth=i, random_state=SEED, min_samples_split=10, 
                                  min_samples_leaf=5)
    model_RF.fit(features_train, target_train)
    predicted_valid_RF = model_RF.predict(features_valid)
    
    print("max_depth =", i, ": ", end='')
    print('F1-score = {:.4f}'.format(f1_score(target_valid, predicted_valid_RF)))
    

max_depth = 5 : F1-score = 0.4524
max_depth = 6 : F1-score = 0.4743
max_depth = 7 : F1-score = 0.5068
max_depth = 8 : F1-score = 0.5153
max_depth = 9 : F1-score = 0.5298
max_depth = 10 : F1-score = 0.5463
max_depth = 11 : F1-score = 0.5563
max_depth = 12 : F1-score = 0.5636
max_depth = 13 : F1-score = 0.5487
max_depth = 14 : F1-score = 0.5539
max_depth = 15 : F1-score = 0.5692
max_depth = 16 : F1-score = 0.5664
max_depth = 17 : F1-score = 0.5678
max_depth = 18 : F1-score = 0.5737
max_depth = 19 : F1-score = 0.5628
max_depth = 20 : F1-score = 0.5674


## 2.2 Decision Tree model

Use min_samples_split=10, min_samples_leaf=7

In [25]:
for i in range(5, 16, 1):    
    model_DT = DecisionTreeClassifier(random_state=SEED, max_depth=i, min_samples_split=10, 
                                  min_samples_leaf=7)
    model_DT.fit(features_train, target_train)
    predicted_valid_DT = model_DT.predict(features_valid)

    print("max_depth =", i, ": ", end='')
    print('F1-score = {:.4f}'.format(f1_score(target_valid, predicted_valid_DT)))
    

max_depth = 5 : F1-score = 0.4577
max_depth = 6 : F1-score = 0.5457
max_depth = 7 : F1-score = 0.5853
max_depth = 8 : F1-score = 0.5891
max_depth = 9 : F1-score = 0.5702
max_depth = 10 : F1-score = 0.5643
max_depth = 11 : F1-score = 0.5668
max_depth = 12 : F1-score = 0.5638
max_depth = 13 : F1-score = 0.5497
max_depth = 14 : F1-score = 0.5626
max_depth = 15 : F1-score = 0.5563


## 2.3 Logistic Regression 

Use hyperparameters solver='newton-cg', penalty='none'

In [26]:
model_regression = LogisticRegression(random_state=SEED, solver='newton-cg', penalty='none')
model_regression.fit(features_train, target_train)


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=12345, solver='newton-cg', tol=0.0001,
                   verbose=0, warm_start=False)

In [27]:
predictions_regression = model_regression.predict(features_valid)
print('F1-score = {:.4f}'.format(f1_score(target_valid, predictions_regression)))

F1-score = 0.3215


### Conclusion

The best F1 score were obtained from Decision tree and Random forest models. Despite the strong imbalance of the classes, it was possible to obtain the F1-measure = 0.59 in the decision tree model. The model logistic regression F1-measure turned out to be 0.32, apparently the imbalance of classes strongly affects the model.



# 3. Make up the disbalance

To increase the quality of models with class imbalances, we apply the techniques  upsampling  (increase in the sample) and downsampling  (decrease in the sample)

## 3.1 Increase the sample

Use upsample function which take features, target and repeat for balance our data.  

In [28]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [29]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

In [30]:
target_upsampled.value_counts()

1    4892
0    4777
Name: Exited, dtype: int64

Looks great. Let's use it for our models

### 3.1.1 Random Forest

In [31]:
for i in range(5, 21, 1):
    model_RF_uns = RandomForestClassifier(n_estimators=50, max_depth=i, random_state=SEED, min_samples_split=10, 
                                  min_samples_leaf=5)
    model_RF_uns.fit(features_upsampled, target_upsampled)
    predicted_valid_uns = model_RF_uns.predict(features_valid)

    print("max_depth =", i, ": ", end='')
    print('F1-score = {:.4f}'.format(f1_score(target_valid, predicted_valid_uns)))

max_depth = 5 : F1-score = 0.5936
max_depth = 6 : F1-score = 0.6022
max_depth = 7 : F1-score = 0.6030
max_depth = 8 : F1-score = 0.6225
max_depth = 9 : F1-score = 0.6218
max_depth = 10 : F1-score = 0.6423
max_depth = 11 : F1-score = 0.6257
max_depth = 12 : F1-score = 0.6352
max_depth = 13 : F1-score = 0.6345
max_depth = 14 : F1-score = 0.6266
max_depth = 15 : F1-score = 0.6292
max_depth = 16 : F1-score = 0.6375
max_depth = 17 : F1-score = 0.6353
max_depth = 18 : F1-score = 0.6370
max_depth = 19 : F1-score = 0.6447
max_depth = 20 : F1-score = 0.6446


### 3.1.2 Decision Tree 

In [32]:
for i in range(5, 16, 1):    
    model_dt2 = DecisionTreeClassifier(random_state=SEED, max_depth=i, min_samples_split=10, 
                                  min_samples_leaf=7)
    model_dt2.fit(features_upsampled, target_upsampled)
    predicted_valid_dt2 = model_dt2.predict(features_valid)

    print("max_depth =", i, ": ", end='')
    print('F1-score = {:.4f}'.format(f1_score(target_valid, predicted_valid_dt2)))

max_depth = 5 : F1-score = 0.5626
max_depth = 6 : F1-score = 0.5773
max_depth = 7 : F1-score = 0.5652
max_depth = 8 : F1-score = 0.5448
max_depth = 9 : F1-score = 0.5400
max_depth = 10 : F1-score = 0.5256
max_depth = 11 : F1-score = 0.5249
max_depth = 12 : F1-score = 0.5279
max_depth = 13 : F1-score = 0.5080
max_depth = 14 : F1-score = 0.5051
max_depth = 15 : F1-score = 0.5054


### 3.1.3 Logistic Regression

In [33]:
model_regression2 = LogisticRegression(random_state=SEED, solver='newton-cg', penalty='none')
model_regression2.fit(features_upsampled, target_upsampled)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=12345, solver='newton-cg', tol=0.0001,
                   verbose=0, warm_start=False)

In [34]:
predictions_regression2 = model_regression2.predict(features_valid)
print('F1-score = {:.4f}'.format(f1_score(target_valid, predictions_regression2)))

F1-score = 0.5068


### Conclusion

We get F1 score = 0.64 in the random forest model. Decision tree model didn't change much. In the Logistic regression model we get better results than earlier.

## 3.2 Downsampling

Use downsample function for decrease samples

In [35]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [36]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.25)

In [37]:
target_downsampled.value_counts()

1    1223
0    1194
Name: Exited, dtype: int64

Looks great! Let's use it

### 3.2.1 Random Forest

In [38]:
for i in range(5, 21, 1):
    model_rf3 = RandomForestClassifier(n_estimators=50, max_depth=i, random_state=SEED, min_samples_split=10, 
                                  min_samples_leaf=5)
    model_rf3.fit(features_downsampled, target_downsampled)
    predicted_valid_rf3 = model_rf3.predict(features_valid)

    print("max_depth =", i, ": ", end='')
    print('F1-score = {:.4f}'.format(f1_score(target_valid, predicted_valid_rf3)))

max_depth = 5 : F1-score = 0.6030
max_depth = 6 : F1-score = 0.5996
max_depth = 7 : F1-score = 0.6061
max_depth = 8 : F1-score = 0.6064
max_depth = 9 : F1-score = 0.6086
max_depth = 10 : F1-score = 0.6072
max_depth = 11 : F1-score = 0.6070
max_depth = 12 : F1-score = 0.6017
max_depth = 13 : F1-score = 0.5988
max_depth = 14 : F1-score = 0.5962
max_depth = 15 : F1-score = 0.5866
max_depth = 16 : F1-score = 0.5877
max_depth = 17 : F1-score = 0.5913
max_depth = 18 : F1-score = 0.5945
max_depth = 19 : F1-score = 0.5945
max_depth = 20 : F1-score = 0.5945


### 3.2.1 Decision Tree

In [39]:
for i in range(5, 16, 1):    
    model_DT = DecisionTreeClassifier(random_state=SEED, max_depth=i, min_samples_split=10, 
                                  min_samples_leaf=7)
    model_DT.fit(features_downsampled, target_downsampled)
    predicted_valid_DT = model_DT.predict(features_valid)

    print("max_depth =", i, ": ", end='')
    print('F1-score = {:.4f}'.format(f1_score(target_valid, predicted_valid_DT)))

max_depth = 5 : F1-score = 0.5547
max_depth = 6 : F1-score = 0.5936
max_depth = 7 : F1-score = 0.5636
max_depth = 8 : F1-score = 0.5657
max_depth = 9 : F1-score = 0.5283
max_depth = 10 : F1-score = 0.5237
max_depth = 11 : F1-score = 0.5425
max_depth = 12 : F1-score = 0.5403
max_depth = 13 : F1-score = 0.5197
max_depth = 14 : F1-score = 0.5153
max_depth = 15 : F1-score = 0.5196


### 3.2.3 Logistic Regression

In [40]:
model_regression3 = LogisticRegression(random_state=SEED, solver='newton-cg', penalty='none')
model_regression3.fit(features_downsampled, target_downsampled)



LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='none',
                   random_state=12345, solver='newton-cg', tol=0.0001,
                   verbose=0, warm_start=False)

In [41]:
predictions_regression3 = model_regression3.predict(features_valid)
print('F1-score = {:.4f}'.format(f1_score(target_valid, predictions_regression3)))

F1-score = 0.5043


We get almost the same results on all three models

# 4. AUC-ROC

For the final test, select the random forest model because we got the best results in this model earlier. We calculate the value of the F1 score and the value of the AUC-ROC metric.

In [42]:
model_test = RandomForestClassifier(n_estimators=50, max_depth=16, random_state=SEED, min_samples_split=10, 
                                  min_samples_leaf=5, class_weight='balanced')
model_test.fit(features_upsampled, target_upsampled)

probabilities_test = model_test.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
predict_test = model_test.predict(features_test)
print('F1-score: {:.3f}'.format(f1_score(target_test, probabilities_one_test>0.55)))
print('AUC-ROC: {:.3f}'.format(roc_auc_score(target_test, probabilities_one_test)))

F1-score: 0.615
AUC-ROC: 0.861


# Summary

During the project, it was necessary to predict the departure of customers from the bank in the near future. Gaps were found in the data, they are replaced by the median of the gaps column values. Categorical signs are replaced by numerical ones. The dataset was divided into training, test and test samples. The data were reduced to the same scale so that all signs were equally significant.

During the study, an imbalance of classes was found in the data. According to the conditions of the problem, models were trained without taking into account imbalance and taking into account. The following models were selected for the study: decision tree, random forest, and logistic regression. The imbalance of classes was eliminated in two ways: by increasing the sample and reducing the sample. The best results of the F1 measure were obtained on balanced data with the help of a larger sample.
The following results were obtained on the test sample:
F1 measure = 0.615
AUC-ROC = 0.861

The value of the F1 measure satisfies the condition of the project task (at least 0.59), and the value of the AUC-ROC metric indicates the high quality of the constructed model.