# Project 'Bank clients outflow'

This project was completed during Yandex.Practicum Data Scientist professional program.

**Key words**: SupervisedLearning, model, LogisticRegression, RandomForestClassifier, f1_score, OrdinalEncoder 

**Libraries used**: pandas, sklearn

## Table of contents

- [Project's goal](#goal)
- [Data preparation](#dp)
- [Splitting sets](#ss)
- [Training](#training)
- [Final test](#finaltest)
- [Conclusion](#conclusion)

## Project's goal<a id='goal'></a>

We work at Beta bank, which suffer from client outflow
We need to **predict** whether a **customer will leave the bank soon**. We have the data on **clients’ past behavior** and **termination of contracts** with the bank.
Out task is to **build a model** with the **maximum possible F1 score**. Acceptable **F1 score** of at least **0.59** on the test set.


In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OrdinalEncoder
from sklearn.utils import shuffle

In [8]:
import warnings
warnings.filterwarnings('ignore')

## Data preparation<a id='dp'></a>

In [9]:
df = pd.read_csv('/datasets/Churn.csv')

In [10]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
RowNumber          10000 non-null int64
CustomerId         10000 non-null int64
Surname            10000 non-null object
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             9091 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


###### We have 'nulls' in Tenure column

In [12]:
df = df.drop(['RowNumber','CustomerId','Surname'] , axis=1)

###### Columns 'RowNumber','CustomerId','Surname' is not need for training

In [13]:
df['Tenure'] = df['Tenure'].fillna(df['Tenure'].median())

###### Filling 'nulls' of Tenure by median

In [14]:
encoder = OrdinalEncoder()

###### I use OrdinalEncoder for tree algoritms

In [15]:
encoder.fit(df)

OrdinalEncoder(categories='auto', dtype=<class 'numpy.float64'>)

In [16]:
df = pd.DataFrame(encoder.fit_transform(df),columns=df.columns)

In [17]:
df.head()

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,228.0,0.0,0.0,24.0,2.0,0.0,0.0,1.0,1.0,5068.0,1.0
1,217.0,2.0,0.0,23.0,1.0,743.0,0.0,0.0,1.0,5639.0,0.0
2,111.0,0.0,0.0,24.0,8.0,5793.0,2.0,1.0,0.0,5707.0,1.0
3,308.0,0.0,0.0,21.0,1.0,0.0,1.0,0.0,0.0,4704.0,0.0
4,459.0,2.0,0.0,25.0,2.0,3696.0,0.0,1.0,1.0,3925.0,0.0


###### Seems ok

## Splitting sets <a id='ss'></a>

In [18]:
target = df['Exited']

In [19]:
features = df.drop(['Exited'] , axis=1)

In [20]:
features_train, f_test_valid, target_train, t_test_valid = train_test_split(features, target, test_size=0.4, random_state=12345)

In [21]:
features_valid, features_test, target_valid, target_test = train_test_split(f_test_valid, t_test_valid, test_size=0.5, random_state=12345)

In [22]:
df.shape, features_train.shape, features_valid.shape, features_test.shape

((10000, 11), (6000, 10), (2000, 10), (2000, 10))

###### We have 60% for training set, 20% for validation and test sets

## Training<a id='training'></a>

### Without balancing

In [23]:
target.mean()

0.2037

**There is a class imbalance in the dataset. 20 percent of clients left the bank**

In [24]:
def get_auc_roc(model,features,target):
    probabilities_valid = model.predict_proba(features)
    probabilities_one_valid = probabilities_valid[:, 1]
    return roc_auc_score(target,probabilities_one_valid)    

In [25]:
def TrainDTC(max_depth_dtc,features_train_set,target_train_set):
    model_DTC = DecisionTreeClassifier(random_state=1, max_depth=max_depth_dtc)
    model_DTC.fit(features_train_set,target_train_set)
    
    return model_DTC

In [26]:
def TestDTC(max_max_depth,features_train_set,target_train_set,features_valid_set,target_valid_set):
    results = {}
    
    for i in range(max_max_depth):
        model_DTC = TrainDTC(i+1,features_train_set,target_train_set)
        predicted_valid = model_DTC.predict(features_valid)
        score = f1_score(target_valid,predicted_valid)
        results[i+1] = score
    
    results = sorted(results.items(), key=lambda x: x[1], reverse=True)
    
    model_DTC = TrainDTC(results[0][0],features_train_set,target_train_set)
    
    print('max_depth:', results[0][0])
    print('f1 score:',results[0][1])
    print('auc_roc:',get_auc_roc(model_DTC,features_valid,target_valid))

In [27]:
TestDTC(100,features_train,target_train,features_valid,target_valid)

max_depth: 4
f1 score: 0.5528700906344411
auc_roc: 0.8203012055480615


**For DecisionTree we have best f1 score equal 55.28%, and auc_roc 82.03%**

In [28]:
def TrainRFC(max_estimator,max_depth_rfc,features_train_set,target_train_set):
    model_RFC = RandomForestClassifier(random_state=1, max_depth=max_depth_rfc, n_estimators = max_estimator)
    model_RFC.fit(features_train_set,target_train_set)
    
    return model_RFC

In [36]:
def TestRFC(max_estimators,max_max_depth, features_train_set,target_train_set,features_valid_set,target_valid_set):
    results = []
    
    for i in range(max_estimators):
        for n in range(max_max_depth):
            model = TrainRFC(i+1,n+1,features_train_set,target_train_set)
            predicted_valid = model.predict(features_valid)
            score = f1_score(target_valid,predicted_valid)
            results.append([i+1,n+1,score,get_auc_roc(model,features_valid,target_valid)])
    
    df_results = pd.DataFrame(results,columns = ['estimator','max_depth','f1_score','auc_roc'])
    df_results = df_results.sort_values(by='f1_score',ascending = False)
    print(df_results.head(10))

In [30]:
TestRFC(50,20,features_train,target_train,features_valid,target_valid)

     estimator  max_depth  f1_score   auc_roc
894         45         15  0.596125  0.841513
934         47         15  0.594918  0.842053
458         23         19  0.593974  0.827492
974         49         15  0.593703  0.841773
954         48         15  0.593703  0.841719
854         43         15  0.591045  0.842011
914         46         15  0.590705  0.841214
418         21         19  0.589744  0.825568
514         26         15  0.589286  0.838995
257         13         18  0.588406  0.818524


**For RandomForestClassifier we have best f1_score 59.6% and auc roc 84.15%. So its best model without class balancing**

### Upsampling

In [31]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

In [32]:
features_train_upsampled, target_train_upsampled = upsample(features_train, target_train, 4)

In [33]:
target_train_upsampled.mean(), target_train_upsampled.count()

(0.49895702962035876, 9588)

**After upsampling we have balanced classes, with almost 10k entries**

In [34]:
TestDTC(100,features_train_upsampled,target_train_upsampled,features_valid,target_valid)

max_depth: 5
f1 score: 0.5894962486602359
auc_roc: 0.8197447056902112


**And after upsampling we have f1 score 58.9% and auc roc 81.9 for DecisionTreeClassifier. Its better than without balancing**

In [35]:
TestRFC(50,20,features_train_upsampled,target_train_upsampled,features_valid,target_valid)

     estimator  max_depth  f1_score   auc_roc
908         46          9  0.628998  0.851988
928         47          9  0.628755  0.852322
888         45          9  0.627535  0.852066
868         44          9  0.626866  0.852192
967         49          8  0.626033  0.851638
947         48          8  0.625387  0.851944
988         50          9  0.625268  0.851924
867         44          8  0.625259  0.852537
927         47          8  0.625259  0.852339
827         42          8  0.625130  0.852110


**For RandomForestClassifeir f1 - 62.89% and auc_roc - 85.19%. Its better than without balancing**

### Downsampling

In [37]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

In [38]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.25)

In [39]:
target_downsampled.mean(),target_downsampled.count()

(0.49895702962035876, 2397)

**After downampling dataset we have 2397 entries, what 4 time less than after upsampling**

In [40]:
TestDTC(100,features_downsampled,target_downsampled,features_valid,target_valid)

max_depth: 6
f1 score: 0.5600706713780919
auc_roc: 0.8093375534572552


In [41]:
TestRFC(50,20,features_downsampled,target_downsampled,features_valid,target_valid)

     estimator  max_depth  f1_score   auc_roc
706         36          7  0.608440  0.851759
686         35          7  0.607595  0.851477
546         28          7  0.607522  0.851427
847         43          8  0.606936  0.852165
566         29          7  0.606352  0.851330
666         34          7  0.606002  0.851142
785         40          6  0.605469  0.848637
746         38          7  0.605289  0.851345
606         31          7  0.605010  0.850657
626         32          7  0.605010  0.851229


**Results after downampling is worse than after upsamling**

## Final test <a id='finaltest'></a>

**So far best model RandomForestClassifeir after upsampling with 46 estimators and 9 max depth. Going to train it on train+valid set**

In [42]:
total_features_frames = [features_train,features_valid]

In [43]:
total_target_frames = [target_train,target_valid]

In [44]:
total_features = pd.concat(total_features_frames)

In [45]:
total_target = pd.concat(total_target_frames)

**Upsample dataset**

In [46]:
features_total_upsampled, target_total_upsampled = upsample(total_features, total_target, 4)

In [47]:
target_total_upsampled.mean(),target_total_upsampled.count()

(0.5027254321756736, 12842)

**Seems ok**

In [48]:
fmodel = TrainRFC(46,9,features_total_upsampled,target_total_upsampled)

In [49]:
predicted_test = fmodel.predict(features_test)

In [50]:
f1_score(target_test,predicted_test)

0.6053169734151329

**F1-score is suitable for task**

In [51]:
get_auc_roc(fmodel,features_test,target_test)

0.8585727756115916

**auc roc - 85.85%**

## Conclusion<a id='conclusion'></a>

**After training different models we get good enough score with Random forest classifier. We have managed to achive f1-score - 0.605. For eliminate class imbalance was used upsampling, which showed better result than downsampling**