# Project Background

Beta Bank customers are leaving: little by little, chipping away every month. The bankers figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.

# Data Description

- RowNumber — data string index
- CustomerId — unique customer identifier
- Surname — surname
- CreditScore — credit score
- Geography — country of residence
- Gender — gender
- Age — age
- Tenure — period of maturation for a customer’s fixed deposit (years)
- Balance — account balance
- NumOfProducts — number of banking products used by the customer
- HasCrCard — customer has a credit card
- IsActiveMember — customer’s activeness
- EstimatedSalary — estimated salary
- Exited — сustomer has left

**Importing Necessary Libraries**

In [128]:
import pandas as pd
import numpy as np
from sklearn.metrics import f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score 
from sklearn.model_selection import train_test_split
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle

#### Data Exploration

In [129]:
data = pd.read_csv('/Users/rsavy/Downloads/Churn.csv')
data.head(10)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [130]:
print("The size of table is", data.shape)

The size of table is (10000, 14)


In [131]:
data.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

We have 909 missing values in Tenure column. Tenure reflects period of maturation for a customer’s fixed deposit (years). Lets look closely into the missing values dataframe.

In [132]:
null_data = data[data.isna().any(axis=1)]
null_data.sample(20)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
992,993,15724563,Hawkins,752,Germany,Female,42,,65046.08,2,0,1,140139.28,0
353,354,15812007,Power,670,Spain,Male,25,,0.0,2,1,1,78358.94,0
3586,3587,15652626,Grave,826,France,Male,55,,115285.85,1,1,0,140126.17,0
8148,8149,15572777,Meng,780,Spain,Male,47,,86006.21,1,1,1,37973.13,0
8430,8431,15775949,Trevisani,612,France,Female,38,,110615.47,1,1,1,193502.93,0
5866,5867,15600392,Amaechi,735,France,Female,53,,123845.36,2,0,1,170454.93,1
1771,1772,15633260,Dumetochukwu,600,France,Male,37,,142663.46,1,0,1,88669.89,0
9281,9282,15679966,Marsh,661,France,Female,31,,133964.3,1,1,1,166187.1,0
3778,3779,15658486,Gidney,579,Spain,Female,59,,148021.12,1,1,1,74878.22,0
8766,8767,15638159,Trentino,649,Spain,Female,36,,86607.39,1,0,0,19825.09,0


Above we can see some of the customers with missing Tenure have 0 balance in their account.Lets see how many of them exist.

In [133]:
(null_data['Balance'].values == 0.00).sum()

334

In [134]:
(data['Balance'].values == 0.00).sum()

3617

There are 334 customers with missing Tenure and zero balance. It is safe to fill those missing values with zero.

Also, lets figure out the mean/median Tenure for the whole dataset.

In [135]:
data['Tenure'].describe()

count    9091.000000
mean        4.997690
std         2.894723
min         0.000000
25%         2.000000
50%         5.000000
75%         7.000000
max        10.000000
Name: Tenure, dtype: float64

In average, 5 years is the total Tenure period. Now, lets fill some missing values in Tenure with zero where customer's balance is zero and fill rest of the missing values with median of the column.

In [136]:
data['Tenure'] = np.where(data['Balance'] == 0.00, data['Tenure'].fillna(0), data['Tenure'])

In [137]:
data.isnull().sum()

RowNumber            0
CustomerId           0
Surname              0
CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             575
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

In [138]:
data.fillna(data['Tenure'].median(), inplace=True)

In [139]:
data.isnull().sum()

RowNumber          0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

# Model Training and Testing

**Dummyy Trap**

In [140]:
data_ohe = pd.get_dummies(data, drop_first=True)
target = data_ohe['Exited']
features = data_ohe.drop('Exited', axis=1)

**Splitting Dataset**

In [141]:
features_train,feature_rem,target_train, target_rem = train_test_split(features,target,train_size=0.60,random_state=12345)
features_valid,features_test,target_valid,target_test = train_test_split(feature_rem,target_rem, test_size=0.5,random_state=12345)

# Model Training

**Logistic Regression**

In [142]:
model = LogisticRegression(random_state=12345,class_weight='balanced',solver='liblinear')
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predicted_valid))


F1: 0.5034602076124567


**Random Forest Classifier**

In [143]:
for estimators in range(1,10):
    model = RandomForestClassifier(random_state=12345, n_estimators=estimators)
    fit=model.fit(features_train, target_train)
    predicted_valid = fit.predict(features_valid)
    
    print("estimators =", estimators, "F1 Score: ", end='')
    print(f1_score(target_valid, predicted_valid))

estimators = 1 F1 Score: 0.43283582089552236
estimators = 2 F1 Score: 0.2935779816513761
estimators = 3 F1 Score: 0.45454545454545453
estimators = 4 F1 Score: 0.3596330275229358
estimators = 5 F1 Score: 0.44976076555023925
estimators = 6 F1 Score: 0.38686131386861317
estimators = 7 F1 Score: 0.49190938511326865
estimators = 8 F1 Score: 0.450354609929078
estimators = 9 F1 Score: 0.4750830564784053


**Class Weight Adjustment**

In [144]:
features_zeros = features_train[target_train == 0]
features_ones = features_train[target_train== 1]
target_zeros = target_train[target_train == 0]
target_ones = target_train[target_train == 1]

In [145]:
print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(4804, 2944)
(1196, 2944)
(4804,)
(1196,)


In [146]:
#Downsampling the train dataset

def downsample(features, target, fraction):
    features_zeros = features_train[target_train == 0]
    features_ones = features_train[target_train== 1]
    target_zeros = target_train[target_train == 0]
    target_ones = target_train[target_train == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)]
        + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)]
        + [target_ones]
    )

    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345
    )

    return features_downsampled, target_downsampled


features_downsampled, target_downsampled = downsample(
    features_train, target_train, 0.24
)


In [147]:
features_zeros = features_downsampled[target_train == 0]
features_ones = features_downsampled[target_train == 1]
target_zeros = target_downsampled[target_train == 0]
target_ones = target_downsampled[target_train == 1]
print(features_zeros.shape)
print(features_ones.shape)
print(target_zeros.shape)
print(target_ones.shape)

(1153, 2944)
(1196, 2944)
(1153,)
(1196,)


  features_zeros = features_downsampled[target_train == 0]
  features_ones = features_downsampled[target_train == 1]


#### Traning model in downsampled data set

**Logistic Regression**

In [148]:
model = LogisticRegression(random_state=12345, solver='liblinear',class_weight='balanced')
model.fit(features_downsampled,target_downsampled)
predicted_valid= model.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

probabilities_test = model.predict_proba(features_valid)
probabilities_one_test = probabilities_test[:, 1]

auc_roc = roc_auc_score(target_valid,probabilities_one_test)

print(auc_roc)

F1: 0.4888123924268503
0.745854983395738


**DecisionTree Classifier**

In [149]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_downsampled,target_downsampled)
predicted_valid= model.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

probabilities_test = model.predict_proba(features_valid)
probabilities_one_test = probabilities_test[:, 1]

auc_roc = roc_auc_score(target_valid,probabilities_one_test)

print(auc_roc)

F1: 0.511864406779661
0.715858431275292


**RandomForestClassifier**

In [150]:
model = RandomForestClassifier(random_state=12345, max_depth = 50, n_estimators=200)

model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)

print('F1:', f1_score(target_valid, predicted_valid))

probabilities_test = model.predict_proba(features_valid)
probabilities_one_test = probabilities_test[:, 1]

auc_roc = roc_auc_score(target_valid,probabilities_one_test)

print(auc_roc)

F1: 0.5653333333333332
0.8311733678524549


**CatBoost Classifier**

In [151]:
clf = CatBoostClassifier(
    iterations=5, 
    learning_rate=0.1, 
    #loss_function='CrossEntropy'
)

clf.fit(features_upsampled, target_upsampled,  
        eval_set=(features_valid, target_valid), 
        verbose=False
)

print('CatBoost model is fitted: ' + str(clf.is_fitted()))
print('CatBoost model parameters:')
print(clf.get_params())

CatBoost model is fitted: True
CatBoost model parameters:
{'iterations': 5, 'learning_rate': 0.1}


In [152]:
clf = CatBoostClassifier(
    iterations=100,
    random_state=12345,
    learning_rate=0.005,
    custom_loss=['AUC', 'Accuracy']
)

clf.fit(
    features_downsampled, target_downsampled, 
    eval_set=(features_valid, target_valid),
    verbose=False,
    plot=True
)

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

<catboost.core.CatBoostClassifier at 0x21f014a2460>

**The highest accuracy score is obtained in CatBoost model, hence using CatBoost model for testing**

**Model Testing**

In [153]:
eval_dataset = Pool(features_test, target_test)
clf = CatBoostClassifier(
    eval_metric = 'AUC',
    random_seed=12345,
    learning_rate=0.1,
    iterations=5,
    custom_metric=['Accuracy','AUC:hints=skip_train~false']

)

clf.fit(features_downsampled, target_downsampled,  
        eval_set=eval_dataset,
        verbose=False,
        plot=True
       )
print(clf.get_best_score())

MetricVisualizer(layout=Layout(align_self='stretch', height='500px'))

{'learn': {'Accuracy': 0.7696892294593444, 'Logloss': 0.558415257585487, 'AUC': 0.8581158791809645}, 'validation': {'Accuracy': 0.7765, 'Logloss': 0.5808155945029289, 'AUC': 0.8337478319399284}}
