**Prediction of the churn**

**Objective**: 

- predict whether a customer will leave the bank soon. You have the data on clients’ past behavior and termination of contracts with the bank.

- build a model with the maximum possible F1 score: threshold = 0.59

- measure the AUC-ROC metric

**Features**
- RowNumber — data string index
- CustomerId — unique customer identifier
- Surname — surname
- CreditScore — credit score
- Geography — country of residence
- Gender — gender
- Age — age
- Tenure — period of maturation for a customer’s fixed deposit (years)
- Balance — account balance
- NumOfProducts — number of banking products used by the customer
- HasCrCard — customer has a credit card
- IsActiveMember — customer’s activeness
- EstimatedSalary — estimated salary
- Exited — сustomer has left 

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import  f1_score, roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.utils import shuffle
from sklearn.preprocessing import StandardScaler

In [2]:
data = pd.read_csv('Churn.csv')
print(data.info())
display(data.head())
data.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB
None


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

**Get rid of useless features**

In [3]:
data = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)

**Fill in the missing values**

In [4]:
data['Tenure'] = data['Tenure'].fillna(data['Tenure'].median()).astype('int')

**One hot encoding for categorical values + avoiding the dummy trap**

In [5]:
data = pd.get_dummies(data, drop_first = True)
data.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,0


**Spliting data**

In [6]:
target = data['Exited']
features = data.drop('Exited', axis = 1)

In [7]:
target_train, target_rest, features_train, features_rest = train_test_split(target, features, test_size = 0.4, random_state = 1)

In [8]:
target_valid, target_test, features_valid, features_test = train_test_split(target_rest, features_rest, test_size = 0.5, random_state = 2)

In [9]:
print(len(target_train))
print(len(target_valid))
len(target_test)

6000
2000


2000

**Standartization**

In [10]:
scaler = StandardScaler()

In [11]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

In [12]:
features_train[numeric] = scaler.fit_transform(features_train[numeric])
features_valid[numeric] = scaler.transform(features_valid[numeric])
features_test[numeric] = scaler.transform(features_test[numeric])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[numeric] = scaler.fit_transform(features_train[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_valid[numeric] = scaler.transform(features_valid[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_test[numeric] = scaler.transform(features_test[numeric])

**Class balance**

In [13]:
round(data['Exited'].value_counts(normalize = True), 2)

0    0.8
1    0.2
Name: Exited, dtype: float64

**Upsampling**

In [14]:
def get_balanced(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    target_upsample = shuffle(pd.concat([target_ones]*repeat + [target_zeros]), random_state = 12)
    features_upsample = shuffle(pd.concat([features_ones]*repeat + [features_zeros]), random_state = 12)
    return target_upsample, features_upsample

In [15]:
target_train, features_train = get_balanced(features_train, target_train, 4)

In [16]:
round(target_train.value_counts(normalize = True), 2)

0    0.5
1    0.5
Name: Exited, dtype: float64

**Logistic Regression**

In [23]:
model = LogisticRegression(random_state = 12345)
model.fit(features_train, target_train)
predicted_valid = model.predict(features_valid)
print(f'f1_score for logistic regression {round(f1_score(target_valid, predicted_valid), 3)}')
print(f'auc_roc_score for logistic regression {round(roc_auc_score(target_valid, predicted_valid), 3)}')

f1_score for logistic regression 0.496
auc_roc_score for logistic regression 0.702


**Decision Tree**

In [29]:
for depth in range(2, 10):
    model = DecisionTreeClassifier(max_depth = depth, random_state = 12)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    print(f'f1_score for decision tree with depth {depth} = {round(f1_score(target_valid, predicted_valid), 3)}')
    print(f'roc_auc_score for decision tree with depth {depth} = {round(roc_auc_score(target_valid, predicted_valid), 3)}')
    print('*'*60)

f1_score for decision tree with depth 2 = 0.526
roc_auc_score for decision tree with depth 2 = 0.726
************************************************************
f1_score for decision tree with depth 3 = 0.553
roc_auc_score for decision tree with depth 3 = 0.74
************************************************************
f1_score for decision tree with depth 4 = 0.546
roc_auc_score for decision tree with depth 4 = 0.753
************************************************************
f1_score for decision tree with depth 5 = 0.58
roc_auc_score for decision tree with depth 5 = 0.778
************************************************************
f1_score for decision tree with depth 6 = 0.59
roc_auc_score for decision tree with depth 6 = 0.781
************************************************************
f1_score for decision tree with depth 7 = 0.613
roc_auc_score for decision tree with depth 7 = 0.791
************************************************************
f1_score for decision tree with

**RandomForestClassifier**

In [34]:
for est in range(10, 40, 10):
    for depth in range (2, 10):
        model = RandomForestClassifier(n_estimators = est, max_depth = depth, random_state = 123)
        model.fit(features_train, target_train)
        predicted_valid = model.predict(features_valid)
        max_f1 = 0
        max_f1_est = 0
        max_f1_depth = 0
        if round(f1_score(predicted_valid, target_valid), 2)>0.59:
            print(f'f1_score for random forest with depth {depth} and estimators = {est} = {round(f1_score(target_valid, predicted_valid), 3)}')
            print(f'roc_auc_score for random forest with depth {depth} and estimators = {est} = {round(roc_auc_score(target_valid, predicted_valid), 3)}')
            print('*'*60)
        if f1_score(predicted_valid, target_valid) > max_f1:
            max_f1 = f1_score(predicted_valid, target_valid)
            max_f1_est = est
            max_f1_depth = depth
print(f'The best random forest model has f1_score = {round(max_f1, 3)} with {max_f1_est} estimators and {max_f1_depth} depth')

f1_score for random forest with depth 4 and estimators = 10 = 0.599
roc_auc_score for random forest with depth 4 and estimators = 10 = 0.774
************************************************************
f1_score for random forest with depth 5 and estimators = 10 = 0.603
roc_auc_score for random forest with depth 5 and estimators = 10 = 0.775
************************************************************
f1_score for random forest with depth 6 and estimators = 10 = 0.614
roc_auc_score for random forest with depth 6 and estimators = 10 = 0.784
************************************************************
f1_score for random forest with depth 7 and estimators = 10 = 0.623
roc_auc_score for random forest with depth 7 and estimators = 10 = 0.784
************************************************************
f1_score for random forest with depth 8 and estimators = 10 = 0.644
roc_auc_score for random forest with depth 8 and estimators = 10 = 0.794
***************************************************

**Model testing**

In [36]:
model = RandomForestClassifier(n_estimators = 30, max_depth = 9, random_state = 123)
model.fit(features_train, target_train)
predicted_test = model.predict(features_test)
print(f'f1_score for test sample = {round(f1_score(predicted_test, target_test), 3)}')
print(f'roc_auc_score for test sample = {round(roc_auc_score(predicted_test, target_test), 3)}')

f1_score for test sample = 0.613
roc_auc_score for test sample = 0.729


**The best model - RandomForest classifier with 30 estimators and max_depth = 9**