# Description of the project

Customers began to leave Beta-Bank. Every month. A little, but noticeable. Bank marketers considered: to save
current customers are cheaper than acquiring new ones. <br />
<br />
It is necessary to predict whether the client will leave the bank in the near future or not. You are provided with historical behavioral data
customers and termination of agreements with the bank. <br />
<br />
It is necessary to build a model with an extremely large value of the F1-measure. You need to bring the metric to 0.59 and check it (F1-measure)
on the test set. Also additionally calculate the AUC-ROC and compare its value with the F1-measure.

# Description of data

Signs: <br />
- RowNumber - row index in the data
- CustomerId — unique customer identifier
- Surname - surname
- CreditScore - credit rating
- Geography - country of residence
- Gender - gender
- Age - age
- Tenure - the amount of real estate the client has
- Balance - account balance
- NumOfProducts - the number of bank products used by the client
- HasCrCard - the presence of a credit card
- IsActiveMember - client activity
- EstimatedSalary - estimated salary

Target sign: <br />
- Exited - the fact that the client left

# Import data files, study general information

In [None]:
import pandas as pd
from IPython.display import display


from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.utils import shuffle

from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

from sklearn.dummy import DummyClassifier

import matplotlib.pyplot as plt

pd.set_option('display.max_rows', 1000)
pd.set_option('display.max_columns', 1000)

In [None]:
#df = pd.read_csv('Churn.csv', sep=',')
df = pd.read_csv('/datasets/Churn.csv', sep=',')
display(df)
df.info()

# Data preparation

Get rid of unnecessary features

In [None]:
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

There are gaps in the Tenure feature, 9 in total. Let's get rid of them

In [None]:
index = df.query('Tenure == "NaN"').index
df = df.drop(index).reset_index(drop=True)
df.info()

Let's transform categorical features into quantitative ones

In [None]:
df_ohe = pd.get_dummies(df, drop_first=True)

In [None]:
print(df_ohe.dtypes)

Divide the samples into sets with features and a target feature

In [None]:
features = df_ohe.drop(['Exited'], axis=1)
target = df_ohe['Exited']

Let's divide the data into three samples: train, validation test in the ratio `3 : 1 : 1`

In [None]:
features_train, features_valid = train_test_split(features, test_size=0.20, random_state=12345)
features_train, features_test = train_test_split(features_train, test_size=0.25, random_state=12345)

target_train, target_valid = train_test_split(target, test_size=0.20, random_state=12345)
target_train, target_test = train_test_split(target_train, test_size=0.25, random_state=12345)

print(features.shape)
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

Scale features

In [None]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

scaler = StandardScaler()
scaler.fit(features_train.loc[:, numeric])
features_train.loc[:, numeric] = scaler.transform(features_train.loc[:, numeric])
features_valid.loc[:, numeric] = scaler.transform(features_valid.loc[:, numeric])
features_test.loc[:, numeric] = scaler.transform(features_test.loc[:, numeric])

# Exploring models without class imbalance

## Model: decision tree

In [None]:
for depth in range(1, 12):
    
    model_dec_tree = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model_dec_tree.fit(features_train, target_train)
    predicted_valid = model_dec_tree.predict(features_valid)
    result = accuracy_score(target_valid, predicted_valid)
    
    print('max_depth =', depth)
    print("F1:", f1_score(target_valid, predicted_valid))
    print()

Retrain the model with better parameters

In [None]:
model_dec_tree = DecisionTreeClassifier(random_state=12345, max_depth=9) 
model_dec_tree.fit(features_train, target_train)

## Model: random forest

In [None]:
for est in range(10, 102, 5):
    
    model_ran_forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=9)
    model_ran_forest.fit(features_train, target_train)
    predicted_valid = model_ran_forest.predict(features_valid)
    
    print('n_estimators =', est)
    print("F1:", f1_score(target_valid, predicted_valid))
    print()

Retrain the model with better parameters

In [None]:
model_ran_forest = RandomForestClassifier(random_state=12345, n_estimators=45, max_depth=9)
model_ran_forest.fit(features_train, target_train) 

## Model: logistic regression

In [None]:
model_log_reg = LogisticRegression(random_state=12345, solver='liblinear')
model_log_reg.fit(features_train, target_train)
predicted_valid = model_log_reg.predict(features_valid)

print("F1:", f1_score(target_valid, predicted_valid))

## Checking models on a test set

In [None]:
predicted_test = model_dec_tree.predict(features_test)

print("Decision tree model on the test sample:")
print("F1:", f1_score(target_test, predicted_test))

probabilities_test = model_dec_tree.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr_tree, tpr_tree, thresholds_tree = roc_curve(target_test, probabilities_one_test) 
print('auc_roc:', roc_auc_score(target_test, probabilities_one_test))
print()
#//
#//
predicted_test = model_ran_forest.predict(features_test)

print("Random forest model on a test sample:")
print("F1:", f1_score(target_test, predicted_test))

probabilities_test = model_ran_forest.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr_for, tpr_for, thresholds_for = roc_curve(target_test, probabilities_one_test) 
print('auc_roc:', roc_auc_score(target_test, probabilities_one_test))
print()
#//
#//
predicted_test = model_log_reg.predict(features_test)

print("Logistic regression model on a test sample:")
print("F1:", f1_score(target_test, predicted_test))

probabilities_test = model_log_reg.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr_reg, tpr_reg, thresholds_reg = roc_curve(target_test, probabilities_one_test) 
print('auc_roc:', roc_auc_score(target_test, probabilities_one_test))
print()

Let's build the ROC curves of the models and the ROC curve of the random model

In [None]:
plt.figure(figsize=(15,15))

plt.plot(fpr_tree, tpr_tree, label='Tree')
plt.plot(fpr_for, tpr_for, label='Forest')
plt.plot(fpr_reg, tpr_reg, label='Log_regression')

# ROC-curve of a random model
plt.plot([0, 1], [0, 1], linestyle='--')

# ROC-curve of a random model
plt.xlim([0,1])
plt.ylim([0,1])

# label the axes "False Positive Rate" and "True Positive Rate"
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# add title "ROC curve" with plt.title() function
plt.title('ROC-кривая')

# add a legend
legend = plt.legend(loc='lower right', shadow=False, fontsize='x-large')

plt.show()

Output: <br />
<br />
The decision tree has the best F1 measure, followed by the random forest. <br />
Random Forest has the largest area under the curve, but Decisive Forest comes in second.

# Investigation of models taking into account the imbalance of classes

## Prepare the data in the same way as before, only take into account the imbalance of classes

Divide the samples into sets with features and a target feature

In [None]:
features = df_ohe.drop(['Exited'], axis=1)
target = df_ohe['Exited']

Let's divide the data into three samples: train, validation test in the ratio `3 : 1 : 1`

In [None]:
features_train, features_valid = train_test_split(features, test_size=0.20, random_state=12345)
features_train, features_test = train_test_split(features_train, test_size=0.25, random_state=12345)

target_train, target_valid = train_test_split(target, test_size=0.20, random_state=12345)
target_train, target_test = train_test_split(target_train, test_size=0.25, random_state=12345)

print(features.shape)
print(features_train.shape)
print(features_valid.shape)
print(features_test.shape)

Let's calculate the imbalance of classes

In [None]:
class_frequency = df_ohe['Exited'].value_counts(normalize=True)
print(class_frequency)
class_frequency.plot(kind='bar')

We balance the samples by increasing the rare class. It was also possible to balance them by setting the `class_weight='balanced'` parameter when setting up the model.

In [None]:
features_zeros = features_train[target_train == 0]
features_ones = features_train[target_train == 1]
target_zeros = target_train[target_train == 0]
target_ones = target_train[target_train == 1]

repeat = 4
features_train = pd.concat([features_zeros] + [features_ones] * repeat)
target_train = pd.concat([target_zeros] + [target_ones] * repeat)

shuffle(features_train, random_state=12345)
shuffle(target_train, random_state=12345)

print(features_train[target_train == 0].shape)
print(features_train[target_train == 1].shape)

class_frequency = target_train.value_counts(normalize=True)
print(class_frequency)
class_frequency.plot(kind='bar')

Scale features

In [None]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

scaler = StandardScaler()
scaler.fit(features_train.loc[:, numeric])
features_train.loc[:, numeric] = scaler.transform(features_train.loc[:, numeric])
features_valid.loc[:, numeric] = scaler.transform(features_valid.loc[:, numeric])
features_test.loc[:, numeric] = scaler.transform(features_test.loc[:, numeric])

## Model: decision tree

In [None]:
for depth in range(1, 12):
    
    model_dec_tree = DecisionTreeClassifier(random_state=12345, max_depth=depth) 
    model_dec_tree.fit(features_train, target_train)
    predicted_valid = model_dec_tree.predict(features_valid)
    
    print('max_depth =', depth)
    print("F1:", f1_score(target_valid, predicted_valid))
    print()

Retrain the model with better parameters

In [None]:
model_dec_tree = DecisionTreeClassifier(random_state=12345, max_depth=6) 
model_dec_tree.fit(features_train, target_train)

## Model: random forest

In [None]:
for est in range(10, 102, 5):
    
    model_ran_forest = RandomForestClassifier(random_state=12345, n_estimators=est, max_depth=6)
    model_ran_forest.fit(features_train, target_train)
    predicted_valid = model_ran_forest.predict(features_valid)
    
    print('n_estimators =', est)
    print("F1:", f1_score(target_valid, predicted_valid))
    print()

Retrain the model with better parameters

In [None]:
model_ran_forest = RandomForestClassifier(random_state=12345, n_estimators=95, max_depth=6)
model_ran_forest.fit(features_train, target_train) 

## Model: logistic regression

In [None]:
model_log_reg = LogisticRegression(random_state=12345, solver='liblinear')
model_log_reg.fit(features_train, target_train)
predicted_valid = model_log_reg.predict(features_valid)
result = model_log_reg.score(features_valid, target_valid)

print("F1:", f1_score(target_valid, predicted_valid))

## Checking models on a test set

In [None]:
predicted_test = model_dec_tree.predict(features_test)

print("Decision tree model on the test sample:")
print("F1:", f1_score(target_test, predicted_test))

probabilities_test = model_dec_tree.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr_tree, tpr_tree, thresholds_tree = roc_curve(target_test, probabilities_one_test) 
print('auc_roc:', roc_auc_score(target_test, probabilities_one_test))

print()
#//
#//
predicted_test = model_ran_forest.predict(features_test)

print("Random forest model on a test sample:")
print("F1:", f1_score(target_test, predicted_test))

probabilities_test = model_ran_forest.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr_for, tpr_for, thresholds_for = roc_curve(target_test, probabilities_one_test) 
print('auc_roc:', roc_auc_score(target_test, probabilities_one_test))

print()
#//
#//
predicted_test = model_log_reg.predict(features_test)

print("Logistic regression model on a test sample:")
print("F1:", f1_score(target_test, predicted_test))

probabilities_test = model_log_reg.predict_proba(features_test)
probabilities_one_test = probabilities_test[:, 1]
fpr_reg, tpr_reg, thresholds_reg = roc_curve(target_test, probabilities_one_test) 
print('auc_roc:', roc_auc_score(target_test, probabilities_one_test))

print()

Let's build the ROC curves of the models and the ROC curve of the random model

In [None]:
plt.figure(figsize=(15,15))

plt.plot(fpr_tree, tpr_tree, label='Tree')
plt.plot(fpr_for, tpr_for, label='Forest')
plt.plot(fpr_reg, tpr_reg, label='Log_regression')

# ROC-кривая случайной модели
plt.plot([0, 1], [0, 1], linestyle='--')

# установимграницы осей от 0 до 1 >
plt.xlim([0,1])
plt.ylim([0,1])

# подпишем оси "False Positive Rate" и "True Positive Rate" >
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')

# добавим заголовок "ROC-кривая" функцией plt.title() >
plt.title('ROC-curve')

# добавим легенду
legend = plt.legend(loc='lower right', shadow=False, fontsize='x-large')

plt.show()

Output: <br />
<br />
The random forest has the best F1 measure, as does the area under the curve. <br />
Not much, but the decision tree lags behind. <br />
Logistic regression shows the worst results.