# Отток клиентов

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.

Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет. Вам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком. 

Постройте модель с предельно большим значением *F1*-меры. Чтобы сдать проект успешно, нужно довести метрику до 0.59. Проверьте *F1*-меру на тестовой выборке самостоятельно.

Дополнительно измеряйте *AUC-ROC*, сравнивайте её значение с *F1*-мерой.

Источник данных: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

## Подготовка данных

In [394]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve
from sklearn.metrics import confusion_matrix
import numpy as np
from sklearn.metrics import precision_score, recall_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import shuffle

In [395]:
data = pd.read_csv('/datasets/Churn.csv')
data

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


Просматриваем датасет. Целевой признак столбец 'Exited'. Удаляем ненужный столбцы 'Surname', 'RowNumber', 'CustomerId'.

In [396]:
data = data.drop(['Surname'], axis=1)
data = data.drop(['RowNumber'], axis=1)
data = data.drop(['CustomerId'], axis=1)
data

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...
9995,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


In [397]:
data.isnull().sum()

CreditScore          0
Geography            0
Gender               0
Age                  0
Tenure             909
Balance              0
NumOfProducts        0
HasCrCard            0
IsActiveMember       0
EstimatedSalary      0
Exited               0
dtype: int64

Находим пропуски в столбце 'Tenure'

In [398]:
data['Tenure'] = data['Tenure'].fillna(-1)

In [399]:
data.isnull().sum()

CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [400]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
CreditScore        10000 non-null int64
Geography          10000 non-null object
Gender             10000 non-null object
Age                10000 non-null int64
Tenure             10000 non-null float64
Balance            10000 non-null float64
NumOfProducts      10000 non-null int64
HasCrCard          10000 non-null int64
IsActiveMember     10000 non-null int64
EstimatedSalary    10000 non-null float64
Exited             10000 non-null int64
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


Заменили отсутствующие значеия на -1.

In [401]:
data1 = pd.get_dummies(data, drop_first=True)
data1.head()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


Меняем категориальные признаки.

In [402]:
target = data1['Exited']
features = data1.drop(['Exited'] , axis=1)

Разбиваем датасет на 2 части. Признаки - 'features' . Целевой признак - 'target'.

In [403]:
features_df, features_valid, target_df, target_valid = train_test_split(features, target, test_size = 0.2, random_state=12345)

print(features_df.shape)
print(features_valid.shape)

(8000, 11)
(2000, 11)


In [404]:
features_train, features_test, target_train, target_test = train_test_split(features_df, target_df, test_size = 0.25, random_state=12345)

print(features_train.shape)
print(features_test.shape)

(6000, 11)
(2000, 11)


Разделяем на тренировучную, валидационную и тестовую выборку.

## Исследование задачи

### Логистическая регрессия

In [405]:
model = LogisticRegression(random_state=12345, solver = 'liblinear')
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)

print('F1:', f1_score(target_valid, predictions_valid))

F1: 0.09896907216494845


Исследуем модель Логистической Регрессии. По метрике F1 - достаточно плохой показатель.

### Деревья решений

In [406]:
model = DecisionTreeClassifier(random_state=12345)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predictions_valid))

F1: 0.5105882352941177


Деревья решений уже лучше справляются с задачей. Но этого не достаточно. Попробуем подобрать параметры.

In [407]:
for depth in range(1, 10):
    model = DecisionTreeClassifier(random_state = 12345, max_depth=depth)
    model.fit(features_train, target_train)
    predictions_valid  = model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(model.score(features_valid, target_valid))

max_depth = 1 : 0.7865
max_depth = 2 : 0.8225
max_depth = 3 : 0.834
max_depth = 4 : 0.8375
max_depth = 5 : 0.848
max_depth = 6 : 0.848
max_depth = 7 : 0.8485
max_depth = 8 : 0.8385
max_depth = 9 : 0.832


Выбираем глубину - 7, как самую лучшую.

In [408]:
model = DecisionTreeClassifier(random_state=12345, max_depth=7)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predictions_valid))

F1: 0.5640287769784172


Модель улучшилась, но все равно не достаточно.

### Модель случайного леса

In [409]:
model = RandomForestClassifier(random_state=12345)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predictions_valid))

F1: 0.5158371040723982




Самый лучший показатель на первоначальном этапе.

In [410]:
for depth in range(1, 16, 1):
    model = RandomForestClassifier(n_estimators=20, max_depth=depth, random_state=12345)
    model.fit(features_train, target_train)
    predictions_valid  = model.predict(features_valid)
    print("max_depth =", depth, ": ", end='')
    print(model.score(features_valid, target_valid))

max_depth = 1 : 0.7865
max_depth = 2 : 0.8035
max_depth = 3 : 0.806
max_depth = 4 : 0.8205
max_depth = 5 : 0.8435
max_depth = 6 : 0.8505
max_depth = 7 : 0.8525
max_depth = 8 : 0.8485
max_depth = 9 : 0.85
max_depth = 10 : 0.8555
max_depth = 11 : 0.85
max_depth = 12 : 0.85
max_depth = 13 : 0.8475
max_depth = 14 : 0.8455
max_depth = 15 : 0.848


Подбираем наилучшую глубину. Выбираем - 10.

In [411]:
model = RandomForestClassifier(n_estimators=20, max_depth=10, random_state=12345)
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predictions_valid))

F1: 0.556067588325653


Уже лучше.

### Вывод

Провели исследование трех моделей и настройки их параметров. Лучшими оказались Деревья Решений и Случайного леса. 

## Борьба с дисбалансом

### Логистическая регрессия

In [412]:
model = LogisticRegression(random_state=12345, solver = 'liblinear', class_weight='balanced')
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)

print('F1:', f1_score(target_valid, predictions_valid))
print(confusion_matrix(target_valid, predictions_valid))




F1: 0.49190938511326854
[[1068  505]
 [ 123  304]]


Пробуем первый способ баланса классов. Метрика F1 улучишилась.

In [413]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 2)

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.31024096385542166


Метод upsample. F1 - ухудшилось.

In [414]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

model = LogisticRegression(random_state=12345, solver='liblinear')
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.3701440419030991


Лучше не стало.

In [415]:
#numeric = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary']

#scaler = StandardScaler()
#scaler.fit(features_train[numeric])

#features_train[numeric] = scaler.transform(features_train[numeric])
#features_valid[numeric] = scaler.transform(features_valid[numeric])

#pd.options.mode.chained_assignment = None

#model = LogisticRegression(random_state=12345, solver = 'liblinear')
#model.fit(features_train, target_train)
#predictions_valid = model.predict(features_valid)
#features_train.head()


In [416]:
#print('F1:', f1_score(target_valid, predictions_valid))

Второй метод менее действеный. Выбираем первый.

### Деревья решений

In [417]:
model = DecisionTreeClassifier(random_state=12345, max_depth=7, class_weight='balanced')
model.fit(features_train, target_train)
predictions_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predictions_valid))
  

F1: 0.5867689357622243


С параметром баланска классов качество модели улучшается.

In [418]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 2)

model = DecisionTreeClassifier(random_state=12345, max_depth=7)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.6170731707317073


In [419]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

model = DecisionTreeClassifier(random_state=12345, max_depth=7)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.502013422818792


Методом upsample и downsample, не получилось добиться лучшего показателя.

### Модель случайного леса

In [420]:
model = RandomForestClassifier(n_estimators=20, max_depth=10, random_state=12345, class_weight='balanced')
model.fit(features_train, target_train) 
predictions_valid = model.predict(features_valid)
print('F1:', f1_score(target_valid, predictions_valid))


F1: 0.6148409893992933


Выбираем мадель Случайного леса, с лучшим показателем.

In [421]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 2)

model = RandomForestClassifier(n_estimators=20, max_depth=10, random_state=12345)
model.fit(features_upsampled, target_upsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.6325224071702944


In [422]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

model = RandomForestClassifier(n_estimators=20, max_depth=10, random_state=12345)
model.fit(features_downsampled, target_downsampled)
predicted_valid = model.predict(features_valid)
print("F1:", f1_score(target_valid, predicted_valid))

F1: 0.4871634314339386


## Тестирование модели

In [423]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 2)


model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=12345)
model.fit(features_upsampled, target_upsampled)
predictions_test  = model.predict(features_test)

print('F1:', f1_score(target_test, predictions_test))

F1: 0.6028169014084506


Лучшей оказалась - Модель Случайного леса.

## Вывод 

Мы разработали модель для определения намерений клиента уйти из банка. Мы просматрели данные, убрали данные которые нам не нужны. Убрали пропуски и поменяли котегориальные признаки на числовые т.к. модель воспринимает только их. 
Мы обучили три модели и смотрели на показатель метрики F1. Поборолись с дисбалансом классов и выбрали лучшую модель - Сдучайного Леса. Проверили эту модель на Тестовой выборке.