# Отток клиентов

Из «Бета-Банка» стали уходить клиенты. Каждый месяц. Немного, но заметно. Банковские маркетологи посчитали: сохранять текущих клиентов дешевле, чем привлекать новых.

Нужно спрогнозировать, уйдёт клиент из банка в ближайшее время или нет.

Нам предоставлены исторические данные о поведении клиентов и расторжении договоров с банком.

Нужно построить модель с предельно большим значением *F1*-меры(хотя бы 0.59).

Дополнительно нужно измерять *AUC-ROC* и сравнивать её значение с *F1*-мерой.

Источник данных: [https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling](https://www.kaggle.com/barelydedicated/bank-customer-churn-modeling)

## Подготовка данных

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
from sklearn.utils import shuffle

In [2]:
df = pd.read_csv('Churn.csv')

In [3]:
df.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Данные оказались неполными, в столбце `Tenure` не хватает 909 значений, заполним их.

In [5]:
df.corr()

  df.corr()


Unnamed: 0,RowNumber,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
RowNumber,1.0,0.004202,0.00584,0.000783,-0.007322,-0.009067,0.007246,0.000599,0.012044,-0.005988,-0.016571
CustomerId,0.004202,1.0,0.005308,0.009497,-0.021418,-0.012419,0.016972,-0.014025,0.001665,0.015271,-0.006248
CreditScore,0.00584,0.005308,1.0,-0.003965,-6.2e-05,0.006268,0.012238,-0.005458,0.025651,-0.001384,-0.027094
Age,0.000783,0.009497,-0.003965,1.0,-0.013134,0.028308,-0.03068,-0.011721,0.085472,-0.007201,0.285323
Tenure,-0.007322,-0.021418,-6.2e-05,-0.013134,1.0,-0.007911,0.011979,0.027232,-0.032178,0.01052,-0.016761
Balance,-0.009067,-0.012419,0.006268,0.028308,-0.007911,1.0,-0.30418,-0.014858,-0.010084,0.012797,0.118533
NumOfProducts,0.007246,0.016972,0.012238,-0.03068,0.011979,-0.30418,1.0,0.003183,0.009612,0.014204,-0.04782
HasCrCard,0.000599,-0.014025,-0.005458,-0.011721,0.027232,-0.014858,0.003183,1.0,-0.011866,-0.009933,-0.007138
IsActiveMember,0.012044,0.001665,0.025651,0.085472,-0.032178,-0.010084,0.009612,-0.011866,1.0,-0.011421,-0.156128
EstimatedSalary,-0.005988,0.015271,-0.001384,-0.007201,0.01052,0.012797,0.014204,-0.009933,-0.011421,1.0,0.012097


Данные стобца `Tenure` не имеют особой корреляции ни с одним из остальных столбцов, поэтому заполним пропуски обычной медианой.

In [6]:
df = df.fillna(df['Tenure'].median())

In [7]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Данные были успешно заполнены.

In [8]:
df = df.drop(['RowNumber', 'CustomerId', 'Surname'], axis=1)

Уберём столбцы `RowNumber`, `CustomerId`, `Surname`, так как они не повлияют на модель.

In [9]:
df.columns = ['credit_score', 'geography', 'gender', 'age', 'tenure', 'balance', 'num_of_products', 'has_cr_card', 'is_active_member', 'estimated_salary', 'exited']

In [10]:
features = df.drop('exited', axis=1)
target = df['exited']
features_train, features_valid_and_test, target_train, target_valid_and_test =  train_test_split(features, target, test_size=0.4, random_state=222)
features_test, features_valid, target_test, target_valid  = train_test_split(features_valid_and_test, target_valid_and_test, test_size=0.5, random_state=222)


In [11]:
print(f'Размер features_train {features_train.shape}')
print(f'Размер target_train {target_train.shape}')
print(f'Размер features_test {features_test.shape}')
print(f'Размер target_test {target_test.shape}')
print(f'Размер features_valid {features_valid.shape}')
print(f'Размер target_valid {target_valid.shape}')

Размер features_train (6000, 10)
Размер target_train (6000,)
Размер features_test (2000, 10)
Размер target_test (2000,)
Размер features_valid (2000, 10)
Размер target_valid (2000,)


Данные были разбиты в соотношении 3:1:1.

In [12]:
features_train_categorial = features_train[['geography', 'gender', 'has_cr_card', 'is_active_member']]
features_valid_categorial = features_valid[['geography', 'gender', 'has_cr_card', 'is_active_member']]
features_test_categorial = features_test[['geography', 'gender', 'has_cr_card', 'is_active_member']]

In [13]:
features_train_numeric = features_train.drop(['geography', 'gender', 'has_cr_card', 'is_active_member'], axis=1)
features_valid_numeric = features_valid.drop(['geography', 'gender', 'has_cr_card', 'is_active_member'], axis=1)
features_test_numeric = features_test.drop(['geography', 'gender', 'has_cr_card', 'is_active_member'], axis=1)

Фичи были поделены на категориальные и числовые для каждой группы

In [14]:
encoder = OrdinalEncoder(handle_unknown = 'use_encoded_value', unknown_value=-1)

In [15]:
features_train_categorial_ohe = pd.DataFrame(encoder.fit_transform(features_train_categorial), 
                                             columns=features_train_categorial.columns, 
                                             index=features_train_categorial.index)

In [16]:
features_valid_categorial_ohe = pd.DataFrame(encoder.transform(features_valid_categorial), 
                                             columns=features_valid_categorial.columns, 
                                             index=features_valid_categorial.index)
features_test_categorial_ohe = pd.DataFrame(encoder.transform(features_test_categorial), 
                                            columns=features_test_categorial.columns, 
                                            index=features_test_categorial.index)

Преобразовали столбцы `geography` и `gender`.

In [17]:
scaler = StandardScaler()

In [18]:
scaler.fit(features_train_numeric)

In [19]:
features_train_numeric_scaled = pd.DataFrame(scaler.transform(features_train_numeric), 
                                             columns=features_train_numeric.columns, 
                                             index=features_train_numeric.index)
features_valid_numeric_scaled = pd.DataFrame(scaler.transform(features_valid_numeric), 
                                             columns=features_valid_numeric.columns, 
                                             index=features_valid_numeric.index)
features_test_numeric_scaled = pd.DataFrame(scaler.transform(features_test_numeric), 
                                            columns=features_test_numeric.columns, 
                                            index=features_test_numeric.index)

Масштабировали числовые столбцы.

In [20]:
features_train_merged = features_train_categorial_ohe.join(features_train_numeric_scaled)
features_valid_merged = features_valid_categorial_ohe.join(features_valid_numeric_scaled)
features_test_merged = features_test_categorial_ohe.join(features_test_numeric_scaled)

In [21]:
print(features_train_merged.shape)
print(features_valid_merged.shape)
print(features_test_merged.shape)

(6000, 10)
(2000, 10)
(2000, 10)


Обратно соединили числовые и категориальные столбцы.

## Исследование задачи

In [22]:
%%time
def log_regression_model():
    model = LogisticRegression(random_state=12345, solver='lbfgs')
    parameters = {'max_iter': range(101, 1001, 50)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged, target_train)

    best_model = grid_search.best_estimator_
    best_iter = best_model.max_iter

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best model accuracy is {best_model_accuracy}')
    print(f'best iter is {best_iter}')
    print(f'auc roc of best model is {auc_roc_of_best}')

log_regression_model()

best model accuracy is 0.28785046728971964
best iter is 101
auc roc of best model is 0.7564304264721767
CPU times: total: 938 ms
Wall time: 868 ms


In [23]:
%%time
def tree_regression_model():
    model = DecisionTreeClassifier(random_state=12345)
    parameters = {'max_depth': range(1, 20)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged, target_train)

    best_model = grid_search.best_estimator_
    best_depth = best_model.max_depth

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best model accuracy is {best_model_accuracy}')
    print(f'best depth is {best_depth}')
    print(f'auc roc of best model is {auc_roc_of_best}')

tree_regression_model()

best model accuracy is 0.5622254758418741
best depth is 8
auc roc of best model is 0.814302307924715
CPU times: total: 1.2 s
Wall time: 1.2 s


In [24]:
%%time
def forest_model():
    model = RandomForestClassifier(random_state=12345)
    parameters = {'n_estimators': range(1, 100, 5), 'max_depth': range(1, 20)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged, target_train)

    best_model = grid_search.best_estimator_
    best_depth = best_model.max_depth
    best_est = best_model.n_estimators

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best depth is {best_depth}')
    print(f'best est is {best_est}')
    print(f'best model accuracy is {best_model_accuracy}')
    print(f'auc roc of best model is {auc_roc_of_best}')

forest_model()

best depth is 15
best est is 31
best model accuracy is 0.5830815709969789
auc roc of best model is 0.8416325009901392
CPU times: total: 5min 31s
Wall time: 5min 31s


Даже с дисбалансом некоторые модели показали неплохой результат (дерево и лес с f1, равным ~0.56 и ~0.68 соответственно), а вот логистическая регрессия показала результат, равный ~0.3, что очень мало. Добавим баланс в наши фичи, чтобы улучшить точность.

## Борьба с дисбалансом

In [25]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)
    return features_downsampled, target_downsampled
features_train_merged_down, target_train_down = downsample(features_train_merged, target_train, 0.5)

In [26]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled
features_train_merged_up, target_train_up = upsample(features_train_merged, target_train, 4)

In [27]:
%%time
def log_regression_model_down():
    model = LogisticRegression(random_state=12345, solver='lbfgs')
    parameters = {'max_iter': range(101, 1001, 50)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged_down, target_train_down)

    best_model = grid_search.best_estimator_
    best_iter = best_model.max_iter

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best model accuracy is {best_model_accuracy}')
    print(f'best iter is {best_iter}')
    print(f'auc roc of best model is {auc_roc_of_best}')

log_regression_model_down()

best model accuracy is 0.4591836734693878
best iter is 101
auc roc of best model is 0.7569000577646489
CPU times: total: 906 ms
Wall time: 798 ms


Логистическая регрессия показала значение 0.45, что ниже заданной нормы.

In [28]:
%%time
def tree_regression_model_down():
    model = DecisionTreeClassifier(random_state=12345)
    parameters = {'max_depth': range(1, 20)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged_down, target_train_down)

    best_model = grid_search.best_estimator_
    best_depth = best_model.max_depth

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best model accuracy is {best_model_accuracy}')
    print(f'best depth is {best_depth}')
    print(f'auc roc of best model is {auc_roc_of_best}')

tree_regression_model_down()

best model accuracy is 0.5896510228640193
best depth is 5
auc roc of best model is 0.8330476409637461
CPU times: total: 766 ms
Wall time: 765 ms


При downsampling'е дерево совсем чуть-чуть не дотянуло до минимально допустимого значения f1.

In [29]:
%%time
def tree_regression_model_up():
    model = DecisionTreeClassifier(random_state=12345)
    parameters = {'max_depth': range(1, 20)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged_up, target_train_up)

    best_model = grid_search.best_estimator_
    best_depth = best_model.max_depth

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best model accuracy is {best_model_accuracy}')
    print(f'best depth is {best_depth}')
    print(f'auc roc of best model is {auc_roc_of_best}')

tree_regression_model_up()

best model accuracy is 0.5077105575326216
best depth is 19
auc roc of best model is 0.6976216305911562
CPU times: total: 1.62 s
Wall time: 1.62 s


Upsampling сделал результаты более неточными.

In [30]:
%%time
def forest_model_down():
    model = RandomForestClassifier(random_state=12345)
    parameters = {'n_estimators': range(1, 100, 5), 'max_depth': range(1, 20)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged_down, target_train_down)

    best_model = grid_search.best_estimator_
    best_depth = best_model.max_depth
    best_est = best_model.n_estimators

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best depth is {best_depth}')
    print(f'best est is {best_est}')
    print(f'best model accuracy is {best_model_accuracy}')
    print(f'auc roc of best model is {auc_roc_of_best}')

forest_model_down()

best depth is 16
best est is 81
best model accuracy is 0.6289308176100629
auc roc of best model is 0.856069749639558
CPU times: total: 3min 47s
Wall time: 3min 47s


Downsampling улучшил результаты леса, хоть и незначительно.

In [31]:
%%time
def forest_model_up():
    model = RandomForestClassifier(random_state=12345)
    parameters = {'n_estimators': range(1, 100, 5), 'max_depth': range(1, 20)}
    grid_search = GridSearchCV(model, parameters, scoring='f1', cv=5)
    grid_search.fit(features_train_merged_up, target_train_up)

    best_model = grid_search.best_estimator_
    best_depth = best_model.max_depth
    best_est = best_model.n_estimators

    predicted_valid = best_model.predict(features_valid_merged)
    best_model_accuracy = f1_score(predicted_valid, target_valid)

    probabilities_valid = best_model.predict_proba(features_valid_merged)
    probabilities_one_valid = probabilities_valid[:, 1]
    auc_roc_of_best = roc_auc_score(target_valid, probabilities_one_valid)

    print(f'best depth is {best_depth}')
    print(f'best est is {best_est}')
    print(f'best model accuracy is {best_model_accuracy}')
    print(f'auc roc of best model is {auc_roc_of_best}')

forest_model_up()

best depth is 19
best est is 46
best model accuracy is 0.6186666666666667
auc roc of best model is 0.8499864589644003
CPU times: total: 7min 39s
Wall time: 7min 39s


Upsampling тоже улучшил результат, но downsampling справился лучше.

Модель `RandomForestClassifier` успешно обучилась и справилась с поставленной задачей на валидационной выборке, `DecisionTreeClassifier` почти достигла успеха, а вот `LogisticRegression` оказалась далеко от порога точности. В итоге в тестовых испытаниях будет участвовать только модель леса.

## Тестирование модели

In [32]:
%%time
def forest_model_down_test():
    model = RandomForestClassifier(n_estimators=19, max_depth=81, random_state=12345)
    model.fit(features_train_merged_up, target_train_up)
    predicted_test = model.predict(features_test_merged)
    model_accuracy = f1_score(predicted_test, target_test)
    probabilities_test = model.predict_proba(features_test_merged)
    probabilities_one_test = probabilities_test[:, 1]
    model_roc_auc = roc_auc_score(target_test, probabilities_one_test)
    print(f'Model F1 accuracy is {model_accuracy}')
    print(f'Auc-Roc of model is {model_roc_auc}')
forest_model_down_test()

Model F1 accuracy is 0.5982905982905983
Auc-Roc of model is 0.8269299303689894
CPU times: total: 172 ms
Wall time: 159 ms


F1-мера и в тестовой выборке достигла значения 0.59.

Мы исследовали 3 вида моделей: логистическую регрессию, дерево и лес.     
После подбора гиперпараметров мы улучшили точность этих моделей, но лучшей оказалась именно модель леса, которая была успешно протестирована.
В итоге мы получили модель, позволяющую с большим успехом прогнозировать уйдёт клиент из банка в ближайшее время или нет.