# Отток клиентов

Булыгин Олег:  
* [LinkedIn](linkedin.com/in/obulygin)  
* [Мой канал в ТГ по Python](https://t.me/pythontalk_ru)
* [Чат канала](https://t.me/pythontalk_chat)
* [Блог в Телетайпе](https://teletype.in/@pythontalk)
* [PythonTalk на Кью](https://yandex.ru/q/loves/pythontalk/)

Описание данных: https://archive.ics.uci.edu/ml/datasets/Credit+Approval


In [None]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter('ignore')
pd.options.display.max_colwidth = 500
pd.options.display.max_columns = 100
pd.options.display.max_rows = 100

In [None]:
RANDOM = 4281

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/obulygin/SkillFactory/main/crx.data", header=None)

In [None]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
0,b,30.83,0.0,u,g,w,v,1.25,t,t,1,f,g,202,0,+
1,a,58.67,4.46,u,g,q,h,3.04,t,t,6,f,g,43,560,+
2,a,24.5,0.5,u,g,q,h,1.5,t,f,0,f,g,280,824,+
3,b,27.83,1.54,u,g,w,v,3.75,t,t,5,t,g,100,3,+
4,b,20.17,5.625,u,g,w,v,1.71,t,f,0,f,s,120,0,+


In [None]:
df.replace('?', np.nan, inplace=True)
df[1] = df[1].astype('float64')

In [None]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import MinMaxScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

X, y = df.iloc[:,0:15] , df.iloc[:,15]
y = y.astype('category').cat.codes

cat_columns = X.dtypes[X.dtypes == 'object'].index
num_columns = X.dtypes[X.dtypes != 'object'].index

num_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('mms', MinMaxScaler(feature_range=(0, 1)))
])
cat_pipe = Pipeline([
    ('imputer', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore'))
])
transformer = ColumnTransformer(transformers=
                                [('num', num_pipe, num_columns),
                                 ('cat', cat_pipe, cat_columns)])

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.model_selection import RandomizedSearchCV
from sklearn.pipeline import make_pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, shuffle=True, random_state=RANDOM)

res = {}

In [None]:
logreg_pipe = make_pipeline(transformer, LogisticRegression(random_state=RANDOM))
logreg_params_grid = {'logisticregression__C': np.logspace(-4, 2, 20),
                      'logisticregression__solver': ['newton-cg', 'lbfgs', 'liblinear'],
                      'logisticregression__class_weight': ['balanced', None]
                     }
logreg_grid = RandomizedSearchCV(logreg_pipe, logreg_params_grid, scoring='f1', random_state=RANDOM)
logreg_grid.fit(X_train, y_train)
print('Лучшие параметры:', logreg_grid.best_params_)
print('F-мера на перекрестной проверке:', logreg_grid.best_score_)
print('F-мера логистической регрессии на тестовом наборе:', logreg_grid.score(X_test, y_test))
res['Логистическая регрессия'] = logreg_grid.score(X_test, y_test)

Лучшие параметры: {'logisticregression__solver': 'liblinear', 'logisticregression__class_weight': None, 'logisticregression__C': 0.01623776739188721}
F-мера на перекрестной проверке: 0.8833357380560525
F-мера логистической регрессии на тестовом наборе: 0.8571428571428572


In [None]:
dtc_pipe = make_pipeline(transformer, DecisionTreeClassifier(random_state=RANDOM))
dtc_params_grid = {'decisiontreeclassifier__min_samples_split': range(2, 200, 5),
                  'decisiontreeclassifier__criterion': ['gini', 'entropy'],
                   'decisiontreeclassifier__max_depth': range(1, 35),
                  'decisiontreeclassifier__class_weight': ['balanced', None],
                  'decisiontreeclassifier__max_features': ['auto', None, 'log2']}
dtc_grid = RandomizedSearchCV(dtc_pipe, dtc_params_grid, scoring='f1', random_state=RANDOM)
dtc_grid.fit(X_train, y_train)
print('Лучшие параметры:', dtc_grid.best_params_)
print('F-мера на перекрестной проверке:', dtc_grid.best_score_)
print('F-мера дерева решений на тестовом наборе:', dtc_grid.score(X_test, y_test))
res['Дерево решений'] = dtc_grid.score(X_test, y_test)

Лучшие параметры: {'decisiontreeclassifier__min_samples_split': 152, 'decisiontreeclassifier__max_features': None, 'decisiontreeclassifier__max_depth': 27, 'decisiontreeclassifier__criterion': 'entropy', 'decisiontreeclassifier__class_weight': 'balanced'}
F-мера на перекрестной проверке: 0.8660302779133016
F-мера дерева решений на тестовом наборе: 0.8374384236453202


In [None]:
rfc_pipe = make_pipeline(transformer, RandomForestClassifier(random_state=RANDOM))
rfc_params_grid = {'randomforestclassifier__min_samples_split': range(2, 100, 5),
                   'randomforestclassifier__n_estimators': range(50, 200, 5),
                   'randomforestclassifier__criterion': ['gini', 'entropy'],
                    'randomforestclassifier__max_depth': range(1, 35),
                   'randomforestclassifier__class_weight': ['balanced', None],
                   'randomforestclassifier__max_features': ['auto', None, 'log2']}
rfc_grid = RandomizedSearchCV(rfc_pipe, rfc_params_grid, scoring='f1', random_state=RANDOM)
rfc_grid.fit(X_train, y_train)
print('Лучшие параметры:', rfc_grid.best_params_)
print('F-мера на перекрестной проверке:', rfc_grid.best_score_)
print('F-мера случайного леса на тестовом наборе:', rfc_grid.score(X_test, y_test))
res['Случайный лес'] = rfc_grid.score(X_test, y_test)

Лучшие параметры: {'randomforestclassifier__n_estimators': 185, 'randomforestclassifier__min_samples_split': 87, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__max_depth': 9, 'randomforestclassifier__criterion': 'gini', 'randomforestclassifier__class_weight': None}
F-мера на перекрестной проверке: 0.8912284655680882
F-мера случайного леса на тестовом наборе: 0.8699551569506726


In [None]:
svc_pipe = make_pipeline(transformer, svm.SVC(random_state=RANDOM))
svc_params_grid = {'svc__C': np.logspace(-4, 2, 20),
                   'svc__gamma': ['scale', 'auto'],
                   'svc__class_weight': ['balanced', None],
                   'svc__kernel': ['linear', 'rbf', 'poly', 'sigmoid']}
svc_grid = RandomizedSearchCV(svc_pipe, svc_params_grid, scoring='f1', random_state=RANDOM)
svc_grid.fit(X_train, y_train)
print('Лучшие параметры:', svc_grid.best_params_)
print('F-мера на перекрестной проверке:', svc_grid.best_score_)
print('F-мера метода опорных векторов на тестовом наборе:', svc_grid.score(X_test, y_test))
res['Метод опорных векторов'] = svc_grid.score(X_test, y_test)

Лучшие параметры: {'svc__kernel': 'poly', 'svc__gamma': 'scale', 'svc__class_weight': None, 'svc__C': 0.29763514416313164}
F-мера на перекрестной проверке: 0.8703089998374847
F-мера метода опорных векторов на тестовом наборе: 0.8390243902439025


In [None]:
res

{'Логистическая регрессия': 0.8571428571428572,
 'Дерево решений': 0.8374384236453202,
 'Случайный лес': 0.8699551569506726,
 'Метод опорных векторов': 0.8390243902439025}

In [None]:
from sklearn.metrics import confusion_matrix
y_pred = rfc_grid.predict(X_test)
confusion_matrix(y_test, y_pred)

array([[81, 16],
       [13, 97]])

https://skillfactoryschool.typeform.com/to/wY5lRTGh?typeform-source=www.google.com