# Оглавление
1. [Подготовка данных](#Шаг_1)
2. [Исследование задачи](#Шаг_2)
3. [Борьба с дисбалансом](#Шаг_3)
4. [Тестирование модели](#Шаг_4)

<a name="Шаг_1"></a>

# 1. Подготовка данных

In [1]:
import pandas as pd
import matplotlib as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.utils import shuffle
from sklearn.metrics import roc_auc_score

In [5]:
data = pd.read_csv('Churn.csv')

In [6]:
data.shape

(10000, 14)

In [7]:
data.dtypes

RowNumber            int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                  int64
Tenure             float64
Balance            float64
NumOfProducts        int64
HasCrCard            int64
IsActiveMember       int64
EstimatedSalary    float64
Exited               int64
dtype: object

In [8]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB


Пропуски есть только Tenure (количество недвижимости у клиента)

In [9]:
data.head()

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [10]:
columns = data.columns

In [11]:
for column in columns:
    display(column)
    display(data[column].describe())

'RowNumber'

count    10000.00000
mean      5000.50000
std       2886.89568
min          1.00000
25%       2500.75000
50%       5000.50000
75%       7500.25000
max      10000.00000
Name: RowNumber, dtype: float64

'CustomerId'

count    1.000000e+04
mean     1.569094e+07
std      7.193619e+04
min      1.556570e+07
25%      1.562853e+07
50%      1.569074e+07
75%      1.575323e+07
max      1.581569e+07
Name: CustomerId, dtype: float64

'Surname'

count     10000
unique     2932
top       Smith
freq         32
Name: Surname, dtype: object

'CreditScore'

count    10000.000000
mean       650.528800
std         96.653299
min        350.000000
25%        584.000000
50%        652.000000
75%        718.000000
max        850.000000
Name: CreditScore, dtype: float64

'Geography'

count      10000
unique         3
top       France
freq        5014
Name: Geography, dtype: object

'Gender'

count     10000
unique        2
top        Male
freq       5457
Name: Gender, dtype: object

'Age'

count    10000.000000
mean        38.921800
std         10.487806
min         18.000000
25%         32.000000
50%         37.000000
75%         44.000000
max         92.000000
Name: Age, dtype: float64

'Tenure'

count    9091.000000
mean        4.997690
std         2.894723
min         0.000000
25%         2.000000
50%         5.000000
75%         7.000000
max        10.000000
Name: Tenure, dtype: float64

'Balance'

count     10000.000000
mean      76485.889288
std       62397.405202
min           0.000000
25%           0.000000
50%       97198.540000
75%      127644.240000
max      250898.090000
Name: Balance, dtype: float64

'NumOfProducts'

count    10000.000000
mean         1.530200
std          0.581654
min          1.000000
25%          1.000000
50%          1.000000
75%          2.000000
max          4.000000
Name: NumOfProducts, dtype: float64

'HasCrCard'

count    10000.00000
mean         0.70550
std          0.45584
min          0.00000
25%          0.00000
50%          1.00000
75%          1.00000
max          1.00000
Name: HasCrCard, dtype: float64

'IsActiveMember'

count    10000.000000
mean         0.515100
std          0.499797
min          0.000000
25%          0.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: IsActiveMember, dtype: float64

'EstimatedSalary'

count     10000.000000
mean     100090.239881
std       57510.492818
min          11.580000
25%       51002.110000
50%      100193.915000
75%      149388.247500
max      199992.480000
Name: EstimatedSalary, dtype: float64

'Exited'

count    10000.000000
mean         0.203700
std          0.402769
min          0.000000
25%          0.000000
50%          0.000000
75%          0.000000
max          1.000000
Name: Exited, dtype: float64

RowNumber (индекс строки в данных) не вызывает вопрсов int64. Уберу

CustomerId (уникальный идентификатор клиента) не вызывает вопрсов int64 Уберу

Surname (фамилия) не вызывает вопрсов. 2932 уникальных фаимилий object Уберу

CreditScore (кредитный рейтинг) не вызывает вопрсов int64

In [12]:
data.Geography.value_counts()

France     5014
Germany    2509
Spain      2477
Name: Geography, dtype: int64

Geography (страна проживания) не вызывает вопрсов. 3 уникальных, Франция ТОП object (50 % общего)

In [13]:
data.Gender.value_counts()

Male      5457
Female    4543
Name: Gender, dtype: int64

Gender (пол) не вызывает вопрсов. 2 уникальных, Male ТОП object. Дисбаланса нет.

In [14]:
data.Age.value_counts()

37    478
38    477
35    474
36    456
34    447
     ... 
92      2
88      1
82      1
85      1
83      1
Name: Age, Length: 70, dtype: int64

Age (возраст)  не вызывает вопрсов. min-18 max-92 int64

In [15]:
data.Tenure.value_counts()

1.0     952
2.0     950
8.0     933
3.0     928
5.0     927
7.0     925
4.0     885
9.0     882
6.0     881
10.0    446
0.0     382
Name: Tenure, dtype: int64

Tenure (количество недвижимости у клиента). Пропуски заменю на 0. и поменяю тип float64 на int64

In [16]:
data.Tenure = data.Tenure.fillna('0')

In [17]:
data.Tenure = data.Tenure.astype('int')

Balance (баланс на счёте) не вызывает вопрсов. float64

In [18]:
data.NumOfProducts.value_counts()

1    5084
2    4590
3     266
4      60
Name: NumOfProducts, dtype: int64

NumOfProducts (NumOfProducts) не вызывает вопрсов. int64

In [19]:
data.HasCrCard.value_counts()

1    7055
0    2945
Name: HasCrCard, dtype: int64

HasCrCard ( наличие кредитной карты ) не вызывает вопрсов. int64.

In [20]:
data.IsActiveMember.value_counts()

1    5151
0    4849
Name: IsActiveMember, dtype: int64

IsActiveMember (активность клиента) не вызывает вопрсов. int64. 

EstimatedSalary  не вызывает вопрсов. float64

In [21]:
data.Exited.value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

Exited (факт ухода клиента)  не вызывает вопрсов. int64. Дисбаланс есть.

In [22]:
data_red = data.drop(['RowNumber', 'CustomerId', 'Surname'], axis = 1)

Номер, id и имя не данными, которые влияют на факт ухода клиента.

In [23]:
data_red.shape

(10000, 11)

In [24]:
data_red = pd.get_dummies(data_red, drop_first=True)

In [25]:
data_red.shape

(10000, 12)

In [26]:
features = data_red.drop(['Exited'], axis = 1)

In [27]:
target = data_red['Exited']

In [28]:
features_train, features_valid, target_train, target_valid = train_test_split(
    features, target, test_size=0.40, random_state=12345)

In [29]:
features_valid, features_test, target_valid, target_test = train_test_split(
    features_valid, target_valid, test_size=0.40, random_state=12345)

In [30]:
scaler = StandardScaler()

In [31]:
numeric = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'EstimatedSalary']

In [32]:
scaler.fit(features_train[numeric])

StandardScaler()

In [33]:
features_train[numeric] = scaler.transform(features_train[numeric])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  features_train[numeric] = scaler.transform(features_train[numeric])
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self.obj[item] = s


In [34]:
features_valid[numeric] = scaler.transform(features_valid[numeric])

In [35]:
features_test[numeric] = scaler.transform(features_test[numeric])

Разбил на три выборки. Обучающая, валидационная, тестовая.

Обход дамми ловушки. + Маштабирование некоторых признаков.

[Home](#Оглавление)

<a name="Шаг_2"></a>

# 2. Исследование задачи

Построение моделей с имеющимися данными

#### RandomForestClassifier

In [36]:
model = RandomForestClassifier(random_state=12345)

In [37]:
model.fit(features_train, target_train)

RandomForestClassifier(random_state=12345)

In [38]:
predicted_valid = model.predict(features_valid)

In [39]:
f1_score(target_valid, predicted_valid)

0.5678073510773131

In [40]:
confusion_matrix(target_valid, predicted_valid)

array([[1835,   72],
       [ 269,  224]], dtype=int64)

In [41]:
recall_score(target_valid, predicted_valid)

0.4543610547667343

In [42]:
precision_score(target_valid, predicted_valid)

0.7567567567567568

In [43]:
probabilities_valid = model.predict_proba(features_valid)

In [44]:
probabilities_one_valid = probabilities_valid[:, 1]

In [45]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [46]:
auc_roc

0.8442250234270878

Определение значений не случайно

#### /DecisionTreeClassifier

In [47]:
model = DecisionTreeClassifier(random_state=12345)

In [48]:
model.fit(features_train, target_train)

DecisionTreeClassifier(random_state=12345)

In [49]:
predicted_valid = model.predict(features_valid)

In [50]:
f1_score(target_valid, predicted_valid)

0.47336065573770497

In [51]:
confusion_matrix(target_valid, predicted_valid)

array([[1655,  252],
       [ 262,  231]], dtype=int64)

In [52]:
recall_score(target_valid, predicted_valid)

0.4685598377281947

In [53]:
precision_score(target_valid, predicted_valid)

0.4782608695652174

In [54]:
probabilities_valid = model.predict_proba(features_valid)

In [55]:
probabilities_one_valid = probabilities_valid[:, 1]

In [56]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [57]:
auc_roc

0.6682075538929385

Значение auc_roc низкое.

#### LogisticRegression

In [58]:
model = LogisticRegression(random_state=12345, solver='liblinear')

In [59]:
model.fit(features_train, target_train)

LogisticRegression(random_state=12345, solver='liblinear')

In [60]:
predicted_valid = model.predict(features_valid)

In [61]:
f1_score(target_valid, predicted_valid)

0.31837916063675836

In [62]:
confusion_matrix(target_valid, predicted_valid)

array([[1819,   88],
       [ 383,  110]], dtype=int64)

In [63]:
recall_score(target_valid, predicted_valid)

0.2231237322515213

In [64]:
precision_score(target_valid, predicted_valid)

0.5555555555555556

In [65]:
probabilities_valid = model.predict_proba(features_valid)

In [66]:
probabilities_one_valid = probabilities_valid[:, 1]

In [67]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [68]:
auc_roc

0.75240466690989

Определение значений не случайно

Обучил три модели, вывел F1, матрицу ошибок, полноту и точность. Дальше менять гиперпараметры моделей. Буду пользоваться только F1

Итого F1 score

    RandomForestClassifier 0.541

    DecisionTreeClassifier 0.473

    LogisticRegression 0.318






#### RandomForestClassifier

In [69]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = 10, max_depth = depth, 
                                   random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

2 0.18214936247723132
3 0.2359154929577465
4 0.4608567208271787
6 0.5254470426409904
8 0.536986301369863
9 0.5613577023498695
12 0.572139303482587


Лучшее значение при max_depth = 12

In [70]:
count = 0
name = 0
for estim in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = estim, max_depth = 12, 
                                   random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = estim
        print(name, count)    
    else:
        continue

1 0.5184404636459432
3 0.5598086124401913
7 0.5682656826568265
8 0.5693069306930693
9 0.5806451612903227
40 0.582051282051282
43 0.582798459563543


Лучшее значение при n_estimators = 43

In [71]:
model_RandomForestClassifier = RandomForestClassifier(n_estimators = 43, max_depth = 12, random_state=12345)
model_RandomForestClassifier.fit(features_train, target_train)
predicted_valid = model_RandomForestClassifier.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.582798459563543

In [72]:
probabilities_valid = model_RandomForestClassifier.predict_proba(features_valid)

In [73]:
probabilities_one_valid = probabilities_valid[:, 1]

In [74]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [75]:
auc_roc

0.844044201410199

Определение значений не случайно

Максимальное значение метрики F1 модели RandomForestClassifier поднялось с 0.541 - 0,583

#### DecisionTreeClassifier

In [76]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = DecisionTreeClassifier(max_depth = depth, 
                                   random_state=12345)
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

2 0.5011547344110855
4 0.5341935483870968
6 0.5621761658031088
9 0.5703971119133574


Лучшее значение при max_depth = 9

In [77]:
model_DecisionTreeClassifier = DecisionTreeClassifier(max_depth = 9, random_state=12345)
model_DecisionTreeClassifier.fit(features_train, target_train)
predicted_valid = model_DecisionTreeClassifier.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.5703971119133574

In [78]:
probabilities_valid = model_DecisionTreeClassifier.predict_proba(features_valid)

In [79]:
probabilities_one_valid = probabilities_valid[:, 1]

In [80]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [81]:
auc_roc

0.7950962132678687

Определение значений не случайно

Максимальное значение метрики F1 модели DecisionTreeClassifier поднялось с 0.473 до 0,570

#### LogisticRegression

In [82]:
model_LogisticRegression = LogisticRegression(random_state=12345, solver='liblinear')
model_LogisticRegression.fit(features_train, target_train)
predicted_valid = model_LogisticRegression.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.31837916063675836

In [83]:
probabilities_valid = model_LogisticRegression.predict_proba(features_valid)

In [84]:
probabilities_one_valid = probabilities_valid[:, 1]

In [85]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [86]:
auc_roc

0.75240466690989

Определение значений не случайно

Значение метрики F1 модели LogisticRegression составляет 0,318

Функцию для подбора не сделал из-за параметра class_weight. Считаю что на такое количество повторений, проще копировать.

[Home](#Оглавление)

<a name="Шаг_3"></a>

# 3. Борьба с дисбалансом

### class_weight='balanced' 

#### RandomForestClassifier

In [87]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = 10, max_depth = depth, 
                                   random_state=12345, class_weight='balanced')
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

1 0.5022999080036799
2 0.556350626118068
3 0.5580286168521462
4 0.5814360770577934
5 0.5836120401337792
6 0.5850694444444444
7 0.6104702750665483


Лучшее значение max_depth равно 10

In [88]:
count = 0
name = 0
for estim in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = estim, max_depth = 10, 
                                   random_state=12345, class_weight='balanced')
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = estim
        print(name, count)    
    else:
        continue

1 0.4875471698113207
2 0.5055350553505535
3 0.5626134301270418
4 0.5659655831739963
5 0.5780346820809248
6 0.5919370698131761
7 0.594
9 0.5963488843813387
28 0.6
48 0.6016260162601627
49 0.603238866396761
50 0.6050761421319797
51 0.6091370558375635
52 0.6105476673427992
55 0.612410986775178


Лучшее значение n_estimators равно 36

In [89]:
model_RandomForestClassifier = RandomForestClassifier(n_estimators = 36, max_depth = 10, 
                                                      random_state=12345, class_weight='balanced')
model_RandomForestClassifier.fit(features_train, target_train)
predicted_valid = model_RandomForestClassifier.predict(features_valid)
print(f1_score(target_valid, predicted_valid))

0.598974358974359


In [90]:
probabilities_valid = model_RandomForestClassifier.predict_proba(features_valid)

In [91]:
probabilities_one_valid = probabilities_valid[:, 1]

In [92]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [93]:
auc_roc

0.8488647036486692

Определение значений не случайно

Максимальное значение F1 которого получилось добиться RandomForestClassifier с применением class_weight='balanced составляет 0.612 это немного больше чем раннее значение 0,583 и более раннее 0.541

#### DecisionTreeClassifier

In [94]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = DecisionTreeClassifier(max_depth = depth, 
                                   random_state=12345, class_weight='balanced')
    model.fit(features_train, target_train)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

1 0.493103448275862
2 0.531897265948633
5 0.5927927927927927


Лучшее значение max_depth равно 5

In [95]:
model_DecisionTreeClassifier = DecisionTreeClassifier(max_depth = 5, 
                                   random_state=12345, class_weight='balanced')
model_DecisionTreeClassifier.fit(features_train, target_train)
predicted_valid = model_DecisionTreeClassifier.predict(features_valid)
print(f1_score(target_valid, predicted_valid))

0.5927927927927927


In [96]:
probabilities_valid = model_DecisionTreeClassifier.predict_proba(features_valid)

In [97]:
probabilities_one_valid = probabilities_valid[:, 1]

In [98]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [99]:
auc_roc

0.8327614393858007

Максимальное значение F1 которого получилось добиться DecisionTreeClassifier с применением class_weight='balanced составляет 0.593 это больше чем раннее значение 0,570 и еще более раннее  0.473

#### LogisticRegression

In [100]:
model_LogisticRegression = LogisticRegression(random_state=12345, solver='liblinear', class_weight='balanced')
model_LogisticRegression.fit(features_train, target_train)
predicted_valid = model_LogisticRegression.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.47804175665946724

In [101]:
probabilities_valid = model_LogisticRegression.predict_proba(features_valid)

In [102]:
probabilities_one_valid = probabilities_valid[:, 1]

In [103]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [104]:
auc_roc

0.7574847019255417

Определение значений не случайно

Максимальное значение F1 которого получилось добиться LogisticRegression с применением class_weight='balanced составляет 0.478 это больше чем раннее значение 0.318

### Увеличение выборки

In [105]:
(data.Exited == 1).sum() / data.Exited.count()

0.2037

соотношение 4 к 1.

In [106]:
def upsample(features, target, repeat):
    features_0 = features[target == 0]
    features_1 = features[target == 1]
    target_0 = target[target == 0]
    target_1 = target[target == 1]
    features_upsampled = pd.concat([features_0] + [features_1] * repeat)
    target_upsampled = pd.concat([target_0] + [target_1] * repeat)
    features_upsampled, target_upsampled = shuffle(features_upsampled, target_upsampled, random_state=12345)
    return features_upsampled, target_upsampled

In [107]:
features_upsampled, target_upsampled = upsample(features_train, target_train, 4)

#### RandomForestClassifier

In [108]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = 10, max_depth = depth, 
                                   random_state=12345)
    model.fit(features_upsampled, target_upsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

1 0.5022999080036799
2 0.5692307692307692
3 0.5737308622078969
4 0.5926558497011102
5 0.5945499587118084
8 0.5974025974025974
9 0.5996376811594203


Лучшее значение max_depth равно 9

In [109]:
count = 0
name = 0
for estim in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = estim, max_depth = 9, 
                                   random_state=12345)
    model.fit(features_upsampled, target_upsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = estim
        print(name, count)    
    else:
        continue

1 0.5505804311774462
2 0.552901023890785
3 0.5789014821272885
5 0.5862676056338028
6 0.594449418084154
9 0.5985533453887885
10 0.5996376811594203
15 0.6032608695652174
33 0.6070460704607047
41 0.607843137254902
43 0.6087735004476276
44 0.611260053619303


Лучшее значение n_estimators равно 44

In [110]:
model_RandomForestClassifier = RandomForestClassifier(n_estimators = 44, max_depth = 9, 
                                                      random_state=12345)
model_RandomForestClassifier.fit(features_upsampled, target_upsampled)
predicted_valid = model_RandomForestClassifier.predict(features_valid)
print(f1_score(target_valid, predicted_valid))

0.611260053619303


In [111]:
probabilities_valid = model_RandomForestClassifier.predict_proba(features_valid)

In [112]:
probabilities_one_valid = probabilities_valid[:, 1]

In [113]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [114]:
auc_roc

0.8546084618321952

Определение значений не случайно

Максимальное значение F1 которого получилось добиться RandomForestClassifier с применением метода "увеличение выборки"  составляет 0.611 это немного меньше чем раннее значение 0.612

#### DecisionTreeClassifier

In [115]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = DecisionTreeClassifier(max_depth = depth, 
                                   random_state=12345)
    model.fit(features_upsampled, target_upsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

1 0.493103448275862
2 0.531897265948633
5 0.5927927927927927


Лучшее значение max_depth равно 5

In [116]:
model_DecisionTreeClassifier = DecisionTreeClassifier(max_depth = 5, 
                                   random_state=12345)
model_DecisionTreeClassifier.fit(features_upsampled, target_upsampled)
predicted_valid = model_DecisionTreeClassifier.predict(features_valid)
print(f1_score(target_valid, predicted_valid))

0.5927927927927927


In [117]:
probabilities_valid = model_DecisionTreeClassifier.predict_proba(features_valid)

In [118]:
probabilities_one_valid = probabilities_valid[:, 1]

In [119]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [120]:
auc_roc

0.8327614393858007

Определение значений не случайно

Максимальное значение которого получилочь добиться DecisionTreeClassifier с применением метода "увеличение выборки" составляет 0.593

#### LogisticRegression

In [121]:
model_LogisticRegression = LogisticRegression(random_state=12345, solver='liblinear')
model_LogisticRegression.fit(features_upsampled, target_upsampled)
predicted_valid = model_LogisticRegression.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.4769452449567724

In [122]:
probabilities_valid = model_LogisticRegression.predict_proba(features_valid)

In [123]:
probabilities_one_valid = probabilities_valid[:, 1]

In [124]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [125]:
auc_roc

0.7573932272581745

Определение значений не случайно

Максимальное значение которого получилочь добиться LogisticRegression с применением метода "увеличение выборки" составляет 0.477

#### Уменьшение выборки

In [126]:
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]
    features_downsampled = pd.concat([features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat([target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    features_downsampled, target_downsampled = shuffle(features_downsampled, target_downsampled, random_state=12345)
    return features_downsampled, target_downsampled

In [127]:
features_downsampled, target_downsampled = downsample(features_train, target_train, 0.2)

In [128]:
features_downsampled.shape

(2157, 11)

In [129]:
target_downsampled.shape

(2157,)

#### RandomForestClassifier

In [130]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = 10, max_depth = depth, 
                                   random_state=12345)
    model.fit(features_downsampled, target_downsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

1 0.37031310398144573
2 0.47130242825607066
3 0.5214564369310793
4 0.5853658536585366


Лучшее значение max_depth равно 4

In [131]:
count = 0
name = 0
for estim in range(1, 100, 1):
    model = RandomForestClassifier(n_estimators = estim, max_depth = 4, 
                                   random_state=12345)
    model.fit(features_downsampled, target_downsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = estim
        print(name, count)    
    else:
        continue

1 0.5386819484240688
3 0.55
6 0.5621621621621622
8 0.5678466076696165
9 0.5836516424751719
10 0.5853658536585366


Лучшее значение n_estimators равно 10

In [132]:
model_RandomForestClassifier = RandomForestClassifier(n_estimators = 10, max_depth = 4, 
                                                      random_state=12345)
model_RandomForestClassifier.fit(features_downsampled, target_downsampled)
predicted_valid = model_RandomForestClassifier.predict(features_valid)
print(f1_score(target_valid, predicted_valid))

0.5853658536585366


In [133]:
probabilities_valid = model_RandomForestClassifier.predict_proba(features_valid)

In [134]:
probabilities_one_valid = probabilities_valid[:, 1]

In [135]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [136]:
auc_roc

0.8380898387599438

Определение значений не случайно

Максимальное значение которого получилочь добиться RandomForestClassifier с применением метода "уменьшение выборки" составляет  0.585

#### DecisionTreeClassifier

In [137]:
count = 0
name = 0
for depth in range(1, 100, 1):
    model = DecisionTreeClassifier(max_depth = depth, 
                                   random_state=12345)
    model.fit(features_downsampled, target_downsampled)
    predicted_valid = model.predict(features_valid)
    score = f1_score(target_valid, predicted_valid)
    if score > count:
        count = score
        name = depth
        print(name, count)    
    else:
        continue

1 0.47179487179487184
2 0.487098804279421
4 0.5346676197283775
5 0.5543018335684062


Лучшее значение max_depth равно 5

In [138]:
model_DecisionTreeClassifier = DecisionTreeClassifier(max_depth = 5, 
                                   random_state=12345)
model_DecisionTreeClassifier.fit(features_downsampled, target_downsampled)
predicted_valid = model_DecisionTreeClassifier.predict(features_valid)
print(f1_score(target_valid, predicted_valid))

0.5543018335684062


In [139]:
probabilities_valid = model_DecisionTreeClassifier.predict_proba(features_valid)

In [140]:
probabilities_one_valid = probabilities_valid[:, 1]

In [141]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [142]:
auc_roc

0.8232342464136081

Определение значений не случайно

Максимальное значение которого получилочь добиться DecisionTreeClassifier с применением метода "уменьшение выборки" составляет 0.554

#### LogisticRegression

In [143]:
model_LogisticRegression = LogisticRegression(random_state=12345, solver='liblinear')
model_LogisticRegression.fit(features_downsampled, target_downsampled)
predicted_valid = model_LogisticRegression.predict(features_valid)
f1_score(target_valid, predicted_valid)

0.46852693056456846

In [144]:
probabilities_valid = model_LogisticRegression.predict_proba(features_valid)

In [145]:
probabilities_one_valid = probabilities_valid[:, 1]

In [146]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [147]:
auc_roc

0.757984621619293

Определение значений не случайно

Максимальное значение которого получилочь добиться LogisticRegression с применением метода "уменьшение выборки" составляет 0.468

In [148]:
itog = [
    ['RandomForestClassifier' , '0.541/0.811' , '0,583/0.844' , '0.612/0.848' , '0.611/0.855' , '0.585/0.838'],
    ['DecisionTreeClassifier' , '0.473/0.668' , '0,570/0.795' , '0.593/0.832' , '0.593/0.833' , '0.554/0.823'],
    ['LogisticRegression', '0.318/0.752' , '0,318/0.752' , '0.478/0.757' , '0.477/0.757' , '0.469/0.757']
]

In [149]:
column_itog = ['model_name', 'non_parameters_F1_auc_roc', 'hyperparameters_F1_auc_roc', 
               'balans_F1_auc_roc', 'upsampled_F1_auc_roc', 'downsampled_F1_auc_roc']

In [150]:
itog_data = pd.DataFrame(data=itog, columns=column_itog)

In [151]:
itog_data

Unnamed: 0,model_name,non_parameters_F1_auc_roc,hyperparameters_F1_auc_roc,balans_F1_auc_roc,upsampled_F1_auc_roc,downsampled_F1_auc_roc
0,RandomForestClassifier,0.541/0.811,"0,583/0.844",0.612/0.848,0.611/0.855,0.585/0.838
1,DecisionTreeClassifier,0.473/0.668,"0,570/0.795",0.593/0.832,0.593/0.833,0.554/0.823
2,LogisticRegression,0.318/0.752,"0,318/0.752",0.478/0.757,0.477/0.757,0.469/0.757


Итог

В результате считаю лучшей модель для данного запроса RandomForestClassifier, обученную на увеличенной выборке.

Так же можно рассматривать модель RandomForestClassifier, с применением class_weight='balanced', но значение auc_roc ниже.

[Home](#Оглавление)

<a name="Шаг_4"></a>

# 4. Тестирование модели RandomForestClassifier

In [152]:
model = RandomForestClassifier(n_estimators = 44, max_depth = 9, 
                                                      random_state=12345)
model.fit(features_upsampled, target_upsampled)

RandomForestClassifier(max_depth=9, n_estimators=44, random_state=12345)

In [153]:
predicted = model.predict(features_test)

In [154]:
f1_score(target_test, predicted)

0.6159695817490494

In [155]:
probabilities_valid = model.predict_proba(features_valid)

In [156]:
probabilities_one_valid = probabilities_valid[:, 1]

In [157]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [158]:
auc_roc

0.8546084618321952

В результате тестирования модели, было достигнуто значение F1 равное примерно 0.616 и auc_roc равное примерно 0.855. Что считаю нормальным результатом. Модель адекватна.

[Home](#Оглавление)

# 4. Тестирование модели DecisionTreeClassifier	

Рассмотрю так же модель DecisionTreeClassifier со значением параметров upsampled_F1_auc_roc 0.593/0.833 (ради интереса)

In [159]:
model_DecisionTreeClassifier_test = DecisionTreeClassifier(max_depth = 5, 
                                   random_state=12345)
model_DecisionTreeClassifier_test.fit(features_upsampled, target_upsampled)

DecisionTreeClassifier(max_depth=5, random_state=12345)

In [160]:
predict_DecisionTreeClassifier_valid = model_DecisionTreeClassifier_test.predict(features_valid)

In [161]:
f1_score(target_valid, predict_DecisionTreeClassifier_valid)

0.5927927927927927

In [162]:
probabilities_valid = model_DecisionTreeClassifier_test.predict_proba(features_valid)

In [163]:
probabilities_one_valid = probabilities_valid[:, 1]

In [164]:
auc_roc = roc_auc_score(target_valid, probabilities_one_valid)

In [165]:
auc_roc

0.8327614393858007

In [166]:
predict_DecisionTreeClassifier_test = model_DecisionTreeClassifier_test.predict(features_test)

In [167]:
f1_score(target_test, predict_DecisionTreeClassifier_test)

0.5825977301387136

In [168]:
probabilities_valid_test = model_DecisionTreeClassifier_test.predict_proba(features_test)

In [169]:
probabilities_one_valid_test = probabilities_valid_test[:, 1]

In [170]:
auc_roc = roc_auc_score(target_test, probabilities_one_valid_test)

In [171]:
auc_roc

0.8342766516102971

При проверке на тестовой выборке, значения F1 уменьшилось.

#### Итоговая модель RandomForestClassifier, обученная на увеличенной выборке.