### Домашнее задание №6
1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)
2. сделать feature engineering
3. обучить любой классификатор (какой вам нравится)
4. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть
5. применить random negative sampling для построения классификатора в новых условиях
6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)
7. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P) 

#### 1. взять любой набор данных для бинарной классификации (можно скачать один из модельных с https://archive.ics.uci.edu/ml/datasets.php)

In [1]:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score, precision_score, classification_report, precision_recall_curve, confusion_matrix

from sklearn.model_selection import train_test_split
#from sklearn.feature_extraction.text import TfidfVectorizer
import itertools

import matplotlib.pyplot as plt

%matplotlib inline

In [2]:
df = pd.read_csv("churn_data.csv")
df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3,75075.31,2,1,0,92888.52,1


#### 2. сделать feature engineering

Посмотрим на распределение классов:

In [3]:
df['Exited'].value_counts()

0    7963
1    2037
Name: Exited, dtype: int64

Не самое плохое распределение (1 к 4)

Давайте построим модель. Сразу же будем работать с использованием sklearn pipeline

In [4]:
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(df, df['Exited'], random_state=0)

- Категориальные признаки закодируем с помощью OneHotEncoding
- Вещественные оставим пока как есть

In [5]:
#соберем наш простой pipeline, но нам понадобится написать класс для выбора нужного поля
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    Преобразователь для выбора одного столбца из фрейма данных для выполнения дополнительных преобразований в
    Использовать для числовых столбцов в данных
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    
class OHEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        self.columns = []

    def fit(self, X, y=None):
        self.columns = [col for col in pd.get_dummies(X, prefix=self.key).columns]
        return self

    def transform(self, X):
        X = pd.get_dummies(X, prefix=self.key)
        test_columns = [col for col in X.columns]
        # print(test_columns)
        for col_ in self.columns:
            # print(col_)
            if col_ not in test_columns:
                X[col_] = 0
        return X[self.columns]

In [6]:
df.head(3)

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8,159660.8,3,1,0,113931.57,1


Зададим списки признаков

In [7]:
categorical_columns = ['Geography', 'Gender', 'Tenure', 'HasCrCard', 'IsActiveMember']
continuous_columns = ['CreditScore', 'Age', 'Balance', 'NumOfProducts', 'EstimatedSalary']

Теперь нам нужно под каждый признак создать трансформер и объединить их в список (сделаем это в цикле, чтобы не мучиться)

In [8]:
final_transformers = list()

for cat_col in categorical_columns:
    cat_transformer = Pipeline([
                ('selector', FeatureSelector(column=cat_col)),
                ('ohe', OHEEncoder(key=cat_col))
            ])
    final_transformers.append((cat_col, cat_transformer))
    
for cont_col in continuous_columns:
    cont_transformer = Pipeline([
                ('selector', NumberSelector(key=cont_col)),
                ( 'std_scaler', StandardScaler())
            ])
    final_transformers.append((cont_col, cont_transformer))

In [9]:
final_transformers

[('Geography',
  Pipeline(steps=[('selector', FeatureSelector(column='Geography')),
                  ('ohe', OHEEncoder(key='Geography'))])),
 ('Gender',
  Pipeline(steps=[('selector', FeatureSelector(column='Gender')),
                  ('ohe', OHEEncoder(key='Gender'))])),
 ('Tenure',
  Pipeline(steps=[('selector', FeatureSelector(column='Tenure')),
                  ('ohe', OHEEncoder(key='Tenure'))])),
 ('HasCrCard',
  Pipeline(steps=[('selector', FeatureSelector(column='HasCrCard')),
                  ('ohe', OHEEncoder(key='HasCrCard'))])),
 ('IsActiveMember',
  Pipeline(steps=[('selector', FeatureSelector(column='IsActiveMember')),
                  ('ohe', OHEEncoder(key='IsActiveMember'))])),
 ('CreditScore',
  Pipeline(steps=[('selector', NumberSelector(key='CreditScore')),
                  ('std_scaler', StandardScaler())])),
 ('Age',
  Pipeline(steps=[('selector', NumberSelector(key='Age')),
                  ('std_scaler', StandardScaler())])),
 ('Balance',
  Pipeline(st

Объединим все это в единый пайплайн

In [10]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
feats = FeatureUnion(final_transformers)

feature_processing = Pipeline([('feats', feats)])

Теперь у нас есть пайплайн, который готовит признаки для моделирования.

#### 3. обучить любой классификатор (какой вам нравится)

In [11]:
from sklearn.ensemble import HistGradientBoostingClassifier
from catboost import CatBoostClassifier

pipeline = Pipeline([
    ('features',feats),
    ('classifier', CatBoostClassifier(random_state=42)),
])

Обучим модель

In [12]:
#обучим наш пайплайн
pipeline.fit(X_train, y_train)

Learning rate set to 0.024355
0:	learn: 0.6726374	total: 149ms	remaining: 2m 29s
1:	learn: 0.6548174	total: 154ms	remaining: 1m 17s
2:	learn: 0.6405685	total: 158ms	remaining: 52.7s
3:	learn: 0.6237743	total: 163ms	remaining: 40.6s
4:	learn: 0.6080844	total: 167ms	remaining: 33.3s
5:	learn: 0.5929243	total: 173ms	remaining: 28.7s
6:	learn: 0.5787577	total: 179ms	remaining: 25.4s
7:	learn: 0.5659958	total: 185ms	remaining: 23s
8:	learn: 0.5556154	total: 195ms	remaining: 21.5s
9:	learn: 0.5441443	total: 202ms	remaining: 20s
10:	learn: 0.5348181	total: 209ms	remaining: 18.8s
11:	learn: 0.5257042	total: 215ms	remaining: 17.7s
12:	learn: 0.5169250	total: 231ms	remaining: 17.5s
13:	learn: 0.5074365	total: 239ms	remaining: 16.8s
14:	learn: 0.4985522	total: 247ms	remaining: 16.2s
15:	learn: 0.4914367	total: 255ms	remaining: 15.7s
16:	learn: 0.4826085	total: 261ms	remaining: 15.1s
17:	learn: 0.4757232	total: 269ms	remaining: 14.7s
18:	learn: 0.4680217	total: 274ms	remaining: 14.1s
19:	learn: 0.

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('Geography',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='Geography')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='Geography'))])),
                                                ('Gender',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='Gender')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='Gender'))])),
                                                ('Tenure',
                                                 Pipeline(steps=[('selector',
           

In [13]:
#наши прогнозы для тестовой выборки
preds_boost = pipeline.predict_proba(X_test)[:, 1]
preds_boost[:10]

array([0.46917705, 0.24499582, 0.11518556, 0.04942908, 0.02369729,
       0.87775373, 0.02032456, 0.13019025, 0.16192791, 0.91391094])

Также нам нужно от вероятностей перейти к меткам классов. Для этого нужно подобрать порог, после которого мы считаем, что объект можно отнести к классу 1 (если вероятность больше порога - размечаем объект как класс 1, если нет - класс 0)

In [14]:
precision, recall, thresholds = precision_recall_curve(y_test, preds_boost)

b=1
fscore = (1+b**2)*(precision * recall) / (b**2*precision + recall)

# fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix = np.argmax(fscore)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds[ix], 
                                                                        fscore[ix],
                                                                        precision[ix],
                                                                        recall[ix]))

Best Threshold=0.384476, F-Score=0.645, Precision=0.661, Recall=0.629


In [15]:
# Запишем метрики в словарь
metrics = {
    'Variant': [
    'Basic', 
    ],
    'precision': [precision[ix]],
    'recall': [recall[ix]],
    'f_score': [fscore[ix]]
     }  


In [16]:
metrics

{'Variant': ['Basic'],
 'precision': [0.6611570247933884],
 'recall': [0.6286836935166994],
 'f_score': [0.6445115810674723]}

#### 4. далее разделить ваш набор данных на два множества: P (positives) и U (unlabeled). Причем брать нужно не все положительные (класс 1) примеры, а только лишь часть

In [17]:
df = df.sort_values('Exited')

In [18]:
U = df[:9000]
U

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
4999,5000,15710408,Cunningham,584,Spain,Female,38,3,0.00,2,1,1,4525.40,0
6317,6318,15654878,Yobanna,450,France,Male,29,7,117199.80,1,1,1,43480.63,0
6316,6317,15765643,Hamilton,725,France,Male,37,6,124348.38,2,0,1,176984.34,0
6313,6314,15812482,Young,575,France,Male,27,3,139301.68,1,1,0,99843.98,0
6312,6313,15648136,Green,658,Germany,Female,28,9,152812.58,1,1,0,166682.57,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6065,6066,15674720,Smith,691,Germany,Female,37,7,123067.63,1,1,1,98162.44,1
2032,2033,15658716,Banks,667,Germany,Female,37,5,92171.35,3,1,0,178106.34,1
7502,7503,15697844,Whitehouse,721,Spain,Female,32,10,0.00,1,1,0,136119.96,1
7504,7505,15587038,Ogochukwu,654,Spain,Female,32,2,0.00,1,1,1,51972.92,1


In [19]:
P = df[9000:]
P

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
7506,7507,15700300,Okoli,674,Germany,Female,44,4,131593.85,1,0,1,171345.02,1
7507,7508,15642001,Lorenzen,576,Germany,Male,44,9,119530.52,1,1,0,119056.68,1
3180,3181,15750447,Ozoemena,678,France,Female,60,10,117738.81,1,1,0,147489.76,1
3184,3185,15631070,Gerasimova,667,Germany,Male,55,9,154393.43,1,1,1,137674.96,1
2007,2008,15727384,Chukwuemeka,705,Germany,Female,43,10,146547.78,1,0,1,10072.55,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5700,5701,15812888,Perreault,447,France,Male,41,3,0.00,4,1,1,197490.39,1
1583,1584,15730394,Crowther,709,France,Female,43,8,0.00,2,0,0,168035.62,1
3452,3453,15722965,Yefimova,757,France,Male,57,3,89079.41,1,1,1,53179.21,1
1676,1677,15658057,Padovesi,812,Spain,Female,44,8,0.00,3,1,0,66926.83,1


#### 5. применить random negative sampling для построения классификатора в новых условиях

In [20]:
U_train = U.sample(n = 5000)
U_train

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
7709,7710,15574119,Okwuadigbo,598,Spain,Female,64,1,62979.93,1,1,1,152273.57,0
6794,6795,15694098,Jackson,575,France,Female,54,9,68332.96,1,1,1,144390.75,0
7448,7449,15593834,Genovese,691,Spain,Male,36,7,129934.64,1,0,0,75664.56,1
1945,1946,15607347,Olisaemeka,734,France,Male,22,5,130056.23,1,0,0,121894.31,1
564,565,15788126,Evans,689,Spain,Female,38,6,121021.05,1,1,1,12182.15,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8735,8736,15713599,Castiglione,728,France,Male,30,10,114835.43,1,0,1,37662.49,0
1524,1525,15653595,Ts'ai,796,France,Male,51,6,0.00,2,0,1,194733.28,0
3489,3490,15809817,Ch'en,593,Spain,Male,43,10,0.00,2,0,0,53478.02,0
3859,3860,15694450,Bianchi,677,France,Male,42,5,99580.13,1,1,0,21007.96,0


In [21]:
U_train['Exited'].value_counts()

0    4418
1     582
Name: Exited, dtype: int64

In [23]:
df_UP = df_merged = P.append(U_train, ignore_index=True)
df_UP['Exited'].value_counts()

0    5000
1    1000
Name: Exited, dtype: int64

In [24]:
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(df_UP, df_UP['Exited'], random_state=0)

- Категориальные признаки закодируем с помощью OneHotEncoding
- Вещественные оставим пока как есть

Зададим списки признаков

In [25]:
categorical_columns = ['Geography', 'Gender', 'Tenure', 'HasCrCard', 'IsActiveMember']
continuous_columns = ['CreditScore', 'Age', 'Balance', 'NumOfProducts', 'EstimatedSalary']

Теперь нам нужно под каждый признак создать трансформер и объединить их в список (сделаем это в цикле, чтобы не мучиться)

In [26]:
final_transformers = list()

for cat_col in categorical_columns:
    cat_transformer = Pipeline([
                ('selector', FeatureSelector(column=cat_col)),
                ('ohe', OHEEncoder(key=cat_col))
            ])
    final_transformers.append((cat_col, cat_transformer))
    
for cont_col in continuous_columns:
    cont_transformer = Pipeline([
                ('selector', NumberSelector(key=cont_col)),
                ( 'std_scaler', StandardScaler())
            ])
    final_transformers.append((cont_col, cont_transformer))

Объединим все это в единый пайплайн

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
feats = FeatureUnion(final_transformers)

feature_processing = Pipeline([('feats', feats)])

Теперь у нас есть пайплайн, который готовит признаки для моделирования.

In [28]:
from sklearn.ensemble import HistGradientBoostingClassifier
from catboost import CatBoostClassifier

pipeline = Pipeline([
    ('features',feats),
    ('classifier', CatBoostClassifier(random_state=42)),
])

Обучим модель

In [29]:
#обучим наш пайплайн
pipeline.fit(X_train, y_train)

Learning rate set to 0.019582
0:	learn: 0.6789399	total: 6.95ms	remaining: 6.94s
1:	learn: 0.6663983	total: 11.3ms	remaining: 5.64s
2:	learn: 0.6540292	total: 14.6ms	remaining: 4.84s
3:	learn: 0.6403602	total: 18.5ms	remaining: 4.6s
4:	learn: 0.6278082	total: 24.8ms	remaining: 4.92s
5:	learn: 0.6171175	total: 27.7ms	remaining: 4.59s
6:	learn: 0.6061837	total: 32.3ms	remaining: 4.58s
7:	learn: 0.5977223	total: 35.3ms	remaining: 4.38s
8:	learn: 0.5861484	total: 42ms	remaining: 4.63s
9:	learn: 0.5765010	total: 48.4ms	remaining: 4.79s
10:	learn: 0.5674385	total: 54.1ms	remaining: 4.86s
11:	learn: 0.5585806	total: 60.4ms	remaining: 4.97s
12:	learn: 0.5483408	total: 64.5ms	remaining: 4.89s
13:	learn: 0.5406628	total: 68.3ms	remaining: 4.81s
14:	learn: 0.5347691	total: 72.6ms	remaining: 4.76s
15:	learn: 0.5284205	total: 76.2ms	remaining: 4.69s
16:	learn: 0.5209869	total: 79.9ms	remaining: 4.62s
17:	learn: 0.5151123	total: 83.3ms	remaining: 4.54s
18:	learn: 0.5088267	total: 90.1ms	remaining: 4

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('Geography',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='Geography')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='Geography'))])),
                                                ('Gender',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='Gender')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='Gender'))])),
                                                ('Tenure',
                                                 Pipeline(steps=[('selector',
           

In [30]:
#наши прогнозы для тестовой выборки
preds_boost = pipeline.predict_proba(X_test)[:, 1]
preds_boost[:10]

array([0.02446561, 0.02429135, 0.0366544 , 0.62260353, 0.00458079,
       0.1926542 , 0.14595509, 0.08492538, 0.31872863, 0.64412171])

Также нам нужно от вероятностей перейти к меткам классов. Для этого нужно подобрать порог, после которого мы считаем, что объект можно отнести к классу 1 (если вероятность больше порога - размечаем объект как класс 1, если нет - класс 0)

In [31]:
precision_RNS, recall_RNS, thresholds_RNS = precision_recall_curve(y_test, preds_boost)

b=1
fscore_RNS = (1+b**2)*(precision_RNS * recall_RNS) / (b**2*precision_RNS + recall_RNS)

# fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix_RNS = np.argmax(fscore_RNS)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds_RNS[ix_RNS], 
                                                                        fscore_RNS[ix_RNS],
                                                                        precision_RNS[ix_RNS],
                                                                        recall_RNS[ix_RNS]))

Best Threshold=0.300567, F-Score=0.507, Precision=0.467, Recall=0.555


In [32]:
# Запишем метрики в словарь
metrics = {
    'Variant': [
    'Basic',
    'RNS_10'
    ],
    'precision': [precision[ix], precision_RNS[ix_RNS]],
    'recall': [recall[ix], recall_RNS[ix_RNS]],
    'f_score': [fscore[ix], fscore_RNS[ix_RNS]]
     }  

In [33]:
metrics

{'Variant': ['Basic', 'RNS_10'],
 'precision': [0.6611570247933884, 0.46735395189003437],
 'recall': [0.6286836935166994, 0.5551020408163265],
 'f_score': [0.6445115810674723, 0.5074626865671641]}

#### 6. сравнить качество с решением из пункта 4 (построить отчет - таблицу метрик)

In [34]:
pd.DataFrame(metrics)

Unnamed: 0,Variant,precision,recall,f_score
0,Basic,0.661157,0.628684,0.644512
1,RNS_10,0.467354,0.555102,0.507463


Видим, что метрики стали ниже

#### 7. поэкспериментировать с долей P на шаге 5 (как будет меняться качество модели при уменьшении/увеличении размера P) 

Уменьшим долю P

In [35]:
U_train = U.sample(n = 6000)
U_train['Exited'] = 0
df_UP = df_merged = P.append(U_train, ignore_index=True)
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(df_UP, df_UP['Exited'], random_state=0)
feature_processing = Pipeline([('feats', feats)])
pipeline = Pipeline([
    ('features', feats),
    ('classifier', CatBoostClassifier(random_state=42)),
])
# обучим модель
pipeline.fit(X_train, y_train)
#наши прогнозы для тестовой выборки
preds_boost = pipeline.predict_proba(X_test)[:, 1]

precision_RNS_2, recall_RNS_2, thresholds_RNS_2 = precision_recall_curve(y_test, preds_boost)
b=1
fscore_RNS_2 = (1+b**2)*(precision_RNS_2 * recall_RNS_2) / (b**2*precision_RNS_2 + recall_RNS_2)

# fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix_RNS_2 = np.argmax(fscore_RNS_2)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds_RNS_2[ix_RNS_2], 
                                                                        fscore_RNS_2[ix_RNS_2],
                                                                        precision_RNS_2[ix_RNS_2],
                                                                        recall_RNS_2[ix_RNS_2]))

Learning rate set to 0.020914
0:	learn: 0.6754614	total: 60.8ms	remaining: 1m
1:	learn: 0.6599227	total: 68.7ms	remaining: 34.3s
2:	learn: 0.6425162	total: 74.7ms	remaining: 24.8s
3:	learn: 0.6274129	total: 78.6ms	remaining: 19.6s
4:	learn: 0.6128106	total: 82.2ms	remaining: 16.4s
5:	learn: 0.5991837	total: 84.9ms	remaining: 14.1s
6:	learn: 0.5884525	total: 87ms	remaining: 12.3s
7:	learn: 0.5745301	total: 90.8ms	remaining: 11.3s
8:	learn: 0.5629099	total: 95.4ms	remaining: 10.5s
9:	learn: 0.5518883	total: 101ms	remaining: 10s
10:	learn: 0.5414372	total: 106ms	remaining: 9.49s
11:	learn: 0.5295037	total: 109ms	remaining: 9s
12:	learn: 0.5202985	total: 113ms	remaining: 8.57s
13:	learn: 0.5123247	total: 117ms	remaining: 8.21s
14:	learn: 0.5049557	total: 121ms	remaining: 7.92s
15:	learn: 0.4955921	total: 124ms	remaining: 7.64s
16:	learn: 0.4900004	total: 128ms	remaining: 7.4s
17:	learn: 0.4831631	total: 134ms	remaining: 7.32s
18:	learn: 0.4752834	total: 139ms	remaining: 7.17s
19:	learn: 0.

In [36]:
# Запишем метрики в словарь
metrics = {
    'Variant': [
    'Basic',
    'RNS_5000',
    'RNS_6000'
    ],
    'precision': [precision[ix], precision_RNS[ix_RNS], precision_RNS_2[ix_RNS_2]],
    'recall': [recall[ix], recall_RNS[ix_RNS], recall_RNS_2[ix_RNS_2]],
    'f_score': [fscore[ix], fscore_RNS[ix_RNS], fscore_RNS_2[ix_RNS_2]]
     }  

In [37]:
metrics

{'Variant': ['Basic', 'RNS_5000', 'RNS_6000'],
 'precision': [0.6611570247933884, 0.46735395189003437, 0.3923444976076555],
 'recall': [0.6286836935166994, 0.5551020408163265, 0.656],
 'f_score': [0.6445115810674723, 0.5074626865671641, 0.4910179640718564]}

In [38]:
pd.DataFrame(metrics)

Unnamed: 0,Variant,precision,recall,f_score
0,Basic,0.661157,0.628684,0.644512
1,RNS_5000,0.467354,0.555102,0.507463
2,RNS_6000,0.392344,0.656,0.491018


Увелиим долю P

In [39]:
U_train = U.sample(n = 4000)
U_train['Exited'] = 0
df_UP = df_merged = P.append(U_train, ignore_index=True)
#разделим данные на train/test
X_train, X_test, y_train, y_test = train_test_split(df_UP, df_UP['Exited'], random_state=0)
feature_processing = Pipeline([('feats', feats)])
pipeline = Pipeline([
    ('features', feats),
    ('classifier', CatBoostClassifier(random_state=42)),
])
# обучим модель
pipeline.fit(X_train, y_train)
#наши прогнозы для тестовой выборки
preds_boost = pipeline.predict_proba(X_test)[:, 1]

precision_RNS_3, recall_RNS_3, thresholds_RNS_3 = precision_recall_curve(y_test, preds_boost)
b=1
fscore_RNS_3 = (1+b**3)*(precision_RNS_3 * recall_RNS_3) / (b**2*precision_RNS_3 + recall_RNS_3)

# fscore = (2 * precision * recall) / (precision + recall)
# locate the index of the largest f score
ix_RNS_3 = np.argmax(fscore_RNS_3)
print('Best Threshold=%f, F-Score=%.3f, Precision=%.3f, Recall=%.3f' % (thresholds_RNS_3[ix_RNS_3], 
                                                                        fscore_RNS_3[ix_RNS_3],
                                                                        precision_RNS_3[ix_RNS_3],
                                                                        recall_RNS_3[ix_RNS_3]))

Learning rate set to 0.018115
0:	learn: 0.6817307	total: 4.82ms	remaining: 4.82s
1:	learn: 0.6718944	total: 7.54ms	remaining: 3.76s
2:	learn: 0.6595330	total: 11.6ms	remaining: 3.85s
3:	learn: 0.6482013	total: 16.3ms	remaining: 4.05s
4:	learn: 0.6385725	total: 21.7ms	remaining: 4.32s
5:	learn: 0.6297736	total: 25.4ms	remaining: 4.21s
6:	learn: 0.6223896	total: 27.6ms	remaining: 3.92s
7:	learn: 0.6139608	total: 33.1ms	remaining: 4.1s
8:	learn: 0.6043418	total: 37.6ms	remaining: 4.14s
9:	learn: 0.5965772	total: 42ms	remaining: 4.16s
10:	learn: 0.5876357	total: 48.8ms	remaining: 4.38s
11:	learn: 0.5799144	total: 54.2ms	remaining: 4.46s
12:	learn: 0.5720608	total: 57.2ms	remaining: 4.35s
13:	learn: 0.5655896	total: 60.4ms	remaining: 4.25s
14:	learn: 0.5597478	total: 66.3ms	remaining: 4.35s
15:	learn: 0.5548024	total: 68.5ms	remaining: 4.21s
16:	learn: 0.5492336	total: 72.2ms	remaining: 4.17s
17:	learn: 0.5436261	total: 75.3ms	remaining: 4.11s
18:	learn: 0.5394516	total: 80.8ms	remaining: 4

In [40]:
# Запишем метрики в словарь
metrics = {
    'Variant': [
    'Basic',
    'RNS_5000',
    'RNS_6000',
    'RNS_4000'
    ],
    'precision': [precision[ix], precision_RNS[ix_RNS], precision_RNS_2[ix_RNS_2], precision_RNS_3[ix_RNS_3]],
    'recall': [recall[ix], recall_RNS[ix_RNS], recall_RNS_2[ix_RNS_2], recall_RNS_3[ix_RNS_3]],
    'f_score': [fscore[ix], fscore_RNS[ix_RNS], fscore_RNS_2[ix_RNS_2], fscore_RNS_3[ix_RNS_3]]
     }  

In [41]:
metrics

{'Variant': ['Basic', 'RNS_5000', 'RNS_6000', 'RNS_4000'],
 'precision': [0.6611570247933884,
  0.46735395189003437,
  0.3923444976076555,
  0.4837758112094395],
 'recall': [0.6286836935166994, 0.5551020408163265, 0.656, 0.6721311475409836],
 'f_score': [0.6445115810674723,
  0.5074626865671641,
  0.4910179640718564,
  0.562607204116638]}

In [42]:
pd.DataFrame(metrics)

Unnamed: 0,Variant,precision,recall,f_score
0,Basic,0.661157,0.628684,0.644512
1,RNS_5000,0.467354,0.555102,0.507463
2,RNS_6000,0.392344,0.656,0.491018
3,RNS_4000,0.483776,0.672131,0.562607


Вывод: видим, что при увеличении доли P, точность прогноза возрастает