### Обучение пайплайна

1. Загрузим данные https://www.kaggle.com/datasets/sonujha090/bank-marketing
2. Соберем пайплайн с обработкой данных
3. Обучим CatBoost и сохраним на диск предобученный пайплайн



Bank Marketing: 

The data is related with direct marketing campaigns of a Portuguese banking institution.
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,
in order to access if the product (bank term deposit) would be (or not) subscribed.

The classification goal is to predict if the client will subscribe a term deposit (variable y).

Number of Attributes: 16 + output attribute.

Attribute information:
1. age (numeric)
2. job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")
3. marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)
4. education (categorical: "unknown","secondary","primary","tertiary")
5. default: has credit in default? (binary: "yes","no")
6. balance: average yearly balance, in euros (numeric)
7. housing: has housing loan? (binary: "yes","no")
8. loan: has personal loan? (binary: "yes","no")


9. contact: contact communication type (categorical: "unknown","telephone","cellular")
10. day: last contact day of the month (numeric)
11. month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")
12. duration: last contact duration, in seconds (numeric)


13. campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
14. pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)
15. previous: number of contacts performed before this campaign and for this client (numeric)
16. poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")


17. y - has the client subscribed a term deposit? (binary: "yes","no")

In [1]:
import dill
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from urllib import request, parse 
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import StandardScaler
from catboost import CatBoostClassifier
from sklearn.metrics import roc_auc_score, precision_recall_curve, confusion_matrix

Загрузим данные:

In [2]:
df = pd.read_csv("./bank-full.csv")
df.head(3)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no


Сделаем преобразование данных:

In [3]:
# Преобразуем текстовые признаки в бинарные: 1 - да, 0 - нет
df['y'] = df['y'].apply(lambda x: 1 if x=='yes' else 0).astype('int')
df['default'] = df['default'].apply(lambda x: 1 if x=='yes' else 0).astype('int')
df['housing'] = df['housing'].apply(lambda x: 1 if x=='yes' else 0).astype('int')
df['loan'] = df['loan'].apply(lambda x: 1 if x=='yes' else 0).astype('int')
df['poutcome'] = df['poutcome'].apply(lambda x: 1 if x=='success' else 0).astype('int')
# удаляем лишнее:
df = df.drop(['day', 'month'], axis=1)

Разделим данные на train и test и сохраним выборки на диск:

In [4]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('y', axis=1), df['y'], random_state=4)

X_test.to_csv("X_test.csv", index=None)
y_test.to_csv("y_test.csv", index=None)
X_train.to_csv("X_train.csv", index=None)
y_train.to_csv("y_train.csv", index=None)

Соберем пайплайн:

In [5]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]
    
class NumberSelector(BaseEstimator, TransformerMixin):
    """
    Transformer to select a single column from the data frame to perform additional transformations on
    Use on numeric columns in the data
    """
    def __init__(self, key):
        self.key = key

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        return X[[self.key]]
    
class OHEEncoder(BaseEstimator, TransformerMixin):
    def __init__(self, key):
        self.key = key
        self.columns = []

    def fit(self, X, y=None):
        self.columns = [col for col in pd.get_dummies(X, prefix=self.key).columns]
        return self

    def transform(self, X):
        X = pd.get_dummies(X, prefix=self.key)
        test_columns = [col for col in X.columns]
        for col_ in self.columns:
            if col_ not in test_columns:
                X[col_] = 0
        return X[self.columns]

In [6]:
# зададим списки признаков
categorical_columns = ['job', 'marital', 'education', 'contact']
continuous_columns = ['age', 'balance', 'duration', 'campaign', 'pdays', 'previous', 'default', 'housing', 'loan', 'poutcome']

# сделаем трансформер под каждый список (в цикле):
final_transformers = list()

for cat_col in categorical_columns:
    cat_transformer = Pipeline([
                ('selector', FeatureSelector(column=cat_col)),
                ('ohe', OHEEncoder(key=cat_col))
            ])
    final_transformers.append((cat_col, cat_transformer))
    
for cont_col in continuous_columns:
    cont_transformer = Pipeline([
                ('selector', NumberSelector(key=cont_col)),
                ('standard', StandardScaler())
            ])
    final_transformers.append((cont_col, cont_transformer))
    
# объединим все это в единый пайплайн
feats = FeatureUnion(final_transformers)
feature_processing = Pipeline([('feats', feats)])

Добавим классификатор CatBoost:

In [7]:
%%time

pipeline = Pipeline([
    ('features', feats),
    ('classifier', CatBoostClassifier(iterations=200, max_depth=8, 
                                      learning_rate=0.03, random_state = 42)), # параметры подобраны с GridSearchCV 
])

pipeline.fit(X_train, y_train)

0:	learn: 0.6546283	total: 172ms	remaining: 34.3s
1:	learn: 0.6178640	total: 200ms	remaining: 19.8s
2:	learn: 0.5859201	total: 223ms	remaining: 14.7s
3:	learn: 0.5562781	total: 243ms	remaining: 11.9s
4:	learn: 0.5316039	total: 263ms	remaining: 10.3s
5:	learn: 0.5088800	total: 279ms	remaining: 9.02s
6:	learn: 0.4863376	total: 294ms	remaining: 8.12s
7:	learn: 0.4665573	total: 311ms	remaining: 7.46s
8:	learn: 0.4491216	total: 326ms	remaining: 6.91s
9:	learn: 0.4320297	total: 339ms	remaining: 6.45s
10:	learn: 0.4181152	total: 350ms	remaining: 6.02s
11:	learn: 0.4050417	total: 364ms	remaining: 5.71s
12:	learn: 0.3920418	total: 389ms	remaining: 5.59s
13:	learn: 0.3804464	total: 404ms	remaining: 5.36s
14:	learn: 0.3702705	total: 416ms	remaining: 5.13s
15:	learn: 0.3631044	total: 423ms	remaining: 4.86s
16:	learn: 0.3560123	total: 436ms	remaining: 4.7s
17:	learn: 0.3486263	total: 451ms	remaining: 4.56s
18:	learn: 0.3413402	total: 466ms	remaining: 4.44s
19:	learn: 0.3318880	total: 481ms	remainin

164:	learn: 0.2112600	total: 2.82s	remaining: 599ms
165:	learn: 0.2111839	total: 2.84s	remaining: 582ms
166:	learn: 0.2110496	total: 2.86s	remaining: 564ms
167:	learn: 0.2109166	total: 2.87s	remaining: 547ms
168:	learn: 0.2107604	total: 2.89s	remaining: 530ms
169:	learn: 0.2106917	total: 2.9s	remaining: 512ms
170:	learn: 0.2106216	total: 2.92s	remaining: 495ms
171:	learn: 0.2104833	total: 2.93s	remaining: 477ms
172:	learn: 0.2103660	total: 2.94s	remaining: 460ms
173:	learn: 0.2103046	total: 2.96s	remaining: 442ms
174:	learn: 0.2101861	total: 2.98s	remaining: 425ms
175:	learn: 0.2101129	total: 2.99s	remaining: 408ms
176:	learn: 0.2100222	total: 3.01s	remaining: 391ms
177:	learn: 0.2099759	total: 3.02s	remaining: 374ms
178:	learn: 0.2098920	total: 3.04s	remaining: 357ms
179:	learn: 0.2097998	total: 3.06s	remaining: 340ms
180:	learn: 0.2096714	total: 3.07s	remaining: 323ms
181:	learn: 0.2095650	total: 3.1s	remaining: 306ms
182:	learn: 0.2094623	total: 3.11s	remaining: 289ms
183:	learn: 0.

Pipeline(steps=[('features',
                 FeatureUnion(transformer_list=[('job',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='job')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='job'))])),
                                                ('marital',
                                                 Pipeline(steps=[('selector',
                                                                  FeatureSelector(column='marital')),
                                                                 ('ohe',
                                                                  OHEEncoder(key='marital'))])),
                                                ('education',
                                                 Pipeline(steps=[('selector',
                       

In [8]:
pipeline.steps

[('features',
  FeatureUnion(transformer_list=[('job',
                                  Pipeline(steps=[('selector',
                                                   FeatureSelector(column='job')),
                                                  ('ohe',
                                                   OHEEncoder(key='job'))])),
                                 ('marital',
                                  Pipeline(steps=[('selector',
                                                   FeatureSelector(column='marital')),
                                                  ('ohe',
                                                   OHEEncoder(key='marital'))])),
                                 ('education',
                                  Pipeline(steps=[('selector',
                                                   FeatureSelector(column='education')),
                                                  ('ohe',
                                                   OHEEncoder(key='educ

Сохраним модель (пайплайн):

In [9]:
with open("catboost_pipeline.dill", "wb") as f:
    dill.dump(pipeline, f)