**Первый подход:**

- используем отфильтрованный трейн
- делаем oversampling при помощи Adasyn
- запускаем gridsearch для catboost c cv=3 и scoring='roc_auc'
- фильтруем тест через csv со списком признаков после фильтрации
- запускаем predict_proba
- считаем метрики
- закидываем submission_version_1

Импортируем библиотеки

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, precision_score, recall_score, roc_auc_score
from catboost import CatBoostClassifier
from imblearn.over_sampling import ADASYN
from sklearn.model_selection import GridSearchCV

In [35]:
VERSION = 1

Считываем отфильтрованные данные

In [3]:
X = pd.read_parquet("../../data/intermediate_data/filter_train.parquet")

In [4]:
print(X.shape)

(519615, 228)


In [5]:
X.head()

Unnamed: 0,feature547,feature418,feature867,feature15,feature641,feature734,feature549,feature674,feature442,feature763,...,feature717,feature844,feature380,feature550,feature864,feature727,feature454,feature369,feature367,feature409
0,37,77,7063,6,0,14,17,1,35,0,...,4,1283,0,49,1619,28,0,1,4,56
1,1,77,7063,135,0,14,17,0,35,153,...,4,1283,0,49,2092,28,6,1,4,56
2,37,77,7063,0,0,14,17,0,35,153,...,4,1283,0,49,7174,28,83,1,3,56
3,37,77,7063,0,0,14,17,0,35,1,...,4,1283,0,49,7174,28,0,1,2,56
4,37,77,7063,0,0,14,17,0,35,153,...,4,1283,0,49,1439,28,1,2,2,56


In [6]:
X.target.value_counts()

target
0    501078
1     18537
Name: count, dtype: int64

In [7]:
y = X["target"]
X.drop(columns = ["target", "id"], inplace = True)

In [8]:
y.tail(3)

519612    0
519613    0
519614    1
Name: target, dtype: int64

Делаем oversampling через ADASYN

In [10]:
X_resampled, y_resampled = ADASYN().fit_resample(X, y)

In [11]:
X_resampled.shape

(995140, 226)

Создаем сетку параметров для CatBoost и запускаем перебор параметров с CV

In [12]:
catboost_parameters = {
    'depth'         : [4, 7, 10],
    'learning_rate' : [0.01, 0.03, 0.05],
    'iterations'    : [50, 250]
}

Разделяем данные на обучающую и тестовую выборки

In [13]:
X_train, X_test, y_train, y_test = train_test_split(X_resampled,
                                                    y_resampled, 
                                                    test_size=0.2, 
                                                    stratify=y_resampled, 
                                                    random_state=23)

In [15]:
grid_search_ct = GridSearchCV(estimator=CatBoostClassifier(),
                              param_grid=catboost_parameters,
                              scoring='roc_auc',
                              cv=3,
                              n_jobs=-1,
                              verbose=2)

In [16]:
grid_search_ct.fit(X_train, y_train)

Fitting 3 folds for each of 18 candidates, totalling 54 fits
0:	learn: 0.6874815	total: 2.49s	remaining: 10m 19s
1:	learn: 0.6818866	total: 5.28s	remaining: 10m 54s
2:	learn: 0.6764365	total: 9.39s	remaining: 12m 52s
3:	learn: 0.6712369	total: 13.5s	remaining: 13m 50s
4:	learn: 0.6661263	total: 17.9s	remaining: 14m 38s
5:	learn: 0.6593215	total: 20.4s	remaining: 13m 48s
6:	learn: 0.6544661	total: 25.4s	remaining: 14m 41s
7:	learn: 0.6480862	total: 29.9s	remaining: 15m 4s
8:	learn: 0.6444070	total: 33.2s	remaining: 14m 48s
9:	learn: 0.6407364	total: 39.1s	remaining: 15m 37s
10:	learn: 0.6363140	total: 42.3s	remaining: 15m 19s
11:	learn: 0.6328893	total: 44.2s	remaining: 14m 37s
12:	learn: 0.6269407	total: 46.1s	remaining: 14m
13:	learn: 0.6213928	total: 48.8s	remaining: 13m 42s
14:	learn: 0.6173857	total: 51.1s	remaining: 13m 20s
15:	learn: 0.6120618	total: 53.4s	remaining: 13m
16:	learn: 0.6090555	total: 55.4s	remaining: 12m 39s
17:	learn: 0.6047042	total: 58.3s	remaining: 12m 31s
18:	

In [17]:
print("the best estimator:\n", grid_search_ct.best_estimator_)
print("the best score:\n", grid_search_ct.best_score_)
print("the best parameters:\n", grid_search_ct.best_params_)

the best estimator:
 <catboost.core.CatBoostClassifier object at 0x7fbef51e2190>
the best score:
 0.9883840833194593
the best parameters:
 {'depth': 10, 'iterations': 250, 'learning_rate': 0.05}


Предиктимся на тесте

In [18]:
pred = grid_search_ct.predict_proba(X_test)
pred = pred[:, 1]

In [19]:
for threshold in [0.1, 0.3, 0.5]:
    pred_binary = (pred >= threshold)
    
    print("threshold:", threshold)
    print("F1_SCORE:", f1_score(y_test, pred_binary))
    print("PRECISION:", precision_score(y_test, pred_binary))
    print("RECALL:", recall_score(y_test, pred_binary))
    print("ROC_AUC:", roc_auc_score(y_test, pred))
    print()

threshold: 0.1
F1_SCORE: 0.9501345841744844
PRECISION: 0.9295298764667157
RECALL: 0.9716734809537303
ROC_AUC: 0.9882289358010621

threshold: 0.3
F1_SCORE: 0.9793229115728896
PRECISION: 0.9965116644493562
RECALL: 0.9627170788973
ROC_AUC: 0.9882289358010621

threshold: 0.5
F1_SCORE: 0.9797114913163532
PRECISION: 0.9996734883035084
RECALL: 0.9605311095818322
ROC_AUC: 0.9882289358010621



In [20]:
best_threshold = 0.1

Сохраняем submission

In [21]:
X_test_submit = pd.read_parquet("../../data/input_data/test_sber.parquet")
#X.drop(columns = ["sample_ml_new", "id"], inplace = True)

In [22]:
print(X_test_submit.shape)

(173433, 1078)


In [23]:
X_test_submit.head()

Unnamed: 0,id,sample_ml_new,feature1,feature2,feature3,feature4,feature5,feature6,feature7,feature8,...,feature1067,feature1068,feature1069,feature1070,feature1071,feature1072,feature1073,feature1074,feature1075,feature1076
0,3,3,1696,458,26,102479,22,16,0,121,...,779,7740,9577,254,0,355,308,779,7740,9577
1,4,3,1688,53,78,103922,191,64,0,0,...,79401,109240,153820,24766,48600,46029,65113,79401,109240,153820
2,12,3,1689,13,81,104111,191,4,0,0,...,0,0,0,0,0,0,0,0,0,0
3,16,3,1761,1759,44,102433,191,4,0,0,...,0,0,0,0,0,0,0,0,0,0
4,20,3,1761,1759,77,102010,191,34,0,0,...,0,0,0,0,0,0,0,0,0,0


In [24]:
filtered_columns = pd.read_csv("../../data/intermediate_data/cols_after_preprocessing.csv")

In [25]:
cols = list(set(filtered_columns.cols_after_preprocessing.values) & set(X_test_submit.columns))

In [26]:
X_filtered = X_test_submit[cols]

In [28]:
X_filtered.drop(columns=["id"], inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  X_filtered.drop(columns=["id"], inplace=True)


In [29]:
print(X_filtered.shape)

(173433, 226)


In [30]:
X_filtered.head()

Unnamed: 0,feature239,feature598,feature483,feature752,feature230,feature831,feature576,feature423,feature868,feature398,...,feature760,feature312,feature326,feature584,feature781,feature777,feature185,feature514,feature780,feature23
0,0,0,16,17,0,1306,8,17,3709,1,...,0,657,1222,0,79,0,9945,87,14,9
1,0,0,16,17,0,0,0,17,1980,0,...,12,657,1222,0,79,0,66467,1,14,9
2,0,0,16,17,0,1306,8,17,3821,0,...,2,657,1222,0,79,2,87996,87,14,9
3,0,0,16,17,0,1306,8,17,4122,0,...,0,657,1222,0,79,1,234812,87,14,9
4,0,0,16,17,0,1306,8,17,2540,0,...,0,657,1222,0,79,0,234812,87,14,9


In [31]:
pred = grid_search_ct.predict_proba(X_filtered)

pred = pred[:, 1]
pred_binary = (pred >= best_threshold).astype(int)

In [32]:
submission = pd.read_csv("../../data/intermediate_data/sample_submission.csv")
submission.head(10)

Unnamed: 0,id,target_bin,target_prob
0,3,0,0.03
1,4,0,0.03
2,12,1,0.03
3,16,1,0.03
4,20,0,0.03
5,23,0,0.03
6,26,0,0.03
7,50,0,0.03
8,51,1,0.03
9,53,0,0.03


In [33]:
submission["target_prob"] = pred
submission["target_bin"] = pred_binary
submission.head(10)

Unnamed: 0,id,target_bin,target_prob
0,3,0,0.056115
1,4,0,0.035944
2,12,0,0.01621
3,16,0,0.017151
4,20,0,0.074867
5,23,0,0.021514
6,26,0,0.028442
7,50,1,0.119715
8,51,0,0.031713
9,53,0,0.016059


In [None]:
submission.to_csv(f"../../data/output_data/submission_version_{VERSION}_{best_threshold}.csv", index=False)