### Урок 3. Построение надежных схем валидации решения, оптимизация целевых метрик

### -- Автор: Шенк Евгений Станиславович

### Домашнее задание 3:
Основное задание:  
Даны выборки для обучения и для тестирования. Задание заключается в том, чтобы попробовать разные способы валидации, проанализировать плюсы / минусы каждой и сделать выводы о том, какой способ валидации наиболее устойчивый в данной задаче. Метрика качества для оценки прогнозов - ROC-AUC, название целевой переменной - IsFraud. Рекомендуется использовать модели градиетного бустинга, реализация любая / гипепараметры любые. Внимание! выборка assignment_2_test.csv - наш аналог лидерборда. Будем моделировать ситуацию отправки решения на лидерборд и сравнить значение метрики на лидерборде и на локальной валидации. Для других целей использовать выборку запрещено!.  

Терминалогия, используемая в задании:  
* обучающая выборка - выборка, которая передается в метод  
fit  
/  
train  
;  
* валидационная выборка - выборка, которая получается при Hold-Out на 2 выборки (  
train  
,  
valid  
);  
* тестовая выборка - выборка, которая получается при Hold-Out на 3 выборки (  
train  
,  
valid  
,  
test  
);  
* ЛБ - лидерборд, выборка  
assignment_2_test.csv  
.  

Задание 1: сделать Hold-Out валидацию с разбиением, размер которого будет адеквтаным, по вашему мнению; разбиение проводить по id-транзакции (TransactionID), обучать модель градиетного бустинга любой реализации с подбором числа деревьев по early_stopping критерию до достижения сходимости. Оценить качество модели на валидационной выборке, оценить расхождение по сравнению с качеством на обучающей выборке и валидационной выборке. Оценить качество на ЛБ, сравнить с качеством на обучении и валидации. Сделать выводы.  

Задание 2: сделать Hold-Out валидацию с разбиением на 3 выборки, разбиение проводить по id-транзакции (TransactionID), размер каждой выборки подобрать самостоятельно. Повторить процедуру из п.1. для каждой выборки.  

Задание 3: построить доверительный интервал на данных из п.2 на основе бутстреп выборок, оценить качество модели на ЛБ относительно полученного доверительного интервала. Сделать выводы.  

Задание 4: выполнить Adversarial Validation, подобрать объекты из обучающей выборки, которые сильно похожи на объекты из
assignment_2_test.csv, и использовать их в качестве валидационного набора. Оценить качество модели на ЛБ, сделать выводы о полученных результатах.  

Задание 5: сделать KFold / StratifiedKFold валидацию (на ваше усмотрение), оценить получаемые качество и разброс по метрике качества. Сделать выводы об устойчивости кросс-валидации, сходимости оценки на кросс-валидации и отложенном наборе данных; Оценить качество на ЛБ, сделать выводы.  

Задание 6 (опциональное): сделать Hold-Out валидацию по времени (TransactionDT), повторить процедуры из п.1 / п.2 (на ваш выбор). Построить доверительный интервал, сравнить качество на ЛБ выборке с полученным доверительным интервалом. Сделать выводы.  

Задание 7 (совсем опциональное): в данном наборе данных у нас есть ID-транзакции (TransactionID) и время транзакции (TransactionDT), но отсутствует ID-клиента, который совершал транзакции. Кажется, что в этой задаче валидация по клиенту работала бы хорошо. Предложить критерий, по которому можно выделить клиентов и сделать п.5, используя созданное определение клиента, используя валидацию по клиенту (GroupKFold).  

In [1]:
import numpy as np
import pandas as pd
import lightgbm as lgb
import matplotlib.pyplot as plt
from tqdm import tqdm
from typing import List, Tuple

from sklearn.model_selection import KFold, StratifiedKFold, GroupKFold, train_test_split, cross_val_score, GroupShuffleSplit
from sklearn.metrics import roc_auc_score
from sklearn.preprocessing import OrdinalEncoder

import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

In [2]:
pd.options.display.max_columns = 400

### Задание 1. Hold-Out 2 parts

In [3]:
data = pd.read_csv("../data/assignment_2_train.csv")
lb_dataset = pd.read_csv("../data/assignment_2_test.csv")

print("data.shape = {} rows, {} cols".format(*data.shape))
print("lb_dataset.shape = {} rows, {} cols".format(*lb_dataset.shape))

data.shape = 180000 rows, 394 cols
lb_dataset.shape = 100001 rows, 394 cols


In [4]:
data.sort_values(by='TransactionID', ascending=True, inplace=True)

In [5]:
data.head(2)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,M1,M2,M3,M4,M5,M6,M7,M8,M9,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,V41,V42,V43,V44,V45,V46,V47,V48,V49,V50,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61,V62,V63,V64,V65,V66,V67,V68,V69,V70,V71,V72,V73,V74,V75,V76,V77,V78,V79,V80,V81,V82,V83,V84,V85,V86,V87,V88,V89,V90,V91,V92,V93,V94,V95,V96,V97,V98,V99,V100,V101,V102,V103,V104,V105,V106,V107,V108,V109,V110,V111,V112,V113,V114,V115,V116,V117,V118,V119,V120,V121,V122,V123,V124,V125,V126,V127,V128,V129,V130,V131,V132,V133,V134,V135,V136,V137,V138,V139,V140,V141,V142,V143,V144,V145,V146,V147,V148,V149,V150,V151,V152,V153,V154,V155,V156,V157,V158,V159,V160,V161,V162,V163,V164,V165,V166,V167,V168,V169,V170,V171,V172,V173,V174,V175,V176,V177,V178,V179,V180,V181,V182,V183,V184,V185,V186,V187,V188,V189,V190,V191,V192,V193,V194,V195,V196,V197,V198,V199,V200,V201,V202,V203,V204,V205,V206,V207,V208,V209,V210,V211,V212,V213,V214,V215,V216,V217,V218,V219,V220,V221,V222,V223,V224,V225,V226,V227,V228,V229,V230,V231,V232,V233,V234,V235,V236,V237,V238,V239,V240,V241,V242,V243,V244,V245,V246,V247,V248,V249,V250,V251,V252,V253,V254,V255,V256,V257,V258,V259,V260,V261,V262,V263,V264,V265,V266,V267,V268,V269,V270,V271,V272,V273,V274,V275,V276,V277,V278,V279,V280,V281,V282,V283,V284,V285,V286,V287,V288,V289,V290,V291,V292,V293,V294,V295,V296,V297,V298,V299,V300,V301,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321,V322,V323,V324,V325,V326,V327,V328,V329,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,T,T,T,M2,F,T,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,M0,T,T,,,,,,,,,,,,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,


In [6]:
lb_dataset.head(2)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,M1,M2,M3,M4,M5,M6,M7,M8,M9,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,V41,V42,V43,V44,V45,V46,V47,V48,V49,V50,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61,V62,V63,V64,V65,V66,V67,V68,V69,V70,V71,V72,V73,V74,V75,V76,V77,V78,V79,V80,V81,V82,V83,V84,V85,V86,V87,V88,V89,V90,V91,V92,V93,V94,V95,V96,V97,V98,V99,V100,V101,V102,V103,V104,V105,V106,V107,V108,V109,V110,V111,V112,V113,V114,V115,V116,V117,V118,V119,V120,V121,V122,V123,V124,V125,V126,V127,V128,V129,V130,V131,V132,V133,V134,V135,V136,V137,V138,V139,V140,V141,V142,V143,V144,V145,V146,V147,V148,V149,V150,V151,V152,V153,V154,V155,V156,V157,V158,V159,V160,V161,V162,V163,V164,V165,V166,V167,V168,V169,V170,V171,V172,V173,V174,V175,V176,V177,V178,V179,V180,V181,V182,V183,V184,V185,V186,V187,V188,V189,V190,V191,V192,V193,V194,V195,V196,V197,V198,V199,V200,V201,V202,V203,V204,V205,V206,V207,V208,V209,V210,V211,V212,V213,V214,V215,V216,V217,V218,V219,V220,V221,V222,V223,V224,V225,V226,V227,V228,V229,V230,V231,V232,V233,V234,V235,V236,V237,V238,V239,V240,V241,V242,V243,V244,V245,V246,V247,V248,V249,V250,V251,V252,V253,V254,V255,V256,V257,V258,V259,V260,V261,V262,V263,V264,V265,V266,V267,V268,V269,V270,V271,V272,V273,V274,V275,V276,V277,V278,V279,V280,V281,V282,V283,V284,V285,V286,V287,V288,V289,V290,V291,V292,V293,V294,V295,V296,V297,V298,V299,V300,V301,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321,V322,V323,V324,V325,V326,V327,V328,V329,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
0,3287000,1,7415038,226.0,W,12473,555.0,150.0,visa,226.0,credit,299.0,87.0,116.0,,aol.com,,2.0,3.0,0.0,0.0,0.0,5.0,0.0,0.0,3.0,0.0,3.0,2.0,6.0,2.0,4.0,4.0,0.0,4.0,3.0,,,,,4.0,4.0,,,,3.0,T,T,F,M0,T,F,F,F,T,1.0,2.0,2.0,1.0,3.0,1.0,1.0,1.0,2.0,0.0,0.0,2.0,2.0,1.0,0.0,0.0,0.0,0.0,2.0,5.0,0.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,2.0,2.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,2.0,0.0,0.0,0.0,1.0,1.0,2.0,2.0,2.0,2.0,0.0,0.0,0.0,0.0,2.0,5.0,0.0,0.0,1.0,1.0,2.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,3.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,7.0,7.0,0.0,1.0,1.0,2.0,6.0,6.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,452.0,1482.0,1482.0,0.0,206.0,206.0,452.0,1276.0,1276.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,2.0,14.0,7.0,9.0,15.0,0.0,2.0,0.0,2.0,1.0,2.0,1.0,2.0,2.0,2.0,12.0,12.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,452.0,2924.0,2924.0,0.0,412.0,0.0,412.0,206.0,412.0,412.0,452.0,2512.0,2512.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,
1,3287001,0,7415054,3072.0,W,15651,417.0,150.0,visa,226.0,debit,330.0,87.0,,,yahoo.com,,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,0.0,,0.0,0.0,0.0,,,,,0.0,,,,,0.0,,,,,,T,,,,,,,,,,,,,,,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.0,1.0,0.0,2.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,2.0,2.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,3059.949951,3059.949951,3059.949951,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3059.949951,3059.949951,3059.949951,,,,,,,,,,,,,,,,,,


In [7]:
numerical_features = data.drop(['isFraud'], axis=1).select_dtypes(include=[np.number]).columns
categorical_features = data.select_dtypes(include=[np.object]).columns

In [8]:
x_train, x_valid = train_test_split(
    data.drop(['isFraud'], axis=1), train_size=0.8, shuffle=True, random_state=2177,
)
y_train, y_valid = train_test_split(
    data["isFraud"], train_size=0.8, shuffle=True, random_state=2177,
)

x_train = x_train[numerical_features]
x_valid = x_valid[numerical_features]

print("x_train.shape = {} rows, {} cols".format(*x_train.shape))
print("x_valid.shape = {} rows, {} cols".format(*x_valid.shape))

x_train.shape = 144000 rows, 379 cols
x_valid.shape = 36000 rows, 379 cols


In [9]:
set(x_train["TransactionID"].unique()) & (set(x_valid["TransactionID"].unique()))

set()

In [10]:
params = {
    "eval_metric": "auc",
    "verbose": 50,
    "early_stopping_rounds": 25,
}

In [11]:
model_lgb_1 = lgb.LGBMClassifier(n_estimators=500, seed=2177)

In [12]:
model_lgb_1.fit(x_train, y_train, 
                eval_set=(x_valid, y_valid),
                **params)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.925357	valid_0's binary_logloss: 0.0660707
[100]	valid_0's auc: 0.940357	valid_0's binary_logloss: 0.0599541
[150]	valid_0's auc: 0.945258	valid_0's binary_logloss: 0.0569468
[200]	valid_0's auc: 0.947712	valid_0's binary_logloss: 0.0552523
[250]	valid_0's auc: 0.950237	valid_0's binary_logloss: 0.0537638
[300]	valid_0's auc: 0.951637	valid_0's binary_logloss: 0.0527156
[350]	valid_0's auc: 0.95333	valid_0's binary_logloss: 0.0517707
[400]	valid_0's auc: 0.954379	valid_0's binary_logloss: 0.0509542
[450]	valid_0's auc: 0.955543	valid_0's binary_logloss: 0.0504493
Early stopping, best iteration is:
[461]	valid_0's auc: 0.95578	valid_0's binary_logloss: 0.0503164


LGBMClassifier(n_estimators=500, seed=2177)

In [13]:
### Train
roc_auc_score(y_train, model_lgb_1.predict(x_train))

0.9057862674966389

In [14]:
### Valid
roc_auc_score(y_valid, model_lgb_1.predict(x_valid))

0.7770641021206961

In [15]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_1.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6529725500204971

### Выводы
train - 0.905786  
valid - 0.777064  
lb - 0.652972  
Разница между train и valid существенна, т.е. в данном случае имеем переобучение. А результаты на ЛБ датасете сильно отличаются от valid и совсем сильно от train. Результат признаем неудовлетворительным и движемся дальше.  

### Задание 2. Hold-Out 3 parts

In [16]:
x_train, x_valid = train_test_split(
    data.drop(['isFraud'], axis=1), train_size=0.75, shuffle=True, random_state=2177,
)
y_train, y_valid = train_test_split(
    data["isFraud"], train_size=0.75, shuffle=True, random_state=2177,
)

x_train = x_train[numerical_features]
x_valid = x_valid[numerical_features]

x_valid, x_test = train_test_split(
    x_valid, train_size=0.75, shuffle=True, random_state=2177,
)
y_valid, y_test = train_test_split(
    y_valid, train_size=0.75, shuffle=True, random_state=2177,
)

print("x_train.shape = {} rows, {} cols".format(*x_train.shape))
print("x_valid.shape = {} rows, {} cols".format(*x_valid.shape))
print("x_test.shape = {} rows, {} cols".format(*x_test.shape))

x_train.shape = 135000 rows, 379 cols
x_valid.shape = 33750 rows, 379 cols
x_test.shape = 11250 rows, 379 cols


In [17]:
model_lgb_2 = lgb.LGBMClassifier(n_estimators=500, seed=2177)

In [18]:
model_lgb_2.fit(x_train, y_train, 
                eval_set=(x_valid, y_valid),
                **params)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.922897	valid_0's binary_logloss: 0.0665576
[100]	valid_0's auc: 0.934427	valid_0's binary_logloss: 0.0612892
[150]	valid_0's auc: 0.940218	valid_0's binary_logloss: 0.0584157
[200]	valid_0's auc: 0.944383	valid_0's binary_logloss: 0.0562879
[250]	valid_0's auc: 0.946136	valid_0's binary_logloss: 0.055016
[300]	valid_0's auc: 0.948081	valid_0's binary_logloss: 0.0538104
[350]	valid_0's auc: 0.951082	valid_0's binary_logloss: 0.0525569
[400]	valid_0's auc: 0.95263	valid_0's binary_logloss: 0.0516261
[450]	valid_0's auc: 0.953922	valid_0's binary_logloss: 0.0508516
[500]	valid_0's auc: 0.955416	valid_0's binary_logloss: 0.0501041
Did not meet early stopping. Best iteration is:
[500]	valid_0's auc: 0.955416	valid_0's binary_logloss: 0.0501041


LGBMClassifier(n_estimators=500, seed=2177)

In [19]:
### Train
roc_auc_score(y_train, model_lgb_1.predict(x_train))

0.9050552521729186

In [20]:
### Valid
roc_auc_score(y_valid, model_lgb_1.predict(x_valid))

0.7961109764117802

In [21]:
### Проверка на test
roc_auc_score(y_test, model_lgb_2.predict(x_test))

0.7984561569648038

In [22]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_2.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6481790177615604

### Выводы
train - 0.905055  
valid - 0.796110  
test - 0.798456  
lb - 0.648179    
Разница между train и valid получилась примерно как и в задании 1. А между valid и test не настолько существенна, т.е. тактика разбития на 3 части (одна из которых вообще не участвует в обучении) надежне, но в данном случае разница между test и lb очень велика, т.к. тактика разбиения для данного датасета наугад (shuffle=True) НЕ очень хорошая идея. (В задании 6 с разбиением по времени ситуация стала лучше)

### Задание 3. Bootstrap

In [23]:
def create_bootstrap_samples(data: np.array, n_samples: int = 1000) -> np.array:
    """
    Создание бутстреп-выборок.

    Parameters
    ----------
    data: np.array
        Исходная выборка, которая будет использоваться для
        создания бутстреп выборок.

    n_samples: int, optional, default = 1000
        Количество создаваемых бутстреп выборок.
        Опциональный параметр, по умолчанию, равен 1000.

    Returns
    -------
    bootstrap_idx: np.array
        Матрица индексов, для создания бутстреп выборок.

    """
    bootstrap_idx = np.random.randint(
        low=0, high=len(data), size=(n_samples, len(data))
    )
    return bootstrap_idx


def create_bootstrap_metrics(y_true: np.array,
                             y_pred: np.array,
                             metric: callable,
                             n_samlpes: int = 1000) -> List[float]:
    """
    Вычисление бутстреп оценок.

    Parameters
    ----------
    y_true: np.array
        Вектор целевой переменной.

    y_pred: np.array
        Вектор прогнозов.

    metric: callable
        Функция для вычисления метрики.
        Функция должна принимать 2 аргумента: y_true, y_pred.

    n_samples: int, optional, default = 1000
        Количество создаваемых бутстреп выборок.
        Опциональный параметр, по умолчанию, равен 1000.

    Returns
    -------
    bootstrap_metrics: List[float]
        Список со значениями метрики качества на каждой бустреп выборке.

    """
    scores = []

    if isinstance(y_true, pd.Series):
        y_true = y_true.values

    bootstrap_idx = create_bootstrap_samples(y_true)
    for idx in bootstrap_idx:
        y_true_bootstrap = y_true[idx]
        y_pred_bootstrap = y_pred[idx]

        score = metric(y_true_bootstrap, y_pred_bootstrap)
        scores.append(score)

    return scores


def calculate_confidence_interval(scores: list, conf_interval: float = 0.95) -> Tuple[float]:
    """
    Вычисление доверительного интервала.

    Parameters
    ----------
    scores: List[float / int]
        Список с оценками изучаемой величины.

    conf_interval: float, optional, default = 0.95
        Уровень доверия для построения интервала.
        Опциональный параметр, по умолчанию, равен 0.95.

    Returns
    -------
    conf_interval: Tuple[float]
        Кортеж с границами доверительного интервала.

    """
    left_bound = np.percentile(
        scores, ((1 - conf_interval) / 2) * 100
    )
    right_bound = np.percentile(
        scores, (conf_interval + ((1 - conf_interval) / 2)) * 100
    )

    return left_bound, right_bound

In [24]:
np.random.seed(2177)
scores = create_bootstrap_metrics(y_test, model_lgb_2.predict(x_test), roc_auc_score)

calculate_confidence_interval(scores)

(0.7719425836952748, 0.8235648601494142)

In [25]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_2.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6481790177615604

### Выводы
На ЛБ датасете получилось за гранью доверительного интервала, т.к. и для предыдущего задания тактика разбиения для данного датасета наугад (shuffle=True) НЕ очень хорошая идея.

### Задание 4. Adversarial Validation

Дропаем 'TransactionID' и 'TransactionDT', по ним разделяет идеально, т.к. они просто увеличиваются со временем.

In [26]:
lb_db = lb_dataset.drop(['isFraud'], axis=1)[numerical_features]
data_adv = data.drop(['isFraud'], axis=1)[numerical_features]

x_adv = pd.concat([
    data_adv, lb_db], axis=0
)
x_adv.drop(['TransactionID', 'TransactionDT'], axis=1, inplace=True)
y_adv = np.hstack((np.zeros(data_adv.shape[0]), np.ones(lb_db.shape[0])))
assert x_adv.shape[0] == y_adv.shape[0]

In [27]:
model_adv = lgb.LGBMClassifier(n_estimators=25)
model_adv.fit(x_adv, y_adv)

LGBMClassifier(n_estimators=25)

In [28]:
y_pred_adv = model_adv.predict_proba(x_adv)
score = roc_auc_score(y_adv, y_pred_adv[:, 1])
print(round(score, 4))

0.8833


In [29]:
y_pred = model_adv.predict_proba(x_train.drop(['TransactionID', 'TransactionDT'], axis=1))

In [30]:
pd.cut(
    y_pred[:, 1], bins=np.arange(0, 1.01, 0.1)
).value_counts().sort_index()

(0.0, 0.1]    10292
(0.1, 0.2]    76957
(0.2, 0.3]    16915
(0.3, 0.4]    14746
(0.4, 0.5]    11103
(0.5, 0.6]     4537
(0.6, 0.7]       74
(0.7, 0.8]       36
(0.8, 0.9]      104
(0.9, 1.0]      236
dtype: int64

In [31]:
y_pred[:, 1] > 0.5

array([False, False, False, ..., False, False, False])

In [32]:
x_valid_adv = x_train.loc[y_pred[:, 1] >= 0.5]
y_valid_adv = y_train.loc[y_pred[:, 1] >= 0.5]
x_train_adv = x_train.loc[y_pred[:, 1] < 0.5]
y_train_adv = y_train.loc[y_pred[:, 1] < 0.5]

In [33]:
model_lgb_4 = lgb.LGBMClassifier(n_estimators=500, seed=2177)

In [34]:
model_lgb_4.fit(x_train_adv, y_train_adv, 
                eval_set=(x_valid_adv, y_valid_adv),
                **params)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.920079	valid_0's binary_logloss: 0.0452587
Early stopping, best iteration is:
[44]	valid_0's auc: 0.921014	valid_0's binary_logloss: 0.0456134


LGBMClassifier(n_estimators=500, seed=2177)

In [35]:
### Train
roc_auc_score(y_train_adv, model_lgb_4.predict(x_train_adv))

0.718753574819543

In [36]:
### Valid
roc_auc_score(y_valid_adv, model_lgb_4.predict(x_valid_adv))

0.723549419000886

In [37]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_4.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6342812001073812

### Выводы
Выполнено, разница между Valid и lb снизилась, результат улучшается.

### Задание 5. StratifiedKFold

In [38]:
def make_cross_validation(X: pd.DataFrame,
                          y: pd.Series,
                          estimator: object,
                          params: dict,
                          metric: callable,
                          cv_strategy):
    """
    Кросс-валидация.

    Parameters
    ----------
    X: pd.DataFrame
        Матрица признаков.

    y: pd.Series
        Вектор целевой переменной.

    estimator: callable
        Объект модели для обучения.
        
    paprams: dict
        Параметры модели

    metric: callable
        Метрика для оценки качества решения.
        Ожидается, что на вход будет передана функция,
        которая принимает 2 аргумента: y_true, y_pred.

    cv_strategy: cross-validation generator
        Объект для описания стратегии кросс-валидации.
        Ожидается, что на вход будет передан объект типа
        KFold или StratifiedKFold.

    Returns
    -------
    oof_score: float
        Значение метрики качества на OOF-прогнозах.

    fold_train_scores: List[float]
        Значение метрики качества на каждом обучающем датасете кросс-валидации.

    fold_valid_scores: List[float]
        Значение метрики качества на каждом валидационном датасете кросс-валидации.

    oof_predictions: np.array
        Прогнозы на OOF.

    """
    estimators, fold_train_scores, fold_valid_scores = [], [], []
    oof_predictions = np.zeros(X.shape[0])

    for fold_number, (train_idx, valid_idx) in enumerate(cv_strategy.split(X, y)):
        x_train, x_valid = X.loc[train_idx], X.loc[valid_idx]
        y_train, y_valid = y.loc[train_idx], y.loc[valid_idx]

        estimator.fit(x_train, y_train, 
                      eval_set=(x_valid, y_valid),
                      **params)
        y_train_pred = estimator.predict(x_train)
        y_valid_pred = estimator.predict(x_valid)

        fold_train_scores.append(metric(y_train, y_train_pred))
        fold_valid_scores.append(metric(y_valid, y_valid_pred))
        oof_predictions[valid_idx] = y_valid_pred

        msg = (
            f"Fold: {fold_number+1}, train-observations = {len(train_idx)}, "
            f"valid-observations = {len(valid_idx)}\n"
            f"train-score = {round(fold_train_scores[fold_number], 4)}, "
            f"valid-score = {round(fold_valid_scores[fold_number], 4)}" 
        )
        print(msg)
        print("="*69)
        estimators.append(estimator)

    oof_score = metric(y, oof_predictions)
    print(f"CV-results train: {round(np.mean(fold_train_scores), 4)} +/- {round(np.std(fold_train_scores), 3)}")
    print(f"CV-results valid: {round(np.mean(fold_valid_scores), 4)} +/- {round(np.std(fold_valid_scores), 3)}")
    print(f"OOF-score = {round(oof_score, 4)}")

    return estimators, oof_score, fold_train_scores, fold_valid_scores, oof_predictions

In [39]:
x_train, x_valid = train_test_split(
    data.drop(['isFraud'], axis=1), train_size=0.75, shuffle=True, random_state=2177,
)
y_train, y_valid = train_test_split(
    data["isFraud"], train_size=0.75, shuffle=True, random_state=2177,
)

y_train = y_train.reset_index(drop=True)
y_valid = y_valid.reset_index(drop=True)
x_train = x_train[numerical_features].reset_index(drop=True)
x_valid = x_valid[numerical_features].reset_index(drop=True)


x_valid, x_test = train_test_split(
    x_valid, train_size=0.75, shuffle=True, random_state=2177,
)
y_valid, y_test = train_test_split(
    y_valid, train_size=0.75, shuffle=True, random_state=2177,
)

print("x_train.shape = {} rows, {} cols".format(*x_train.shape))
print("x_valid.shape = {} rows, {} cols".format(*x_valid.shape))
print("x_test.shape = {} rows, {} cols".format(*x_test.shape))

x_train.shape = 135000 rows, 379 cols
x_valid.shape = 33750 rows, 379 cols
x_test.shape = 11250 rows, 379 cols


In [40]:
model_lgb_5 = lgb.LGBMClassifier(n_estimators=500, seed=2177)
### Model for CV
model_5 = lgb.LGBMClassifier(n_estimators=500, seed=2177)

In [41]:
model_lgb_5.fit(x_train, y_train, 
                eval_set=(x_valid, y_valid),
                **params)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.922897	valid_0's binary_logloss: 0.0665576
[100]	valid_0's auc: 0.934427	valid_0's binary_logloss: 0.0612892
[150]	valid_0's auc: 0.940218	valid_0's binary_logloss: 0.0584157
[200]	valid_0's auc: 0.944383	valid_0's binary_logloss: 0.0562879
[250]	valid_0's auc: 0.946136	valid_0's binary_logloss: 0.055016
[300]	valid_0's auc: 0.948081	valid_0's binary_logloss: 0.0538104
[350]	valid_0's auc: 0.951082	valid_0's binary_logloss: 0.0525569
[400]	valid_0's auc: 0.95263	valid_0's binary_logloss: 0.0516261
[450]	valid_0's auc: 0.953922	valid_0's binary_logloss: 0.0508516
[500]	valid_0's auc: 0.955416	valid_0's binary_logloss: 0.0501041
Did not meet early stopping. Best iteration is:
[500]	valid_0's auc: 0.955416	valid_0's binary_logloss: 0.0501041


LGBMClassifier(n_estimators=500, seed=2177)

#### StratifiedKFold

In [42]:
cv_strategy = StratifiedKFold(n_splits=5, random_state=2177)

estimators, oof_score, fold_train_scores, fold_valid_scores, oof_predictions = make_cross_validation(
    x_train, y_train, model_5, params, metric=roc_auc_score, cv_strategy=cv_strategy
)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.919757	valid_0's binary_logloss: 0.0669143
[100]	valid_0's auc: 0.93084	valid_0's binary_logloss: 0.061332
Early stopping, best iteration is:
[99]	valid_0's auc: 0.930952	valid_0's binary_logloss: 0.0613069
Fold: 1, train-observations = 108000, valid-observations = 27000
train-score = 0.7742, valid-score = 0.7247
Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.904801	valid_0's binary_logloss: 0.0693576
[100]	valid_0's auc: 0.920171	valid_0's binary_logloss: 0.0637327
[150]	valid_0's auc: 0.923557	valid_0's binary_logloss: 0.0611433
[200]	valid_0's auc: 0.925908	valid_0's binary_logloss: 0.0596512
[250]	valid_0's auc: 0.92814	valid_0's binary_logloss: 0.0584122
[300]	valid_0's auc: 0.929355	valid_0's binary_logloss: 0.0574889
Early stopping, best iteration is:
[305]	valid_0's auc: 0.929633	valid_0's binary_logloss: 0.0573748
Fold: 2, train-observations = 108000, valid-ob

#### Eval

In [43]:
### Train
roc_auc_score(y_train, model_lgb_5.predict(x_train))

0.917964937862153

In [44]:
### Valid
roc_auc_score(y_valid, model_lgb_5.predict(x_valid))

0.7817604367666687

In [45]:
### Проверка на test
roc_auc_score(y_test, model_lgb_5.predict(x_test))

0.7984561569648038

In [46]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_5.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6481790177615604

### Выводы
Получился результат (0.7439 +/- 0.011) с небольшим доверительным интервалом, т.е. с одной стороны результат устойчив, с другой результат на отложеной выборке (0.798456) и тем более на lb (0.648179) НЕ попадают в этот интервал. Судя по всему у нас проблемы с разбиением датасета.

### Задание 6. Разбиение по времени (поле 'TransactionDT')

In [47]:
data.sort_values(by='TransactionDT', ascending=True, inplace=True)

In [48]:
x_train, x_valid = train_test_split(
    data.drop(['isFraud'], axis=1), train_size=0.75, shuffle=False,
)
y_train, y_valid = train_test_split(
    data["isFraud"], train_size=0.75, shuffle=False,
)

x_train = x_train[numerical_features]
x_valid = x_valid[numerical_features]

x_valid, x_test = train_test_split(
    x_valid, train_size=0.75, shuffle=False,
)
y_valid, y_test = train_test_split(
    y_valid, train_size=0.75, shuffle=False,
)

print("x_train.shape = {} rows, {} cols".format(*x_train.shape))
print("x_valid.shape = {} rows, {} cols".format(*x_valid.shape))
print("x_test.shape = {} rows, {} cols".format(*x_test.shape))

x_train.shape = 135000 rows, 379 cols
x_valid.shape = 33750 rows, 379 cols
x_test.shape = 11250 rows, 379 cols


In [49]:
set(x_train["TransactionDT"].unique()) & (set(x_valid["TransactionDT"].unique()))

set()

In [50]:
(set(x_valid["TransactionDT"].unique())) & (set(x_test["TransactionDT"].unique()))

set()

In [51]:
model_lgb_6 = lgb.LGBMClassifier(n_estimators=500, seed=2177)
### Model for CV
model_6 = lgb.LGBMClassifier(n_estimators=500, seed=2177)

In [52]:
model_lgb_6.fit(x_train, y_train, 
                eval_set=(x_valid, y_valid),
                **params)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.889916	valid_0's binary_logloss: 0.0936729
[100]	valid_0's auc: 0.900924	valid_0's binary_logloss: 0.0896224
[150]	valid_0's auc: 0.904311	valid_0's binary_logloss: 0.0888089
[200]	valid_0's auc: 0.905983	valid_0's binary_logloss: 0.0881964
[250]	valid_0's auc: 0.907066	valid_0's binary_logloss: 0.0879782
Early stopping, best iteration is:
[235]	valid_0's auc: 0.907597	valid_0's binary_logloss: 0.087622


LGBMClassifier(n_estimators=500, seed=2177)

In [53]:
cv_strategy = StratifiedKFold(n_splits=5, random_state=2177)

estimators, oof_score, fold_train_scores, fold_valid_scores, oof_predictions = make_cross_validation(
    x_train, y_train, model_6, params, metric=roc_auc_score, cv_strategy=cv_strategy
)

Training until validation scores don't improve for 25 rounds
Early stopping, best iteration is:
[11]	valid_0's auc: 0.824223	valid_0's binary_logloss: 0.0990629
Fold: 1, train-observations = 108000, valid-observations = 27000
train-score = 0.6524, valid-score = 0.5704
Training until validation scores don't improve for 25 rounds
Early stopping, best iteration is:
[6]	valid_0's auc: 0.864261	valid_0's binary_logloss: 0.108433
Fold: 2, train-observations = 108000, valid-observations = 27000
train-score = 0.6382, valid-score = 0.6301
Training until validation scores don't improve for 25 rounds
Early stopping, best iteration is:
[18]	valid_0's auc: 0.893164	valid_0's binary_logloss: 0.0730625
Fold: 3, train-observations = 108000, valid-observations = 27000
train-score = 0.6639, valid-score = 0.6481
Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.896692	valid_0's binary_logloss: 0.0747098
Early stopping, best iteration is:
[25]	valid_0's auc: 0.888299	valid

In [54]:
print(f'{np.mean(fold_valid_scores) - np.std(fold_valid_scores)}, {np.mean(fold_valid_scores) + np.std(fold_valid_scores)}')

0.5827848111441667, 0.6582821238609649


In [55]:
### Train
roc_auc_score(y_train, model_lgb_6.predict(x_train))

0.8517302624995997

In [56]:
### Valid
roc_auc_score(y_valid, model_lgb_6.predict(x_valid))

0.7188534372619785

In [57]:
### Проверка на test
roc_auc_score(y_test, model_lgb_6.predict(x_test))

0.6600130709318778

In [58]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_6.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6369634874640081

### Выводы
В данном случае результат получился гораздо лучше: test и lb близки по значениям и попадают в доверительный интервал. Доверительный интервал (0.6205 +/- 0.038) небольшой - валидания устойчива.

### Задание 7

### Выполнение:
Попробуем выделить пользователей по данным карт ('card1', 'card2', 'card3', 'card4', 'card5', 'card6')  
Создадим с помощью OrdinalEncoder поле 'user_id'  
Разделим датасет GroupShuffleSplit  
Далее проводим кросс валидациб с помощью GroupKFold  

In [59]:
data['user_id'] = data[['card1', 'card2', 'card3', 'card4', 'card5', 'card6']].values.tolist()
data['user_id'] = data['user_id'].astype(str)
data.head(2)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,M1,M2,M3,M4,M5,M6,M7,M8,M9,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,V41,V42,V43,V44,V45,V46,V47,V48,V49,V50,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61,V62,V63,V64,V65,V66,V67,V68,V69,V70,V71,V72,V73,V74,V75,V76,V77,V78,V79,V80,V81,V82,V83,V84,V85,V86,V87,V88,V89,V90,V91,V92,V93,V94,V95,V96,V97,V98,V99,V100,V101,V102,V103,V104,V105,V106,V107,V108,V109,V110,V111,V112,V113,V114,V115,V116,V117,V118,V119,V120,V121,V122,V123,V124,V125,V126,V127,V128,V129,V130,V131,V132,V133,V134,V135,V136,V137,V138,V139,V140,V141,V142,V143,V144,V145,V146,V147,V148,V149,V150,V151,V152,V153,V154,V155,V156,V157,V158,V159,V160,V161,V162,V163,V164,V165,V166,V167,V168,V169,V170,V171,V172,V173,V174,V175,V176,V177,V178,V179,V180,V181,V182,V183,V184,V185,V186,V187,V188,V189,V190,V191,V192,V193,V194,V195,V196,V197,V198,V199,V200,V201,V202,V203,V204,V205,V206,V207,V208,V209,V210,V211,V212,V213,V214,V215,V216,V217,V218,V219,V220,V221,V222,V223,V224,V225,V226,V227,V228,V229,V230,V231,V232,V233,V234,V235,V236,V237,V238,V239,V240,V241,V242,V243,V244,V245,V246,V247,V248,V249,V250,V251,V252,V253,V254,V255,V256,V257,V258,V259,V260,V261,V262,V263,V264,V265,V266,V267,V268,V269,V270,V271,V272,V273,V274,V275,V276,V277,V278,V279,V280,V281,V282,V283,V284,V285,V286,V287,V288,V289,V290,V291,V292,V293,V294,V295,V296,V297,V298,V299,V300,V301,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321,V322,V323,V324,V325,V326,V327,V328,V329,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339,user_id
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,T,T,T,M2,F,T,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,"[13926, nan, 150.0, 'discover', 142.0, 'credit']"
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,M0,T,T,,,,,,,,,,,,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,"[2755, 404.0, 150.0, 'mastercard', 102.0, 'cre..."


In [60]:
ordinal_enc = OrdinalEncoder()
data['user_id'] = ordinal_enc.fit_transform(data[['user_id']])
np.save('ordinal_enc_cats.npy', ordinal_enc.categories_)
data.head(2)

Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,addr1,addr2,dist1,dist2,P_emaildomain,R_emaildomain,C1,C2,C3,C4,C5,C6,C7,C8,C9,C10,C11,C12,C13,C14,D1,D2,D3,D4,D5,D6,D7,D8,D9,D10,D11,D12,D13,D14,D15,M1,M2,M3,M4,M5,M6,M7,M8,M9,V1,V2,V3,V4,V5,V6,V7,V8,V9,V10,V11,V12,V13,V14,V15,V16,V17,V18,V19,V20,V21,V22,V23,V24,V25,V26,V27,V28,V29,V30,V31,V32,V33,V34,V35,V36,V37,V38,V39,V40,V41,V42,V43,V44,V45,V46,V47,V48,V49,V50,V51,V52,V53,V54,V55,V56,V57,V58,V59,V60,V61,V62,V63,V64,V65,V66,V67,V68,V69,V70,V71,V72,V73,V74,V75,V76,V77,V78,V79,V80,V81,V82,V83,V84,V85,V86,V87,V88,V89,V90,V91,V92,V93,V94,V95,V96,V97,V98,V99,V100,V101,V102,V103,V104,V105,V106,V107,V108,V109,V110,V111,V112,V113,V114,V115,V116,V117,V118,V119,V120,V121,V122,V123,V124,V125,V126,V127,V128,V129,V130,V131,V132,V133,V134,V135,V136,V137,V138,V139,V140,V141,V142,V143,V144,V145,V146,V147,V148,V149,V150,V151,V152,V153,V154,V155,V156,V157,V158,V159,V160,V161,V162,V163,V164,V165,V166,V167,V168,V169,V170,V171,V172,V173,V174,V175,V176,V177,V178,V179,V180,V181,V182,V183,V184,V185,V186,V187,V188,V189,V190,V191,V192,V193,V194,V195,V196,V197,V198,V199,V200,V201,V202,V203,V204,V205,V206,V207,V208,V209,V210,V211,V212,V213,V214,V215,V216,V217,V218,V219,V220,V221,V222,V223,V224,V225,V226,V227,V228,V229,V230,V231,V232,V233,V234,V235,V236,V237,V238,V239,V240,V241,V242,V243,V244,V245,V246,V247,V248,V249,V250,V251,V252,V253,V254,V255,V256,V257,V258,V259,V260,V261,V262,V263,V264,V265,V266,V267,V268,V269,V270,V271,V272,V273,V274,V275,V276,V277,V278,V279,V280,V281,V282,V283,V284,V285,V286,V287,V288,V289,V290,V291,V292,V293,V294,V295,V296,V297,V298,V299,V300,V301,V302,V303,V304,V305,V306,V307,V308,V309,V310,V311,V312,V313,V314,V315,V316,V317,V318,V319,V320,V321,V322,V323,V324,V325,V326,V327,V328,V329,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339,user_id
0,2987000,0,86400,68.5,W,13926,,150.0,discover,142.0,credit,315.0,87.0,19.0,,,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,2.0,0.0,1.0,1.0,14.0,,13.0,,,,,,,13.0,13.0,,,,0.0,T,T,T,M2,F,T,,,,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,117.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,117.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,2432.0
1,2987001,0,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,325.0,87.0,,,gmail.com,,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,,,0.0,,,,,,0.0,,,,,0.0,,,,M0,T,T,,,,,,,,,,,,,,,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,,,,,,,,,,,,,,,,,,,5705.0


In [61]:
len(data['user_id'].unique())

9786

In [62]:
gs = GroupShuffleSplit(n_splits=1, train_size=.66, random_state=2177)

train_idx, valid_idx  = next(gs.split(data.drop(['isFraud'], axis=1), data["isFraud"], groups=data.user_id))

x_train = data.drop(['isFraud'], axis=1).iloc[train_idx]
x_valid = data.drop(['isFraud'], axis=1).iloc[valid_idx]
y_train = data["isFraud"].iloc[train_idx]
y_valid = data["isFraud"].iloc[valid_idx]

print(f'user_id в train и valid:{set(x_train["user_id"].unique()) & (set(x_valid["user_id"].unique()))}')

valid_idx, test_idx  = next(gs.split(x_valid, y_valid, groups=x_valid.user_id))

x_test = x_valid.iloc[test_idx]
x_valid = x_valid.iloc[valid_idx]
y_test = y_valid.iloc[test_idx]
y_valid = y_valid.iloc[valid_idx]

print(f'user_id в test и valid:{set(x_test["user_id"].unique()) & (set(x_valid["user_id"].unique()))}')

new_feat = data.drop(['isFraud'], axis=1).select_dtypes(include=[np.number]).columns
x_train = x_train[new_feat].reset_index(drop=True)
x_valid = x_valid[new_feat].reset_index(drop=True)
x_test = x_test[new_feat].reset_index(drop=True)

y_train = y_train.reset_index(drop=True)
y_valid = y_valid.reset_index(drop=True)
y_test = y_test.reset_index(drop=True)

print("x_train.shape = {} rows, {} cols".format(*x_train.shape))
print("x_valid.shape = {} rows, {} cols".format(*x_valid.shape))
print("x_test.shape = {} rows, {} cols".format(*x_test.shape))

user_id в train и valid:set()
user_id в test и valid:set()
x_train.shape = 122005 rows, 380 cols
x_valid.shape = 43805 rows, 380 cols
x_test.shape = 14190 rows, 380 cols


In [63]:
model_lgb_7 = lgb.LGBMClassifier(n_estimators=500, seed=2177)
### Model for CV
model_7 = lgb.LGBMClassifier(n_estimators=500, seed=2177)

In [64]:
def make_GroupKFold_cross_validation(X: pd.DataFrame,
                                      y: pd.Series,
                                      by_groups: pd.Series,
                                      estimator: object,
                                      params: dict,
                                      metric: callable,
                                      cv_strategy):
    """
    Кросс-валидация.

    Parameters
    ----------
    X: pd.DataFrame
        Матрица признаков.

    y: pd.Series
        Вектор целевой переменной.
        
    by_groups: pd.Series
        Признаки по которым проводим разбиение

    estimator: callable
        Объект модели для обучения.
            
    paprams: dict
        Параметры модели
        
    metric: callable
        Метрика для оценки качества решения.
        Ожидается, что на вход будет передана функция,
        которая принимает 2 аргумента: y_true, y_pred.

    cv_strategy: cross-validation generator
        Объект для описания стратегии кросс-валидации.
        Ожидается, что на вход будет передан объект типа
        KFold или StratifiedKFold.

    Returns
    -------
    oof_score: float
        Значение метрики качества на OOF-прогнозах.

    fold_train_scores: List[float]
        Значение метрики качества на каждом обучающем датасете кросс-валидации.

    fold_valid_scores: List[float]
        Значение метрики качества на каждом валидационном датасете кросс-валидации.

    oof_predictions: np.array
        Прогнозы на OOF.

    """
    estimators, fold_train_scores, fold_valid_scores = [], [], []
    oof_predictions = np.zeros(X.shape[0])

    for fold_number, (train_idx, valid_idx) in enumerate(cv_strategy.split(X, y, groups=X[by_groups])):
        x_train, x_valid = X.loc[train_idx], X.loc[valid_idx]
        y_train, y_valid = y.loc[train_idx], y.loc[valid_idx]
        x_train.drop(['user_id'], axis=1, inplace=True)
        x_valid.drop(['user_id'], axis=1, inplace=True)

        estimator.fit(x_train, y_train, 
              eval_set=(x_valid, y_valid),
              **params)
        y_train_pred = estimator.predict(x_train)
        y_valid_pred = estimator.predict(x_valid)

        fold_train_scores.append(metric(y_train, y_train_pred))
        fold_valid_scores.append(metric(y_valid, y_valid_pred))
        oof_predictions[valid_idx] = y_valid_pred

        msg = (
            f"Fold: {fold_number+1}, train-observations = {len(train_idx)}, "
            f"valid-observations = {len(valid_idx)}\n"
            f"train-score = {round(fold_train_scores[fold_number], 4)}, "
            f"valid-score = {round(fold_valid_scores[fold_number], 4)}" 
        )
        print(msg)
        print("="*69)
        estimators.append(estimator)

    oof_score = metric(y, oof_predictions)
    print(f"CV-results train: {round(np.mean(fold_train_scores), 4)} +/- {round(np.std(fold_train_scores), 3)}")
    print(f"CV-results valid: {round(np.mean(fold_valid_scores), 4)} +/- {round(np.std(fold_valid_scores), 3)}")
    print(f"OOF-score = {round(oof_score, 4)}")

    return estimators, oof_score, fold_train_scores, fold_valid_scores, oof_predictions

In [65]:
cv_strategy = StratifiedKFold(n_splits=5, random_state=2177)

estimators, oof_score, fold_train_scores, fold_valid_scores, oof_predictions = make_GroupKFold_cross_validation(
    x_train, y_train, 'user_id', model_7, params, metric=roc_auc_score, cv_strategy=cv_strategy
)

Training until validation scores don't improve for 25 rounds
Early stopping, best iteration is:
[22]	valid_0's auc: 0.864163	valid_0's binary_logloss: 0.0942405
Fold: 1, train-observations = 97604, valid-observations = 24401
train-score = 0.7141, valid-score = 0.6054
Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.888955	valid_0's binary_logloss: 0.0860413
Early stopping, best iteration is:
[29]	valid_0's auc: 0.896677	valid_0's binary_logloss: 0.0824086
Fold: 2, train-observations = 97604, valid-observations = 24401
train-score = 0.7178, valid-score = 0.688
Training until validation scores don't improve for 25 rounds
Early stopping, best iteration is:
[8]	valid_0's auc: 0.860975	valid_0's binary_logloss: 0.0957598
Fold: 3, train-observations = 97604, valid-observations = 24401
train-score = 0.6654, valid-score = 0.6265
Training until validation scores don't improve for 25 rounds
Early stopping, best iteration is:
[10]	valid_0's auc: 0.85624	valid_0's

In [66]:
print(f'{np.mean(fold_valid_scores) - np.std(fold_valid_scores)}, {np.mean(fold_valid_scores) + np.std(fold_valid_scores)}')

0.5508187771005971, 0.6775271569737821


In [67]:
x_train.drop(['user_id'], axis=1, inplace=True)
x_valid.drop(['user_id'], axis=1, inplace=True)
x_test.drop(['user_id'], axis=1, inplace=True)

In [68]:
model_lgb_7.fit(x_train, y_train, 
                eval_set=(x_valid, y_valid),
                **params)

Training until validation scores don't improve for 25 rounds
[50]	valid_0's auc: 0.878905	valid_0's binary_logloss: 0.0727789
Early stopping, best iteration is:
[45]	valid_0's auc: 0.879546	valid_0's binary_logloss: 0.0730245


LGBMClassifier(n_estimators=500, seed=2177)

#### Eval

In [69]:
### Train
roc_auc_score(y_train, model_lgb_7.predict(x_train))

0.7448539979687759

In [70]:
### Valid
roc_auc_score(y_valid, model_lgb_7.predict(x_valid))

0.6525851889436612

In [71]:
### Проверка на test
roc_auc_score(y_test, model_lgb_7.predict(x_test))

0.6769730815235002

In [72]:
### Проверка на ЛБ датасете
roc_auc_score(lb_dataset['isFraud'], model_lgb_7.predict(lb_dataset.drop(['isFraud'], axis=1)[numerical_features]))

0.6338022167575481

### Выводы
Данное разбиение тоже показывает приличный результат: test (0.67697) и lb (0.633802) попадают в доверительный интервал (0.6142 +/- 0.063), хотя сам интервал получился шире чем в задании 6.