Урок 5. Feature Engineering, Feature Selection, part I

Задание 0: выбрать любую модель машнного обучения и зафиксировать любой тип валидации. Обучить базовую модель и зафиксировать базовое качество модели. В каждом следующем задании нужно будет обучить выбранную модель и оценивать ее качество на зафиксированной схеме валидации. После каждого задания, требуется сделать вывод о достигаемом качестве модели, по сравнению с качестом из предыдущего шага.

Задание 1: признак TransactionDT - это смещение в секундах относительно базовой даты. Базовая дата - 2017-12-01, преобразовать признак TransactionDT в datetime, прибавив к базовой дате исходное значение признака. Из полученного признака выделить год, месяц, день недели, час, день.

Задание 2: сделать конкатенацию признаков
* card1 + card2;
* card1 + card2 + card_3 + card_5;
* card1 + card2 + card_3 + card_5 + addr1 + addr2

Рассматривать их как категориальных признаки.

Задание 3: Сделать FrequencyEncoder для признаков card1 - card6, addr1, addr2.

Задание 4: Создать признаки на основе отношения: TransactionAmt к вычисленной статистике. Статистика - среднее значение / стандартное отклонение TransactionAmt, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

Задание 5: Создать признаки на основе отношения: D15 к вычисленной статистике. Статистика - среднее значение / стандартное отклонение D15, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

Задание 6: выделить дробную часть и целую часть признака TransactionAmt в два отдельных признака. После создать отдельных признак - логарифм от TransactionAmt

Задание 7 (опция): выполнить предварительную подготовку / очистку признаков P_emaildomain и R_emaildomain (что и как делать - остается на ваше усмотрение) и сделать Frequency Encoding для очищенных признаков.

In [None]:
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import KFold 
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import LabelEncoder
import lightgbm as lgb

In [None]:
import warnings
warnings.simplefilter("ignore")
pd.options.display.max_columns = 450

In [None]:
!unzip /content/drive/MyDrive/assignment2_data.zip

Archive:  /content/drive/MyDrive/assignment2_data.zip
  inflating: assignment_2_test.csv   
  inflating: assignment_2_train.csv  


In [None]:
data = pd.read_csv('/content/assignment_2_train.csv')
lb = pd.read_csv('/content/assignment_2_test.csv')

X_data = data.drop('isFraud', axis=1)
y_data = data['isFraud']

X_lb = lb.drop('isFraud', axis=1)
y_lb = lb['isFraud']

stata = pd.DataFrame(columns=['train_mean', 'train_std', 'valid_mean', 'valid_std', \
                              'valid_conf_interval','auc_lb'])

# Задание 0

In [None]:
categorical_features = data.select_dtypes(include=[np.object])
print(f"Categorical Feature Count {categorical_features.shape[1]}")
categorical_features.head()

Categorical Feature Count 14


Unnamed: 0,ProductCD,card4,card6,P_emaildomain,R_emaildomain,M1,M2,M3,M4,M5,M6,M7,M8,M9
0,W,discover,credit,,,T,T,T,M2,F,T,,,
1,W,mastercard,credit,gmail.com,,,,,M0,T,T,,,
2,W,visa,debit,outlook.com,,T,T,T,M0,F,F,F,F,F
3,W,mastercard,debit,yahoo.com,,,,,M0,T,F,,,
4,H,mastercard,credit,gmail.com,,,,,,,,,,


In [None]:
def evaluation_model(X, y, lb_X, lb_y, operation=None):
    
    # преобразование категориальных признаков
    cat_features = X.select_dtypes(exclude=np.number).columns.to_list()
    X[cat_features] = X[cat_features].astype('category')
    lb_X[cat_features] = lb_X[cat_features].astype('category')
    
    # обучение модели
    X_train, X_valid = train_test_split(X, train_size=0.7, shuffle=True, random_state=5)
    y_train, y_valid = train_test_split(y, train_size=0.7, shuffle=True, random_state=5)

    model = lgb.LGBMClassifier(objective="binary", n_estimators=1000, random_state=5)

    model.fit(X=X_train, y=y_train,
                eval_set=[(X_train, y_train), (X_valid, y_valid)],
                categorical_feature=cat_features, # "auto",
                early_stopping_rounds=25,
                eval_metric="auc",
                verbose=100)
    
    # кросс-валидация
    fold_train_scores, fold_valid_scores = [], []
    
    cv_strategy = KFold(n_splits=5, random_state=1)
    
    for fold_number, (train_idx, valid_idx) in enumerate(cv_strategy.split(X, y)):
        X_train, X_valid = X.loc[train_idx], X.loc[valid_idx]
        y_train, y_valid = y.loc[train_idx], y.loc[valid_idx]

        y_train_pred = model.predict(X_train)
        y_valid_pred = model.predict(X_valid)

        fold_train_scores.append(roc_auc_score(y_train, y_train_pred))
        fold_valid_scores.append(roc_auc_score(y_valid, y_valid_pred))
        
    # доверительный интервал
    conf_interval = 0.95 
        
    left_bound = np.percentile(fold_valid_scores, ((1 - conf_interval) / 2) * 100)
    right_bound = np.percentile(fold_valid_scores, (conf_interval + ((1 - conf_interval) / 2)) * 100)
    
    # статистика
    if operation != None:
        
        stata.loc[f'{operation}', 'train_mean'] = round(np.mean(fold_train_scores), 4)
        stata.loc[f'{operation}', 'valid_mean'] = round(np.mean(fold_valid_scores), 4)
        stata.loc[f'{operation}', 'train_std'] = round(np.std(fold_train_scores), 3)
        stata.loc[f'{operation}', 'valid_std'] = round(np.std(fold_valid_scores), 3)
        stata.loc[f'{operation}', 'valid_conf_interval'] = f'{round(left_bound, 3)}/{round(right_bound, 3)}'

        auc_lb = round(roc_auc_score(lb_y, model.predict_proba(lb_X)[:, 1]), 4)
        stata.loc[f'{operation}', 'auc_lb'] = auc_lb

        return stata

In [None]:
X_data_base = X_data.copy()
X_lb_base = X_lb.copy()

stata = evaluation_model(X_data_base, y_data, X_lb_base, y_lb, operation='baseline')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.976086	training's binary_logloss: 0.0436651	valid_1's auc: 0.941268	valid_1's binary_logloss: 0.0562966
[200]	training's auc: 0.991432	training's binary_logloss: 0.0315221	valid_1's auc: 0.948801	valid_1's binary_logloss: 0.0517477
[300]	training's auc: 0.996621	training's binary_logloss: 0.0236964	valid_1's auc: 0.95171	valid_1's binary_logloss: 0.0490745
Early stopping, best iteration is:
[345]	training's auc: 0.99764	training's binary_logloss: 0.0210635	valid_1's auc: 0.952709	valid_1's binary_logloss: 0.0483246


# Задание 1

In [None]:
def transform_datetime(data): 
    data = data.copy()
    data["DT"] = pd.to_datetime(data["TransactionDT"], unit='s', origin='2017-11-30')
    data["year"] = data["DT"].dt.year
    data["month"] = data["DT"].dt.month
    data["day"] = data["DT"].dt.day
    data["hour"] = data["DT"].dt.hour
    data["day_of_week"] = data["DT"].dt.weekday
    data = data.drop("DT", axis=1)   
    return data

In [None]:
X_data_dt = transform_datetime(data)
X_lb_dt = transform_datetime(lb)

stata = evaluation_model(X_data_dt, y_data, X_lb_dt, y_lb, operation='Задание 1')

Training until validation scores don't improve for 25 rounds.
Early stopping, best iteration is:
[1]	training's auc: 1	training's binary_logloss: 0.0471359	valid_1's auc: 1	valid_1's binary_logloss: 0.0458656


# Задание 2

In [None]:
def concatenation(data):   
    data = data.copy()
    data['card_1_2'] = data['card1'].astype(np.str) + '_' + data['card2'].astype(np.str)
    data['card_1_2_3_5'] = data['card_1_2'] + '_' + data['card3'].astype(np.str) + '_' + data['card5'].astype(np.str)
    data['card_1_2_3_5_addr_1_2'] = data['card_1_2_3_5'] + '_' + data['addr1'].astype(np.str) + '_' + data['addr2'].astype(np.str)
    return data

In [None]:
X_data_concat = concatenation(X_data) 
X_lb_concat = concatenation(X_lb)

stata = evaluation_model(X_data_concat, y_data, X_lb_concat, y_lb, operation='Задание 2')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.991948	training's binary_logloss: 0.0288224	valid_1's auc: 0.947225	valid_1's binary_logloss: 0.0509659
[200]	training's auc: 0.997546	training's binary_logloss: 0.0183563	valid_1's auc: 0.951762	valid_1's binary_logloss: 0.0471606
Early stopping, best iteration is:
[240]	training's auc: 0.99855	training's binary_logloss: 0.0157772	valid_1's auc: 0.952873	valid_1's binary_logloss: 0.0463486


# Задание 3

In [None]:
def FrequencyEncoder(data, features):   
    data = data.copy()
    for feature in features:
        freq_encoder = data[feature].value_counts(normalize=True)
        data[f"{feature}_freq_enc"] = data[feature].map(freq_encoder)
    return data

In [None]:
features = ['card1', 'card2', 'card3', 'card4', 'card5', 'card6', 'addr1', 'addr2']
X_data_freq = FrequencyEncoder(X_data, features)
X_lb_freq = FrequencyEncoder(X_lb, features)

stata = evaluation_model(X_data_freq, y_data, X_lb_freq, y_lb, operation='Задание 3')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.979645	training's binary_logloss: 0.0423194	valid_1's auc: 0.943879	valid_1's binary_logloss: 0.0555691
[200]	training's auc: 0.99382	training's binary_logloss: 0.0295744	valid_1's auc: 0.94963	valid_1's binary_logloss: 0.0505619
[300]	training's auc: 0.997883	training's binary_logloss: 0.0217524	valid_1's auc: 0.952833	valid_1's binary_logloss: 0.0477901
[400]	training's auc: 0.999186	training's binary_logloss: 0.0165938	valid_1's auc: 0.954415	valid_1's binary_logloss: 0.0463061
Early stopping, best iteration is:
[430]	training's auc: 0.999333	training's binary_logloss: 0.0154046	valid_1's auc: 0.954927	valid_1's binary_logloss: 0.0459595


# Задание 4

In [None]:
def create_aggs(data, groupby_id, aggs=None, features=None): 
    data = data.copy()
    if aggs != None:
        data_grouped_num = data.groupby(groupby_id)
        stats_num = data_grouped_num.agg(aggs)
        stats_num.columns = [f"{groupby_id}_{feature}_{stat}" for feature, stat in stats_num]
        stats_num = stats_num.reset_index()
        data = data.merge(stats_num, how='left', on=groupby_id)
    if features != None:
        categorical = data[features].copy()
        le = LabelEncoder()
        for feature in features:
            cat_value = list(categorical[feature].values.astype('str'))
            le.fit(cat_value)
            categorical[feature] = le.transform(cat_value)
        categorical[groupby_id] = data[groupby_id]
        data_grouped_cat = categorical.groupby(groupby_id)
        stats_cat = data_grouped_cat.agg({col: ["mean", "sum"] for col in features})
        stats_cat.columns = [f"{groupby_id}_{feature}_{stat}" for feature, stat in stats_cat]
        stats_cat = stats_cat.reset_index()
        data = data.merge(stats_cat, how='left', on=groupby_id) 
    return data

In [None]:
aggs = {"card1": [np.mean, np.std],
        "card2": [np.mean, np.std],
        "card3": [np.mean, np.std],
        "card5": [np.mean, np.std],
        "addr1": [np.mean, np.std],
        "addr2": [np.mean, np.std]
        }
features = ["card4", "card6", "card_1_2", "card_1_2_3_5", "card_1_2_3_5_addr_1_2"]
groupby_id = "TransactionAmt"
    
X_data_agg_amt = create_aggs(X_data_concat, groupby_id, aggs, features)
X_lb_agg_amt = create_aggs(X_lb_concat, groupby_id, aggs, features)

stata = evaluation_model(X_data_agg_amt, y_data, X_lb_agg_amt, y_lb, operation='Задание 4')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.992804	training's binary_logloss: 0.027592	valid_1's auc: 0.947165	valid_1's binary_logloss: 0.0503691
[200]	training's auc: 0.998728	training's binary_logloss: 0.0167352	valid_1's auc: 0.952425	valid_1's binary_logloss: 0.0467891
[300]	training's auc: 0.999661	training's binary_logloss: 0.01084	valid_1's auc: 0.954487	valid_1's binary_logloss: 0.0451334
[400]	training's auc: 0.999925	training's binary_logloss: 0.00730164	valid_1's auc: 0.956156	valid_1's binary_logloss: 0.0447157
Early stopping, best iteration is:
[384]	training's auc: 0.99992	training's binary_logloss: 0.00771351	valid_1's auc: 0.956098	valid_1's binary_logloss: 0.0446878


# Задание 5

In [None]:
groupby_id = "D15"
X_data_agg_d15 = create_aggs(X_data_concat, groupby_id, aggs, features)
X_lb_agg_d15 = create_aggs(X_lb_concat, groupby_id, aggs, features)

stata = evaluation_model(X_data_agg_d15, y_data, X_lb_agg_d15, y_lb, operation='Задание 5')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.993076	training's binary_logloss: 0.0283584	valid_1's auc: 0.950206	valid_1's binary_logloss: 0.0500928
[200]	training's auc: 0.998015	training's binary_logloss: 0.0180425	valid_1's auc: 0.954119	valid_1's binary_logloss: 0.0464295
[300]	training's auc: 0.999258	training's binary_logloss: 0.0125933	valid_1's auc: 0.956627	valid_1's binary_logloss: 0.0451057
Early stopping, best iteration is:
[348]	training's auc: 0.999649	training's binary_logloss: 0.010527	valid_1's auc: 0.957297	valid_1's binary_logloss: 0.0446245


# Задание 6

In [None]:
def transform_TransactionAmt(data):   
    data = data.copy()
    data['TransactionAmt_whole'] = data['TransactionAmt']//1
    data['TransactionAmt_frac'] = data['TransactionAmt']%1
    data['TransactionAmt_log'] = np.log2(data['TransactionAmt'])
    return data

In [None]:
X_data_trans_amt = transform_TransactionAmt(X_data)
X_lb_trans_amt = transform_TransactionAmt(X_lb)

stata = evaluation_model(X_data_trans_amt, y_data, X_lb_trans_amt, y_lb, operation='Задание 6')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.97545	training's binary_logloss: 0.043453	valid_1's auc: 0.939509	valid_1's binary_logloss: 0.0566433
[200]	training's auc: 0.992229	training's binary_logloss: 0.0308658	valid_1's auc: 0.949247	valid_1's binary_logloss: 0.051666
[300]	training's auc: 0.996512	training's binary_logloss: 0.0234502	valid_1's auc: 0.951631	valid_1's binary_logloss: 0.0492989
[400]	training's auc: 0.998567	training's binary_logloss: 0.0177986	valid_1's auc: 0.953997	valid_1's binary_logloss: 0.0473766
[500]	training's auc: 0.999555	training's binary_logloss: 0.0136307	valid_1's auc: 0.954883	valid_1's binary_logloss: 0.0464313
Early stopping, best iteration is:
[533]	training's auc: 0.999688	training's binary_logloss: 0.0124602	valid_1's auc: 0.955168	valid_1's binary_logloss: 0.0463191


# Задание 7

In [None]:
def transform_emaildomain(data):
    data = data.copy()
    
    # отсутствующие значения P_emaildomain заполнить данными из R_emaildomain
    condition = (data['P_emaildomain'].isnull()) & (data['R_emaildomain'].notnull())
    data[condition]['P_emaildomain'] = data[condition]['R_emaildomain']

    # разбиение домена на уровни
    new = data['P_emaildomain'].str.split(".", n = 1, expand = True)
    data['P_emaildomain_1'] = new[0]
    data['P_emaildomain_2'] = new[1]

    # R_emaildomain, P_emaildomain удалить
    data = data.drop(['R_emaildomain', 'P_emaildomain'], axis=1)
    return data

In [None]:
X_data_trans_email = transform_emaildomain(X_data)
X_lb_trans_email = transform_emaildomain(X_lb)

stata = evaluation_model(X_data_trans_email, y_data, X_lb_trans_email, y_lb, operation='Задание 7')

Training until validation scores don't improve for 25 rounds.
[100]	training's auc: 0.97758	training's binary_logloss: 0.0438405	valid_1's auc: 0.93965	valid_1's binary_logloss: 0.0566607
[200]	training's auc: 0.991221	training's binary_logloss: 0.0315137	valid_1's auc: 0.94722	valid_1's binary_logloss: 0.0517566
[300]	training's auc: 0.996309	training's binary_logloss: 0.023992	valid_1's auc: 0.951246	valid_1's binary_logloss: 0.0492848
[400]	training's auc: 0.998313	training's binary_logloss: 0.0185288	valid_1's auc: 0.953171	valid_1's binary_logloss: 0.047829
Early stopping, best iteration is:
[446]	training's auc: 0.999126	training's binary_logloss: 0.0163242	valid_1's auc: 0.954389	valid_1's binary_logloss: 0.0473662


In [None]:
stata

Unnamed: 0,train_mean,train_std,valid_mean,valid_std,valid_conf_interval,auc_lb
baseline,0.8597,0.005,0.8578,0.016,0.839/0.883,0.8585
Задание 1,0.5,0.0,0.5,0.0,0.5/0.5,1.0
Задание 2,0.8945,0.002,0.893,0.008,0.885/0.906,0.8506
Задание 3,0.8893,0.003,0.8879,0.01,0.88/0.905,0.8506
Задание 4,0.9341,0.001,0.9336,0.003,0.929/0.938,0.8458
Задание 5,0.9188,0.002,0.9177,0.007,0.911/0.928,0.8398
Задание 6,0.9036,0.003,0.902,0.01,0.894/0.919,0.8493
Задание 7,0.8846,0.004,0.8828,0.011,0.873/0.902,0.8532
