# Тема “Feature Engineering, Feature Selection, part I.”


<b>Продолжим работу с данными, которые были использованы в ДЗ2 и 3, продолжим решать задачу обнаружения мошеннических транзакций, что позволит получить полное решение задачи, полный пайплайн.
<hr>
Задания:
<ol>
<li><a href="#task_0">Задание 0: выбрать любую модель машнного обучения и зафиксировать любой тип валидации. Обучить базовую модель и зафиксировать базовое качество модели. В каждом следующем задании нужно будет обучить выбранную модель и оценивать ее качество на зафиксированной схеме валидации. После каждого задания, требуется сделать вывод о достигаемом качестве модели, по сравнению с качестом из предыдущего шага.
</a>
<li><a href = "#task_1">Задание 1: признак TransactionDT - это смещение в секундах относительно базовой даты. Базовая дата - 2017-12-01, преобразовать признак TransactionDT в datetime, прибавив к базовой дате исходное значение признака. Из полученного признака выделить год, месяц, день недели, час, день.
</a>
<li><a href = "#task_2">Задание 2: сделать конкатенацию признаков
* card1 + card2;
* card1 + card2 + card_3 + card_5;
* card1 + card2 + card_3 + card_5 + addr1 + addr2

Рассматривать их как категориальных признаки.
</a>
<li><a href = "#task_3">Задание 3: Сделать FrequencyEncoder для признаков card1 - card6, addr1, addr2.
</a>
<li><a href = "#task_4">Задание 4: Создать признаки на основе отношения: TransactionAmt к вычисленной статистике. Статистика - среднее значение / стандартное отклонение TransactionAmt, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.
</a>
<li><a href = "#task_5">Задание 5: Создать признаки на основе отношения: D15 к вычисленной статистике. Статистика - среднее значение / стандартное отклонение D15, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.
</a>
<li><a href = "#task_6">Задание 6: выделить дробную часть и целую часть признака TransactionAmt в два отдельных признака. После создать отдельных признак - логарифм от TransactionAmt
</a>
<li><a href = "#task_7">Задание 7 (опция): выполнить предварительную подготовку / очистку признаков P_emaildomain и R_emaildomain (что и как делать - остается на ваше усмотрение) и сделать Frequency Encoding для очищенных признаков.
</a>
</ol>

## Импорт библиотек

In [1]:
import warnings

import lightgbm as lgbm
import numpy as np
import pandas as pd

from sklearn.metrics import (f1_score, roc_auc_score,
                             precision_score, classification_report,
                             precision_recall_curve, confusion_matrix)

from sklearn.model_selection import  train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.base import BaseEstimator, TransformerMixin

from sklearn.preprocessing import LabelEncoder

warnings.simplefilter("ignore")
%matplotlib inline

In [2]:
def optimization_memory_usage(df: pd.DataFrame):
    """ 
    Функция оптимизации числовых значений по int[8, 16, 32, 64] по float[16, 32, 64]
    с преобразованием object -> category
    """
    start_memory_usage = df.memory_usage().sum() / 1024**2
    print(f'Memory usage of dataframe is {start_memory_usage:.2f} MB')
    
    for column in df.columns:
        column_type = df[column].dtype
        
        if column_type != object: 
            col_min = df[column].min()
            col_max = df[column].max()
            if str(column_type)[:3] == 'int':
                if col_min > np.iinfo(np.int8).min and col_max < np.iinfo(np.int8).max:
                    df[column] = df[column].astype(np.int8)

                elif col_min > np.iinfo(np.int16).min and col_max < np.iinfo(np.int16).max:
                    df[column] = df[column].astype(np.int16)

                elif col_min > np.iinfo(np.int32).min and col_max < np.iinfo(np.int32).max:
                    df[column] = df[column].astype(np.int32)

                elif col_min > np.iinfo(np.int64).min and col_max < np.iinfo(np.int64).max:
                    df[column] = df[column].astype(np.int64) 
            else:
                if col_min > np.finfo(np.float16).min and col_max < np.finfo(np.float16).max:
                    df[column] = df[column].astype(np.float16)

                elif col_min > np.finfo(np.float32).min and col_max < np.finfo(np.float32).max:
                    df[column] = df[column].astype(np.float32)

                else:
                    df[column] = df[column].astype(np.float64)
        else:
            df[column] = df[column].astype('category')
            
    end_memory_usage = df.memory_usage().sum() / 1024**2
    percent_optimization  = 100 * (start_memory_usage - end_memory_usage) / start_memory_usage
    print(f'Memory usage after optimization is: {end_memory_usage:.2f} MB')
    print(f'Decreased by {percent_optimization:.1f}%')
    
    return df


def get_continuos_object_base_features_names(data: pd.DataFrame,
                       continuous_feature_threshold: int=21,
                      ) -> (list, list, list):
    """Функция возвращяет кортеж списков:
    (continuos_features[непрерывных],
     object_features[категориальных],
     base_features[базовых(дискретных по `continuous_feature_threshold`)])
    
    имен признаков"""
    numerical_features = data.select_dtypes(include=[np.number]).columns.to_list()
    object_features = data.select_dtypes(exclude=[np.number]).columns.to_list()
    
    
    base_features = [feature for feature in numerical_features
                     if len(data[feature].unique())<continuous_feature_threshold
                    ]
    
    
    continuos_features = [feature for feature in numerical_features
                          if feature not in base_features
                         ]
    return (continuos_features, object_features, base_features)
  

## Загрузка данных

In [3]:
train_df = pd.read_csv('../data/assignment_2_train.csv')
lb_test = pd.read_csv('../data/assignment_2_test.csv')

In [4]:
train_df = optimization_memory_usage(train_df)

Memory usage of dataframe is 541.08 MB
Memory usage after optimization is: 141.28 MB
Decreased by 73.9%


In [5]:
lb_test = optimization_memory_usage(lb_test)

Memory usage of dataframe is 300.60 MB
Memory usage after optimization is: 74.11 MB
Decreased by 75.3%


In [6]:
target = 'isFraud'

train_data = train_df.drop(target, axis=1)
train_target = train_df[target]

test_data = lb_test.drop(target, axis=1)
test_target = lb_test[target]

number_feature_df = train_df.select_dtypes(include=[np.number])
object_feature_df = train_df.select_dtypes(exclude=[np.number])


## Выполнение заданий

In [7]:
def check_missings(df: pd.DataFrame) -> pd.DataFrame:
    """
    Функция для вычисления среднего и общего числа пропусков.

    Parameters
    ----------
    df: pandas.core.DataFrame
        Набор данных для вычисления статистики.

    Returns
    -------
    result: pandas.core.DataFrame
        Датафрейм со статистикой распределения пропусков.

    """
    na = df.isnull().sum()
    result = pd.DataFrame({
        "Total": na,
        "Percent": 100*na/df.shape[0],
        "Types": df.dtypes
    })
    print(f"Total NA-values = {na.sum()}")
    return result.T

In [8]:
check_missings(train_df)

Total NA-values = 28186929


Unnamed: 0,TransactionID,isFraud,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,...,V330,V331,V332,V333,V334,V335,V336,V337,V338,V339
Total,0,0,0,0,0,0,2611,3,8,953,...,132004,132004,132004,132004,132004,132004,132004,132004,132004,132004
Percent,0.0,0.0,0.0,0.0,0.0,0.0,1.450556,0.001667,0.004444,0.529444,...,73.335556,73.335556,73.335556,73.335556,73.335556,73.335556,73.335556,73.335556,73.335556,73.335556
Types,int32,int8,int32,float16,category,int16,float16,float16,category,float16,...,float16,float16,float16,float16,float16,float16,float16,float16,float16,float16


<p><a name="task_0"></a></p>

### Задание 0: выбрать любую модель машнного обучения и зафиксировать любой тип валидации. Обучить базовую модель и зафиксировать базовое качество модели. В каждом следующем задании нужно будет обучить выбранную модель и оценивать ее качество на зафиксированной схеме валидации. После каждого задания, требуется сделать вывод о достигаемом качестве модели, по сравнению с качестом из предыдущего шага.


In [9]:
def run_model(train_data: pd.DataFrame,
              train_target: pd.Series,
              test_data: pd.DataFrame,
              test_target: pd.Series,
              cv: int =5, fscore_b:int = 1,
              params_lgbm:dict = {"boosting_type": "gbdt",
                                  "objective": "binary",
                                  "metric": "auc",
                                  "learning_rate": 0.1,
                                  "n_estimators": 1000,
                                  "categorical_feature":"auto",
                                  "n_jobs": -1,
                                  "seed": 21
                                 }) -> dict:
    
    from sklearn.metrics import f1_score
    from sklearn.metrics import roc_auc_score
    
    X_train, X_valid = train_test_split(train_data,
                                        train_size=0.7,
                                        shuffle=True,
                                        random_state=21)

    y_train, y_valid = train_test_split(train_target,
                                        train_size=0.7,
                                        shuffle=True,
                                        random_state=21)

    classifier = lgbm.LGBMClassifier(**params_lgbm)  
    
    classifier.fit(X=X_train, y=y_train,
                  eval_set=[(X_train, y_train),
                            (X_valid, y_valid)
                           ],
                  early_stopping_rounds=100,
                  eval_metric="auc",
                  verbose=100)
    
    #запустим кросс-валидацию на всей тренировочной выборке
    cv_scores = cross_val_score(classifier,
                                train_data,
                                train_target,
                                cv=cv,
                                scoring='roc_auc')
    
    cv_score = np.mean(cv_scores)
    cv_score_std = np.std(cv_scores)

    cv_result = f'{round(cv_score, 4)} +/- {round(cv_score_std, 3)}'
    
    # valid scores
    y_valscore = classifier.predict_proba(X_valid)[:, 1]
    
    # Precision, Recall, F_score, Roc auc
    precision_val, recall_val, thresholds_val = precision_recall_curve(y_valid.values, y_valscore)
    fscore_val = (1 + fscore_b**2) * (precision_val * recall_val) / (fscore_b**2 * precision_val + recall_val)
    roc_auc_val = roc_auc_score(y_valid, y_valscore)
    ix_val = np.argmax(fscore_val)
    
    # test scores
    y_score = classifier.predict_proba(test_data)[:, 1]
    
    # Precision, Recall, F_score, Roc auc
    precision, recall, thresholds = precision_recall_curve(test_target.values, y_score)
    fscore = (1 + fscore_b**2) * (precision * recall) / (fscore_b**2 * precision + recall)
    roc_auc = roc_auc_score(test_target, y_score)
    ix = np.argmax(fscore)
    
    res_score = {
        'cv_results': [cv_result],
        'roc_auc(valid, test)': [roc_auc_val, roc_auc],
        'precision(valid, test)': [precision_val[ix_val], precision[ix]],
        'recall(valid, test)': [recall_val[ix_val], recall[ix]],
        'fscore(valid, test)': [fscore_val[ix_val], fscore[ix]],
    } 
    
    return res_score

In [10]:
res_0 = run_model(train_data, train_target, test_data, test_target)
res_0

[100]	training's auc: 0.976518	valid_1's auc: 0.941829
[200]	training's auc: 0.991183	valid_1's auc: 0.949929
[300]	training's auc: 0.99682	valid_1's auc: 0.953164
[400]	training's auc: 0.998887	valid_1's auc: 0.9556
[500]	training's auc: 0.999547	valid_1's auc: 0.956955
[600]	training's auc: 0.999775	valid_1's auc: 0.957666
[700]	training's auc: 0.99992	valid_1's auc: 0.958238


{'cv_results': ['0.8546 +/- 0.062'],
 'roc_auc(valid, test)': [0.9583620691289452, 0.8413296629822693],
 'precision(valid, test)': [0.8883291351805206, 0.6845151953690304],
 'recall(valid, test)': [0.6924083769633508, 0.37302839116719244],
 'fscore(valid, test)': [0.7782272894446488, 0.482899438489025]}

<p><a name="task_1"></a></p>

### Задание 1: признак TransactionDT - это смещение в секундах относительно базовой даты. Базовая дата - 2017-12-01, преобразовать признак TransactionDT в datetime, прибавив к базовой дате исходное значение признака. Из полученного признака выделить год, месяц, день недели, час, день.


In [11]:
def transaction_dt_to_datetime(data):
    
    data = data.copy()
    data["transaction_datetime"] = pd.to_datetime(data["TransactionDT"],
                                                  unit='s',
                                                  origin='2017-12-01')
    
    data["transaction_year"] = data["transaction_datetime"].dt.year
    data["transaction_month"] = data["transaction_datetime"].dt.month
    data["transaction_day"] = data["transaction_datetime"].dt.day
    data["transaction_hour"] = data["transaction_datetime"].dt.hour
    data["transaction_day_of_week"] = data["transaction_datetime"].dt.weekday
    data = data.drop("transaction_datetime", axis=1)
    return data

In [12]:
train_data = transaction_dt_to_datetime(train_data)
test_data = transaction_dt_to_datetime(test_data)

In [13]:
res_1 = run_model(train_data, train_target, test_data, test_target)

[100]	training's auc: 0.97711	valid_1's auc: 0.942055
[200]	training's auc: 0.992392	valid_1's auc: 0.949524
[300]	training's auc: 0.996718	valid_1's auc: 0.953295
[400]	training's auc: 0.998525	valid_1's auc: 0.954913
[500]	training's auc: 0.999497	valid_1's auc: 0.955961
[600]	training's auc: 0.99985	valid_1's auc: 0.956006
[700]	training's auc: 0.999914	valid_1's auc: 0.956683
[800]	training's auc: 0.999963	valid_1's auc: 0.957016


In [14]:
res_1

{'cv_results': ['0.7732 +/- 0.161'],
 'roc_auc(valid, test)': [0.9572710941942418, 0.8368506811409077],
 'precision(valid, test)': [0.9094076655052264, 0.6567855488652153],
 'recall(valid, test)': [0.6832460732984293, 0.37276550998948477],
 'fscore(valid, test)': [0.780269058295964, 0.47559953043769915]}

<p><a name="task_2"></a></p>

### Задание 2: сделать конкатенацию признаков
* card1 + card2;
* card1 + card2 + card_3 + card_5;
* card1 + card2 + card_3 + card_5 + addr1 + addr2

Рассматривать их как категориальных признаки.

In [15]:
def sum_card_addr(data):
    
    data = data.copy()
    data['card12'] = data['card1'].fillna(0) +  data['card2'].fillna(0)
    data['card1235'] = data['card12'] + data['card3'].fillna(0) + data['card5'].fillna(0)
    data['card1235_addr12'] = data['card1235'] + data['addr1'].fillna(0) + data['addr2'].fillna(0)

    return data

def concat_card_addr(data):
    
    data = data.copy()
    data['card12'] = data['card1'].astype(np.str_) + '|' +data['card2'].astype(np.str_)
    data['card1235'] = data['card12'] + '|' + data['card3'].astype(np.str_) + '|' + data['card5'].astype(np.str_)
    data['card1235_addr12'] = data['card1235'] + '|' + data['addr1'].astype(np.str_) + '|' + data['addr2'].astype(np.str_)
    
    data['card12'] = data['card12'].astype('category')
    data['card1235'] = data['card1235'].astype('category')
    data['card1235_addr12'] = data['card1235_addr12'].astype('category')
    return data

In [16]:
# train_data_sum_card_addr = sum_card_addr(train_data)
# test_data_sum_card_addr = sum_card_addr(test_data)

train_data = concat_card_addr(train_data)
test_data = concat_card_addr(test_data)


In [17]:
# res_2_1 = run_model(train_data_sum_card_addr, train_target, test_data_sum_card_addr, test_target)


res_2 = run_model(train_data, train_target, test_data, test_target)
res_2 # , res_2_1

[100]	training's auc: 0.993629	valid_1's auc: 0.950091
[200]	training's auc: 0.998987	valid_1's auc: 0.953618
[300]	training's auc: 0.999876	valid_1's auc: 0.954358
[400]	training's auc: 1	valid_1's auc: 0.954874
[500]	training's auc: 1	valid_1's auc: 0.95615
[600]	training's auc: 1	valid_1's auc: 0.956878
[700]	training's auc: 1	valid_1's auc: 0.957505
[800]	training's auc: 1	valid_1's auc: 0.957632


{'cv_results': ['0.805 +/- 0.157'],
 'roc_auc(valid, test)': [0.9578622109620476, 0.833840309482774],
 'precision(valid, test)': [0.895367412140575, 0.6180150125104253],
 'recall(valid, test)': [0.7336387434554974, 0.38958990536277605],
 'fscore(valid, test)': [0.8064748201438848, 0.47791035149951633]}

<p><a name="task_3"></a></p>

### Задание 3: Сделать FrequencyEncoder для признаков card1 - card6, addr1, addr2.

In [18]:
class FrequencyEncoder(BaseEstimator, TransformerMixin):
    """
    Класс частотного кодирования списка признаков
    """
    def __init__(self, keys: list):
        self.keys = keys
        self.columns = []
        
        
    def fit(self, X:pd.DataFrame, y=None):
        self.columns = [(key, f'{key}_freq') for key in self.keys]
        return self

    
    def transform(self, data: pd.DataFrame)->pd.DataFrame:
        for key, key_freq in self.columns:
            fr_cod = data[key].value_counts(normalize=True)
            data[key_freq] = data[key].map(fr_cod)

        return data

In [19]:
freq_enc_features = ['card1', 'card2', 'card3',
                     'card4', 'card5', 'card6',
                     'addr1', 'addr2']

freqencoder = FrequencyEncoder(freq_enc_features)
freqencoder.fit_transform(train_data)
freqencoder.fit_transform(test_data)

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,card1235,card1235_addr12,card1_freq,card2_freq,card3_freq,card4_freq,card5_freq,card6_freq,addr1_freq,addr2_freq
0,3287000,7415038,226.000000,W,12473,555.0,150.0,visa,226.0,credit,...,12473|555.0|150.0|226.0,12473|555.0|150.0|226.0|299.0|87.0,0.00032,0.072942,0.881635,0.648834,0.495638,0.209399,0.083024,0.994213
1,3287001,7415054,3072.000000,W,15651,417.0,150.0,visa,226.0,debit,...,15651|417.0|150.0|226.0,15651|417.0|150.0|226.0|330.0|87.0,0.00269,0.004261,0.881635,0.648834,0.495638,0.790450,0.048890,0.994213
2,3287002,7415081,320.000000,W,13844,583.0,150.0,visa,226.0,credit,...,13844|583.0|150.0|226.0,13844|583.0|150.0|226.0|126.0|87.0,0.00041,0.026372,0.881635,0.648834,0.495638,0.209399,0.029411,0.994213
3,3287003,7415111,171.000000,W,11556,309.0,150.0,visa,226.0,debit,...,11556|309.0|150.0|226.0,11556|309.0|150.0|226.0|181.0|87.0,0.00022,0.000224,0.881635,0.648834,0.495638,0.790450,0.027373,0.994213
4,3287004,7415112,107.937500,W,10985,555.0,150.0,visa,226.0,debit,...,10985|555.0|150.0|226.0,10985|555.0|150.0|226.0|231.0|87.0,0.00007,0.072942,0.881635,0.648834,0.495638,0.790450,0.016127,0.994213
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
99996,3386996,10091528,369.000000,W,13964,496.0,150.0,mastercard,224.0,debit,...,13964|496.0|150.0|224.0,13964|496.0|150.0|224.0|299.0|87.0,0.00077,0.000864,0.881635,0.334167,0.148428,0.790450,0.083024,0.994213
99997,3386997,10091533,445.250000,W,10616,583.0,150.0,visa,226.0,credit,...,10616|583.0|150.0|226.0,10616|583.0|150.0|226.0|472.0|87.0,0.00689,0.026372,0.881635,0.648834,0.495638,0.209399,0.016285,0.994213
99998,3386998,10091544,15.226562,C,9803,583.0,150.0,visa,226.0,credit,...,9803|583.0|150.0|226.0,9803|583.0|150.0|226.0|nan|nan,0.00184,0.026372,0.881635,0.648834,0.495638,0.209399,,
99999,3386999,10091549,34.750000,C,16062,500.0,185.0,mastercard,137.0,credit,...,16062|500.0|185.0|137.0,16062|500.0|185.0|137.0|284.0|60.0,0.00193,0.009601,0.098039,0.334167,0.019969,0.209399,0.003250,0.004202


In [20]:
res_3 = run_model(train_data, train_target, test_data, test_target)
res_3

[100]	training's auc: 0.996852	valid_1's auc: 0.952671
[200]	training's auc: 0.999703	valid_1's auc: 0.955408
[300]	training's auc: 0.999978	valid_1's auc: 0.956762
[400]	training's auc: 0.999999	valid_1's auc: 0.957012
[500]	training's auc: 1	valid_1's auc: 0.956963


{'cv_results': ['0.8012 +/- 0.158'],
 'roc_auc(valid, test)': [0.9571069484450891, 0.8324991596558006],
 'precision(valid, test)': [0.8964110929853181, 0.6466789667896679],
 'recall(valid, test)': [0.7192408376963351, 0.3685594111461619],
 'fscore(valid, test)': [0.7981118373275236, 0.46952444742129934]}

In [21]:
train_data.head(3)

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,card1235,card1235_addr12,card1_freq,card2_freq,card3_freq,card4_freq,card5_freq,card6_freq,addr1_freq,addr2_freq
0,2987000,86400,68.5,W,13926,,150.0,discover,142.0,credit,...,13926|nan|150.0|142.0,13926|nan|150.0|142.0|315.0|87.0,6.1e-05,,0.879737,0.013212,0.000274,0.317951,0.042773,0.982344
1,2987001,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,...,2755|404.0|150.0|102.0,2755|404.0|150.0|102.0|325.0|87.0,0.001244,0.006855,0.879737,0.302797,0.054723,0.317951,0.080004,0.982344
2,2987002,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,...,4663|490.0|150.0|166.0,4663|490.0|150.0|166.0|330.0|87.0,0.001428,0.061413,0.879737,0.657224,0.080269,0.681949,0.046205,0.982344


<p><a name="task_4"></a></p>

### Задание 4: Создать признаки на основе отношения: TransactionAmt к вычисленной статистике. Статистика - среднее значение / стандартное отклонение TransactionAmt, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

In [22]:
def add_features_stat_agg(data: pd.DataFrame,
                          stat_key: str,
                          group_keys: list[str],
                          agg_funcs: list[str]):
    """
    Функция добавления признака :
        stat_key/mean(stat_key)by(group_key)
        stat_key/std(stat_key)by(group_key) 
    по спискам agg_funcs & group_keys 
    """  
    for agg_func in agg_funcs:
        for group_key in group_keys:
            agg_data = data.groupby(group_key, as_index=False).agg({stat_key: agg_func})\
                                    .rename(columns={stat_key:f'{agg_func}{stat_key}By_{group_key}'}) 
            data = data.merge(agg_data, on=group_key, how='left')
            data[f'{stat_key} / {agg_func}{stat_key}By_{group_key}'] = data[stat_key] / data[f'{agg_func}{stat_key}By_{group_key}']
            data = data.drop(f'{agg_func}{stat_key}By_{group_key}', axis=1)
    return data


class StatFeatures(BaseEstimator, TransformerMixin):
    """
    Класс статистических значений 
    """
    def __init__(self,
                 stat_key: str,
                 group_keys: list[str],
                 agg_funcs: list[str]):
        
        self.stat_key = stat_key
        self.group_keys = group_keys
        self.agg_funcs = agg_funcs
        
                 
    def fit_transform(self, data:pd.DataFrame, y=None):
        for agg_func in self.agg_funcs:
            for group_key in self.group_keys:
                agg_data = data.groupby(group_key, as_index=False).agg({self.stat_key: agg_func})\
                                        .rename(columns={self.stat_key:f'{agg_func}{self.stat_key}By_{group_key}'}) 
                data = data.merge(agg_data, on=group_key, how='left')
                data[f'{self.stat_key} / {agg_func}{self.stat_key}By_{group_key}'] = data[self.stat_key] / data[f'{agg_func}{self.stat_key}By_{group_key}']
                data = data.drop(f'{agg_func}{self.stat_key}By_{group_key}', axis=1)

        return data
    


In [23]:
transform_feature_names = ['card1', 'card2', 'card3', 'card4',
                           'card5', 'card6','addr1', 'addr2',
                           'card12', 'card1235', 'card1235_addr12']

add_stat_features = StatFeatures('TransactionAmt',
                                 group_keys = transform_feature_names,
                                 agg_funcs = ['mean', 'std'])

train_data = add_stat_features.fit_transform(train_data)
test_data = add_stat_features.fit_transform(test_data)
train_data.head(3)

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,TransactionAmt / stdTransactionAmtBy_card2,TransactionAmt / stdTransactionAmtBy_card3,TransactionAmt / stdTransactionAmtBy_card4,TransactionAmt / stdTransactionAmtBy_card5,TransactionAmt / stdTransactionAmtBy_card6,TransactionAmt / stdTransactionAmtBy_addr1,TransactionAmt / stdTransactionAmtBy_addr2,TransactionAmt / stdTransactionAmtBy_card12,TransactionAmt / stdTransactionAmtBy_card1235,TransactionAmt / stdTransactionAmtBy_card1235_addr12
0,2987000,86400,68.5,W,13926,,150.0,discover,142.0,credit,...,,0.315706,0.200991,0.587949,0.262548,0.287963,0.314952,0.290306,0.290306,
1,2987001,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,...,0.098587,0.133657,0.141234,0.09987,0.111152,0.126671,0.133337,0.07015,0.07015,0.107415
2,2987002,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,...,0.297846,0.271922,0.289315,0.457807,0.342364,0.26399,0.271272,0.691704,0.691704,4.110451


In [24]:
res_4 = run_model(train_data, train_target, test_data, test_target)
res_4

[100]	training's auc: 0.996912	valid_1's auc: 0.954091
[200]	training's auc: 0.999746	valid_1's auc: 0.957619
[300]	training's auc: 0.999998	valid_1's auc: 0.957948
[400]	training's auc: 1	valid_1's auc: 0.958448


{'cv_results': ['0.8043 +/- 0.157'],
 'roc_auc(valid, test)': [0.9584874764314094, 0.8320139866548608],
 'precision(valid, test)': [0.8849487785657998, 0.5995943204868154],
 'recall(valid, test)': [0.7349476439790575, 0.38853838065194535],
 'fscore(valid, test)': [0.8030032177332855, 0.47152655925985004]}

<p><a name="task_5"></a></p>

### Задание 5: Создать признаки на основе отношения: D15 к вычисленной статистике. Статистика - среднее значение / стандартное отклонение D15, сгруппированное по card1 - card6, addr1, addr2, и по признакам, созданным в задании 2.

In [25]:
add_stat_features = StatFeatures('D15',
                                 group_keys = transform_feature_names,
                                 agg_funcs = ['mean', 'std'])

train_data = add_stat_features.fit_transform(train_data)
test_data = add_stat_features.fit_transform(test_data)
train_data.head(3)

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,D15 / stdD15By_card2,D15 / stdD15By_card3,D15 / stdD15By_card4,D15 / stdD15By_card5,D15 / stdD15By_card6,D15 / stdD15By_addr1,D15 / stdD15By_addr2,D15 / stdD15By_card12,D15 / stdD15By_card1235,D15 / stdD15By_card1235_addr12
0,2987000,86400,68.5,W,13926,,150.0,discover,142.0,credit,...,,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,
1,2987001,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,2987002,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,...,1.650909,1.690476,1.720592,1.93627,1.717345,1.657848,1.690481,1.954574,1.954574,3.17756


In [26]:
res_5 = run_model(train_data, train_target, test_data, test_target)
res_5

[100]	training's auc: 0.997258	valid_1's auc: 0.953727
[200]	training's auc: 0.999685	valid_1's auc: 0.957027
[300]	training's auc: 0.999991	valid_1's auc: 0.956473


{'cv_results': ['0.8081 +/- 0.157'],
 'roc_auc(valid, test)': [0.9571216780347669, 0.8312631560146133],
 'precision(valid, test)': [0.8458049886621315, 0.6301305970149254],
 'recall(valid, test)': [0.7323298429319371, 0.35515247108307046],
 'fscore(valid, test)': [0.7849877236057523, 0.4542703429724277]}

<p><a name="task_6"></a></p>

### Задание 6: выделить дробную часть и целую часть признака TransactionAmt в два отдельных признака. После создать отдельных признак - логарифм от TransactionAmt


In [27]:
def num_frac_encoder(data, key):
    
    data = data.copy()
    data[f'{key}_int'] = data[key]//1
    data[f'{key}_frac'] = data[key]%1
    data[f'{key}_log'] = np.log2(data[key])

    return data



class NumFracEncoder(BaseEstimator, TransformerMixin):

    def __init__(self, key: str):
        self.key = key
        
      
    def fit_transform(self, data: pd.DataFrame)->pd.DataFrame:
        data[f'{self.key}_int'] = data[self.key]//1
        data[f'{self.key}_frac'] = data[self.key]%1
        data[f'{self.key}_log'] = np.log2(data[self.key])

        return data

In [28]:
num_frac_features = NumFracEncoder('TransactionAmt')

train_data = num_frac_features.fit_transform(train_data)
test_data = num_frac_features.fit_transform(test_data)
train_data.head(3)

Unnamed: 0,TransactionID,TransactionDT,TransactionAmt,ProductCD,card1,card2,card3,card4,card5,card6,...,D15 / stdD15By_card5,D15 / stdD15By_card6,D15 / stdD15By_addr1,D15 / stdD15By_addr2,D15 / stdD15By_card12,D15 / stdD15By_card1235,D15 / stdD15By_card1235_addr12,TransactionAmt_int,TransactionAmt_frac,TransactionAmt_log
0,2987000,86400,68.5,W,13926,,150.0,discover,142.0,credit,...,0.0,0.0,0.0,0.0,0.0,0.0,,68.0,0.5,6.097656
1,2987001,86401,29.0,W,2755,404.0,150.0,mastercard,102.0,credit,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,29.0,0.0,4.859375
2,2987002,86469,59.0,W,4663,490.0,150.0,visa,166.0,debit,...,1.93627,1.717345,1.657848,1.690481,1.954574,1.954574,3.17756,59.0,0.0,5.882812


In [29]:
res_6 = run_model(train_data, train_target, test_data, test_target)
res_6

[100]	training's auc: 0.997656	valid_1's auc: 0.95324
[200]	training's auc: 0.99991	valid_1's auc: 0.956637
[300]	training's auc: 0.999994	valid_1's auc: 0.958302
[400]	training's auc: 1	valid_1's auc: 0.958917
[500]	training's auc: 1	valid_1's auc: 0.959898
[600]	training's auc: 1	valid_1's auc: 0.960315
[700]	training's auc: 1	valid_1's auc: 0.960538
[800]	training's auc: 1	valid_1's auc: 0.960694
[900]	training's auc: 1	valid_1's auc: 0.960951
[1000]	training's auc: 1	valid_1's auc: 0.960877


{'cv_results': ['0.8033 +/- 0.157'],
 'roc_auc(valid, test)': [0.9591660727436689, 0.8345476841265984],
 'precision(valid, test)': [0.9179916317991632, 0.6358695652173914],
 'recall(valid, test)': [0.7179319371727748, 0.36908517350157727],
 'fscore(valid, test)': [0.8057289753947852, 0.46706586826347307]}

In [32]:
results = pd.DataFrame([res_0, res_1, res_2, res_3, res_4, res_5, res_6],
                       index=['baseline', 'add_datatime_features',
                              'add_concate_card_addr_features', 
                              'frequenc_encoder_feaature',
                              'add_stat_features_TransactionAmt',
                              'add_stat_features_D15',
                              'add_num_frac_features_TransactionAmt'])


Сортировка по roc-auc

In [33]:
results.sort_values(by='roc_auc(valid, test)', ascending=False)

Unnamed: 0,cv_results,"roc_auc(valid, test)","precision(valid, test)","recall(valid, test)","fscore(valid, test)"
add_num_frac_features_TransactionAmt,[0.8033 +/- 0.157],"[0.9591660727436689, 0.8345476841265984]","[0.9179916317991632, 0.6358695652173914]","[0.7179319371727748, 0.36908517350157727]","[0.8057289753947852, 0.46706586826347307]"
add_stat_features_TransactionAmt,[0.8043 +/- 0.157],"[0.9584874764314094, 0.8320139866548608]","[0.8849487785657998, 0.5995943204868154]","[0.7349476439790575, 0.38853838065194535]","[0.8030032177332855, 0.47152655925985004]"
baseline,[0.8546 +/- 0.062],"[0.9583620691289452, 0.8413296629822693]","[0.8883291351805206, 0.6845151953690304]","[0.6924083769633508, 0.37302839116719244]","[0.7782272894446488, 0.482899438489025]"
add_concate_card_addr_features,[0.805 +/- 0.157],"[0.9578622109620476, 0.833840309482774]","[0.895367412140575, 0.6180150125104253]","[0.7336387434554974, 0.38958990536277605]","[0.8064748201438848, 0.47791035149951633]"
add_datatime_features,[0.7732 +/- 0.161],"[0.9572710941942418, 0.8368506811409077]","[0.9094076655052264, 0.6567855488652153]","[0.6832460732984293, 0.37276550998948477]","[0.780269058295964, 0.47559953043769915]"
add_stat_features_D15,[0.8081 +/- 0.157],"[0.9571216780347669, 0.8312631560146133]","[0.8458049886621315, 0.6301305970149254]","[0.7323298429319371, 0.35515247108307046]","[0.7849877236057523, 0.4542703429724277]"
frequenc_encoder_feaature,[0.8012 +/- 0.158],"[0.9571069484450891, 0.8324991596558006]","[0.8964110929853181, 0.6466789667896679]","[0.7192408376963351, 0.3685594111461619]","[0.7981118373275236, 0.46952444742129934]"


Cортировка по кросвалидации

In [34]:
results.sort_values(by='cv_results', ascending=False)

Unnamed: 0,cv_results,"roc_auc(valid, test)","precision(valid, test)","recall(valid, test)","fscore(valid, test)"
baseline,[0.8546 +/- 0.062],"[0.9583620691289452, 0.8413296629822693]","[0.8883291351805206, 0.6845151953690304]","[0.6924083769633508, 0.37302839116719244]","[0.7782272894446488, 0.482899438489025]"
add_stat_features_D15,[0.8081 +/- 0.157],"[0.9571216780347669, 0.8312631560146133]","[0.8458049886621315, 0.6301305970149254]","[0.7323298429319371, 0.35515247108307046]","[0.7849877236057523, 0.4542703429724277]"
add_concate_card_addr_features,[0.805 +/- 0.157],"[0.9578622109620476, 0.833840309482774]","[0.895367412140575, 0.6180150125104253]","[0.7336387434554974, 0.38958990536277605]","[0.8064748201438848, 0.47791035149951633]"
add_stat_features_TransactionAmt,[0.8043 +/- 0.157],"[0.9584874764314094, 0.8320139866548608]","[0.8849487785657998, 0.5995943204868154]","[0.7349476439790575, 0.38853838065194535]","[0.8030032177332855, 0.47152655925985004]"
add_num_frac_features_TransactionAmt,[0.8033 +/- 0.157],"[0.9591660727436689, 0.8345476841265984]","[0.9179916317991632, 0.6358695652173914]","[0.7179319371727748, 0.36908517350157727]","[0.8057289753947852, 0.46706586826347307]"
frequenc_encoder_feaature,[0.8012 +/- 0.158],"[0.9571069484450891, 0.8324991596558006]","[0.8964110929853181, 0.6466789667896679]","[0.7192408376963351, 0.3685594111461619]","[0.7981118373275236, 0.46952444742129934]"
add_datatime_features,[0.7732 +/- 0.161],"[0.9572710941942418, 0.8368506811409077]","[0.9094076655052264, 0.6567855488652153]","[0.6832460732984293, 0.37276550998948477]","[0.780269058295964, 0.47559953043769915]"


Сортировка по fscore

In [35]:
results.sort_values(by='fscore(valid, test)', ascending=False)

Unnamed: 0,cv_results,"roc_auc(valid, test)","precision(valid, test)","recall(valid, test)","fscore(valid, test)"
add_concate_card_addr_features,[0.805 +/- 0.157],"[0.9578622109620476, 0.833840309482774]","[0.895367412140575, 0.6180150125104253]","[0.7336387434554974, 0.38958990536277605]","[0.8064748201438848, 0.47791035149951633]"
add_num_frac_features_TransactionAmt,[0.8033 +/- 0.157],"[0.9591660727436689, 0.8345476841265984]","[0.9179916317991632, 0.6358695652173914]","[0.7179319371727748, 0.36908517350157727]","[0.8057289753947852, 0.46706586826347307]"
add_stat_features_TransactionAmt,[0.8043 +/- 0.157],"[0.9584874764314094, 0.8320139866548608]","[0.8849487785657998, 0.5995943204868154]","[0.7349476439790575, 0.38853838065194535]","[0.8030032177332855, 0.47152655925985004]"
frequenc_encoder_feaature,[0.8012 +/- 0.158],"[0.9571069484450891, 0.8324991596558006]","[0.8964110929853181, 0.6466789667896679]","[0.7192408376963351, 0.3685594111461619]","[0.7981118373275236, 0.46952444742129934]"
add_stat_features_D15,[0.8081 +/- 0.157],"[0.9571216780347669, 0.8312631560146133]","[0.8458049886621315, 0.6301305970149254]","[0.7323298429319371, 0.35515247108307046]","[0.7849877236057523, 0.4542703429724277]"
add_datatime_features,[0.7732 +/- 0.161],"[0.9572710941942418, 0.8368506811409077]","[0.9094076655052264, 0.6567855488652153]","[0.6832460732984293, 0.37276550998948477]","[0.780269058295964, 0.47559953043769915]"
baseline,[0.8546 +/- 0.062],"[0.9583620691289452, 0.8413296629822693]","[0.8883291351805206, 0.6845151953690304]","[0.6924083769633508, 0.37302839116719244]","[0.7782272894446488, 0.482899438489025]"


In [36]:
results.to_csv('hw5_results.csv') 

<p><a name="task_7"></a></p>

### Задание 7 (опция): выполнить предварительную подготовку / очистку признаков P_emaildomain и R_emaildomain (что и как делать - остается на ваше усмотрение) и сделать Frequency Encoding для очищенных признаков.