В данном ноутбуке подготовим основной валидационный и трейнинговый сеты, на которых будем обучать модели и к которым будем присоединять наборы дополнительных признаков

In [1]:
import numpy as np
import pandas as pd
import gc
import holidays

In [2]:
#загружаем расишренный датасет из предыдущего ноутбука
df = pd.read_csv("../input/alfabattle-1-parq/alfa1_train_expend.csv")

Сортируем данные по клиентам и времени, колонку времени приводим к соответствующему типу

In [3]:
df = df.sort_values(by=['client_pin', 'timestamp'])

In [4]:
df['timestamp'] = pd.to_datetime(df['timestamp'])

In [5]:
#далее будем работать с копиями основного датафрейма
df_temp = df.copy()

In [6]:
#создаем вспомогательный датафрейм, куда будем подтягивать статистику по каждому пользователю
df_stat = pd.DataFrame()
df_stat['client_pin'] = df_temp['client_pin'].unique()

In [7]:
#получаем номера последних сессий, чтобы потом исключать их при составлении валидации и трейна
last_ses = df_temp.groupby('client_pin')['session_id'].tail(1).values

In [8]:
#получаем сведения о самом популярном действии клиента
df_stat = df_stat.merge(df_temp.loc[~df_temp['session_id'].isin(last_ses)].groupby(['client_pin'])['multi_class_target'].agg(
    lambda x: pd.Series.mode(x)[0]).rename('most_popular'), how='left', on='client_pin')

In [9]:
#мерджим информацию о времени будущего действия клиента
df_temp = df_temp.merge(df_temp[['client_pin', 'timestamp']].groupby(['client_pin']).tail(1).rename({'timestamp':'next_time'}, axis=1), how='left', on='client_pin')

In [10]:
#считаем разницу во времени в часах между действием и будущим действием
df_temp['timedelta_act'] = (df_temp['next_time'] - df_temp['timestamp']) / np.timedelta64(1, 'h')

In [11]:
#добавляем в статистику информацию о времени между последним и будущим действием
df_stat = df_stat.merge(df_temp[['client_pin', 'timedelta_act']].loc[~df_temp['session_id'].isin(last_ses)].groupby('client_pin').tail(1), how='left', on='client_pin')

In [12]:
#считаем время между будущим действием и каждым конкретным таргетом, так же считаем среднее время между каждым таргетом
#и рассчитываем абсолютную погрешность между средней разницей и текущей разницей. Пропуски заполняем заведомо большими числами.
for action in df_temp['multi_class_target'].unique():
    df_time = df_temp.loc[(~df_temp['session_id'].isin(last_ses)) & (df_temp['multi_class_target'] == action)]
    df_time['next_time_act'] = df_time[['client_pin', 'timestamp']].groupby(['client_pin']).shift(-1)
    df_time.dropna(inplace=True)
    df_time[f'mean_timedelta_{action}'] = (df_time['next_time_act'] - df_time['timestamp']) / np.timedelta64(1, 'h')
    df_time = df_time[['client_pin', f'mean_timedelta_{action}']].groupby('client_pin').mean()
    df_stat = df_stat.merge(df_time, how='left', on='client_pin')
    df_stat = df_stat.merge(df_temp.loc[(~df_temp['session_id'].isin(last_ses)) & (df_temp['multi_class_target'] == action)].groupby(['client_pin']).tail(1)[['client_pin', 'timedelta_act']].rename({'timedelta_act':f'timedelta_{action}'}, axis=1), 
                            how='left', on='client_pin')
    df_stat[f'err_timedelta_{action}'] = np.abs(df_stat[f'mean_timedelta_{action}'] - df_stat[f'timedelta_{action}'])
    df_stat[f'timedelta_{action}'].fillna(10000, inplace=True)
    df_stat[f'mean_timedelta_{action}'].fillna(10000, inplace=True)
    df_stat[f'err_timedelta_{action}'].fillna(10000, inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  This is separate from the ipykernel package so we can avoid doing imports until
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats 

In [13]:
#список колонок из паркетников
from_parq = ['application_id', 'event_type', 'event_category', 'event_name',
       'event_label', 'device_screen_name', 'timezone', 'device_is_webview',
       'page_urlhost', 'page_urlpath_full', 'net_connection_type',
       'net_connection_tech', 'timedelta', 'count']

In [14]:
#добавляем в статистику последние паркетные данные
df_stat = df_stat.merge(df_temp.loc[~df_temp['session_id'].isin(last_ses)].groupby('client_pin').tail(1)[from_parq+['client_pin']], 
                            how='left', on='client_pin')

In [15]:
#строим датасет состояющий из 35 последних лагов действий пользователя, записи без последней и предпоследней сессии удаляем
exp_df = df[['session_id', 'client_pin', 'multi_class_target']]
exp_df['row_number'] = exp_df.groupby(['client_pin']).cumcount()+1
exp_df.drop(['session_id'], axis=1, inplace=True)
exp_df.set_index(['client_pin', 'row_number'])
exp_df = exp_df.groupby(['client_pin']).tail(35)
exp_df['row_number'] = exp_df.groupby('client_pin').rank(ascending=False).astype('int32')
exp_df.set_index(['client_pin', 'row_number'], inplace=True)
df_lag = pd.DataFrame()
df_lag['total'] = [0]*len(df['client_pin'].unique())*35
pins = df['client_pin'].unique()
lags = list(range(1, 36))
index = pd.MultiIndex.from_product([pins, lags], names=['client_pin', 'row_number'])
df_lag.set_index(index, inplace=True)
df_lag = df_lag.merge(exp_df, how='left', on=['client_pin', 'row_number'])
df_lag.drop(['total'], axis=1, inplace=True)
df_lag = df_lag.unstack().add_prefix('lag_').rename_axis([None, None], axis=1)
df_lag.columns = df_lag.columns.droplevel(0)
df_lag = df_lag[df_lag['lag_2'].notnull()]
df_lag = df_lag[df_lag['lag_1'].notnull()]
df_lag['weight'] = 1
df_lag.reset_index(inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,


In [16]:
#соединяем лаги с остальной статистикой, получаем около 77к строк валидационного сета
df_lag = df_lag.merge(df_stat, how='left', on='client_pin')

In [17]:
df_lag.to_csv('alfa1_df_valid6.csv', index=False)

In [18]:
#далее делаем все тоже самое, только каждый раз убираем последнее действие пользователя, чтобы сместиться по времени назад,
#делаем так 64 раза, что позволяет получить трейн примерно из 2.8кк записей. Сначала модели обучались на 35 шифтах, прирост
#с 64 шифтами был, но не вау, поэтому на всех данных решено было не обучаться в угоду скорости разработки. Так же ключевым
#моментом является добавление весов строкам в зависимости от того, какой шифт. 
concat_list = []
for i in range(1, 65):
    exp_df = df[['session_id', 'client_pin', 'multi_class_target']]
    exp_df['multi_class_target'] = df.groupby('client_pin')['multi_class_target'].shift(i)
    exp_df['row_number'] = exp_df.groupby(['client_pin']).cumcount()+1
    exp_df.drop(['session_id'], axis=1, inplace=True)
    exp_df.set_index(['client_pin', 'row_number'])
    exp_df = exp_df.groupby(['client_pin']).tail(35)
    exp_df['row_number'] = exp_df.groupby('client_pin').rank(ascending=False).astype('int32')
    exp_df.set_index(['client_pin', 'row_number'], inplace=True)
    df_lag = pd.DataFrame()
    df_lag['total'] = [0]*len(df['client_pin'].unique())*35
    pins = df['client_pin'].unique()
    lags = list(range(1, 36))
    index = pd.MultiIndex.from_product([pins, lags], names=['client_pin', 'row_number'])
    df_lag.set_index(index, inplace=True)
    df_lag = df_lag.merge(exp_df, how='left', on=['client_pin', 'row_number'])
    df_lag.drop(['total'], axis=1, inplace=True)
    df_lag = df_lag.unstack().add_prefix('lag_').rename_axis([None, None], axis=1)
    df_lag.columns = df_lag.columns.droplevel(0)
    df_lag = df_lag[df_lag['lag_2'].notnull()]
    df_lag = df_lag[df_lag['lag_1'].notnull()]
    df_lag.reset_index(inplace=True)
    
    if (i >= 1) and (i <= 5):
        df_lag['weight'] = 1
    elif i <= 10:
        df_lag['weight'] = 6/7
    elif i <= 15:
        df_lag['weight'] = 5/7
    elif i <= 20:
        df_lag['weight'] = 4/7
    elif i <= 25:
        df_lag['weight'] = 3/7
    elif i <= 30:
        df_lag['weight'] = 2/7
    else:
        df_lag['weight'] = 1/7
        
        
    df_temp = df.copy()
    cols = df_temp.drop(['client_pin'], axis=1).columns
    df_temp[cols] = df.groupby('client_pin')[cols].shift(i)
    df_temp.dropna(inplace=True)

    df_stat = pd.DataFrame()
    df_stat['client_pin'] = df_temp['client_pin'].unique()
    last_ses = df_temp.groupby('client_pin')['session_id'].tail(1).values
    df_stat = df_stat.merge(df_temp.loc[~df_temp['session_id'].isin(last_ses)].groupby(['client_pin'])['multi_class_target'].agg(lambda x: pd.Series.mode(x)[0]).rename('most_popular'), how='left', on='client_pin')
    
    df_temp = df_temp.merge(df_temp[['client_pin', 'timestamp']].groupby(['client_pin']).tail(1).rename({'timestamp':'next_time'}, axis=1), how='left', on='client_pin')
    
    df_temp['timedelta_act'] = (df_temp['next_time'] - df_temp['timestamp']) / np.timedelta64(1, 'h')
    
    df_stat = df_stat.merge(df_temp[['client_pin', 'timedelta_act']].loc[~df_temp['session_id'].isin(last_ses)].groupby('client_pin').tail(1), how='left', on='client_pin')
    
    for action in df_temp['multi_class_target'].unique():
        df_time = df_temp.loc[(~df_temp['session_id'].isin(last_ses)) & (df_temp['multi_class_target'] == action)]
        df_time['next_time_act'] = df_time[['client_pin', 'timestamp']].groupby(['client_pin']).shift(-1)
        df_time.dropna(inplace=True)
        df_time[f'mean_timedelta_{action}'] = (df_time['next_time_act'] - df_time['timestamp']) / np.timedelta64(1, 'h')
        df_time = df_time[['client_pin', f'mean_timedelta_{action}']].groupby('client_pin').mean()
        df_stat = df_stat.merge(df_time, how='left', on='client_pin')
        df_stat = df_stat.merge(df_temp.loc[(~df_temp['session_id'].isin(last_ses)) & (df_temp['multi_class_target'] == action)].groupby(['client_pin']).tail(1)[['client_pin', 'timedelta_act']].rename({'timedelta_act':f'timedelta_{action}'}, axis=1), 
                                how='left', on='client_pin')
        df_stat[f'err_timedelta_{action}'] = np.abs(df_stat[f'mean_timedelta_{action}'] - df_stat[f'timedelta_{action}'])
        df_stat[f'timedelta_{action}'].fillna(10000, inplace=True)
        df_stat[f'mean_timedelta_{action}'].fillna(10000, inplace=True)
        df_stat[f'err_timedelta_{action}'].fillna(10000, inplace=True)
        
    df_stat = df_stat.merge(df_temp.loc[~df_temp['session_id'].isin(last_ses)].groupby('client_pin').tail(1)[from_parq+['client_pin']], 
                            how='left', on='client_pin')
    
    df_lag = df_lag.merge(df_stat, how='left', on='client_pin')
    
    
        
    concat_list.append(df_lag)
    del exp_df
    del df_lag
    del df_stat
    del df_temp
    gc.collect()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  after removing the cwd from sys.path.
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  """
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  errors=errors,
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stabl

In [None]:
#конкатенируем и сохраняем

In [19]:
df_all = pd.concat(concat_list).reset_index(drop=True)

In [20]:
df_all.to_csv('alfa1_df_train6.csv', index=False)