## Введение (1)

Учебный проект состоит из нескольких разделов (блокнотов), в которых последовательно обрабатываются данные, предоставленные организаторами соревнования Catch Me If You Can ("Alice") на платформе Kaggle. В проект входят следующие разделы:
* 1. Введение, подготовка датасетов
* 2. Исследование данных
* 3. Обучение моделей
* 4. Дополнительная глава

При изучении данных материалов важно помнить, что это мой первый опыт как программирования, так и анализа данных :) Буду благодарен каждому совету и указанию на ошибки!

<span style="color:red">Цель</span> учебного проекта заключается в получении навыков обработки и визуализации данных, в изучении основ машинного обучения, а также в знакомстве с площадкой Kaggle.

<span style="color:red">Задача</span>, поставленная передо мной и другими участниками соревнования - задача бинарной классификации сессий пользователей. Класс "1" - сессия Alice, класс "0" - все остальные.

### О данных

In [1]:
import warnings
warnings.filterwarnings('ignore')
from glob import glob
import os
from tqdm.notebook import tqdm
import numpy as np
import pandas as pd
import scipy.sparse as sp
import pickle

PATH_TO_TRAIN = os.path.join('initial_data', 'train_sessions')
PATH_TO_TEST = os.path.join('initial_data', 'test_sessions')
PATH_TO_DICT = os.path.join('initial_data', 'site_dict')
PATH_TO_DATASET = os.path.join('intermediate_data', 'test_train')
PATH_TO_TRAIN_RAW = os.path.join('initial_data', 'user_logs')

Список исходных данных:
* Сырые данные в train.zip (распаковано в папку user_logs)
Сырые данные представляют из себя набор (1558 файлов, по одному на пользователя) csv-файлов. Каждый файл содержит таблицу из двух столбцов с информацией о последовательности посещённых пользователем сайтов. Например, так выглядят первые три строки файла user_logs/user0002.csv:
```
timestamp,site
2013-11-29 08:14:18,fpdownload2.macromedia.com
2013-11-29 08:14:26,hotmail.fr
...
```

* Подготовленный организаторами соревнования на Kaggle тренировочный набор сессий (сессия - последовательность из 10 сайтов, ограниченная по времени 30 минутами) в train_sessions/train_sessions.csv:

In [2]:
train_df = pd.read_csv(os.path.join(PATH_TO_TRAIN, 'train_sessions.csv'), 
            index_col='session_id', parse_dates=['time' + str(j) for j in range(1,11)])
train_df

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,NaT,,NaT,,NaT,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253557,3474,2013-11-25 10:26:54,3474.0,2013-11-25 10:26:58,141.0,2013-11-25 10:27:03,2428.0,2013-11-25 10:27:04,106.0,2013-11-25 10:27:13,...,2013-11-25 10:27:16,2428.0,2013-11-25 10:27:28,2428.0,2013-11-25 10:27:40,2428.0,2013-11-25 10:27:52,148.0,2013-11-25 10:27:53,0
253558,12727,2013-03-12 16:01:15,12727.0,2013-03-12 16:01:16,2215.0,2013-03-12 16:01:16,38.0,2013-03-12 16:01:17,2215.0,2013-03-12 16:01:17,...,2013-03-12 16:01:17,25444.0,2013-03-12 16:01:18,2215.0,2013-03-12 16:01:18,23.0,2013-03-12 16:01:18,21.0,2013-03-12 16:01:18,0
253559,2661,2013-09-12 14:05:03,15004.0,2013-09-12 14:05:10,5562.0,2013-09-12 14:05:10,5562.0,2013-09-12 14:06:29,5562.0,2013-09-12 14:06:30,...,NaT,,NaT,,NaT,,NaT,,NaT,0
253560,812,2013-12-19 15:20:22,676.0,2013-12-19 15:20:22,814.0,2013-12-19 15:20:22,22.0,2013-12-19 15:20:22,39.0,2013-12-19 15:20:22,...,2013-12-19 15:20:23,814.0,2013-12-19 15:20:23,570.0,2013-12-19 15:20:23,22.0,2013-12-19 15:20:24,570.0,2013-12-19 15:20:24,0


In [22]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253561 entries, 1 to 253561
Data columns (total 21 columns):
 #   Column  Non-Null Count   Dtype         
---  ------  --------------   -----         
 0   site1   253561 non-null  int64         
 1   time1   253561 non-null  datetime64[ns]
 2   site2   250098 non-null  float64       
 3   time2   250098 non-null  datetime64[ns]
 4   site3   246919 non-null  float64       
 5   time3   246919 non-null  datetime64[ns]
 6   site4   244321 non-null  float64       
 7   time4   244321 non-null  datetime64[ns]
 8   site5   241829 non-null  float64       
 9   time5   241829 non-null  datetime64[ns]
 10  site6   239495 non-null  float64       
 11  time6   239495 non-null  datetime64[ns]
 12  site7   237297 non-null  float64       
 13  time7   237297 non-null  datetime64[ns]
 14  site8   235224 non-null  float64       
 15  time8   235224 non-null  datetime64[ns]
 16  site9   233084 non-null  float64       
 17  time9   233084 non-null  date

* Подготовленный организаторами тестовый набор сессий в test_sessions/test_sessions.csv:

In [3]:
test_df = pd.read_csv(os.path.join(PATH_TO_TEST, 'test_sessions.csv'), 
                      index_col='session_id', parse_dates=['time' + str(j) for j in range(1,11)])
test_df

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,29,2014-10-04 11:19:53,35.0,2014-10-04 11:19:53,22.0,2014-10-04 11:19:54,321.0,2014-10-04 11:19:54,23.0,2014-10-04 11:19:54,2211.0,2014-10-04 11:19:54,6730.0,2014-10-04 11:19:54,21.0,2014-10-04 11:19:54,44582.0,2014-10-04 11:20:00,15336.0,2014-10-04 11:20:00
2,782,2014-07-03 11:00:28,782.0,2014-07-03 11:00:53,782.0,2014-07-03 11:00:58,782.0,2014-07-03 11:01:06,782.0,2014-07-03 11:01:09,782.0,2014-07-03 11:01:10,782.0,2014-07-03 11:01:23,782.0,2014-07-03 11:01:29,782.0,2014-07-03 11:01:30,782.0,2014-07-03 11:01:53
3,55,2014-12-05 15:55:12,55.0,2014-12-05 15:55:13,55.0,2014-12-05 15:55:14,55.0,2014-12-05 15:56:15,55.0,2014-12-05 15:56:16,55.0,2014-12-05 15:56:17,55.0,2014-12-05 15:56:18,55.0,2014-12-05 15:56:19,1445.0,2014-12-05 15:56:33,1445.0,2014-12-05 15:56:36
4,1023,2014-11-04 10:03:19,1022.0,2014-11-04 10:03:19,50.0,2014-11-04 10:03:20,222.0,2014-11-04 10:03:21,202.0,2014-11-04 10:03:21,3374.0,2014-11-04 10:03:22,50.0,2014-11-04 10:03:22,48.0,2014-11-04 10:03:22,48.0,2014-11-04 10:03:23,3374.0,2014-11-04 10:03:23
5,301,2014-05-16 15:05:31,301.0,2014-05-16 15:05:32,301.0,2014-05-16 15:05:33,66.0,2014-05-16 15:05:39,67.0,2014-05-16 15:05:40,69.0,2014-05-16 15:05:40,70.0,2014-05-16 15:05:40,68.0,2014-05-16 15:05:40,71.0,2014-05-16 15:05:40,167.0,2014-05-16 15:05:44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82793,812,2014-10-02 18:20:09,1039.0,2014-10-02 18:20:09,676.0,2014-10-02 18:20:09,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT
82794,300,2014-05-26 14:16:40,302.0,2014-05-26 14:16:41,302.0,2014-05-26 14:16:44,300.0,2014-05-26 14:16:44,300.0,2014-05-26 14:17:19,1222.0,2014-05-26 14:17:19,302.0,2014-05-26 14:17:19,1218.0,2014-05-26 14:17:19,1221.0,2014-05-26 14:17:19,1216.0,2014-05-26 14:17:19
82795,29,2014-05-02 11:21:56,33.0,2014-05-02 11:21:56,35.0,2014-05-02 11:21:56,22.0,2014-05-02 11:22:03,37.0,2014-05-02 11:22:03,6779.0,2014-05-02 11:22:03,30.0,2014-05-02 11:22:03,21.0,2014-05-02 11:22:04,23.0,2014-05-02 11:22:04,6780.0,2014-05-02 11:22:04
82796,5828,2014-05-03 10:05:25,23.0,2014-05-03 10:05:27,21.0,2014-05-03 10:05:27,804.0,2014-05-03 10:05:27,21.0,2014-05-03 10:05:36,3350.0,2014-05-03 10:05:37,23.0,2014-05-03 10:05:37,894.0,2014-05-03 10:05:38,21.0,2014-05-03 10:05:38,961.0,2014-05-03 10:05:38


In [23]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82797 entries, 1 to 82797
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype         
---  ------  --------------  -----         
 0   site1   82797 non-null  int64         
 1   time1   82797 non-null  datetime64[ns]
 2   site2   81308 non-null  float64       
 3   time2   81308 non-null  datetime64[ns]
 4   site3   80075 non-null  float64       
 5   time3   80075 non-null  datetime64[ns]
 6   site4   79182 non-null  float64       
 7   time4   79182 non-null  datetime64[ns]
 8   site5   78341 non-null  float64       
 9   time5   78341 non-null  datetime64[ns]
 10  site6   77566 non-null  float64       
 11  time6   77566 non-null  datetime64[ns]
 12  site7   76840 non-null  float64       
 13  time7   76840 non-null  datetime64[ns]
 14  site8   76151 non-null  float64       
 15  time8   76151 non-null  datetime64[ns]
 16  site9   75484 non-null  float64       
 17  time9   75484 non-null  datetime64[ns]
 18  site10

## Подготовка датасетов

В данном блокноте содержатся функции, позволяющие подготовить тренировочный и тестовый наборы данных по образу сформированного организаторами соревнования на Kaggle для дальнейшей визуализации, генерации признаков и тренировки моделей. Это упражнение проделано для формирования навыков предобработки данных.

### Подготовка тренировочного набора данных

В данном разделе главная функция, подготавливающая набор данных - prepare_train_set_with_fe. На вход она принимает путь до файлов с информацией о пользователях (path_to_csv_files), путь до словаря с сайтами (path_to_site_dict), а также длину сессии session_length и ширину окна window_size. 

In [4]:
def prepare_train_set_with_fe(path_to_csv_files, path_to_site_dict, 
                            session_length=10, window_size=10):
    ''' Подготавливает набор данных. Сырые данных хранятся в path_to_csv_files. 
    Возвращает DataFrame с сессиями и с несколькими извлеченными признаками
    Внимание! В текущем виде функция плохо работает с другим значением window_size'''
    
    with open(os.path.join(path_to_site_dict, 'site_dict.pkl'), 'rb') as file:
        site_dict = pickle.load(file)

    data = []
    
    list_of_files = glob(os.path.join(path_to_csv_files, '*'))         
    for path_to_user in tqdm(list_of_files):
        if 'Alice' in path_to_user:
            target = 1
        else:
            target = 0
            
        sites_array = pd.read_csv(path_to_user)['site'].map(site_dict).values.tolist()
        timestamps = pd.read_csv(path_to_user, parse_dates=['timestamp'])['timestamp']
        time_diffs, ind_reset_generator = timestamps_handling(timestamps)
        
        try:
            ind_reset = next(ind_reset_generator)
        except StopIteration:
            continue_iter = False
        continue_iter = True
        n = len(sites_array)
        ind = 0

        while True:
            if continue_iter and (ind_reset in range(ind, ind+session_length)):
                '''Если в сессии есть перерыв'''
                sites = sites_array[ind:ind_reset+1]
                times = time_diffs[ind:ind_reset]
                zeros_sites = [0 for _ in range(session_length-len(sites))]
                zeros_times = [0 for _ in range(session_length-len(times) - 1)]
                session = sites + zeros_sites + times + zeros_times
                
                features = add_features_to_train(timestamps, times, sites, ind)
                data.append(session + features + [target])

                ind = ind_reset + 1
                try:
                    ind_reset = next(ind_reset_generator)
                except StopIteration:
                    continue_iter = False
                    
            else:
                '''Если в сессии нет перерыва'''
                if ind + session_length > n-1:
                    '''Если сессия - последняя сессия пользователя'''
                    sites = sites_array[ind:n]
                    times = time_diffs[ind:n-1]
                    zeros_sites = [0 for _ in range(session_length-len(sites))]
                    zeros_times = [0 for _ in range(session_length-len(times) - 1)]
                    session = sites + zeros_sites + times + zeros_times
                    
                    features = add_features_to_train(timestamps, times, sites, ind)
                    data.append(session + features + [target])
                    
                    ind += window_size
                else:
                    '''Обычная сессия'''
                    sites = sites_array[ind:ind+session_length]
                    times = time_diffs[ind:ind+session_length-1]
                    session = sites + times

                    features = add_features_to_train(timestamps, times, sites, ind)
                    data.append(session + features + [target])
                    
                    ind += window_size

            if ind >= n:
                break
                
    feature_names=([f'site{i}' for i in range(1, session_length + 1)] +
                   [f'time_diff{i}' for i in range(1, session_length)] +
                   ['session_timespan', '#unique_sites', 'start_hour', 
                    'day_of_week', 'day', 'month', 'year',
                    'time', 'time_of_day'] +
                   ['target'])
    data_df = pd.DataFrame(data, columns=feature_names)

    return data_df

In [5]:
def timestamps_handling(timestamps):
    ''' Принимает на вход timestamps (pd.Series). Возвращает список с time_diff и
    генератор индексов перерывов (time_diffs >= 1800) '''
    
    time_diffs = (timestamps[1:].reset_index(drop=True)-
                  timestamps[:-1].reset_index(drop=True)).values
    
    time_diffs = np.array(time_diffs/1e9, dtype='int') # --> astype(int)

    return time_diffs.tolist(), (i for i in np.where(time_diffs >= 1800)[0])

In [6]:
def add_features_to_train(timestamps, times, sites, ind):
    ''' Добавляет основные фичи к тренировочному набору '''
    
    session_timespan = sum(times)
    num_unique_sites = len(set(sites)-{0})
    start_hour = timestamps[ind].hour
    day_of_week = timestamps[ind].weekday()
    start_day = timestamps[ind].day
    start_month = timestamps[ind].month
    start_year = timestamps[ind].year
    start_time = timestamps[ind].time()
    if timestamps[ind].hour >= 6: time_of_day = 'morning'
    if timestamps[ind].hour >= 12: time_of_day = 'afternoon'
    if timestamps[ind].hour >= 18: time_of_day = 'evening'
    
    return [session_timespan, num_unique_sites, start_hour, 
            day_of_week, start_day, start_month, start_year,
            start_time, time_of_day]

In [7]:
%%time
train_s10_w10_m30 = prepare_train_set_with_fe(PATH_TO_TRAIN_RAW, PATH_TO_DICT,
                                    session_length=10, window_size=10)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=1558.0), HTML(value='')))


Wall time: 2min 30s


In [8]:
train_s10_w10_m30

Unnamed: 0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,session_timespan,#unique_sites,start_hour,day_of_week,day,month,year,time,time_of_day,target
0,270,270,270,21,21,7832,21,7832,30,7832,...,437,4,16,1,12,2,2013,16:25:10,afternoon,1
1,29,7832,37,7832,7832,29,7832,29,7832,7832,...,26,3,16,1,12,2,2013,16:32:27,afternoon,1
2,29,7832,7832,29,37,7832,29,7832,29,270,...,53,4,16,1,12,2,2013,16:32:53,afternoon,1
3,167,167,1515,167,37,1514,855,1515,855,1514,...,3,5,16,1,12,2,2013,16:33:50,afternoon,1
4,1516,1515,1514,1518,1521,1523,1519,1524,1517,855,...,0,10,16,1,12,2,2013,16:33:55,afternoon,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
252205,3,3,8,3,3,4,4819,8,4819,6,...,15,5,16,0,17,3,2014,16:58:12,afternoon,0
252206,4819,8,41592,14,6,4819,10,14,7,4819,...,2,7,16,0,17,3,2014,16:58:28,afternoon,0
252207,4819,14,11,12,7,3,18,41592,14,4819,...,59,8,16,0,17,3,2014,16:58:31,afternoon,0
252208,11,66,63,8,29,30,162,131,162,29,...,18,8,16,0,17,3,2014,16:59:51,afternoon,0


In [9]:
with open(os.path.join(PATH_TO_DATASET, 'train_s10_w10_m30.pkl'), 'wb') as file:
    pickle.dump(train_s10_w10_m30, file)

Сформированный из сырых данных тренировочный набор данных получен достаточно быстро (вследствие хорошей оптимизации используемой функции), однако в дальнейшем будет использоваться другой набор данных. Набор данных, на котором будут тренироваться модели, формируется в заключении данного раздела при помощи куда менее эффективной функции.

### Подготовка тестового набора данных

In [10]:
test_df = pd.read_csv(os.path.join(PATH_TO_TEST, 'test_sessions.csv'), 
                      index_col='session_id', parse_dates=['time' + str(j) for j in range(1,11)])

In [11]:
test_df

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,29,2014-10-04 11:19:53,35.0,2014-10-04 11:19:53,22.0,2014-10-04 11:19:54,321.0,2014-10-04 11:19:54,23.0,2014-10-04 11:19:54,2211.0,2014-10-04 11:19:54,6730.0,2014-10-04 11:19:54,21.0,2014-10-04 11:19:54,44582.0,2014-10-04 11:20:00,15336.0,2014-10-04 11:20:00
2,782,2014-07-03 11:00:28,782.0,2014-07-03 11:00:53,782.0,2014-07-03 11:00:58,782.0,2014-07-03 11:01:06,782.0,2014-07-03 11:01:09,782.0,2014-07-03 11:01:10,782.0,2014-07-03 11:01:23,782.0,2014-07-03 11:01:29,782.0,2014-07-03 11:01:30,782.0,2014-07-03 11:01:53
3,55,2014-12-05 15:55:12,55.0,2014-12-05 15:55:13,55.0,2014-12-05 15:55:14,55.0,2014-12-05 15:56:15,55.0,2014-12-05 15:56:16,55.0,2014-12-05 15:56:17,55.0,2014-12-05 15:56:18,55.0,2014-12-05 15:56:19,1445.0,2014-12-05 15:56:33,1445.0,2014-12-05 15:56:36
4,1023,2014-11-04 10:03:19,1022.0,2014-11-04 10:03:19,50.0,2014-11-04 10:03:20,222.0,2014-11-04 10:03:21,202.0,2014-11-04 10:03:21,3374.0,2014-11-04 10:03:22,50.0,2014-11-04 10:03:22,48.0,2014-11-04 10:03:22,48.0,2014-11-04 10:03:23,3374.0,2014-11-04 10:03:23
5,301,2014-05-16 15:05:31,301.0,2014-05-16 15:05:32,301.0,2014-05-16 15:05:33,66.0,2014-05-16 15:05:39,67.0,2014-05-16 15:05:40,69.0,2014-05-16 15:05:40,70.0,2014-05-16 15:05:40,68.0,2014-05-16 15:05:40,71.0,2014-05-16 15:05:40,167.0,2014-05-16 15:05:44
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82793,812,2014-10-02 18:20:09,1039.0,2014-10-02 18:20:09,676.0,2014-10-02 18:20:09,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT,,NaT
82794,300,2014-05-26 14:16:40,302.0,2014-05-26 14:16:41,302.0,2014-05-26 14:16:44,300.0,2014-05-26 14:16:44,300.0,2014-05-26 14:17:19,1222.0,2014-05-26 14:17:19,302.0,2014-05-26 14:17:19,1218.0,2014-05-26 14:17:19,1221.0,2014-05-26 14:17:19,1216.0,2014-05-26 14:17:19
82795,29,2014-05-02 11:21:56,33.0,2014-05-02 11:21:56,35.0,2014-05-02 11:21:56,22.0,2014-05-02 11:22:03,37.0,2014-05-02 11:22:03,6779.0,2014-05-02 11:22:03,30.0,2014-05-02 11:22:03,21.0,2014-05-02 11:22:04,23.0,2014-05-02 11:22:04,6780.0,2014-05-02 11:22:04
82796,5828,2014-05-03 10:05:25,23.0,2014-05-03 10:05:27,21.0,2014-05-03 10:05:27,804.0,2014-05-03 10:05:27,21.0,2014-05-03 10:05:36,3350.0,2014-05-03 10:05:37,23.0,2014-05-03 10:05:37,894.0,2014-05-03 10:05:38,21.0,2014-05-03 10:05:38,961.0,2014-05-03 10:05:38


In [12]:
def feature_extracting_from_test(test_df, train=False):
    ''' Преобразует тестовый набор данных в вид, аналогичный тренировочному набору '''
    
    data_df = test_df.fillna(0)
    
    sites = ['site' + str(j) for j in range(1,11)]
    times = ['time_diff' + str(j) for j in range(1,10)]
    base_features = ['session_timespan', '#unique_sites', 
                     'start_hour', 'day_of_week', 'day',
                     'month', 'year', 'time', 'time_of_day']
    
    if train is True:
        base_features.append('target')
    
    new_df = pd.DataFrame(columns=sites + times + base_features,
                          index=range(1, data_df.shape[0] + 1))

    for column in ['site' + str(j) for j in range(1,11)]:
        new_df[column] = data_df[column]

    for j in range(1, 10):
        new_df['time_diff' + str(j)] = (test_df['time' + str(j+1)] - 
                                        test_df['time' + str(j)]).apply(lambda x: x.total_seconds())
    
    new_df['session_timespan'] = new_df.loc[:, 'time_diff1':'time_diff9'].sum(axis=1)

    for i in tqdm(range(1, data_df.shape[0])):
        sites_row = data_df[sites].iloc[i].values
        num_unique_in_session = np.unique(sites_row[sites_row != 0]).shape[0]
        new_df['#unique_sites'][i] = num_unique_in_session

    new_df['start_hour'] = test_df['time1'].apply(lambda x: x.hour)
    new_df['day_of_week'] = test_df['time1'].apply(lambda x: x.weekday())
    new_df['day'] = test_df['time1'].apply(lambda x: x.day)
    new_df['month'] = test_df['time1'].apply(lambda x: x.month)
    new_df['year'] = test_df['time1'].apply(lambda x: x.year)
    new_df['time'] = test_df['time1'].apply(lambda x: x.time())

    def time_of_day(x):
        time_of_day = 'night'
        if x.hour >= 6: time_of_day = 'morning'
        if x.hour >= 12: time_of_day = 'afternoon'
        if x.hour >= 18: time_of_day = 'evening'
        return time_of_day

    new_df['time_of_day'] = data_df['time1'].apply(time_of_day)
    
    if train is True:
        new_df['target'] = data_df['target']

    return new_df

In [13]:
%%time
test_df_final = feature_extracting_from_test(test_df)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=82796.0), HTML(value='')))


Wall time: 5min 5s


In [14]:
test_df_final = test_df_final.fillna(0)

In [15]:
test_df_final

Unnamed: 0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,...,time_diff9,session_timespan,#unique_sites,start_hour,day_of_week,day,month,year,time,time_of_day
1,29,35.0,22.0,321.0,23.0,2211.0,6730.0,21.0,44582.0,15336.0,...,0.0,7.0,1,11,5,4,10,2014,11:19:53,morning
2,782,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,782.0,...,23.0,85.0,2,11,3,3,7,2014,11:00:28,morning
3,55,55.0,55.0,55.0,55.0,55.0,55.0,55.0,1445.0,1445.0,...,3.0,84.0,7,15,4,5,12,2014,15:55:12,afternoon
4,1023,1022.0,50.0,222.0,202.0,3374.0,50.0,48.0,48.0,3374.0,...,0.0,4.0,8,10,1,4,11,2014,10:03:19,morning
5,301,301.0,301.0,66.0,67.0,69.0,70.0,68.0,71.0,167.0,...,4.0,13.0,8,15,4,16,5,2014,15:05:31,afternoon
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
82793,812,1039.0,676.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,6,18,3,2,10,2014,18:20:09,evening
82794,300,302.0,302.0,300.0,300.0,1222.0,302.0,1218.0,1221.0,1216.0,...,0.0,39.0,10,14,0,26,5,2014,14:16:40,afternoon
82795,29,33.0,35.0,22.0,37.0,6779.0,30.0,21.0,23.0,6780.0,...,0.0,8.0,7,11,4,2,5,2014,11:21:56,morning
82796,5828,23.0,21.0,804.0,21.0,3350.0,23.0,894.0,21.0,961.0,...,0.0,13.0,2,10,5,3,5,2014,10:05:25,morning


In [16]:
with open(os.path.join(PATH_TO_DATASET, 'test_s10_w10_m30_final.pkl'), 'wb') as file:
    pickle.dump(test_df_final, file)

Замечание! Модели, натренированные в разделе "Тренировка моделей" на полученном в этом разделе вручную тренировочном наборе, выдают меньший score на публичном тестовом наборе данных, чем при использовании сформированного организаторами тестового набора данных. Здесь будет применена функция feature_extracting_from_test с параметром train=True к сформированному организаторами набору данных train_sessions.csv. К сожалению, данная функция неоптимизирована и работает существенно дольше той, что собирает датасет из сырых данных.

In [17]:
train_df = pd.read_csv(os.path.join(PATH_TO_TRAIN, 'train_sessions.csv'), 
            index_col='session_id', parse_dates=['time' + str(j) for j in range(1,11)])

In [18]:
train_df

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,NaT,,NaT,,NaT,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
253557,3474,2013-11-25 10:26:54,3474.0,2013-11-25 10:26:58,141.0,2013-11-25 10:27:03,2428.0,2013-11-25 10:27:04,106.0,2013-11-25 10:27:13,...,2013-11-25 10:27:16,2428.0,2013-11-25 10:27:28,2428.0,2013-11-25 10:27:40,2428.0,2013-11-25 10:27:52,148.0,2013-11-25 10:27:53,0
253558,12727,2013-03-12 16:01:15,12727.0,2013-03-12 16:01:16,2215.0,2013-03-12 16:01:16,38.0,2013-03-12 16:01:17,2215.0,2013-03-12 16:01:17,...,2013-03-12 16:01:17,25444.0,2013-03-12 16:01:18,2215.0,2013-03-12 16:01:18,23.0,2013-03-12 16:01:18,21.0,2013-03-12 16:01:18,0
253559,2661,2013-09-12 14:05:03,15004.0,2013-09-12 14:05:10,5562.0,2013-09-12 14:05:10,5562.0,2013-09-12 14:06:29,5562.0,2013-09-12 14:06:30,...,NaT,,NaT,,NaT,,NaT,,NaT,0
253560,812,2013-12-19 15:20:22,676.0,2013-12-19 15:20:22,814.0,2013-12-19 15:20:22,22.0,2013-12-19 15:20:22,39.0,2013-12-19 15:20:22,...,2013-12-19 15:20:23,814.0,2013-12-19 15:20:23,570.0,2013-12-19 15:20:23,22.0,2013-12-19 15:20:24,570.0,2013-12-19 15:20:24,0


In [19]:
%%time
train_df_final = feature_extracting_from_test(train_df, train=True)

HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=253560.0), HTML(value='')))


Wall time: 39min


In [20]:
train_df_final = train_df_final.fillna(0)

In [21]:
with open(os.path.join(PATH_TO_DATASET, 'train_s10_w10_m30_final.pkl'), 'wb') as file:
    pickle.dump(train_df_final, file)

В результате выполнения кода из данного блокнота будут сформированы несколько файлов. В дальнейшем будут использоваться следующие:

* test_train/train_s10_w10_m30_final.pkl
* test_train/test_s10_w10_m30_final.pkl