<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [2]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [3]:
#table = impute_nan_with_median(train_df)

Separate target feature 

In [4]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [5]:
sites_table = train_df.drop(['time1', 'time2', 'time3', 'time4', 'time5', 'time6', 'time7', 'time8', 'time9', 'time10', 'target'], axis=1)

for col in sites_table.columns:
    sites_table[col]= sites_table[col].fillna(0)
    
tf_idf_sites = TfidfVectorizer(ngram_range=(1, 3),
                           max_features=100000).fit(sites_table)

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [6]:
time_table = train_df.drop(['site1', 'site2', 'site3', 'site4', 'site5', 'site6', 'site7', 'site8', 'site9', 'site10', 'target'], axis=1)

for col in time_table.columns:    
    time_table[col]= time_table[col].fillna(0)

In [70]:
time_table

Unnamed: 0_level_0,time1,time2,time3,time4,time5,time6,time7,time8,time9,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-02-20 10:02:45,0,0,0,0,0,0,0,0,0
2,2014-02-22 11:19:50,2014-02-22 11:19:50,2014-02-22 11:19:51,2014-02-22 11:19:51,2014-02-22 11:19:51,2014-02-22 11:19:51,2014-02-22 11:19:52,2014-02-22 11:19:52,2014-02-22 11:20:15,2014-02-22 11:20:16
3,2013-12-16 16:40:17,2013-12-16 16:40:18,2013-12-16 16:40:19,2013-12-16 16:40:19,2013-12-16 16:40:19,2013-12-16 16:40:19,2013-12-16 16:40:20,2013-12-16 16:40:21,2013-12-16 16:40:22,2013-12-16 16:40:24
4,2014-03-28 10:52:12,2014-03-28 10:52:42,2014-03-28 10:53:12,2014-03-28 10:53:42,2014-03-28 10:54:12,2014-03-28 10:54:42,2014-03-28 10:55:12,2014-03-28 10:55:42,2014-03-28 10:56:12,2014-03-28 10:56:42
5,2014-02-28 10:53:05,2014-02-28 10:55:22,2014-02-28 10:55:22,2014-02-28 10:55:23,2014-02-28 10:55:23,2014-02-28 10:55:59,2014-02-28 10:55:59,2014-02-28 10:55:59,2014-02-28 10:57:06,2014-02-28 10:57:11
6,2014-03-18 15:18:31,2014-03-18 15:18:39,2014-03-18 15:23:02,2014-03-18 15:23:43,2014-03-18 15:29:57,0,0,0,0,0
7,2014-02-13 16:45:35,2014-02-13 16:45:35,2014-02-13 16:45:35,2014-02-13 16:45:35,2014-02-13 16:46:05,2014-02-13 16:47:14,2014-02-13 16:47:14,2014-02-13 16:47:15,2014-02-13 16:47:16,2014-02-13 16:47:17
8,2013-04-12 10:27:26,2013-04-12 10:27:26,2013-04-12 10:27:28,2013-04-12 10:27:29,2013-04-12 10:27:29,2013-04-12 10:27:29,2013-04-12 10:27:29,2013-04-12 10:27:31,2013-04-12 10:27:31,2013-04-12 10:27:32
9,2014-03-17 16:23:08,2014-03-17 16:23:35,2014-03-17 16:23:35,2014-03-17 16:23:35,2014-03-17 16:23:36,2014-03-17 16:23:36,2014-03-17 16:23:36,2014-03-17 16:23:52,2014-03-17 16:23:52,2014-03-17 16:23:53
10,2014-02-20 16:09:13,2014-02-20 16:10:08,2014-02-20 16:10:08,2014-02-20 16:10:08,2014-02-20 16:10:24,2014-02-20 16:10:24,2014-02-20 16:10:29,2014-02-20 16:10:39,2014-02-20 16:10:40,2014-02-20 16:10:40


In [72]:
pd.Timestamp(time_table.iloc[0][0]).second

45

In [73]:
v = pd.DataFrame([(1, 2)],
                 columns=['a', 'b'])

In [74]:
v.append(pd.DataFrame([(1, 2)],
                 columns=['a', 'b']), ignore_index=True)

Unnamed: 0,a,b
0,1,2
1,1,2


In [84]:
h = pd.DataFrame(columns=['пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день',
                         'пн', 'вт', 'ср', 'чт', 'пт', 'сб', 'вс', 'час', 'мин', 'сек', 'год', 'месяц', 'день'])

In [85]:
h

Unnamed: 0,пн,вт,ср,чт,пт,сб,вс,час,мин,сек,...,чт.1,пт.1,сб.1,вс.1,час.1,мин.1,сек.1,год,месяц,день


In [None]:
h = 0
for i in range(time_table.shape[0]):
    count = 0
    b = []
    for j in range(len(time_table.ilocp[0]) - 1):
        if time_table.iloc[i][j] == 0: continue
        if pd.Timestamp(time_table.iloc[i][j]).dayofweek = 0:
         b.append((1, 0, 0, 0, 0, 0, 0, 
                   pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))   
        elif pd.Timestamp(time_table.iloc[i][j]).dayofweek = 1:
            b.append((0, 1, 0, 0, 0, 0, 0, pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))
        elif pd.Timestamp(time_table.iloc[i][j]).dayofweek = 2:
            b.append((0, 0, 1, 0, 0, 0, 0, pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))
        elif pd.Timestamp(time_table.iloc[i][j]).dayofweek = 3:
            b.append((0, 0, 0, 1, 0, 0, 0, pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))
        elif pd.Timestamp(time_table.iloc[i][j]).dayofweek = 4:
            b.append((0, 0, 0, 0, 1, 0, 0, pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))
        elif pd.Timestamp(time_table.iloc[i][j]).dayofweek = 5:
            b.append((0, 0, 0, 0, 0, 1, 0, pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))
        elif pd.Timestamp(time_table.iloc[i][j]).dayofweek = 6:
            b.append((0, 0, 0, 0, 0, 0, 1, pd.Timestamp(time_table.iloc[i][j]).hour,
                  pd.Timestamp(time_table.iloc[i][j]).minute,
                  pd.Timestamp(time_table.iloc[i][j]).second,
                  pd.Timestamp(time_table.iloc[i][j]).year,
                  pd.Timestamp(time_table.iloc[i][j]).month,
                  pd.Timestamp(time_table.iloc[i][j]).day))
            
        
        b = pd.Timestamp(time_table.iloc[i][j])

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [None]:
#ddd

Perform cross-validation with logistic regression.

In [None]:
# You code here

Make prediction for the test set and form a submission file.

In [None]:
test_pred = # You code here

In [None]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [None]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")