<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [27]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [28]:
PATH_TO_DATA = 'intruder'

parse_dates = [f"time{i}" for i in range (1, 11)]
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id', parse_dates=parse_dates)
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id', parse_dates=parse_dates)

train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,NaT,,NaT,,NaT,,NaT,...,NaT,,NaT,,NaT,,NaT,,NaT,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0


Separate target feature 

In [29]:
y = train_df['target']

Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [30]:
def sstr_get(df=train_df):
    site_str = df['site1'].apply(lambda s: '' if np.isnan(s) else str(int(s)))
    for i in range(2, 11):
        site_str += " " + df[f"site{i}"].apply(lambda s: '' if np.isnan(s) else str(int(s)))
    return site_str.str.strip()
    
tfidf = TfidfVectorizer(ngram_range=(1,7), max_features=100000, lowercase=False)
train_sites = tfidf.fit_transform(sstr_get(train_df))
test_sites = tfidf.transform(sstr_get(test_df))
train_sites.shape, test_sites.shape

((253561, 100000), (82797, 100000))

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [31]:
def time_get(df=train_df):
    hour = df.time1.apply(lambda ts: ts.hour)
    time_features = [hour.between(i, i) for i in range(24)]
    weekdays = [df.time1.apply(lambda ts: ts.weekday()).between(i, i) for i in range(7)]
    yearmon = df.time1.apply(lambda ts: ts.year * 12 + ts.month)
    time_features += weekdays
    time_features.append(yearmon)
    return pd.concat(time_features, axis=1)

train_time = time_get(train_df)
test_time = time_get(test_df)

train_time.shape, test_time.shape

((253561, 32), (82797, 32))

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [32]:
time_scaler = StandardScaler()
train_time_scaled = time_scaler.fit_transform(train_time.values)
test_time_scaled = time_scaler.transform(test_time.values)
train_time_scaled.shape, test_time_scaled.shape

((253561, 32), (82797, 32))

In [33]:
X_train = hstack([train_sites, train_time_scaled])
X_test = hstack([test_sites, test_time_scaled])
X_train.shape, X_test.shape

((253561, 100032), (82797, 100032))

Perform cross-validation with logistic regression.

In [34]:
%%time
cv = LogisticRegressionCV(
    Cs=np.logspace(-5, 5, 15), cv=13,
    penalty='l2', scoring='roc_auc',
    random_state=17, n_jobs=-1)
cv.fit(X_train, y)

CPU times: user 15.8 s, sys: 1.02 s, total: 16.8 s
Wall time: 6min 45s


Make prediction for the test set and form a submission file.

In [35]:
cv.scores_[1].mean(axis=0).max()

0.9888133175066453

In [36]:
test_pred = cv.predict_proba(X_test)[:,1]

In [37]:
test_pred.shape

(82797,)

In [38]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [39]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")