<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline

Reading original data

In [2]:
train_df = pd.read_csv('train_sessions.csv', index_col='session_id')
test_df = pd.read_csv('test_sessions.csv', index_col='session_id')
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

idx = train_df.shape[0]

Separate target feature 

In [3]:
y = train_df['target']

In [4]:
sites = ['site%s' %i for i in range(1, 11)]
full_df[sites] = full_df[sites].fillna(0).astype(int)
full_df[sites] = full_df[sites].astype(str)
full_df[sites].head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,718,0,0,0,0,0,0,0,0,0
2,890,941,3847,941,942,3846,3847,3846,1516,1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768
4,782,782,782,782,782,782,782,782,782,782
5,22,177,175,178,177,178,175,177,177,178


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [5]:
site1 = full_df['site1'].values.flatten()
site2 = full_df['site2'].values.flatten()
site3 = full_df['site3'].values.flatten()
site4 = full_df['site4'].values.flatten()
site5 = full_df['site5'].values.flatten()
site6 = full_df['site6'].values.flatten()
site7 = full_df['site7'].values.flatten()
site8 = full_df['site8'].values.flatten()
site9 = full_df['site9'].values.flatten()
site10 = full_df['site10'].values.flatten()

session_str = site1 + ' ' + site2 + ' ' + site3 + ' ' + site4 + ' ' + site5 + ' ' + site6 + ' ' + site7 + ' ' + site8 + ' ' + site9 + ' ' + site10 

In [6]:
tf = TfidfVectorizer(ngram_range=(1, 4), max_features=300000)
tf.fit(session_str)

site_feat = tf.transform(session_str[:idx])
site_feat

<253561x300000 sparse matrix of type '<class 'numpy.float64'>'
	with 4536233 stored elements in Compressed Sparse Row format>

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [7]:
times = ['time%s' % i for i in range(1, 11)]
full_df[times] = full_df[times].apply(pd.to_datetime)
full_df['hour'] = full_df['time1'].apply(lambda x: x.hour)

In [8]:
full_df['is_morning'] =  full_df['hour'].apply(lambda x: 1 if x in range(4,13) else 0)
full_df['is_day'] = full_df['hour'].apply(lambda x: 1 if x in range(13,19) else 0)
full_df['is_evening'] = full_df['hour'].apply(lambda x: 1 if x in range(19,25) else 0)
full_df['is_night'] = full_df['hour'].apply(lambda x: 1 if x in range(0,4) else 0)

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [9]:
full_df = pd.concat([full_df, pd.get_dummies(full_df['hour'], prefix='hour')], axis=1)
full_df.drop('hour', axis=1, inplace=True)
full_df.drop(sites, axis=1, inplace=True)
full_df.drop(times, axis=1, inplace=True)
full_df.head()

Unnamed: 0_level_0,is_morning,is_day,is_evening,is_night,hour_7,hour_8,hour_9,hour_10,hour_11,hour_12,...,hour_14,hour_15,hour_16,hour_17,hour_18,hour_19,hour_20,hour_21,hour_22,hour_23
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
2,1,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
4,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,1,0,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0


In [10]:
scaler = StandardScaler()
scaler.fit(full_df)

StandardScaler(copy=True, with_mean=True, with_std=True)

In [11]:
time_feat_scaled = scaler.transform(full_df[:idx])
time_feat = csr_matrix(time_feat_scaled)
time_feat

<253561x21 sparse matrix of type '<class 'numpy.float64'>'
	with 5071220 stored elements in Compressed Sparse Row format>

In [12]:
X = hstack([time_feat, site_feat])
X

<253561x300021 sparse matrix of type '<class 'numpy.float64'>'
	with 9607453 stored elements in COOrdinate format>

Perform cross-validation with logistic regression.

In [13]:
scv = StratifiedKFold(n_splits=4, random_state=18, shuffle=True)
lr = LogisticRegression(random_state=18, n_jobs=-1, C=10)
np.mean(cross_val_score(lr, X, y, cv=scv, scoring='roc_auc', n_jobs=-1))

0.9844536457845772

In [14]:
param_grid = {'C': range(5,15)}
lr_grid = GridSearchCV(lr, param_grid, cv=3, n_jobs=-1, scoring='roc_auc')

In [15]:
lr_grid.fit(X, y)

GridSearchCV(cv=3, error_score='raise',
       estimator=LogisticRegression(C=10, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=-1,
          penalty='l2', random_state=18, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False),
       fit_params=None, iid=True, n_jobs=-1,
       param_grid={'C': range(5, 15)}, pre_dispatch='2*n_jobs', refit=True,
       return_train_score='warn', scoring='roc_auc', verbose=0)

In [16]:
lr_grid.best_params_, lr_grid.best_score_

({'C': 14}, 0.9831682618813616)

Make prediction for the test set and form a submission file.

In [17]:
time_feat_scaled = scaler.transform(full_df[idx:])
time_feat_test = csr_matrix(time_feat_scaled)

site_feat_test = tf.transform(session_str[idx:])

X_test = hstack([time_feat_test, site_feat_test])

In [18]:
test_pred = lr_grid.predict_proba(X_test)[:, 1]

In [19]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [20]:
# write_to_submission_file(test_pred, "assignment6_alice_submission.csv")