<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [51]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold, cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [67]:
train_df = pd.read_csv('./data/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2/train_sessions.csv', index_col='session_id')
test_df = pd.read_csv('./data/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2/test_sessions.csv', index_col='session_id')
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

idx = train_df.shape[0]

Separate target feature 

In [68]:
y = train_df['target']

In [69]:
sites = ['site%s' %i for i in range(1, 11)]
full_df[sites] = full_df[sites].fillna(0).astype(int)
full_df[sites] = full_df[sites].astype(str)
full_df[sites].head()

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,718,0,0,0,0,0,0,0,0,0
2,890,941,3847,941,942,3846,3847,3846,1516,1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768
4,782,782,782,782,782,782,782,782,782,782
5,22,177,175,178,177,178,175,177,177,178


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [70]:
site1 = full_df['site1'].values.flatten()
site2 = full_df['site2'].values.flatten()
site3 = full_df['site3'].values.flatten()
site4 = full_df['site4'].values.flatten()
site5 = full_df['site5'].values.flatten()
site6 = full_df['site6'].values.flatten()
site7 = full_df['site7'].values.flatten()
site8 = full_df['site8'].values.flatten()
site9 = full_df['site9'].values.flatten()
site10 = full_df['site10'].values.flatten()

session_str = site1 + ' ' + site2 + ' ' + site3 + ' ' + site4 + ' ' + site5 + ' ' + site6 + ' ' + site7 + ' ' + site8 + ' ' + site9 + ' ' + site10 
session_str = [str.replace(' 0', '') for str in session_str]

In [71]:
tf = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
tf.fit(session_str)

site_feat = tf.transform(session_str[:idx])
site_feat

<253561x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 3610893 stored elements in Compressed Sparse Row format>

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [72]:
times = ['time%s' % i for i in range(1, 11)]
full_df[times] = full_df[times].apply(pd.to_datetime)
full_df['hour'] = full_df['time1'].apply(lambda x: x.hour)
full_df['month'] = full_df['time1'].apply(lambda x: x.month)
full_df['year'] = full_df['time1'].apply(lambda x: x.year)

In [73]:
full_df['is_morning'] =  full_df['hour'].apply(lambda x: 1 if x in range(4,13) else 0)
full_df['is_day'] = full_df['hour'].apply(lambda x: 1 if x in range(13,19) else 0)
full_df['is_evening'] = full_df['hour'].apply(lambda x: 1 if x in range(19,25) else 0)
full_df['is_night'] = full_df['hour'].apply(lambda x: 1 if x in range(0,4) else 0)
full_df['day'] = full_df['time1'].apply(lambda x: x.weekday())
full_df['year_month'] = full_df['time1'].apply(lambda x: 100 * x.year + x.month)

full_df['is_morning_1'] = full_df['hour'].apply(lambda x: 1 if x in range(4, 9) else 0)
full_df['is_morning_2'] = full_df['hour'].apply(lambda x: 1 if x in range(9, 13) else 0)
full_df['is_day_1'] = full_df['hour'].apply(lambda x: 1 if x in range(13,16) else 0)
full_df['is_day_2'] = full_df['hour'].apply(lambda x: 1 if x in range(16,19) else 0)
full_df['is_evening_1'] = full_df['hour'].apply(lambda x: 1 if x in range(19,22) else 0)
full_df['is_evening_2'] = full_df['hour'].apply(lambda x: 1 if x in range(22,25) else 0)

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [74]:
full_df = pd.concat([full_df, 
                     pd.get_dummies(full_df['hour'], prefix='hour'),
                     pd.get_dummies(full_df['day'], prefix='day'),
                     pd.get_dummies(full_df['month'], prefix='month'),
                     pd.get_dummies(full_df['year'], prefix='year')], axis=1)

full_df.drop(['hour', 'is_night', 'day', 'month', 'year'] + sites + times, axis=1, inplace=True)
full_df.head()

Unnamed: 0_level_0,is_morning,is_day,is_evening,year_month,is_morning_1,is_morning_2,is_day_1,is_day_2,is_evening_1,is_evening_2,...,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12,year_2013,year_2014
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,201402,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,201402,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,1,0,201312,0,0,0,1,0,0,...,0,0,0,0,0,0,0,1,1,0
4,1,0,0,201403,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,1,0,0,201402,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [75]:
scaler = StandardScaler()
scaler.fit(full_df['year_month'].values.reshape(-1, 1))

StandardScaler(copy=True, with_mean=True, with_std=True)

In [76]:
# full_df['year_month_scaled'] = scaler.transform(full_df['year_month'].values.reshape(-1, 1))
to_drop=['is_morning_2','is_day_1','hour_7','hour_11','hour_15','hour_16',
        'hour_18','hour_21','hour_23','day_0','day_1',
        'day_4','month_1','month_3','month_12','year_month']
full_df.drop(to_drop, axis=1, inplace=True)



<336358x32 sparse matrix of type '<class 'numpy.int64'>'
	with 1369108 stored elements in Compressed Sparse Row format>

In [78]:
time_feat = csr_matrix(full_df[:idx])
time_feat

<253561x32 sparse matrix of type '<class 'numpy.int64'>'
	with 1015090 stored elements in Compressed Sparse Row format>

In [79]:
X = hstack([time_feat, site_feat]).tocsr()
X

<253561x100032 sparse matrix of type '<class 'numpy.float64'>'
	with 4625983 stored elements in Compressed Sparse Row format>

Perform cross-validation with logistic regression.

In [80]:
scv = StratifiedKFold(n_splits=4, random_state=18, shuffle=True)
lr = LogisticRegression(random_state=18, n_jobs=-1, C=10)
np.mean(cross_val_score(lr, X, y, cv=scv, scoring='roc_auc', n_jobs=-1))

0.989114014793439

In [65]:
def get_auc_lr_valid(X, y, C=1.0, seed=17, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, random_state=seed, n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [46]:
# X = csr_matrix(full_df.drop('year_month_scaled', axis=1)).tocsr()[:idx, :]
time_feat = csr_matrix(full_df[:idx].drop(['month_1'], axis=1))
X = hstack([site_feat, time_feat]).tocsr()

In [49]:
full_df.head()

Unnamed: 0_level_0,is_morning,is_day,is_evening,is_morning_1,is_morning_2,is_day_1,is_day_2,is_evening_1,is_evening_2,hour_7,...,month_5,month_6,month_7,month_8,month_9,month_10,month_11,month_12,year_2013,year_2014
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,0,1,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,1,1,0
4,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,1,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


In [None]:
useless_features = []
for feature in full_df.columns:
    time_feat = csr_matrix(full_df[:idx].drop([feature], axis=1))
    X = hstack([site_feat, time_feat]).tocsr()
    auc_score = get_auc_lr_valid(X, y, C=10.0)
    
    if auc_score > 0.989458442753113:
        print(feature)
        useless_features.append(feature)

is_morning
is_day
is_evening
is_morning_1
is_day_2
is_evening_1
is_evening_2
hour_8
hour_9
hour_10
hour_13
hour_14
hour_17
hour_19
hour_20
hour_22
day_3
day_5
day_6
month_2
month_4
month_5
month_6
month_7
month_8


In [None]:
%%time
get_auc_lr_valid(X, y, C=10.0)

In [None]:
# все 48 признаков 0.989458442753113
# без 'year_month_scaled' 0.989461012297841

# time_feat 0.9332844209787909
# site_feat 0.966657748205173
# site_feat + 'year_month_scaled' 0.9687415686813612
# site_feat + 'is_morning' 0.9698333842959705
# site_feat + 'year_month_scaled'+ 'is_morning' 0.9724231641887691
# site_feat + 'year_month_scaled'+ 'is_morning' + 'is_day' 0.9729608414231167
# site_feat + bin_feat_is_* + 'year_month_scaled' 0.9730133564934966
# site_feat + bin_feat_is_* + 'year_month_scaled' + day_* 0.9774413244461346
# site_feat + day_* 0.9733971822372511
# site_feat + day_* + 'year_month_scaled' 0.9744273287783871
# site_feat + day_* + bin_feat_is_* 0.976016030104786
# site_feat + hours 0.9855231850020814
# site_feat + hours + 'year_month_scaled' 0.9867531939440969
# site_feat + hours + day_* 0.9870136012426318
# site_feat + hours + day_* + bin_feat_is_* 0.9870601742408279
# site_feat + hours + day_* + bin_feat_is_* + 'year_month_scaled' 0.9880636617554102
# site_feat + months 0.9746322499704503

# 48 без 'year_month_scaled', 'month_1' 0.9894912044483959

In [None]:
lr_grid = LogisticRegression(random_state=18, n_jobs=-1, C=10).fit(X, y)

Make prediction for the test set and form a submission file.

In [None]:
time_feat_test = csr_matrix(full_df[idx:].drop(['year_month_scaled', 'month_1'], axis=1))
site_feat_test = tf.transform(session_str[idx:])

X_test = hstack([site_feat_test, time_feat_test]).tocsr()

In [None]:
test_pred = lr_grid.predict_proba(X_test)[:, 1]

In [None]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [None]:
write_to_submission_file(test_pred, "assignment6_alice_submission.csv")