<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [16]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Separate target feature 

In [3]:
y = train_df['target']

In [4]:
sites = ['site%s' % i for i in range(1, 11)]

In [22]:
import pickle
# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [141]:
# You code here
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
tfidf12 = TfidfVectorizer(ngram_range=(1, 3), max_features=12000)

In [57]:
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

## Попробуем добавить разницу между time1 и time10

In [164]:
train_df['time1t'] = train_df['time1'].apply(lambda x : datetime.strptime(x,'%Y-%m-%d %H:%M:%S'))

In [190]:
def delta_time(row):
    res = datetime.strptime(row.time1,'%Y-%m-%d %H:%M:%S') - datetime.strptime(row.time1,'%Y-%m-%d %H:%M:%S')
    try:
        res = datetime.strptime(row.time10,'%Y-%m-%d %H:%M:%S') - datetime.strptime(row.time1,'%Y-%m-%d %H:%M:%S')
    except:
        pass
    return res

In [191]:
train_df['delta'] = train_df.apply(delta_time, axis=1)

In [206]:
full_df['delta'] = full_df.apply(delta_time, axis=1)

In [192]:
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time7,site8,time8,site9,time9,site10,time10,target,time1t,delta
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,,,,,,,,...,,,,,,,,0,2014-02-20 10:02:45,00:00:00
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0,2014-02-22 11:19:50,00:00:26
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0,2013-12-16 16:40:17,00:00:07
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0,2014-03-28 10:52:12,00:04:30
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0,2014-02-28 10:53:05,00:04:06


In [195]:
train_df[train_df.target == 1].delta.mean()

Timedelta('0 days 00:00:46.634305')

In [196]:
train_df[train_df.target == 0].delta.mean()

Timedelta('0 days 00:01:49.528766')

In [194]:
train_df.delta.mean()

Timedelta('0 days 00:01:48.959007')

In [201]:
len(train_df.delta.value_counts())

1797

In [199]:
len(train_df[train_df.target == 0].delta.value_counts())

1797

In [200]:
len(train_df[train_df.target == 1].delta.value_counts())

265

In [207]:
DeltaOHE = pd.get_dummies(full_df.delta)

In [166]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253561 entries, 1 to 253561
Data columns (total 22 columns):
site1     253561 non-null int64
time1     253561 non-null object
site2     250098 non-null float64
time2     250098 non-null object
site3     246919 non-null float64
time3     246919 non-null object
site4     244321 non-null float64
time4     244321 non-null object
site5     241829 non-null float64
time5     241829 non-null object
site6     239495 non-null float64
time6     239495 non-null object
site7     237297 non-null float64
time7     237297 non-null object
site8     235224 non-null float64
time8     235224 non-null object
site9     233084 non-null float64
time9     233084 non-null object
site10    231052 non-null float64
time10    231052 non-null object
target    253561 non-null int64
time1t    253561 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(9), int64(2), object(10)
memory usage: 54.5+ MB


In [66]:
train_df.shape, test_df.shape

((253561, 21), (82797, 20))

In [73]:
len(y)

253561

In [68]:
idx_split = train_df.shape[0]

In [60]:
Xfull_df = full_df[sites]
Xfull_df = Xfull_df.fillna(0).astype(int)

In [64]:
Xfull = Xfull_df.to_string(index=False).split('\n')[2:]

In [72]:
len(Xfull[:idx_split])

253561

In [50]:
X_tfidf = tfidf.fit_transform(Xtrain_st)

In [142]:
Xfull_ifidf = tfidf.fit_transform(Xfull)

In [143]:
Xfull_ifidf12 = tfidf12.fit_transform(Xfull)

In [144]:
Xfull_ifidf

<336358x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 4759727 stored elements in Compressed Sparse Row format>

In [145]:
Xfull_ifidf12

<336358x12000 sparse matrix of type '<class 'numpy.float64'>'
	with 3732499 stored elements in Compressed Sparse Row format>

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [222]:
# You code here
def getMDAN(data):
    time = datetime.strptime(data,'%Y-%m-%d %H:%M:%S')
    if (time.hour >= 0) & (time.hour < 8):
        return 1
    elif (time.hour >= 8) & (time.hour < 16):
        return 2
    elif (time.hour >= 16) & (time.hour < 23):
        return 3
    else:
        return 4
PartsOfDay = full_df['time1'].apply(getMDAN)
PartsOfDayOHE = pd.get_dummies(PartsOfDay)

In [223]:
WeekDays = full_df['time1'].apply(lambda x : datetime.strptime(x,'%Y-%m-%d %H:%M:%S').isoweekday())
WeekDaysOHE1 = pd.get_dummies(WeekDays)

In [232]:
SitesDaysPartsOHE12 = hstack([DeltasOHE12,PartsOfDayOHE,WeekDaysOHE1.drop([3,6,7], axis=1)])
# SitesDaysPartsOHE12 = hstack([DeltasOHE12,WeekDaysOHE1.drop([3,6,7], axis=1)])

In [115]:
from datetime import datetime

StartHour = full_df['time1'].apply(lambda x : datetime.strptime(x,'%Y-%m-%d %H:%M:%S').hour)

In [116]:
StartHourOHE = pd.get_dummies(StartHour)

In [146]:
SitesHoursOHE = hstack([Xfull_ifidf,StartHourOHE])
SitesHoursOHE12 = hstack([Xfull_ifidf12,StartHourOHE])

In [217]:
SitesHoursOHE12drop = hstack([Xfull_ifidf12,StartHourOHE.drop([16,17,18], axis=1)])
DeltasOHE12drop = hstack([SitesHoursOHE12drop, DeltaOHE])

In [210]:
DeltasOHE = hstack([SitesHoursOHE, DeltaOHE])
DeltasOHE12 = hstack([SitesHoursOHE12, DeltaOHE])

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [49]:
# You code here
enc = OneHotEncoder()

Perform cross-validation with logistic regression.

In [99]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=4)
def calcScore(X,y,C):
    for train_index, test_index in tscv.split(X):
        X_train, X_test = X[train_index,:], X[test_index,:]
        y_train1, y_test = y[train_index], y[test_index]

        lr = LogisticRegression(C=C, random_state=17, n_jobs=-1).fit(X_train, y_train1)
        # Prediction for validation set
        y_pred = lr.predict_proba(X_test)[:, 1]
        # Calculate the quality
#         print(y_train1.shape, y_test.shape)
        score = roc_auc_score(y_test, y_pred)
        print(score)

In [None]:
calcScore(Xfull_ifidf[:idx_split], y,1)

In [95]:
from sklearn.model_selection import cross_val_score

In [135]:
cross_val_score(lr, Xfull_ifidf[:idx_split], y, scoring='roc_auc').mean()

0.94340095387900647

In [147]:
cross_val_score(lr, SitesHoursOHE.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.97512292128486155

In [148]:
cross_val_score(lr, SitesHoursOHE12.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.97319879771730233

In [211]:
cross_val_score(lr, DeltasOHE12.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.97226883897277949

In [212]:
cross_val_score(lr, DeltasOHE.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.97389228315846799

In [225]:
cross_val_score(lr, SitesDaysPartsOHE12.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.9754504242499088

In [229]:
cross_val_score(lr, SitesDaysPartsOHE12.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.97549451045162139

In [None]:
cross_val_score(lr, Xfull_ifidf[:idx_split], y, scoring='roc_auc').mean()

In [54]:
# You code here
#вначале попробуем проверить в стиле baseline
def get_auc_lr_valid(X, y, C=1.0, seed=17, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, random_state=seed, n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [151]:
get_auc_lr_valid(SitesHoursOHE.tocsr()[:idx_split], y)

0.98011268738404933

In [152]:
get_auc_lr_valid(SitesHoursOHE12.tocsr()[:idx_split], y)

0.97665472256625574

Make prediction for the test set and form a submission file.

In [154]:
lr = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(SitesHoursOHE.tocsr()[:idx_split, :], y)
lr12 = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(SitesHoursOHE12.tocsr()[:idx_split, :], y)

In [155]:
# You code here
test_pred = lr.predict_proba(SitesHoursOHE.tocsr()[idx_split:])[:, 1]
test_pred12 = lr12.predict_proba(SitesHoursOHE12.tocsr()[idx_split:])[:, 1]

In [214]:
lr_delta = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(DeltasOHE.tocsr()[:idx_split, :], y)
lr_delta12 = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(DeltasOHE12.tocsr()[:idx_split, :], y)

test_pred_delta = lr_delta.predict_proba(DeltasOHE.tocsr()[idx_split:])[:, 1]
test_pred_delta12 = lr_delta12.predict_proba(DeltasOHE12.tocsr()[idx_split:])[:, 1]

In [219]:
lr_delta12drop = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(DeltasOHE12drop.tocsr()[:idx_split, :], y)
test_pred_delta12drop = lr_delta12drop.predict_proba(DeltasOHE12drop.tocsr()[idx_split:])[:, 1]

In [233]:
#SitesDaysPartsOHE12
lr_delta12VK = LogisticRegression(C=7, random_state=17, n_jobs=-1).fit(SitesDaysPartsOHE12.tocsr()[:idx_split, :], y)
test_pred_delta12VK = lr_delta12VK.predict_proba(SitesDaysPartsOHE12.tocsr()[idx_split:])[:, 1]

In [84]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [156]:
write_to_submission_file(test_pred, "assignment6_alice_submission5_100000.csv")
write_to_submission_file(test_pred12, "assignment6_alice_submission5_12000.csv")

In [215]:
write_to_submission_file(test_pred_delta, "assignment6_alice_submission6_delta_100000.csv")
write_to_submission_file(test_pred_delta12, "assignment6_alice_submission6_delta_12000.csv")

In [220]:
write_to_submission_file(test_pred_delta12drop, "assignment6_alice_submission7_delta_12000_drop161718.csv")

In [234]:
write_to_submission_file(test_pred_delta12VK, "assignment6_alice_submission7_delta_12000_VK3.csv")