<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [16]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

Separate target feature 

In [3]:
y = train_df['target']

In [4]:
sites = ['site%s' % i for i in range(1, 11)]

In [17]:
Xtrain_st = train_df[sites]

In [18]:
Xtrain_st = Xtrain_st.fillna(0).astype(int)

In [23]:
Xtrain_st = Xtrain_st.to_string(index=False).split('\n')[1:]

In [30]:
Xtrain_st = Xtrain_st[1:]

In [33]:
len(Xtrain_st), len(y)

(253561, 253561)

In [22]:
import pickle
# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [118]:
# You code here
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=12000)

In [46]:
X_tfidf = tfidf.fit_transform(Xtrain_st[:2])
Xtrain_st[:2]

['  718      0      0      0      0      0      0      0      0       0',
 '  890    941   3847    941    942   3846   3847   3846   1516    1518']

In [47]:
X_tfidf.toarray()

array([[ 0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  1.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
         0.        ,  0.        ,  0.        ,  0.        ,  0.        ],
       [ 0.17407766,  0.17407766,  0.17407766,  0.34815531,  0.17407766,
         0.17407766,  0.17407766,  0.17407766,  0.34815531,  0.17407766,
         0.17407766,  0.17407766,  0.17407766,  0.        ,  0.17407766,
         0.17407766,  0.17407766,  0.34815531,  0.17407766,  0.17407766,
         0.17407766,  0.17407766,  0.17407766,  0.17407766,  0.17407766]])

In [48]:
tfidf.vocabulary_

{'1516': 0,
 '1516 1518': 1,
 '1518': 2,
 '3846': 3,
 '3846 1516': 4,
 '3846 1516 1518': 5,
 '3846 3847': 6,
 '3846 3847 3846': 7,
 '3847': 8,
 '3847 3846': 9,
 '3847 3846 1516': 10,
 '3847 941': 11,
 '3847 941 942': 12,
 '718': 13,
 '890': 14,
 '890 941': 15,
 '890 941 3847': 16,
 '941': 17,
 '941 3847': 18,
 '941 3847 941': 19,
 '941 942': 20,
 '941 942 3846': 21,
 '942': 22,
 '942 3846': 23,
 '942 3846 3847': 24}

In [50]:
X_tfidf = tfidf.fit_transform(Xtrain_st)

In [117]:
X_tfidf

<253561x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 3636401 stored elements in Compressed Sparse Row format>

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [112]:
# You code here
X_tfidf

<253561x100000 sparse matrix of type '<class 'numpy.float64'>'
	with 3636401 stored elements in Compressed Sparse Row format>

In [115]:
from datetime import datetime

StartHour = full_df['time1'].apply(lambda x : datetime.strptime(x,'%Y-%m-%d %H:%M:%S').hour)

In [116]:
StartHourOHE = pd.get_dummies(StartHour)

In [120]:
SitesHoursOHE = hstack([Xfull_ifidf,StartHourOHE])

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [49]:
# You code here
enc = OneHotEncoder()

Perform cross-validation with logistic regression.

In [99]:
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=4)
def calcScore(X,y,C):
    for train_index, test_index in tscv.split(X):
        X_train, X_test = X[train_index,:], X[test_index,:]
        y_train1, y_test = y[train_index], y[test_index]

        lr = LogisticRegression(C=C, random_state=17, n_jobs=-1).fit(X_train, y_train1)
        # Prediction for validation set
        y_pred = lr.predict_proba(X_test)[:, 1]
        # Calculate the quality
#         print(y_train1.shape, y_test.shape)
        score = roc_auc_score(y_test, y_pred)
        print(score)

In [None]:
calcScore(Xfull_ifidf[:idx_split], y,1)

In [95]:
from sklearn.model_selection import cross_val_score

In [98]:
cross_val_score(lr, Xfull_ifidf[:idx_split], y, scoring='roc_auc').mean()

0.94340095387900647

In [123]:
cross_val_score(lr, SitesHoursOHE.tocsr()[:idx_split], y, scoring='roc_auc').mean()

0.97319879771730233

In [None]:
cross_val_score(lr, Xfull_ifidf[:idx_split], y, scoring='roc_auc').mean()

In [54]:
# You code here
#вначале попробуем проверить в стиле baseline
def get_auc_lr_valid(X, y, C=1.0, seed=17, ratio = 0.9):
    # Split the data into the training and validation sets
    idx = int(round(X.shape[0] * ratio))
    # Classifier training
    lr = LogisticRegression(C=C, random_state=seed, n_jobs=-1).fit(X[:idx, :], y[:idx])
    # Prediction for validation set
    y_pred = lr.predict_proba(X[idx:, :])[:, 1]
    # Calculate the quality
    score = roc_auc_score(y[idx:], y_pred)
    
    return score

In [56]:
get_auc_lr_valid(X_tfidf, y)

0.95858359628341061

Make prediction for the test set and form a submission file.

In [57]:
full_df = pd.concat([train_df.drop('target', axis=1), test_df])

In [66]:
train_df.shape, test_df.shape

((253561, 21), (82797, 20))

In [73]:
len(y)

253561

In [68]:
idx_split = train_df.shape[0]

In [60]:
Xfull_df = full_df[sites]
Xfull_df = Xfull_df.fillna(0).astype(int)

In [64]:
Xfull = Xfull_df.to_string(index=False).split('\n')[2:]

In [72]:
len(Xfull[:idx_split])

253561

In [119]:
Xfull_ifidf = tfidf.fit_transform(Xfull)

In [125]:
lr = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(SitesHoursOHE.tocsr()[:idx_split, :], y)

In [126]:
# You code here
test_pred = lr.predict_proba(SitesHoursOHE.tocsr()[idx_split:])[:, 1]

In [84]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [127]:
write_to_submission_file(test_pred, "assignment6_alice_submission4_time1.csv")