<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [12]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [200]:
Xst_df = test_df.fillna(0).astype(str)
Xst = Xst_df.reset_index()[Xst_df.columns].to_string(index=False).split('\n')

In [205]:
Xstf = tfidf.fit_transform(Xst)

In [206]:
lr.predict_proba(Xstf[1:])

array([[ 0.99586368,  0.00413632],
       [ 0.99403525,  0.00596475],
       [ 0.99598239,  0.00401761],
       ..., 
       [ 0.99638491,  0.00361509],
       [ 0.99590688,  0.00409312],
       [ 0.9956035 ,  0.0043965 ]])

Separate target feature 

In [13]:
y = train_df['target']

In [15]:
train_df.target.value_counts()

0    251264
1      2297
Name: target, dtype: int64

In [22]:
import pickle
# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [29]:
sites_dict.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48371 entries, 25075 to 43410
Data columns (total 1 columns):
site    48371 non-null object
dtypes: object(1)
memory usage: 2.0+ MB


In [40]:
sites_dict.loc[_]

Unnamed: 0,site
21.0,www.google.fr
23.0,www.google.com
782.0,annotathon.org
22.0,apis.google.com
29.0,www.facebook.com
167.0,www.bing.com
780.0,blast.ncbi.nlm.nih.gov
52.0,clients1.google.com
778.0,www.ncbi.nlm.nih.gov
812.0,mail.google.com


In [47]:
sites_dict[sites_dict.site.str.contains('porn')]

Unnamed: 0,site
25559,thumb.pornravage.com
46821,www.sitedeporno.com
25565,pdv.pornravage.com
28373,www.youporn.com
12724,fr.youporn.com
25566,dhtml.pornravage.com
25538,images.v2.mypornmotion.com
25537,www.mypornmotion.com
25558,log.mypornmotion.com
36045,www.pornovoisines.com


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [117]:
# You code here
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
X_df = train_df.drop(['target'], axis=1)

In [123]:
X_df = X_df.fillna(0).astype(str)

In [182]:
X_df.shape

(253561, 20)

In [158]:
%%time
X = X_df.reset_index()[X_df.columns].to_string(index=False).split('\n')

Wall time: 1min 20s


In [160]:
%%time
X = tfidf.fit_transform(X)
y = y

Wall time: 44.5 s


In [168]:
N = 10
# индексы топ 10 столбцов с максимальной суммой элементов (в столбцах)
idx = np.ravel(X.sum(axis=0).argsort(axis=1))[::-1][:N]
top_10_words = np.array(tfidf.get_feature_names())[idx].tolist()

In [169]:
top_10_words

['2014', '2013', '11', '03', '02', '12', '2014 03', '2014 02', '04', '10']

In [177]:
tfidf.vocabulary_

{'718': 97128,
 '2014': 57169,
 '02': 1336,
 '20': 55786,
 '10': 15108,
 '45': 85194,
 '718 2014': 97131,
 '2014 02': 57194,
 '02 20': 2062,
 '20 10': 55938,
 '10 02': 15231,
 '02 45': 2374,
 '718 2014 02': 97133,
 '2014 02 20': 57205,
 '02 20 10': 2066,
 '20 10 02': 55941,
 '10 02 45': 15277,
 '890': 99063,
 '22': 59156,
 '11': 20175,
 '19': 53916,
 '50': 88607,
 '941': 99385,
 '3847': 80534,
 '51': 89314,
 '942': 99400,
 '3846': 80522,
 '52': 90005,
 '1516': 43243,
 '15': 39085,
 '1518': 43282,
 '16': 43733,
 '890 2014': 99071,
 '02 22': 2112,
 '22 11': 59389,
 '11 19': 21416,
 '19 50': 54980,
 '50 941': 89150,
 '941 2014': 99395,
 '50 3847': 88916,
 '3847 2014': 80542,
 '19 51': 54989,
 '51 941': 89856,
 '51 942': 89859,
 '942 2014': 99410,
 '51 3846': 89627,
 '3846 2014': 80529,
 '51 3847': 89628,
 '19 52': 54996,
 '52 3846': 90315,
 '1516 2014': 43250,
 '11 20': 21490,
 '20 15': 56288,
 '1518 2014': 43289,
 '20 16': 56365,
 '890 2014 02': 99073,
 '2014 02 22': 57207,
 '02 22 11': 

In [179]:
X.shape

(253562, 100000)

In [185]:
X[0:10].toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [180]:
y.shape

(253561,)

In [186]:
lr = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(X[1:], y)

In [188]:
y_pred = lr.predict_proba(X[1:])

In [189]:
y_pred

array([[  9.99076443e-01,   9.23557276e-04],
       [  9.98348705e-01,   1.65129465e-03],
       [  9.51719582e-01,   4.82804183e-02],
       ..., 
       [  9.97669493e-01,   2.33050713e-03],
       [  9.99570319e-01,   4.29680964e-04],
       [  9.99080547e-01,   9.19452657e-04]])

In [194]:
score = roc_auc_score(y, y_pred[:, 1])
score

0.99733487149399291

In [195]:
from sklearn.model_selection import cross_val_score

In [198]:
cross_val_score(LogisticRegression(), X[1:], y, scoring='roc_auc')

array([ 0.99326522,  0.99353094,  0.99354434])

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [None]:
# You code here

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [None]:
# You code here

Perform cross-validation with logistic regression.

In [None]:
# You code here

Make prediction for the test set and form a submission file.

In [None]:
test_pred = # You code here

In [208]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [211]:
write_to_submission_file(lr.predict_proba(Xstf[1:])[:,1], "assignment6_alice_submission.csv")