<center>
<img src="../../img/ods_stickers.jpg">
## Open Machine Learning Course
<center>
Author: Yury Kashnitsky, Data Scientist at Mail.Ru Group

This material is subject to the terms and conditions of the license [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/). Free use is permitted for any non-comercial purpose with an obligatory indication of the names of the authors and of the source.

## <center>Assignment #6. Part 1
### <center> Beating benchmarks in "Catch Me If You Can: Intruder Detection through Webpage Session Tracking"
    
[Competition](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). The task is to beat "Assignment 6 baseline".

In [1]:
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns
import os
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression, LogisticRegressionCV
from sklearn.feature_extraction.text import TfidfVectorizer

Reading original data

In [12]:
PATH_TO_DATA = ('../../data')
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'), index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'), index_col='session_id')

In [225]:
Xst_df = test_df.fillna(0).astype(str)
Xst = Xst_df.reset_index()[sites].to_string(index=False).split('\n')

In [226]:
Xstf = tfidf.fit_transform(Xst)

In [227]:
lr.predict_proba(Xstf[1:])

array([[ 0.99583719,  0.00416281],
       [ 0.99577614,  0.00422386],
       [ 0.99582558,  0.00417442],
       ..., 
       [ 0.99601264,  0.00398736],
       [ 0.9955968 ,  0.0044032 ],
       [ 0.99491351,  0.00508649]])

Separate target feature 

In [13]:
y = train_df['target']

In [15]:
train_df.target.value_counts()

0    251264
1      2297
Name: target, dtype: int64

In [22]:
import pickle
# Load websites dictionary
with open(r"../../data/site_dic.pkl", "rb") as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print(u'Websites total:', sites_dict.shape[0])
sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


In [29]:
sites_dict.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 48371 entries, 25075 to 43410
Data columns (total 1 columns):
site    48371 non-null object
dtypes: object(1)
memory usage: 2.0+ MB


In [40]:
sites_dict.loc[_]

Unnamed: 0,site
21.0,www.google.fr
23.0,www.google.com
782.0,annotathon.org
22.0,apis.google.com
29.0,www.facebook.com
167.0,www.bing.com
780.0,blast.ncbi.nlm.nih.gov
52.0,clients1.google.com
778.0,www.ncbi.nlm.nih.gov
812.0,mail.google.com


In [47]:
sites_dict[sites_dict.site.str.contains('porn')]

Unnamed: 0,site
25559,thumb.pornravage.com
46821,www.sitedeporno.com
25565,pdv.pornravage.com
28373,www.youporn.com
12724,fr.youporn.com
25566,dhtml.pornravage.com
25538,images.v2.mypornmotion.com
25537,www.mypornmotion.com
25558,log.mypornmotion.com
36045,www.pornovoisines.com


Build Tf-Idf features based on sites. You can use `ngram_range`=(1, 3) and `max_features`=100000 or more

In [117]:
# You code here
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
X_df = train_df.drop(['target'], axis=1)

In [123]:
X_df = X_df.fillna(0).astype(str)

In [182]:
X_df.shape

(253561, 20)

In [213]:
sites = ['site%s' % i for i in range(1, 11)]

In [None]:
get_X(train_df)

In [215]:
%%time
X = X_df.reset_index()[sites].to_string(index=False).split('\n')

Wall time: 50.8 s


In [216]:
%%time
X = tfidf.fit_transform(X)
y = y

Wall time: 15.4 s


In [217]:
N = 10
# индексы топ 10 столбцов с максимальной суммой элементов (в столбцах)
idx = np.ravel(X.sum(axis=0).argsort(axis=1))[::-1][:N]
top_10_words = np.array(tfidf.get_feature_names())[idx].tolist()

In [218]:
top_10_words

['21', '23', '782', '167', '22', '782 782', '780', '29', '812', '778']

In [219]:
tfidf.vocabulary_

{'718': 85566,
 '890': 96271,
 '941': 97671,
 '3847': 63304,
 '942': 97886,
 '3846': 63204,
 '1516': 14280,
 '1518': 14502,
 '890 941': 96377,
 '941 3847': 97729,
 '3847 941': 63386,
 '941 942': 97854,
 '942 3846': 97927,
 '3846 3847': 63226,
 '3847 3846': 63315,
 '1516 1518': 14301,
 '890 941 3847': 96379,
 '941 3847 941': 97737,
 '3847 941 942': 63397,
 '941 942 3846': 97861,
 '942 3846 3847': 97929,
 '3846 3847 3846': 63230,
 '14769': 13329,
 '39': 64183,
 '14768': 13303,
 '37': 60598,
 '14768 14769': 13310,
 '14769 37': 13339,
 '37 39': 61270,
 '14768 14768': 13304,
 '14768 14768 14768': 13305,
 '782': 89859,
 '782 782': 90063,
 '782 782 782': 90111,
 '22': 30146,
 '177': 19914,
 '175': 19485,
 '178': 20223,
 '22 177': 30347,
 '177 175': 19921,
 '175 178': 19558,
 '178 177': 20261,
 '177 178': 19958,
 '178 175': 20233,
 '175 177': 19532,
 '177 177': 19950,
 '22 177 175': 30348,
 '177 175 178': 19926,
 '175 178 177': 19563,
 '178 177 178': 20266,
 '177 178 175': 19960,
 '178 175 177

In [179]:
X.shape

(253562, 100000)

In [185]:
X[0:10].toarray()

array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ..., 
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])

In [180]:
y.shape

(253561,)

In [221]:
%%time
lr = LogisticRegression(C=1.0, random_state=17, n_jobs=-1).fit(X[1:], y)

Wall time: 5.36 s


In [222]:
y_pred = lr.predict_proba(X[1:])

In [223]:
y_pred

array([[  9.96692545e-01,   3.30745457e-03],
       [  9.52155081e-01,   4.78449194e-02],
       [  9.97780522e-01,   2.21947811e-03],
       ..., 
       [  9.95872639e-01,   4.12736073e-03],
       [  9.99586822e-01,   4.13178269e-04],
       [  9.93219896e-01,   6.78010368e-03]])

In [224]:
score = roc_auc_score(y, y_pred[:, 1])
score

0.97826921780907161

In [195]:
from sklearn.model_selection import cross_val_score

In [198]:
cross_val_score(LogisticRegression(), X[1:], y, scoring='roc_auc')

array([ 0.99326522,  0.99353094,  0.99354434])

Add features based on the session start time: hour, whether it's morning, day or night and so on.

In [None]:
# You code here

Scale this features and combine then with Tf-Idf based on sites (you'll need `scipy.sparse.hstack`)

In [None]:
# You code here

Perform cross-validation with logistic regression.

In [None]:
# You code here

Make prediction for the test set and form a submission file.

In [None]:
test_pred = # You code here

In [208]:
def write_to_submission_file(predicted_labels, out_file,
                             target='target', index_label="session_id"):
    predicted_df = pd.DataFrame(predicted_labels,
                                index = np.arange(1, predicted_labels.shape[0] + 1),
                                columns=[target])
    predicted_df.to_csv(out_file, index_label=index_label)


In [228]:
write_to_submission_file(lr.predict_proba(Xstf[1:])[:,1], "assignment6_alice_submission1.csv")