# <center> Competition Kaggle "Catch Me If You Can"

<img src = 'https://storage.googleapis.com/kaggle-competitions/kaggle/7173/logos/front_page.png' >

The link for Kaggle competition [kaggle competition](https://inclass.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2). 

### <center><b>The task is to identify a user on the Internet tracking his/her sequence of attended Web pages. The algorithm to be built will take a webpage session (a sequence of webpages attended consequently by the same person) and predict whether it belongs to Alice or somebody else.</b>

#### <center><b>The data comes from Blaise Pascal University proxy servers. Paper "A Tool for Classification of Sequential Data" by Giacomo Kahn, Yannick Loiseau and Olivier Raynaud.</b>

  

In [1]:
# Import all demanded libraries

from __future__ import division, print_function
# switch off all the warnings
import warnings
warnings.filterwarnings('ignore')

import os
import pickle
import numpy as np
import pandas as pd

import sklearn
from sklearn  import linear_model
from scipy.sparse import csr_matrix, hstack
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold, GridSearchCV
from sklearn.linear_model import SGDClassifier, LogisticRegression

from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn import metrics
from sklearn.metrics import roc_auc_score


**Download all data [competitions](https://inclass.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2) in DataFrame train_df и test_df.**

In [2]:
#!kaggle competitions download -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2

In [3]:
PATH_TO_DATA = '/../project_alice'

In [4]:
train_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'train_sessions.csv'),
                       index_col='session_id')
test_df = pd.read_csv(os.path.join(PATH_TO_DATA, 'test_sessions.csv'),
                      index_col='session_id')

In [5]:
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,718,2014-02-20 10:02:45,,,,,,,,,...,,,,,,,,,,0
2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,2014-02-22 11:19:51,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,2013-12-16 16:40:19,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,2014-03-28 10:54:12,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,2014-02-28 10:55:23,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0


**Concate Train and test data in order to make it sparse.**

In [6]:
train_test_df = pd.concat([train_df, test_df])

In train data we have such features as:
    - site1 – the index of the first visited site in the session
    - time1 – the time of the first visited site in the session
    - ...
    - site10 – the index of the first visited site in the session
    - time10 – the time of the first visited site in the session
    - user_id – ID of user
    
User sessions are allocated in such a way that they cannot be longer than half an hour or 10 sites. That is, the session is considered to be completed either when the user visited 10 sites in a row, or when the session took more than 30 minutes.

**Let's look at the statistics of signs.**

Gaps arise where sessions are short (less than 10 sites). For example, if a person visited * vk.com * at 20:01 on January 1, 2015, then * yandex.ru * at 20:29, then * google.com * at 20:33, then his first session will consist of only two sites (site1 - site ID * vk.com *, time1 - 2015-01-01 20:01:00, site2 - site ID * yandex.ru *, time2 - 2015-01-01 20:29:00, other signs - NaN), and starting with * google.com * a new session will go, because more than 30 minutes have already passed since visiting * vk.com *.

In [7]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 253561 entries, 1 to 253561
Data columns (total 21 columns):
site1     253561 non-null int64
time1     253561 non-null object
site2     250098 non-null float64
time2     250098 non-null object
site3     246919 non-null float64
time3     246919 non-null object
site4     244321 non-null float64
time4     244321 non-null object
site5     241829 non-null float64
time5     241829 non-null object
site6     239495 non-null float64
time6     239495 non-null object
site7     237297 non-null float64
time7     237297 non-null object
site8     235224 non-null float64
time8     235224 non-null object
site9     233084 non-null float64
time9     233084 non-null object
site10    231052 non-null float64
time10    231052 non-null object
target    253561 non-null int64
dtypes: float64(9), int64(2), object(10)
memory usage: 42.6+ MB


In [8]:
test_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,site6,time6,site7,time7,site8,time8,site9,time9,site10,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,29,2014-10-04 11:19:53,35.0,2014-10-04 11:19:53,22.0,2014-10-04 11:19:54,321.0,2014-10-04 11:19:54,23.0,2014-10-04 11:19:54,2211.0,2014-10-04 11:19:54,6730.0,2014-10-04 11:19:54,21.0,2014-10-04 11:19:54,44582.0,2014-10-04 11:20:00,15336.0,2014-10-04 11:20:00
2,782,2014-07-03 11:00:28,782.0,2014-07-03 11:00:53,782.0,2014-07-03 11:00:58,782.0,2014-07-03 11:01:06,782.0,2014-07-03 11:01:09,782.0,2014-07-03 11:01:10,782.0,2014-07-03 11:01:23,782.0,2014-07-03 11:01:29,782.0,2014-07-03 11:01:30,782.0,2014-07-03 11:01:53
3,55,2014-12-05 15:55:12,55.0,2014-12-05 15:55:13,55.0,2014-12-05 15:55:14,55.0,2014-12-05 15:56:15,55.0,2014-12-05 15:56:16,55.0,2014-12-05 15:56:17,55.0,2014-12-05 15:56:18,55.0,2014-12-05 15:56:19,1445.0,2014-12-05 15:56:33,1445.0,2014-12-05 15:56:36
4,1023,2014-11-04 10:03:19,1022.0,2014-11-04 10:03:19,50.0,2014-11-04 10:03:20,222.0,2014-11-04 10:03:21,202.0,2014-11-04 10:03:21,3374.0,2014-11-04 10:03:22,50.0,2014-11-04 10:03:22,48.0,2014-11-04 10:03:22,48.0,2014-11-04 10:03:23,3374.0,2014-11-04 10:03:23
5,301,2014-05-16 15:05:31,301.0,2014-05-16 15:05:32,301.0,2014-05-16 15:05:33,66.0,2014-05-16 15:05:39,67.0,2014-05-16 15:05:40,69.0,2014-05-16 15:05:40,70.0,2014-05-16 15:05:40,68.0,2014-05-16 15:05:40,71.0,2014-05-16 15:05:40,167.0,2014-05-16 15:05:44


In [9]:
test_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 82797 entries, 1 to 82797
Data columns (total 20 columns):
site1     82797 non-null int64
time1     82797 non-null object
site2     81308 non-null float64
time2     81308 non-null object
site3     80075 non-null float64
time3     80075 non-null object
site4     79182 non-null float64
time4     79182 non-null object
site5     78341 non-null float64
time5     78341 non-null object
site6     77566 non-null float64
time6     77566 non-null object
site7     76840 non-null float64
time7     76840 non-null object
site8     76151 non-null float64
time8     76151 non-null object
site9     75484 non-null float64
time9     75484 non-null object
site10    74806 non-null float64
time10    74806 non-null object
dtypes: float64(9), int64(1), object(10)
memory usage: 13.3+ MB


**In the training sample - 2297 sessions of one user (Alice) and 251264 sessions - other users, not Alice. Class imbalance is very strong, and looking at the proportion of correct answers (accuracy) is not indicative.**

In [10]:
train_df['target'].value_counts(normalize=True)

0    0.990941
1    0.009059
Name: target, dtype: float64

**For now, we will use only indexes of visited sites for the prediction. The indices were numbered from 1, so we replace the omissions with zeros.**

In [11]:
train_test_df_sites = train_test_df[['site%d' % i for i in range(1, 11)]].fillna(0).astype('int')

In [12]:
train_test_df_sites.head(10)

Unnamed: 0_level_0,site1,site2,site3,site4,site5,site6,site7,site8,site9,site10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,718,0,0,0,0,0,0,0,0,0
2,890,941,3847,941,942,3846,3847,3846,1516,1518
3,14769,39,14768,14769,37,39,14768,14768,14768,14768
4,782,782,782,782,782,782,782,782,782,782
5,22,177,175,178,177,178,175,177,177,178
6,570,21,570,21,21,0,0,0,0,0
7,803,23,5956,17513,37,21,803,17514,17514,17514
8,22,21,29,5041,14422,23,21,5041,14421,14421
9,668,940,942,941,941,942,940,23,21,22
10,3700,229,570,21,229,21,21,21,2336,2044


** Sparse matrix *X_train_sparse* and *X_test_sparse* **

In [13]:
#The splitting index (originally in train data we had about 255000)
idx_split = train_df.shape[0]
# sequence with indexes
sites_flatten = train_test_df_sites.values.flatten()

train_test_sparse = csr_matrix(([1] * sites_flatten.shape[0],
                                sites_flatten,
                                range(0, sites_flatten.shape[0] + 10, 10)))[:, 1:]

y_train = train_df['target']


In [14]:
X_train_sparse = train_test_sparse[:idx_split]
X_test_sparse = train_test_sparse[idx_split:]

In [15]:
X_test_sparse.shape

(82797, 48371)

In [16]:
print(f'The shape of X_train_sparse', X_train_sparse.shape, f'The shape of y_train',y_train.shape)
print(f'The shape of X_test_sparse', X_test_sparse.shape)

The shape of X_train_sparse (253561, 48371) The shape of y_train (253561,)
The shape of X_test_sparse (82797, 48371)


In [17]:
PATH_TO_DATA = '/../project_alice'

**Just pickle files**

In [18]:
with open(os.path.join(PATH_TO_DATA, 'X_train_sparse.pkl'), 'wb') as X_train_sparse_pkl:
    pickle.dump(X_train_sparse, X_train_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'X_test_sparse.pkl'), 'wb') as X_test_sparse_pkl:
    pickle.dump(X_test_sparse, X_test_sparse_pkl, protocol=2)
with open(os.path.join(PATH_TO_DATA, 'train_target.pkl'), 'wb') as train_target_pkl:
    pickle.dump(y_train, train_target_pkl, protocol=2)

In [19]:
#Cheking the model

logit = LogisticRegression(n_jobs = -1, random_state = 17) 
logit.fit(X_train_sparse, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=-1,
          penalty='l2', random_state=17, solver='warn', tol=0.0001,
          verbose=0, warm_start=False)

In [20]:
logit.predict_proba(X_test_sparse[0, :])

array([[0.99779021, 0.00220979]])

### Logistic Regression without tuning hyper parameters
**Function for checking the evaluation of model on validation data 9/1**    

In [21]:
# C = 1.0
# C = 1e4
# C = 1e-2
def get_auc_lr_valid(X, y, C = 1.0, ratio = 0.9, seed = 17):
    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    logit = LogisticRegression(n_jobs = -1, random_state = seed, C = C)
    
    logit.fit(X_train, y_train)
    
    valid_predict = logit.predict_proba(X_valid)[:, 1]
    
    return(roc_auc_score(y_valid, valid_predict))
    

In [22]:
%time
get_auc_lr_valid(X_train_sparse, y_train)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.25 µs


0.9642429170108648

In [23]:
logit = LogisticRegression(n_jobs=-1, random_state=17)
logit.fit(X_train_sparse, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=-1,
          penalty='l2', random_state=17, solver='warn', tol=0.0001,
          verbose=0, warm_start=False)

In [24]:
test_predict = logit.predict_proba(X_test_sparse)[:, 1]

In [25]:
# Check the range of archeved answers
test_predict.shape

(82797,)

In [26]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_1_reg.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [27]:
# Check the correctness of file for kaggle
!head -10 Benchmark_1_koef_reg.csv

session_id,target
1,0.0022097873803092077
2,4.810213590399314e-09
3,1.8729363666424977e-08
4,2.3547447875078495e-08
5,3.130243424826118e-05
6,0.0002184619136023449
7,0.0005479231135761249
8,0.00013227051257335375
9,0.0007951454475832064


In [28]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_1_koef_reg.csv -m "Log regression and koef regular "

### SGD Classifier  without tuning hyper parameters

In [29]:
def get_auc_sgd_valid(X, y, ratio = 0.9, seed = 17):
    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    sgd_logit = sklearn.linear_model.SGDClassifier(loss = 'log', random_state = seed, n_jobs = -1)
    sgd_logit.fit(X_train, y_train) 
    
    valid_predict = sgd_logit.predict_proba(X_valid)[:, 1]
    
    return(roc_auc_score(y_valid, valid_predict))


In [30]:
%time
get_auc_sgd_valid(X_train_sparse, y_train)

CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.01 µs


0.9362467544730779

In [31]:
sgd_logit = sklearn.linear_model.SGDClassifier(loss = 'log', random_state = 17, n_jobs = -1)
sgd_logit.fit(X_train_sparse, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=-1, penalty='l2',
       power_t=0.5, random_state=17, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [32]:
test_predict = sgd_logit.predict_proba(X_test_sparse)[:, 1]

In [33]:
# Check the range of archeved answers
test_predict.shape

(82797,)

In [34]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_2_sgd.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [35]:
# Check the correctness of file for kaggle
!head -10 Benchmark_2_sgd.csv

session_id,target
1,0.010805617795376256
2,5.708909249665912e-05
3,1.7003168099574377e-05
4,0.00046359989945096133
5,0.0010328783654981939
6,0.007023601371539453
7,0.014843678934588137
8,0.005238344765274798
9,0.008108683575818041


**Send submission to Kaggle from Ipython notebook**
<b>Usual form</b>: kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f submission.csv -m "Message"

In [36]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_2_sgd.csv -m "Test submission from notebook"
                                  

## Linear Regression and Feature Engineering

In [37]:
time = ['time%d' % i for i in range(1, 11)]
# DataFrame just with time columns
train_df[time] = train_df[time].apply(pd.to_datetime)
test_df[time] = test_df[time].apply(pd.to_datetime)

train_df[time]

Unnamed: 0_level_0,time1,time2,time3,time4,time5,time6,time7,time8,time9,time10
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1,2014-02-20 10:02:45,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT,NaT
2,2014-02-22 11:19:50,2014-02-22 11:19:50,2014-02-22 11:19:51,2014-02-22 11:19:51,2014-02-22 11:19:51,2014-02-22 11:19:51,2014-02-22 11:19:52,2014-02-22 11:19:52,2014-02-22 11:20:15,2014-02-22 11:20:16
3,2013-12-16 16:40:17,2013-12-16 16:40:18,2013-12-16 16:40:19,2013-12-16 16:40:19,2013-12-16 16:40:19,2013-12-16 16:40:19,2013-12-16 16:40:20,2013-12-16 16:40:21,2013-12-16 16:40:22,2013-12-16 16:40:24
4,2014-03-28 10:52:12,2014-03-28 10:52:42,2014-03-28 10:53:12,2014-03-28 10:53:42,2014-03-28 10:54:12,2014-03-28 10:54:42,2014-03-28 10:55:12,2014-03-28 10:55:42,2014-03-28 10:56:12,2014-03-28 10:56:42
5,2014-02-28 10:53:05,2014-02-28 10:55:22,2014-02-28 10:55:22,2014-02-28 10:55:23,2014-02-28 10:55:23,2014-02-28 10:55:59,2014-02-28 10:55:59,2014-02-28 10:55:59,2014-02-28 10:57:06,2014-02-28 10:57:11
6,2014-03-18 15:18:31,2014-03-18 15:18:39,2014-03-18 15:23:02,2014-03-18 15:23:43,2014-03-18 15:29:57,NaT,NaT,NaT,NaT,NaT
7,2014-02-13 16:45:35,2014-02-13 16:45:35,2014-02-13 16:45:35,2014-02-13 16:45:35,2014-02-13 16:46:05,2014-02-13 16:47:14,2014-02-13 16:47:14,2014-02-13 16:47:15,2014-02-13 16:47:16,2014-02-13 16:47:17
8,2013-04-12 10:27:26,2013-04-12 10:27:26,2013-04-12 10:27:28,2013-04-12 10:27:29,2013-04-12 10:27:29,2013-04-12 10:27:29,2013-04-12 10:27:29,2013-04-12 10:27:31,2013-04-12 10:27:31,2013-04-12 10:27:32
9,2014-03-17 16:23:08,2014-03-17 16:23:35,2014-03-17 16:23:35,2014-03-17 16:23:35,2014-03-17 16:23:36,2014-03-17 16:23:36,2014-03-17 16:23:36,2014-03-17 16:23:52,2014-03-17 16:23:52,2014-03-17 16:23:53
10,2014-02-20 16:09:13,2014-02-20 16:10:08,2014-02-20 16:10:08,2014-02-20 16:10:08,2014-02-20 16:10:24,2014-02-20 16:10:24,2014-02-20 16:10:29,2014-02-20 16:10:39,2014-02-20 16:10:40,2014-02-20 16:10:40


In [38]:
# New DataFrame, where will add new features, 
# after will add it in "previous" train_sparse и test_sparse 

new_feat_train = pd.DataFrame(index=train_df.index)
new_feat_test = pd.DataFrame(index=test_df.index)

In [39]:
# add only feature on time1 - that's the start of session

new_feat_train['year_month'] = train_df['time1'].apply(lambda ts: 100 * ts.year + ts.month)
new_feat_test['year_month'] = test_df['time1'].apply(lambda ts: 100 * ts.year + ts.month)

In [40]:
new_feat_train.head()

Unnamed: 0_level_0,year_month
session_id,Unnamed: 1_level_1
1,201402
2,201402
3,201312
4,201403
5,201402


In [41]:
# Normalizing it
# For it StandartScale

In [42]:
scaler = StandardScaler()
scaler.fit(new_feat_train['year_month'].values.reshape(-1,1))


new_feat_train['year_month_scaled'] = scaler.transform(new_feat_train['year_month'].values.reshape(-1,1))
new_feat_test['year_month_scaled'] = scaler.transform(new_feat_test['year_month'].values.reshape(-1,1))

In [43]:
new_feat_train.head(), new_feat_test.head()

(            year_month  year_month_scaled
 session_id                               
 1               201402           0.634518
 2               201402           0.634518
 3               201312          -1.485314
 4               201403           0.658072
 5               201402           0.634518,
             year_month  year_month_scaled
 session_id                               
 1               201410           0.822948
 2               201407           0.752287
 3               201412           0.870055
 4               201411           0.846501
 5               201405           0.705179)

In [44]:
# add this feat to X_train_sparse and X_test_sparse


X_train_sparse_new = csr_matrix(hstack([X_train_sparse, 
                            new_feat_train['year_month_scaled'].values.reshape(-1,1)]))

X_test_sparse_new = csr_matrix(hstack([X_test_sparse, 
                            new_feat_test['year_month_scaled'].values.reshape(-1,1)]))

In [45]:
#use prepared data X_train_sparse_new
%time
get_auc_lr_valid(X_train_sparse_new, y_train)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 6.2 µs


0.9655742093504578

In [46]:
logit = LogisticRegression(n_jobs=-1, random_state=17)
logit.fit(X_train_sparse_new, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn', n_jobs=-1,
          penalty='l2', random_state=17, solver='warn', tol=0.0001,
          verbose=0, warm_start=False)

In [47]:
test_predict = logit.predict_proba(X_test_sparse_new)[:, 1]

In [48]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_1_and_feat.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [49]:
# Check the correctness of file for kaggle
!head -10 Benchmark_1_and_feat.csv

session_id,target
1,0.0015730914592577904
2,1.3990578037385505e-08
3,1.0015397304783872e-08
4,1.221598419197595e-08
5,2.4722982825425023e-05
6,0.00015755102509422563
7,0.0003639165635836856
8,6.740295496143995e-05
9,0.000513955488784313


In [50]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_1_and_feat.csv -m "Logistic Regegression as Benchmark_1.csv + added 1 feature"

## SGD Classifier and Feature Engineering

In [51]:
%time
get_auc_sgd_valid(X_train_sparse_new, y_train)

CPU times: user 2 µs, sys: 2 µs, total: 4 µs
Wall time: 5.96 µs


0.9448731755353111

In [52]:
sgd_logit = sklearn.linear_model.SGDClassifier(loss = 'log', random_state = 17, n_jobs = -1)
sgd_logit.fit(X_train_sparse_new, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.0, fit_intercept=True,
       l1_ratio=0.15, learning_rate='optimal', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=-1, penalty='l2',
       power_t=0.5, random_state=17, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [53]:
test_predict = sgd_logit.predict_proba(X_test_sparse_new)[:, 1]

In [54]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_2_and_feat.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [55]:
# Check the correctness of file for kaggle
!head -10 Benchmark_2_and_feat.csv

session_id,target
1,0.007387491670547649
2,5.562100324859741e-05
3,1.4389572417221454e-05
4,0.0002790352970941884
5,0.000686246635037244
6,0.00354995447663557
7,0.006145660998630394
8,0.0024797118011489856
9,0.005031294832855142


In [56]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_2_and_feat.csv -m "Stochastic Gradient Descent as Benchmark_2.csv + added 1 feature"

Till here we tried: <ul>
<li>Untuned Linear Regression</li>
<li>Untuned Linear Regression + 1 additional feature</li>
<li>Untuned SGD</li>
<li>Untuned SGD + 1 additional feature</li>    
</ul>

**<b>On the first data X_TRAIN_SPARSE</b>**

## KNeighbors Classifier

In [57]:
X_train_sparse.shape, y_train.shape, X_test_sparse.shape

((253561, 48371), (253561,), (82797, 48371))

In [58]:
def get_auc_KNN_valid(X, y, ratio = 0.9):
    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    knn = KNeighborsClassifier(n_neighbors = 100, n_jobs = -1)
    
    knn.fit(X_train, y_train)
    
    valid_predict = knn.predict_proba(X_valid)[:, 1]
    
    return(roc_auc_score(y_valid, valid_predict))
    

In [59]:
%time
get_auc_KNN_valid(X_train_sparse, y_train)

CPU times: user 2 µs, sys: 3 µs, total: 5 µs
Wall time: 6.91 µs


0.8989015152001788

In [60]:
knn = KNeighborsClassifier(n_neighbors = 100, n_jobs = -1)
knn.fit(X_train_sparse, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=100, p=2,
           weights='uniform')

In [77]:
# Calculating for a long time
#test_predict = knn.predict_proba(X_test_sparse)[:, 1]

In [None]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_KNN.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [None]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_KNN.csv -m "Clean KNN"

## Random Forest Classifier

In [62]:
def get_auc_RANDOM_FOREST_valid(X, y, ratio = 0.9, seed = 17):
    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    rand_forest = RandomForestClassifier(n_estimators = 100, max_depth=2, random_state = seed)
    
    rand_forest.fit(X_train, y_train)
    
    valid_predict = rand_forest.predict_proba(X_valid)[:, 1]
    
    return(roc_auc_score(y_valid, valid_predict))
    

In [63]:
%time
get_auc_RANDOM_FOREST_valid(X_train_sparse, y_train)

CPU times: user 4 µs, sys: 9 µs, total: 13 µs
Wall time: 31.9 µs


0.8348523020276754

In [64]:
RANDOM_FOREST = RandomForestClassifier(n_estimators = 100, max_depth = 2, random_state = 17, n_jobs = -1)
RANDOM_FOREST.fit(X_train_sparse, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=17, verbose=0, warm_start=False)

In [65]:
test_predict = RANDOM_FOREST.predict_proba(X_test_sparse)[:, 1]

In [66]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_random_forest.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [67]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_random_forest.csv -m "Clean RARNDOM FOREST"

## KNN and Feature Engineering

In [68]:
%time
get_auc_KNN_valid(X_train_sparse_new, y_train)

CPU times: user 2 µs, sys: 1e+03 ns, total: 3 µs
Wall time: 4.05 µs


0.9257749920026106

In [69]:
knn_feat = KNeighborsClassifier(n_neighbors = 100, n_jobs = -1)
knn_feat.fit(X_train_sparse_new, y_train)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=-1, n_neighbors=100, p=2,
           weights='uniform')

In [78]:
# Calculating for a long time
#test_predict = knn_feat.predict_proba(X_test_sparse_new)[:, 1]

In [79]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_KNN_and_Feat.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [80]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_KNN_and_Feat.csv -m "Benchmark_KNN_and_Feat"

## Random Forest Classifier and Feature Engineering

In [74]:
%time
get_auc_RANDOM_FOREST_valid(X_train_sparse_new, y_train)

CPU times: user 3 µs, sys: 2 µs, total: 5 µs
Wall time: 22.2 µs


0.8292273911552085

In [81]:
RANDOM_FOREST = RandomForestClassifier(n_estimators = 100, max_depth = 2, random_state = 17, n_jobs = -1)
RANDOM_FOREST.fit(X_train_sparse_new, y_train)

RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=2, max_features='auto', max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, n_estimators=100, n_jobs=-1,
            oob_score=False, random_state=17, verbose=0, warm_start=False)

In [82]:
test_predict = RANDOM_FOREST.predict_proba(X_test_sparse_new)[:, 1]

In [83]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_random_forest_and_feat.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [84]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_random_forest_and_feat.csv -m "Benchmark_random_forest_and_feat"

**Try to change hyper parametrs in SGD**

In [85]:
def get_auc_sgd_tunned_valid(X, y, ratio = 0.9, seed = 17):
    train_len = int(ratio * X.shape[0])
    X_train = X[:train_len, :]
    X_valid = X[train_len:, :]
    y_train = y[:train_len]
    y_valid = y[train_len:]
    
    sgd_logit = sklearn.linear_model.SGDClassifier(loss = 'log', learning_rate='adaptive', eta0 = 0.05,
                                                   penalty = 'l2',  random_state = 17, n_jobs = -1,
                                                   )
    sgd_logit.fit(X_train, y_train) 
    
    valid_predict = sgd_logit.predict_proba(X_valid)[:, 1]
    
    return(roc_auc_score(y_valid, valid_predict))



In [86]:
%time
get_auc_sgd_tunned_valid(X_train_sparse_new, y_train)

CPU times: user 2 µs, sys: 0 ns, total: 2 µs
Wall time: 5.01 µs


0.9150965500096997

In [87]:
sgd_logit = sklearn.linear_model.SGDClassifier(loss = 'log', learning_rate='adaptive', eta0 = 0.05,
                                              penalty = 'l2',  random_state = 17, n_jobs = -1)
sgd_logit.fit(X_train_sparse_new, y_train)

SGDClassifier(alpha=0.0001, average=False, class_weight=None,
       early_stopping=False, epsilon=0.1, eta0=0.05, fit_intercept=True,
       l1_ratio=0.15, learning_rate='adaptive', loss='log', max_iter=None,
       n_iter=None, n_iter_no_change=5, n_jobs=-1, penalty='l2',
       power_t=0.5, random_state=17, shuffle=True, tol=None,
       validation_fraction=0.1, verbose=0, warm_start=False)

In [88]:
test_predict = sgd_logit.predict_proba(X_test_sparse_new)[:, 1]

In [89]:
# Create a file for kaggle
pd.Series(test_predict, index = range(1, test_predict.shape[0] + 1), name = 'target').to_csv('Benchmark_sgd_tried.csv', 
                                                                                             header = True, 
                                                                                            index_label = 'session_id')

In [None]:
#!kaggle competitions submit -c catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2 -f Benchmark_sgd_tried.csv -m "Benchmark_sgd_tried"