# Catch Me If You Can

_**Intruder Detection through Webpage Session Tracking**_

---

---

## Contents

1. [Background](#Background)
1. [Setups](#Setups)
1. [Meet and Greet the Data](#Data)
1. [Model Training](#Model-Training)
    1. [Count vectorizer](#Count-vectorizer)
    1. [Logistic Regression model](#Logistic-Regression-model)

---



## Background

_This notebook has been adapted from multiple notebook submission in [Kaggle: Catch Me If You Can](https://www.kaggle.com/c/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2/overview)_

Web-user identification is a hot research topic on the brink of sequential pattern mining and behavioral psychology.

Here we try to identify a user on the Internet tracking his/her sequence of attended Web pages. The algorithm to be built will take a webpage session (a sequence of webpages attended consequently by the same person) and predict whether it belongs to Alice or somebody else.

With this dataset, this workshop aim to introduce to you the data science workflow on creating and deploying models as API. AWS provide a service for blablabla...

## Setups

_This notebook was created and tested on an ml.m4.xlarge notebook instance._

Let's start by specifying:

- The S3 bucket and prefix that you want to use for training and model data.  This should be within the same region as the Notebook Instance, training, and hosting.
- The IAM role arn used to give training and hosting access to your data. See the documentation for how to create these.  Note, if more than one role is required for notebook instances, training, and/or hosting, please replace the boto regexp with a the appropriate full IAM role arn string(s).

### AWS Configuration and S3

In [None]:
bucket = '<your_s3_bucket_name_here>'
prefix = 'sagemaker/DEMO-xgboost-churn'

# Define IAM role
import boto3
import re
from sagemaker import get_execution_role

role = get_execution_role()

### Python Packages & Libraries

In [6]:
# Standard library
import os, json, time
from IPython.display import display
from time import strftime, gmtime

# AWS Sagemaker Python API
import sagemaker
from sagemaker.predictor import csv_serializer

In [None]:
# Visualisation library
%matplotlib inline
from matplotlib import pyplot as plt
import seaborn as sns

# Datascience libraries
import pickle
import numpy as np
import pandas as pd

from scipy.sparse import csr_matrix
from scipy.sparse import hstack

from sklearn.preprocessing import StandardScaler
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression

---
## Meet and Greet the Data


The train set train_sessions.csv contains information on user browsing sessions where the features are:

- `site_i` – are ids of sites in this session. The mapping is given with a pickled dictionary site_dic.pkl
- `time_j` – are timestamps of attending the corresponding site
- `target` – whether this session belongs to Alice

The dataset we use is publicly available and was mentioned in the book [A Tool for Classification of Sequential Data]() by Giacomo Kahn, Yannick Loiseau and Olivier Raynaud. Let's begin exploring the data.

In [17]:
PATH_TO_DATA = './data/catch_me_kaggle/'

path_to_train = os.path.join(PATH_TO_DATA, 'train_sessions.csv')
path_to_test = os.path.join(PATH_TO_DATA, 'test_sessions.csv')

In [18]:
train_df = pd.read_csv(path_to_train,
                       index_col='session_id', parse_dates=['time1'])
test_df = pd.read_csv(path_to_test,
                      index_col='session_id', parse_dates=['time1'])

# Sort the data by time
train_df = train_df.sort_values(by='time1')

# Look at the first rows of the training set
train_df.head()

Unnamed: 0_level_0,site1,time1,site2,time2,site3,time3,site4,time4,site5,time5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
session_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
21669,56,2013-01-12 08:05:57,55.0,2013-01-12 08:05:57,,,,,,,...,,,,,,,,,,0
54843,56,2013-01-12 08:37:23,55.0,2013-01-12 08:37:23,56.0,2013-01-12 09:07:07,55.0,2013-01-12 09:07:09,,,...,,,,,,,,,,0
77292,946,2013-01-12 08:50:13,946.0,2013-01-12 08:50:14,951.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:15,946.0,2013-01-12 08:50:16,...,2013-01-12 08:50:16,948.0,2013-01-12 08:50:16,784.0,2013-01-12 08:50:16,949.0,2013-01-12 08:50:17,946.0,2013-01-12 08:50:17,0
114021,945,2013-01-12 08:50:17,948.0,2013-01-12 08:50:17,949.0,2013-01-12 08:50:18,948.0,2013-01-12 08:50:18,945.0,2013-01-12 08:50:18,...,2013-01-12 08:50:18,947.0,2013-01-12 08:50:19,945.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:19,946.0,2013-01-12 08:50:20,0
146670,947,2013-01-12 08:50:20,950.0,2013-01-12 08:50:20,948.0,2013-01-12 08:50:20,947.0,2013-01-12 08:50:21,950.0,2013-01-12 08:50:21,...,2013-01-12 08:50:21,946.0,2013-01-12 08:50:21,951.0,2013-01-12 08:50:22,946.0,2013-01-12 08:50:22,947.0,2013-01-12 08:50:22,0


In [25]:
# show the shape of our dataset
print("The shape of training dataset: ", train_df.shape)

The shape of training dataset:  (253561, 21)


By modern standards, it’s a relatively big dataset, with 253561 records, where each record uses 21 attributes to describe the visiting time of a particular site.

The training data set contains the following features:

- `site1` – id of the first visited website in the session
- `time1` – visiting time for the first website in the session
...
- `site10` – id of the tenth visited website in the session
- `time10` – visiting time for the tenth website in the session
- `target` – target variable, possesses value of 1 for Alice's sessions, and 0 for the other users' sessions

User sessions are chosen in the way they are not longer than half an hour or/and contain more than ten websites. I.e. a session is considered as ended either if a user has visited ten websites or if a session has lasted over thirty minutes.

Now let's load the websites dictionary and check how it looks like:

In [24]:
# Load websites dictionary
path_to_sitedic = os.path.join(PATH_TO_DATA, 'site_dic.pkl')
with open(path_to_sitedic, 'rb') as input_file:
    site_dict = pickle.load(input_file)

# Create dataframe for the dictionary
sites_dict = pd.DataFrame(list(site_dict.keys()), index=list(site_dict.values()), columns=['site'])
print('Websites total:', sites_dict.shape[0])

sites_dict.head()

Websites total: 48371


Unnamed: 0,site
25075,www.abmecatronique.com
13997,groups.live.com
42436,majeureliguefootball.wordpress.com
30911,cdt46.media.tourinsoft.eu
8104,www.hdwallpapers.eu


There are some empty values in the table, it means that some sessions contain less than ten websites. Replace empty values with 0 and change columns types to integer.

In [26]:
# Change site1, ..., site10 columns type to integer and fill NA-values with zeros
sites = ['site%s' % i for i in range(1, 11)]
train_df[sites] = train_df[sites].fillna(0).astype('int')
test_df[sites] = test_df[sites].fillna(0).astype('int')

In [32]:
# extract the target/label
y_train = train_df['target']

## Modelling
For the very basic model, we will use only the visited websites in the session (but we will not take into account timestamp features). The point behind this data selection is: Alice has her favorite sites, and the more often you see these sites in the session, the higher probability that this is an Alice's session, and vice versa.

Let us prepare the data, we will take only features `site1`, `site2`, ... , `site10` from the whole dataframe. Keep in mind that the missing values are replaced with zero.

With this basic idea, we can use many models that works great. For the sake of this workshop, we propose the workflow as follows:
1. Transform data into "*Bag of Word*" representation: this representation that highlights frequency of visited site by Alice as well as the intruder with the hope that our next model can discriminate between Alice and the intruder.
2. Classification model: We choose XGBoost algorithm for this example. Feel free to experiment with other algorithms available as well

In [None]:
# transform dataframe 
train_df[sites].fillna(0).to_csv('train_sessions_text.txt', 
                                 sep=' ', index=None, header=None)
test_df[sites].fillna(0).to_csv('test_sessions_text.txt', 
                                sep=' ', index=None, header=None)

In [28]:
%%time
cv = CountVectorizer(ngram_range=(1, 3), max_features=50000)

with open('train_sessions_text.txt') as inp_train_file:
    X_train = cv.fit_transform(inp_train_file)

with open('test_sessions_text.txt') as inp_test_file:
    X_test = cv.transform(inp_test_file)

print(X_train.shape, X_test.shape)

(253561, 50000) (82797, 50000)
CPU times: user 7.62 s, sys: 114 ms, total: 7.74 s
Wall time: 7.75 s


## Training the model

In [35]:
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.1)

In [50]:
import xgboost as xgb

D_train = xgb.DMatrix(X_train, label=y_train)
D_val = xgb.DMatrix(X_val, label=y_val)

In [51]:
param = {
    'eta': 0.03, 
    'max_depth': 5,
    'objective': 'binary:logistic',  
    'eval_metric': 'logloss',
    'gamma': 2,
} 

steps = 30  # The number of training iterations

In [52]:
import numpy as np
from sklearn.metrics import precision_score, recall_score, accuracy_score

preds = model.predict(D_val)
preds = np.asarray([np.argmax(line) for line in preds])

print("Precision = {}".format(precision_score(y_val, preds, average='macro')))
print("Recall = {}".format(recall_score(y_val, preds, average='macro')))
print("Accuracy = {}".format(accuracy_score(y_val, preds)))

Precision = 0.49489292897424775
Recall = 0.5
Accuracy = 0.9897858579484955


  _warn_prf(average, modifier, msg_start, len(result))


In [53]:
D_test = xgb.DMatrix(X_test)