### Catch Me If You Can: Intruder Detection through Webpage Session Tracking
We try to identify a user on the Internet tracking his/her sequence of attended Web pages. The algorithm to be built will take a webpage session (a sequence of webpages attended consequently by the same person) and predict whether it belongs to Alice or somebody else.

**Objects**
- Websites visits of either Alice or Intruder

In [5]:
import pandas as pd
import numpy as np
import warnings

import xgboost as xgb

from gensim.models import word2vec
from collections import defaultdict
from typing import Dict, List

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer

warnings.filterwarnings('ignore')

### Data Loading

In [6]:
train_df = pd.read_csv('/home/jovyan/work/data_sets/alice_catch_me/sessions_train.csv')
test_df = pd.read_csv('/home/jovyan/work/data_sets/alice_catch_me/sessions_test.csv')

In [7]:
train_df.head()

Unnamed: 0,session_id,site1,time1,site2,time2,site3,time3,site4,time4,site5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
0,1,718,2014-02-20 10:02:45,,,,,,,,...,,,,,,,,,,0
1,2,890,2014-02-22 11:19:50,941.0,2014-02-22 11:19:50,3847.0,2014-02-22 11:19:51,941.0,2014-02-22 11:19:51,942.0,...,2014-02-22 11:19:51,3847.0,2014-02-22 11:19:52,3846.0,2014-02-22 11:19:52,1516.0,2014-02-22 11:20:15,1518.0,2014-02-22 11:20:16,0
2,3,14769,2013-12-16 16:40:17,39.0,2013-12-16 16:40:18,14768.0,2013-12-16 16:40:19,14769.0,2013-12-16 16:40:19,37.0,...,2013-12-16 16:40:19,14768.0,2013-12-16 16:40:20,14768.0,2013-12-16 16:40:21,14768.0,2013-12-16 16:40:22,14768.0,2013-12-16 16:40:24,0
3,4,782,2014-03-28 10:52:12,782.0,2014-03-28 10:52:42,782.0,2014-03-28 10:53:12,782.0,2014-03-28 10:53:42,782.0,...,2014-03-28 10:54:42,782.0,2014-03-28 10:55:12,782.0,2014-03-28 10:55:42,782.0,2014-03-28 10:56:12,782.0,2014-03-28 10:56:42,0
4,5,22,2014-02-28 10:53:05,177.0,2014-02-28 10:55:22,175.0,2014-02-28 10:55:22,178.0,2014-02-28 10:55:23,177.0,...,2014-02-28 10:55:59,175.0,2014-02-28 10:55:59,177.0,2014-02-28 10:55:59,177.0,2014-02-28 10:57:06,178.0,2014-02-28 10:57:11,0


### Data Preprocessing

In [8]:
# class distribution
(
    train_df['target'].value_counts(),
    train_df['target'].value_counts(normalize=True)*100
)

(0    251264
 1      2297
 Name: target, dtype: int64,
 0    99.094104
 1     0.905896
 Name: target, dtype: float64)

- highly imbalanced data

In [9]:
# look at number of sessions and websites
print('N Unique Sessions: ', train_df['session_id'].value_counts().sum())

site_columns = train_df.columns[train_df.columns.str.contains('site')]
time_columns = train_df.columns[train_df.columns.str.contains('time')]

sites_vocab = set()

for col in site_columns:
    sites_vocab = sites_vocab.union(set(train_df[col].unique()))

print('N Unique Websites: ', len(sites_vocab))

N Unique Sessions:  253561
N Unique Websites:  41610


In [10]:
train_df['time1'].min()

'2013-01-12 08:05:57'

In [11]:
# preprocess time column
train_df[time_columns] = train_df[time_columns].apply(pd.to_datetime)
test_df[time_columns] = test_df[time_columns].apply(pd.to_datetime)

# what time period we have
print(f'Time min: {train_df["time1"].min()}, Time max: {train_df["time10"].max()} (Train)')
print(f'Time min: {test_df["time1"].min()}, Time max: {test_df["time10"].max()} (Test)')

Time min: 2013-01-12 08:05:57, Time max: 2014-04-30 23:39:53 (Train)
Time min: 2014-05-01 17:14:03, Time max: 2014-12-05 19:10:03 (Test)


**Important**
- Since we have sequences / sites that are visited in sessions and time, we have to consider time order!

In [12]:
train_df = train_df.sort_values(by='time1', ascending=True)
test_df = test_df.sort_values(by='time1', ascending=True)

In [13]:
# NaN -> 0
train_df[site_columns] = train_df[site_columns].fillna(0).astype('int').astype('str')
test_df[site_columns] = test_df[site_columns].fillna(0).astype('int').astype('str')

In [14]:
train_df.head()

Unnamed: 0,session_id,site1,time1,site2,time2,site3,time3,site4,time4,site5,...,time6,site7,time7,site8,time8,site9,time9,site10,time10,target
21668,21669,56,2013-01-12 08:05:57,55,2013-01-12 08:05:57,0,NaT,0,NaT,0,...,NaT,0,NaT,0,NaT,0,NaT,0,NaT,0
54842,54843,56,2013-01-12 08:37:23,55,2013-01-12 08:37:23,56,2013-01-12 09:07:07,55,2013-01-12 09:07:09,0,...,NaT,0,NaT,0,NaT,0,NaT,0,NaT,0
77291,77292,946,2013-01-12 08:50:13,946,2013-01-12 08:50:14,951,2013-01-12 08:50:15,946,2013-01-12 08:50:15,946,...,2013-01-12 08:50:16,948,2013-01-12 08:50:16,784,2013-01-12 08:50:16,949,2013-01-12 08:50:17,946,2013-01-12 08:50:17,0
114020,114021,945,2013-01-12 08:50:17,948,2013-01-12 08:50:17,949,2013-01-12 08:50:18,948,2013-01-12 08:50:18,945,...,2013-01-12 08:50:18,947,2013-01-12 08:50:19,945,2013-01-12 08:50:19,946,2013-01-12 08:50:19,946,2013-01-12 08:50:20,0
146669,146670,947,2013-01-12 08:50:20,950,2013-01-12 08:50:20,948,2013-01-12 08:50:20,947,2013-01-12 08:50:21,950,...,2013-01-12 08:50:21,946,2013-01-12 08:50:21,951,2013-01-12 08:50:22,946,2013-01-12 08:50:22,947,2013-01-12 08:50:22,0


### Word2Vec
Many different models and features can be built on this dataset. We will train Word2Vec model tolearn association/dependencies between the websites visited.

In [15]:
# create texts to train word2vec
train_df['site_seq'] = train_df['site1']
test_df['site_seq'] = test_df['site1']

for s in site_columns[1:]:
    train_df['site_seq'] = train_df['site_seq'] + "," + train_df[s]
    test_df['site_seq'] = test_df['site_seq'] + "," + test_df[s]

train_df['site_seq_list'] = train_df['site_seq'].apply(lambda x: x.split(','))
test_df['site_seq_list'] = test_df['site_seq'].apply(lambda x: x.split(','))

In [16]:
# obtained site seq for each session
train_df['site_seq_list']

21668                      [56, 55, 0, 0, 0, 0, 0, 0, 0, 0]
54842                    [56, 55, 56, 55, 0, 0, 0, 0, 0, 0]
77291     [946, 946, 951, 946, 946, 945, 948, 784, 949, ...
114020    [945, 948, 949, 948, 945, 946, 947, 945, 946, ...
146669    [947, 950, 948, 947, 950, 952, 946, 951, 946, ...
                                ...                        
12223            [50, 50, 48, 49, 48, 52, 52, 49, 303, 304]
164437    [4207, 753, 753, 52, 50, 4207, 3346, 3359, 334...
12220     [52, 3346, 784, 784, 3346, 979, 3324, 7330, 35...
156967    [3328, 3324, 3599, 3413, 753, 3328, 3599, 3359...
204761     [222, 3346, 3346, 3359, 55, 2891, 3346, 0, 0, 0]
Name: site_seq_list, Length: 253561, dtype: object

**Interpretation**
- Sentence -> sequence of websites visited by a user
- It's unnecessary transform numbers into website names because algorithm will learn correlation/dependencies anyway

In [17]:
# for better word2vec model training concatenate train and test (use both train and test only for word2vec)
test_df['target'] = -1
data = pd.concat([train_df, test_df], axis=0)

# define word2vec model
w2v_model = word2vec.Word2Vec(data['site_seq_list'], vector_size=300, window=3, workers=4)

In [21]:
# define vocab and embeddigs
embeddings_vocab = dict(zip(w2v_model.wv.index_to_key, w2v_model.wv.vectors))

- Now each word has a vector but in our case a session consists of websites/words
- We need to decide how can we represent a sentence using a set of word embeddings?
- **One solution** -> **get average embedding vector** -> average meaning of a sentence

### Mean Embeddings

In [60]:
# define custom 'MeanVectorizer'
class MeanSentenceVectorizer():
    """
    Class to get mean embeddings of sentences.
    """
    def __init__(self, embeddings_vocab: Dict[str, np.ndarray]) -> None:
        self.embeddings_vocab = embeddings_vocab # word name and its embedding
        self.dim = len(next(iter(self.embeddings_vocab.values()))) # embedding dimension


    def transform(self, X: List[str]) -> np.ndarray:
        """
        Gets mean embedding of a sentence. If token is missing in model vocab,
        returns zero vector.

        Parameters
        ----------
        X: str
            Sentence (i.e. sequence of tokens).
        """
        return np.array([
            np.mean(
                [self.embeddings_vocab[w] for w in words if w in self.embeddings_vocab] 
                or
                [np.zeros(self.dim)], axis=0
            )
            for words in X
        ])

In [63]:
# get mean embeddings of sentences / sessions
mean_sen_vec = MeanSentenceVectorizer(embeddings_vocab)

# get transformations for both train and test
mean_embeddings_train = mean_sen_vec.transform(train_df['site_seq_list'])
mean_embeddings_test = mean_sen_vec.transform(test_df['site_seq_list'])

train_df['seq_mean_embedding'] = list(mean_embeddings_train)
test_df['seq_mean_embedding'] = list(mean_embeddings_test)

print('Obtained Embeddings Shape: ', mean_embeddings_train.shape)

Obtained Embeddings Shape:  (253561, 300)


- Now we have our embeddings and can train a model
- Apply train test split first to get validation set

In [64]:
# don't shuffle since time is important! (first we always check the performance and train and validation)
X_train, X_val, y_train, y_val = train_test_split(
    mean_embeddings_train, train_df['target'],
    train_size=0.8, shuffle=False, random_state=23
)

# to check the ratio of samples in train and validation
(
    y_train.value_counts(normalize=True)*100,
    y_val.value_counts(normalize=True)*100
)

(0    99.027351
 1     0.972649
 Name: target, dtype: float64,
 0    99.361111
 1     0.638889
 Name: target, dtype: float64)

### Model Training

**Simple Neural Net**

- To predict if a session is fraudulent, train NN

In [65]:
from keras import regularizers
from keras.layers import Activation, Dense, Dropout, Input
from keras.models import Model, Sequential
from keras.preprocessing.text import Tokenizer

In [66]:
# define a model
model = Sequential()
model.add(Dense(128, input_dim=(X_train.shape[1])))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('sigmoid'))

model.compile(
    loss='binary_crossentropy',
    optimizer='adam',
    metrics=['binary_accuracy']
)

In [67]:
# train a model
history = model.fit(
    X_train,
    y_train, 
    batch_size=128,
    epochs=10,
    validation_data=(X_val, y_val),
#     class_weight='auto',
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [68]:
# get performance (validation set)
y_pred_nn_val = model.predict(X_val, batch_size=128)
roc_auc_score(y_val, y_pred_nn_val)

0.9236614448234709

Let's test model performance on Leaderboard (Test data)

In [1]:
def prediction_to_submission(ids, preds, f_name=''):
    submission_df = pd.DataFrame({
        'session_id': ids,
        'target': preds
    })
    submission_df.to_csv(f_name, index=False)

In [71]:
# get predictions for test data
y_pred_nn_test = model.predict(mean_embeddings_test, batch_size=128)

prediction_to_submission(
    test_df['session_id'], np.squeeze(y_pred_nn_test).tolist(), 'mean_embeddings_nn.csv'
)

**Interpretation**
- The performance of the model is close to a baseline (log_reg + OHE(sites)). However, the dimensionality is much less, thus, word2vec model was able to identify some dependencies among sessions.
- Performance on Leaderboard (Test): `0.89974`

**Tree-Based Model**

In [84]:
# prepare data for xgb
train_xgb_matrix = xgb.DMatrix(X_train, label=y_train, missing = np.nan)
val_xgb_matrix = xgb.DMatrix(X_val, label=y_val, missing = np.nan)
test_xgb_matrix = xgb.DMatrix(mean_embeddings_test)

watchlist = [(train_xgb_matrix, 'train'), (val_xgb_matrix, 'eval')]
history = dict()

In [74]:
params = {
    'max_depth': 26,
    'eta': 0.025,
    'nthread': 4,
    'gamma' : 1,
    'alpha' : 1,
    'subsample': 0.85,
    'eval_metric': ['auc'],
    'objective': 'binary:logistic',
    'colsample_bytree': 0.9,
    'min_child_weight': 100,
    'scale_pos_weight': (1)/train_df['target'].mean(),
    'seed': 23
}

xgb_model = xgb.train(
    params,
    train_xgb_matrix,
    num_boost_round=100,
    evals=watchlist,
    evals_result=history,
    verbose_eval=20
)

[0]	train-auc:0.95504	eval-auc:0.85363
[20]	train-auc:0.98877	eval-auc:0.91651
[40]	train-auc:0.99146	eval-auc:0.91991
[60]	train-auc:0.99308	eval-auc:0.92145
[80]	train-auc:0.99425	eval-auc:0.92304
[99]	train-auc:0.99511	eval-auc:0.92383


In [81]:
# performance estimation (validation data)
scores_val = xgb_model.predict(val_xgb_matrix)
roc_auc_score(y_val, scores_val)

0.9238325825080871

In [87]:
y_pred_xgb_test = xgb_model.predict(test_xgb_matrix)

prediction_to_submission(
    test_df['session_id'], np.squeeze(y_pred_xgb_test).tolist(), 'mean_embeddings_xgb.csv'
)

**Results Analysis**
- XGBoost overfits the training data and thus val performance is not good, let's check linear models
- Performance on Leaderboard (Test): `0.89870` although val performance is a bit better

**Linear Models**

In [88]:
params = {
    'C': 1,
    'random_state': 23,
    'n_jobs': -1
}

mean_lr_model = LogisticRegression(**params)
mean_lr_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1, n_jobs=-1, random_state=23)

In [90]:
# performance estimation
y_pred_lr_mean = mean_lr_model.predict_proba(X_val)[:, 1]
roc_auc_score(y_val, y_pred_lr_mean)

0.9009595164435507

### IDF Weighted Mean Embeddings
- Let's assign a weight in front of every word in a text and **compute weighted average**
- Weight -> `idf`
    - High weight for words that rare -> Can better distinguish between similar documents
    - Low weight for words that are common and appear in many documents

In [91]:
class CustomTfIdfVectorizer():
    def __init__(self, embeddings_vocab: Dict[str, np.ndarray]) -> None:
        self.embeddings_vocab = embeddings_vocab
        self.dim = len(next(iter(self.embeddings_vocab.values())))
        self.weight = None


    def fit(self, X: List[str]) -> None:
        """
        Computes IDF weights for tokens.
        
        Important
        ---------
        Words that are missed in vocab must be as infrequent as any known words.
        Thus, the default IDF value for them is max(idf).

        Parameters
        ----------
        X: list
            Sentence (i.e. sequence of tokens).
        """
        # fit TfidfVectorizer
        tfidf_vectorizer = TfidfVectorizer(analyzer=lambda x: x)
        tfidf_vectorizer.fit(X)

        max_idf = max(tfidf_vectorizer.idf_) # this weight will have words that are missing in vocab

        self.weight = defaultdict(
            lambda: max_idf,
            [
                (w, tfidf_vectorizer.idf_[i]) for w, i in tfidf_vectorizer.vocabulary_.items()
            ]
        )
        return None

    def transform(self, X: List[str]) -> np.ndarray:
        """Returns IDF weighted mean embedding vector of a sentence."""
        return np.array([
            np.mean(
                [self.embeddings_vocab[w] * self.weight[w] for w in words if w in self.embeddings_vocab] 
                or
                [np.zeros(self.dim)], axis=0 # if word is missing, it has zero vector and max(idf)
            )
            for words in X
        ])

In [92]:
# get idf weighted mean embeddings of sentences / sessions
idf_mean_sen_vec = CustomTfIdfVectorizer(embeddings_vocab)
idf_mean_sen_vec.fit(train_df['site_seq_list'])

idf_mean_embeddings = idf_mean_sen_vec.transform(train_df['site_seq_list'])
train_df['seq_idf_mean_embedding'] = list(idf_mean_embeddings)

print('Obtained Embeddings Shape: ', idf_mean_embeddings.shape)

Obtained Embeddings Shape:  (253561, 300)


**Check Improvement**
- Use logistic regression

In [93]:
# train/val split
X_train, X_val, y_train, y_val = train_test_split(
    idf_mean_embeddings, train_df['target'],
    train_size=0.8, shuffle=False, random_state=23
)

In [94]:
# model
params = {
    'C': 1,
    'random_state': 23,
    'n_jobs': -1
}

lr_model = LogisticRegression(**params)
lr_model.fit(X_train, y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1, n_jobs=-1, random_state=23)

In [95]:
# performance estimation
y_pred_lr = lr_model.predict_proba(X_val)[:, 1]
score_lr = roc_auc_score(y_val, y_pred_lr)
score_lr

0.8977889366408356

**Results Analysis**

There is a small improvement that supports the usefulness of weighting the word embeddings:
- simple embeddings average: `0.8922`
- weighted embeddings average: `0.8989`

### Leaderboard Performance
Let's check how well we trained our model.
- [Competition Leaderboard](https://www.kaggle.com/competitions/catch-me-if-you-can-intruder-detection-through-webpage-session-tracking2/leaderboard?)

**Linear Regression**

In [96]:
idf_mean_sen_vec = CustomTfIdfVectorizer(embeddings_vocab)
mean_sen_vec = MeanSentenceVectorizer(embeddings_vocab)

mean_embeddings_train = mean_sen_vec.transform(train_df['site_seq_list'])
mean_embeddings_test = mean_sen_vec.transform(test_df['site_seq_list'])

idf_mean_sen_vec.fit(train_df['site_seq_list'])

idf_mean_embeddings_train = idf_mean_sen_vec.transform(train_df['site_seq_list'])
idf_mean_embeddings_test = idf_mean_sen_vec.transform(test_df['site_seq_list'])

In [97]:
%%time

# use all data and train model
params = {
    'C': 1,
    'random_state': 23,
    'n_jobs': -1
}

lr_model_mean = LogisticRegression(**params)
lr_model_mean.fit(mean_embeddings_train, train_df['target'])

lr_model_mean_idf = LogisticRegression(**params)
lr_model_mean_idf.fit(idf_mean_embeddings_train, train_df['target'])

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


LogisticRegression(C=1, n_jobs=-1, random_state=23)

In [98]:
# submission
pred_mean = lr_model_mean.predict_proba(mean_embeddings_test)[:, 1]
pred_mean_idf = lr_model_mean_idf.predict_proba(idf_mean_embeddings_test)[:, 1]

In [68]:
prediction_to_submission(
    test_df['session_id'], pred_mean, 'mean_embeddings.csv'
)

prediction_to_submission(
    test_df['session_id'], pred_mean_idf, 'idf_mean_embeddings.csv'
)

- mean_embeddings: `0.89886`
- idf_mean_embeddings: `0.89501`

**Neural Net**

In [100]:
# train a model
history = model.fit(
    idf_mean_embeddings_train,
    train_df['target'], 
    batch_size=128,
    epochs=10,
    verbose=1
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [101]:
# get predictions for test data
y_pred_nn_idf_mean_test = model.predict(mean_embeddings_test, batch_size=128)

prediction_to_submission(
    test_df['session_id'], np.squeeze(y_pred_nn_idf_mean_test).tolist(), 'idf_mean_embeddings_nn.csv'
)

- Performance on Leaderboard (Test): `0.90333` -> the best model

### Further Potential
We trained a model only on embeddings that allowed drastically reducing dimensionality. However, we have only embedding features and other features can be created as well. For example:
- Time related features:
    - Month, day, time of day, ...
    - Time between sessions
    - ...