Idea parking space:

* Extract features from URL



What I've done so far: 

* Load in the data
* Inspect the data
* Look for unique values with (`df.value_counts`)
* Look for missing values with `df.info`
* Drop na values (todo: refactor this in pipeline)
* Create targets and remove from df

# Bot or not v2

This is version 2 of the bot or not framework where we try to incorporate more features and try to put everything in a single pipeline.

In [34]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split

from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier

from sklearn.base import BaseEstimator
from sklearn.base import TransformerMixin

## Load in the data

In [35]:
filename = "../../../data/bot-or-not-clickdata.csv"
df = pd.read_csv(filename)

Data: 

* `epoch_ms`
* `session_id`
* `country_by_ip_address`
* `region_by_ip_address`
* `url_without_parameters`
* `referrer_without_parameters`
* `visitor_recognition_type`
* `ua_agent_class`


In [36]:
df.head()

Unnamed: 0,epoch_ms,session_id,country_by_ip_address,region_by_ip_address,url_without_parameters,referrer_without_parameters,visitor_recognition_type,ua_agent_class
0,1520280001034,be73c8d1b836170a21529a1b23140f8e,US,CA,https://www.bol.com/nl/l/nederlandstalige-kuns...,,ANONYMOUS,Robot
1,1520280001590,c24c6637ed7dcbe19ad64056184212a7,US,CA,https://www.bol.com/nl/l/italiaans-natuur-wete...,,ANONYMOUS,Robot
2,1520280002397,ee391655f5680a7bfae0019450aed396,IT,LI,https://www.bol.com/nl/p/nespresso-magimix-ini...,https://www.bol.com/nl/p/nespresso-magimix-ini...,ANONYMOUS,Browser
3,1520280002598,f8c8a696dd37ca88233b2df096afa97f,US,CA,https://www.bol.com/nl/l/nieuwe-engelstalige-o...,,ANONYMOUS,Robot
4,1520280004428,f8b0c06747b7dd1d53c0932306bd04d6,US,CA,https://www.bol.com/nl/l/nieuwe-actie-avontuur...,,ANONYMOUS,Robot Mobile


# Preprocessing

## Drop NaNs

In [37]:
mask = df['region_by_ip_address'].isnull()
df = df.loc[~mask]

Let's check for missing data with `df.info()`

In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 49886 entries, 0 to 59780
Data columns (total 8 columns):
epoch_ms                       49886 non-null int64
session_id                     49886 non-null object
country_by_ip_address          49886 non-null object
region_by_ip_address           49886 non-null object
url_without_parameters         49886 non-null object
referrer_without_parameters    12838 non-null object
visitor_recognition_type       49886 non-null object
ua_agent_class                 49886 non-null object
dtypes: int64(1), object(7)
memory usage: 3.4+ MB


We have some missing values in: 
* `country`
* `region`
* `referrer_without_parameters`

First come up with a very simple model. 

* We drop the column `region_by_ip_address`
* We drop the column `referrer_without_parameters`

## Create target/labels

Let's check what categories we have:

In [39]:
df['ua_agent_class'].value_counts()

Browser              26667
Robot                15852
Robot Mobile          5115
Browser Webview       1454
Hacker                 690
Special                102
Mobile App               4
Cloud Application        2
Name: ua_agent_class, dtype: int64

We turn these into labels by picking the right ones and adding a zero or one there.

In [40]:
def class_to_bot(agent):
    if agent in ["Robot", "Robot Mobile", "Special", "Cloud Application"]: 
        return 1
    else: 
        return 0
    
df['target'] = df['ua_agent_class'].apply(class_to_bot)

df.head()

Unnamed: 0,epoch_ms,session_id,country_by_ip_address,region_by_ip_address,url_without_parameters,referrer_without_parameters,visitor_recognition_type,ua_agent_class,target
0,1520280001034,be73c8d1b836170a21529a1b23140f8e,US,CA,https://www.bol.com/nl/l/nederlandstalige-kuns...,,ANONYMOUS,Robot,1
1,1520280001590,c24c6637ed7dcbe19ad64056184212a7,US,CA,https://www.bol.com/nl/l/italiaans-natuur-wete...,,ANONYMOUS,Robot,1
2,1520280002397,ee391655f5680a7bfae0019450aed396,IT,LI,https://www.bol.com/nl/p/nespresso-magimix-ini...,https://www.bol.com/nl/p/nespresso-magimix-ini...,ANONYMOUS,Browser,0
3,1520280002598,f8c8a696dd37ca88233b2df096afa97f,US,CA,https://www.bol.com/nl/l/nieuwe-engelstalige-o...,,ANONYMOUS,Robot,1
4,1520280004428,f8b0c06747b7dd1d53c0932306bd04d6,US,CA,https://www.bol.com/nl/l/nieuwe-actie-avontuur...,,ANONYMOUS,Robot Mobile,1


In [41]:
df = df.drop(columns=['ua_agent_class'])

# Feature engineering



In [48]:
y = df.pop('target')
X = df

KeyError: 'target'

In [49]:
df.head()

Unnamed: 0,epoch_ms,session_id,country_by_ip_address,region_by_ip_address,url_without_parameters,referrer_without_parameters,visitor_recognition_type
0,1520280001034,be73c8d1b836170a21529a1b23140f8e,US,CA,https://www.bol.com/nl/l/nederlandstalige-kuns...,,ANONYMOUS
1,1520280001590,c24c6637ed7dcbe19ad64056184212a7,US,CA,https://www.bol.com/nl/l/italiaans-natuur-wete...,,ANONYMOUS
2,1520280002397,ee391655f5680a7bfae0019450aed396,IT,LI,https://www.bol.com/nl/p/nespresso-magimix-ini...,https://www.bol.com/nl/p/nespresso-magimix-ini...,ANONYMOUS
3,1520280002598,f8c8a696dd37ca88233b2df096afa97f,US,CA,https://www.bol.com/nl/l/nieuwe-engelstalige-o...,,ANONYMOUS
4,1520280004428,f8b0c06747b7dd1d53c0932306bd04d6,US,CA,https://www.bol.com/nl/l/nieuwe-actie-avontuur...,,ANONYMOUS


In [33]:
class CustomTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        return self
    
    def fit(self, X, y=None):
        return self
    def transform(self, X, y=None):
        X_ = X.copy()
        return self

In [55]:
class FeatureSelector(BaseEstimator, TransformerMixin):
    """Transformer that selects a particular feature."""

    def __init__(self, feature_names):
        self.feature_names = feature_names
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return X[self.feature_names]

In [57]:
class UrlLength(BaseEstimator, TransformerMixin):
    def __init__(self, url):
        self.url = url
        
    def fit(self, X, y=None):
        return self
        
    def transform(self, X):
        return len(X[self.url])

In [58]:
from sklearn.pipeline import FeatureUnion

In [61]:
url_pipeline = Pipeline([
    ('selector', FeatureSelector(['url_without_parameters'])),
    ("length", UrlLength('url_without_parameters'))
])

url_pipeline.fit_transform(X, y)

49886

In [121]:
df = df.drop(columns=['epoch_ms', 'session_id', 'region_by_ip_address', 'referrer_without_parameters', 'url_without_parameters'])

In [122]:
df.head()

Unnamed: 0,country_by_ip_address,visitor_recognition_type,ua_agent_class
0,US,ANONYMOUS,Robot
1,US,ANONYMOUS,Robot
2,IT,ANONYMOUS,Browser
3,US,ANONYMOUS,Robot
4,US,ANONYMOUS,Robot Mobile


# Prepare data for ML algorithm

In [135]:
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.30)

In [144]:
pipe = Pipeline([
    ('ohe', OneHotEncoder(handle_unknown='ignore')), 
    ('clf', RandomForestClassifier(n_estimators=10))
])

In [145]:
pipe.fit(X_train, y_train)

Pipeline(memory=None,
         steps=[('ohe',
                 OneHotEncoder(categorical_features=None, categories=None,
                               drop=None, dtype=<class 'numpy.float64'>,
                               handle_unknown='ignore', n_values=None,
                               sparse=True)),
                ('clf',
                 RandomForestClassifier(bootstrap=True, class_weight=None,
                                        criterion='gini', max_depth=None,
                                        max_features='auto',
                                        max_leaf_nodes=None,
                                        min_impurity_decrease=0.0,
                                        min_impurity_split=None,
                                        min_samples_leaf=1, min_samples_split=2,
                                        min_weight_fraction_leaf=0.0,
                                        n_estimators=10, n_jobs=None,
                                        o

In [146]:
train_acc = pipe.score(X_train, y_train)
test_acc = pipe.score(X_test, y_test)

In [148]:
print("Accuracy on train set:", train_acc)
print("Accuracy on test set:", test_acc)

Accuracy on train set: 0.959077892325315
Accuracy on test set: 0.9573032206334358
