<h3><center>Feature Engineering and Anomalies - Outliers Detection</h3>
This notebook is devoted in feature exploration. The results of this notebook coming from experimentation and they are not used explicitly. In a later stage the findings of this notebook are implemented on both training and testing data, before fitting and testing a model.

In [18]:
import matplotlib.pyplot as plt 
from matplotlib import style  
style.use("ggplot")

import pandas as pd

from commons import load_data, get_dummies

In [23]:
# load data
filename = "data/traindata.csv"
data = load_data(filename)

## feature engineering "day" feature

In [4]:
data.head()

Unnamed: 0,id,purpose_validated,dur_in_sec,hour_start,hour_end,day_of_month,week,dist_gtfsstops_all,dist_gtfsstops_train,dist_gtfsstops_tram,dist_gtfsstops_bus,day
0,0,home,1017.087,8,9,2,5,126.188532,574.259045,10496.593564,126.188532,3
1,1,leisure,310.0,0,0,14,2,411.47895,3388.489009,24408.713071,411.47895,6
2,2,leisure,4993.099,16,17,23,51,242.799218,420.061325,242.799218,242.799218,4
3,3,errand,422.515001,11,11,29,26,73.118524,1757.238956,16999.384112,73.118524,3
4,4,leisure,9673.579,18,21,28,4,644.638357,644.638357,45432.181046,2333.563174,5


In [24]:
# weekdays vs weekends feature.
weekend = [5,6]
data['weekend'] = 0
data.loc[data['day'].isin(weekend),'weekend'] = 1

# After this commands we have creted a new feature called "weekend" which has only two values, 
#either 1 or 0. The value 1 corresponds to the days 5 and 6 of the feature 'day', whilst the 
#value 0 corresponds to every other day of the week.

In [6]:
data.head(n=6)

Unnamed: 0,id,purpose_validated,dur_in_sec,hour_start,hour_end,day_of_month,week,dist_gtfsstops_all,dist_gtfsstops_train,dist_gtfsstops_tram,dist_gtfsstops_bus,day,weekend
0,0,home,1017.087,8,9,2,5,126.188532,574.259045,10496.593564,126.188532,3,0
1,1,leisure,310.0,0,0,14,2,411.47895,3388.489009,24408.713071,411.47895,6,1
2,2,leisure,4993.099,16,17,23,51,242.799218,420.061325,242.799218,242.799218,4,0
3,3,errand,422.515001,11,11,29,26,73.118524,1757.238956,16999.384112,73.118524,3,0
4,4,leisure,9673.579,18,21,28,4,644.638357,644.638357,45432.181046,2333.563174,5,1
5,5,leisure,331.453,6,6,1,13,20.64124,20.64124,1549.041373,69.659482,5,1


Since the feature 'Day' contains categorical values, it is neccesary to transform the values into dummy variables, using the one-hotting technique. This will prevent the model from seeing the higher values (like Friday = 4) as more important than others.

In [25]:
# transform to dummy
data = get_dummies(data, columns = ["day"])  #Convert categorical variable into dummy/indicator variables

After this command the feature 'days' will be splited into five features, one for each day of the week (from Monday to Friday)

In [8]:
data.head(n=6)

Unnamed: 0,id,purpose_validated,dur_in_sec,hour_start,hour_end,day_of_month,week,dist_gtfsstops_all,dist_gtfsstops_train,dist_gtfsstops_tram,dist_gtfsstops_bus,weekend,day_0,day_1,day_2,day_3,day_4,day_5
0,0,home,1017.087,8,9,2,5,126.188532,574.259045,10496.593564,126.188532,0,0,0,0,1,0,0
1,1,leisure,310.0,0,0,14,2,411.47895,3388.489009,24408.713071,411.47895,1,0,0,0,0,0,0
2,2,leisure,4993.099,16,17,23,51,242.799218,420.061325,242.799218,242.799218,0,0,0,0,0,1,0
3,3,errand,422.515001,11,11,29,26,73.118524,1757.238956,16999.384112,73.118524,0,0,0,0,1,0,0
4,4,leisure,9673.579,18,21,28,4,644.638357,644.638357,45432.181046,2333.563174,1,0,0,0,0,0,1
5,5,leisure,331.453,6,6,1,13,20.64124,20.64124,1549.041373,69.659482,1,0,0,0,0,0,1


## feature engineering "hour_start" feature

The commands that follow will create eight extra features. More specificaly, the feature 'hour_start', will be split into eight groups('start hour bins'). Thus, for a certaine example of the data that belongs to a certain bin, will be assigned to it the value 1, and for every other bin with the value 0. Additionally the feature "hour_start" will be converted to dummie variables like the previous example.

In [26]:
hour_start_bins = pd.cut(data.hour_start, bins=8)
data['start_hour_bins'] = hour_start_bins
data = get_dummies(data, ['start_hour_bins', 'hour_start'])
print(hour_start_bins.nunique())
#print(data.hour_start.nunique())

8


In [27]:
data.columns

Index(['id', 'purpose_validated', 'dur_in_sec', 'hour_end', 'day_of_month',
       'week', 'dist_gtfsstops_all', 'dist_gtfsstops_train',
       'dist_gtfsstops_tram', 'dist_gtfsstops_bus', 'weekend', 'day_0',
       'day_1', 'day_2', 'day_3', 'day_4', 'day_5',
       'start_hour_bins_(-0.023, 2.875]', 'start_hour_bins_(2.875, 5.75]',
       'start_hour_bins_(5.75, 8.625]', 'start_hour_bins_(8.625, 11.5]',
       'start_hour_bins_(11.5, 14.375]', 'start_hour_bins_(14.375, 17.25]',
       'start_hour_bins_(17.25, 20.125]', 'hour_start_0', 'hour_start_1',
       'hour_start_2', 'hour_start_3', 'hour_start_4', 'hour_start_5',
       'hour_start_6', 'hour_start_7', 'hour_start_8', 'hour_start_9',
       'hour_start_10', 'hour_start_11', 'hour_start_12', 'hour_start_13',
       'hour_start_14', 'hour_start_15', 'hour_start_16', 'hour_start_17',
       'hour_start_18', 'hour_start_19', 'hour_start_20', 'hour_start_21',
       'hour_start_22'],
      dtype='object')

In [29]:
data.head(n=5)

Unnamed: 0,id,purpose_validated,dur_in_sec,hour_end,day_of_month,week,dist_gtfsstops_all,dist_gtfsstops_train,dist_gtfsstops_tram,dist_gtfsstops_bus,...,hour_start_13,hour_start_14,hour_start_15,hour_start_16,hour_start_17,hour_start_18,hour_start_19,hour_start_20,hour_start_21,hour_start_22
0,0,home,1017.087,9,2,5,126.188532,574.259045,10496.593564,126.188532,...,0,0,0,0,0,0,0,0,0,0
1,1,leisure,310.0,0,14,2,411.47895,3388.489009,24408.713071,411.47895,...,0,0,0,0,0,0,0,0,0,0
2,2,leisure,4993.099,17,23,51,242.799218,420.061325,242.799218,242.799218,...,0,0,0,1,0,0,0,0,0,0
3,3,errand,422.515001,11,29,26,73.118524,1757.238956,16999.384112,73.118524,...,0,0,0,0,0,0,0,0,0,0
4,4,leisure,9673.579,21,28,4,644.638357,644.638357,45432.181046,2333.563174,...,0,0,0,0,0,1,0,0,0,0


## day_of_month feature

In [31]:
# just transform to dummy
# maybe bin start,middle,end of month (maybe go out more when they have money at the beginning of the month)

## Week Feature

In [None]:
# TODO: week to month and then month to season

## dist_* features

I already have the min distance. I can also add the max and mean and sum distance. 
This will add two more features.

In [35]:
data["dist_max"] = data[["dist_gtfsstops_train","dist_gtfsstops_tram","dist_gtfsstops_bus"]].max(axis=1)
data["dist_mean"] = data[["dist_gtfsstops_train","dist_gtfsstops_tram","dist_gtfsstops_bus"]].mean(axis=1)

### Anomaly detection

<h5>At this point an effort is given with respect of finding incosistencies in our data</h5>

In [118]:
from sklearn.neighbors import LocalOutlierFactor
clf = LocalOutlierFactor(n_neighbors=20, contamination=0.002)
y_pred = clf.fit_predict(data[["dur_in_sec","dist_gtfsstops_train","dist_gtfsstops_tram","dist_gtfsstops_bus"]])
data['is_anomaly'] = y_pred

In [120]:
data.loc[data['is_anomaly'] == -1]

Unnamed: 0,id,purpose_validated,dur_in_sec,hour_end,day_of_month,week,dist_gtfsstops_all,dist_gtfsstops_train,dist_gtfsstops_tram,dist_gtfsstops_bus,...,hour_start_16,hour_start_17,hour_start_18,hour_start_19,hour_start_20,hour_start_21,hour_start_22,dist_max,dist_sum,is_anomaly
168,168,wait,2020.306999,11,5,1,14501380.0,14501380.0,14773590.0,14646570.0,...,0,0,0,0,0,0,0,14773590.0,43921530.0,-1
171,171,leisure,764.063999,12,11,49,42.95447,86.04259,148000.2,81.98159,...,0,0,0,0,0,0,0,148000.2,148168.2,-1
404,404,leisure,872.068,12,5,1,14502150.0,14502150.0,14774310.0,14647240.0,...,0,0,0,0,0,0,0,14774310.0,43923690.0,-1
1685,1685,leisure,353.017,19,18,11,51.31761,72.2422,147922.3,51.31761,...,0,0,0,1,0,0,0,147922.3,148045.9,-1
1848,1848,errand,1159.506,9,18,7,14549950.0,14549950.0,14822900.0,14696500.0,...,0,0,0,0,0,0,0,14822900.0,44069350.0,-1
2374,2374,home,-171448.67,8,24,25,307.5106,2645.471,33102.38,307.5106,...,0,0,0,0,0,0,0,33102.38,36055.36,-1
2726,2726,wait,546.0,8,27,9,44.07361,7527.494,139209.9,2669.291,...,0,0,0,0,0,0,0,139209.9,149406.6,-1
3405,3405,leisure,414.962,13,8,32,36.72308,9930.756,145402.9,36.72308,...,0,0,0,0,0,0,0,145402.9,155370.4,-1
3686,3686,wait,725.0,8,29,52,11.62068,119.3461,147043.5,72.25561,...,0,0,0,0,0,0,0,147043.5,147235.1,-1
3734,3734,wait,918.999,9,23,16,38.93122,47.38335,147910.4,38.93122,...,0,0,0,0,0,0,0,147910.4,147996.7,-1


In [121]:
# just remove by id 
list(data.loc[data['is_anomaly'] == -1, 'id']) # this is a list containing the id's 
                                               # of the incosistent examples.

[168, 171, 404, 1685, 1848, 2374, 2726, 3405, 3686, 3734]