## Introduction

The goal of this notebook is to demonstrate how we can use the time features in the Talking Data dataset.

The notebooks use materials made available in the following public kernels :
 - https://www.kaggle.com/nanomathias/feature-engineering-importance-testing
 - https://www.kaggle.com/asydorchuk/nextclick-calculation-without-hashing-trick
 - https://www.kaggle.com/anttip/talkingdata-wordbatch-fm-ftrl-lb-0-9769
 
If I had to choose one public kernel I would use [the first notebook](https://www.kaggle.com/nanomathias/feature-engineering-importance-testing). It is an extensive evaluation of all sorts of features in the context of TalkingData. In my opinion this is must read.

At the end of this notebook you will know how to :
 - convert a date in string format into an easy to use pandas datetime feature
 - easily parse date in string format to get information like day, hour, time...
 - be able to compute the time difference between 2 consecutive clicks originating from the same ip address
 - hopefully get a better score on the LB with time features

What is included in this notebook is really advanced. 

Please don't be frustrated if you do not fully understand the pandas code, instead ask me questions. You can reach me directly on my slack @advisor_olivier !

Have a good read.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.linear_model import SGDClassifier
import time
import re
import gc

gc.enable()

Please change the file_path so that it points to where the train file is on your system  

In [2]:
file_path = "../input/train.csv.zip"

Specify data types to limit memory usage

In [3]:
dtypes = {
        'ip': 'uint32',
        'app': 'uint16',
        'device': 'uint16',
        'os': 'uint16',
        'channel': 'uint16',
        'is_attributed': 'uint8'
    }
cols = [f_ for f_ in dtypes.keys()]

## Using dates in a pandas dataframe

In this section we will read a few rows in the TalkingData training file and show how you can convert and use the date information inside the training file.

The time feature in the dataset is called **click_time**

In [4]:
train = pd.read_csv(file_path, nrows=1000000, usecols=['ip', 'click_time', 'is_attributed'], dtype=dtypes)

Let's look at the time format used in the dataset

In [5]:
train.head(5)

Unnamed: 0,ip,click_time,is_attributed
0,83230,2017-11-06 14:32:21,0
1,17357,2017-11-06 14:33:34,0
2,35810,2017-11-06 14:34:12,0
3,45745,2017-11-06 14:34:52,0
4,161007,2017-11-06 14:35:08,0


The time is set with Year, month, day and then hour, minute and seconds

The way to convert a date into a usable object within pandas uses [pd.to_datetime()](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_datetime.html) function

In [6]:
datetime = pd.to_datetime(train['click_time'])
datetime.head()

0   2017-11-06 14:32:21
1   2017-11-06 14:33:34
2   2017-11-06 14:34:12
3   2017-11-06 14:34:52
4   2017-11-06 14:35:08
Name: click_time, dtype: datetime64[ns]

In some case you will need to give pandas the exact structure of the datetime field. To do that you would need to use special format characters:
 - %Y : is the year with 4 digits
 - %m : is the month in numerical format (1 up to 12)
 - %d : is the day of the month
 - %H : is the hour
 - %M : is the minutes
 - %S : is the seconds
 
 In our cases the next cell shows what the structure of the datetime string is :

In [7]:
datetime = pd.to_datetime(train['click_time'], format='%Y-%m-%d %H:%M:%S')
datetime.head()

0   2017-11-06 14:32:21
1   2017-11-06 14:33:34
2   2017-11-06 14:34:12
3   2017-11-06 14:34:52
4   2017-11-06 14:35:08
Name: click_time, dtype: datetime64[ns]

The advantage of using pandas datatime format is that you can add or substract them, which will soon become very handy !

The other advantages are that pandas datetime comes with lots of attributes like : day, dayofweek, month, year, dayofyear...

To access these attributes you need to use the **.dt** property of pandas datetime objects like so :

In [8]:
if type(datetime) == pd.Series:
    datetime = datetime.to_frame()
datetime['year'] = datetime['click_time'].dt.year
datetime['month'] = datetime['click_time'].dt.month
datetime['day'] = datetime['click_time'].dt.day
datetime['day_of_week'] = datetime['click_time'].dt.dayofweek
datetime['day_of_year'] = datetime['click_time'].dt.dayofyear
datetime.head(5)

Unnamed: 0,click_time,year,month,day,day_of_week,day_of_year
0,2017-11-06 14:32:21,2017,11,6,0,310
1,2017-11-06 14:33:34,2017,11,6,0,310
2,2017-11-06 14:34:12,2017,11,6,0,310
3,2017-11-06 14:34:52,2017,11,6,0,310
4,2017-11-06 14:35:08,2017,11,6,0,310


All datetime related attributes are detailed [here](https://pandas.pydata.org/pandas-docs/stable/api.html#datetimelike-properties) 

Now if you just need the hour of a click event you may want to use a simple string extraction technique like split or a regex

In [9]:
%%timeit
x = pd.to_datetime(train['click_time'])
dt_hour = x.dt.hour

1 loop, best of 3: 336 ms per loop


In [10]:
%%timeit
split_hour = train['click_time'].apply(lambda x: int(x.split()[1].split(':')[0]))

1 loop, best of 3: 1.09 s per loop


In [11]:
%%timeit
re_hour = train['click_time'].apply(lambda x: int(re.findall(' (\d\d)\:', x)[0]))

1 loop, best of 3: 1.61 s per loop


As a conclusion we can say that using **pd.to_datetime** is the fastest technique.

## Feature engineering using click_time

Now that we can convert and use datetime object within pandas, let's have a look at what we can do with time in the context of this TalkingData competition.

As @avisor_yifan demonstrated in [his post on Kaggle](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/54765), click farms fraudulent clicks are what we are trying to detect.

The first thing that comes to mind would be to compute the time between 2 clicks coming from the same **ip** address and surely, I am much slower than a click farm.

How can we compute these time difference between 2 clicks ?

First you have to know that the samples in the training set and test set are ordered by click_time. So now it becomes a matter of grouping camples per **ip** addresses and compute the time difference between 2 consecutive clicks.

First I will demonstrate how you can compute a time difference using the **shift** method of pandas Series 

In [12]:
train['click_time'] = pd.to_datetime(train['click_time'])

Now that **click_time** is a pd.datatime object let's check what shift does

In [13]:
pd.concat([train['click_time'], train['click_time'].shift(1).head()], axis=1).head(5)

Unnamed: 0,click_time,click_time.1
0,2017-11-06 14:32:21,NaT
1,2017-11-06 14:33:34,2017-11-06 14:32:21
2,2017-11-06 14:34:12,2017-11-06 14:33:34
3,2017-11-06 14:34:52,2017-11-06 14:34:12
4,2017-11-06 14:35:08,2017-11-06 14:34:52


I used **pd.concat** to display the original clicktime and its shifted version to really understand what **shift** does.

**shift(1)** simply delays samples by 1 row so the time that was at row 0 is now at row 1. The **NaT** is just the NaN version for time and means **Not a Time**.

Calculating time difference between each row can be done using :

In [14]:
(train['click_time'] - train['click_time'].shift(1)).head()

0        NaT
1   00:01:13
2   00:00:38
3   00:00:40
4   00:00:16
Name: click_time, dtype: timedelta64[ns]

What we see here is the time difference between each rows. The first row displays a NaT since it is the first row and you cannot compute its time difference with a previous value.

Now you need to understand that we have just done is computing the time difference between one sample in the file and the next sample. 

However what we want to do is compute the difference between consecutive samples coming from the same **ip** address.

This where the **groupby** statement will help us

Let's use the groupby statement on the **ip** address :

In [15]:
ip_groups = train.groupby('ip')
ip_groups

<pandas.core.groupby.DataFrameGroupBy object at 0x0000006AC9329FD0>

This creates a DataFrameGroupBy object. To understand what it is, we will use the **groups** attribute of DataFrameGroupBy object. 

Since we grouped samples by ip, the **groups** property returns a dictionary whose key is the ip address and would contain the index of each samples in train that correspond to this ip address.

In [16]:
for ip in ip_groups.groups.keys():
    print("Indexes for ip %6d : " % (ip), ip_groups.groups[ip])
    break

Indexes for ip      9 :  Int64Index([136178, 145734], dtype='int64')


To avoid getting a very long display I just displayed the first group information.

This group is for **ip** 9 and contains 2 samples whose index are 136178 and 145734.

Let's check this is true that samples 136178 and 145734 have their **ip** address equal to 9 using **.loc** property of pandas DataFrames: 

In [17]:
ip_9_indexes = [136178,145734]
train.loc[ip_9_indexes, :]

Unnamed: 0,ip,click_time,is_attributed
136178,9,2017-11-06 16:02:30,0
145734,9,2017-11-06 16:02:41,0


As you can see these samples have **ip** 9 AND the click_time are ordered

Let's check a bigger group like ip 27:

In [18]:
print("Indexes for ip %6d : " % (27), ip_groups.groups[27])

Indexes for ip     27 :  Int64Index([311979, 312579, 312591, 312753, 430663, 430695, 547438, 678126,
            703169, 703170, 708679, 709258, 709430, 709546, 709635, 709654,
            710660, 761864, 773721, 773909, 774178],
           dtype='int64')


In [19]:
ip_27_indexes = [
    311979, 312579, 312591, 312753, 430663, 430695, 547438, 678126,
    703169, 703170, 708679, 709258, 709430, 709546, 709635, 709654,
    710660, 761864, 773721, 773909, 774178
]
train.loc[ip_27_indexes, :]

Unnamed: 0,ip,click_time,is_attributed
311979,27,2017-11-06 16:05:59,0
312579,27,2017-11-06 16:05:59,0
312591,27,2017-11-06 16:05:59,0
312753,27,2017-11-06 16:05:59,0
430663,27,2017-11-06 16:08:26,0
430695,27,2017-11-06 16:08:26,0
547438,27,2017-11-06 16:11:04,0
678126,27,2017-11-06 16:14:07,0
703169,27,2017-11-06 16:14:42,0
703170,27,2017-11-06 16:14:42,0


Again you can see that they all have **ip** 27 and they are ordered by **click_time**

Let's look at another group with ip 25 : 


In [20]:
print("Indexes for ip %6d : " % (25), ip_groups.groups[25])

Indexes for ip     25 :  Int64Index([301564, 328270, 328282, 371311, 374091, 479925, 480071, 549539,
            549571, 652614],
           dtype='int64')


In [21]:
ip_25_indexes = [301564, 328270, 328282, 371311, 374091, 479925, 480071, 549539, 549571, 652614]
train.loc[ip_25_indexes, :]

Unnamed: 0,ip,click_time,is_attributed
301564,25,2017-11-06 16:05:47,0
328270,25,2017-11-06 16:06:16,0
328282,25,2017-11-06 16:06:16,0
371311,25,2017-11-06 16:07:09,0
374091,25,2017-11-06 16:07:12,0
479925,25,2017-11-06 16:09:32,0
480071,25,2017-11-06 16:09:32,0
549539,25,2017-11-06 16:11:07,0
549571,25,2017-11-06 16:11:07,0
652614,25,2017-11-06 16:13:30,0


So the groupby statement create groups of samples for each unique ip address in the file.

If I want to shift the click_time for each ip, I just need to use the shift method on each group like so:

(I demonstrate this on ip 25 and 27 to make it easier to understand)

In [22]:
# First display ip 25 and 27 in the order of the index
train.loc[sorted(ip_25_indexes + ip_27_indexes)].head(8)

Unnamed: 0,ip,click_time,is_attributed
301564,25,2017-11-06 16:05:47,0
311979,27,2017-11-06 16:05:59,0
312579,27,2017-11-06 16:05:59,0
312591,27,2017-11-06 16:05:59,0
312753,27,2017-11-06 16:05:59,0
328270,25,2017-11-06 16:06:16,0
328282,25,2017-11-06 16:06:16,0
371311,25,2017-11-06 16:07:09,0


Now if we apply the shift method we obtain:

In [23]:
train.loc[sorted(ip_25_indexes + ip_27_indexes)].groupby('ip').shift(1).head(9)

Unnamed: 0,click_time,is_attributed
301564,NaT,
311979,NaT,
312579,2017-11-06 16:05:59,0.0
312591,2017-11-06 16:05:59,0.0
312753,2017-11-06 16:05:59,0.0
328270,2017-11-06 16:05:47,0.0
328282,2017-11-06 16:06:16,0.0
371311,2017-11-06 16:06:16,0.0
374091,2017-11-06 16:07:09,0.0


You can see that the click_time for the 1st occurence of ip 25 and 27 have been replaced by Nat (Not a Time)

Then for the second occurence of ip 25 (index 328270), click_time has been replaced by the previous click_time value for ip 25, which was 2017-11-06 16:05:47.

What happens is the shift over samples has been performed for each ip individually, which is exactly what we wanted.

Now to compute the time difference for each samples coming from the same ip, we do :

In [24]:
time_difference = train.click_time - train.groupby('ip').shift(1).click_time
time_difference.tail(20) # I use tail since the first samples will all show NaT

999980   00:00:02
999981   00:00:13
999982   00:00:02
999983   00:00:00
999984   00:00:42
999985   00:00:08
999986   00:00:11
999987   00:12:37
999988   00:05:18
999989   00:00:00
999990   00:00:00
999991   00:00:05
999992   00:00:02
999993   00:00:00
999994   00:06:39
999995   00:00:04
999996   00:00:00
999997   00:05:43
999998   00:01:34
999999   00:00:00
Name: click_time, dtype: timedelta64[ns]

Please take the time to read the notebook carefully. This is really advanced pandas material!

Be sure to ask me questions on this for points that are not clear to you.