## Introduction

The goal of this notebook is to demonstrate how we can use the time features in the Talking Data dataset and in particular the creation of a click_rate feature. 

The idea behind the click_rate is that it may well be that click farms are a lot faster than humans and therefore high click_rates may be indicative of fraudulent clicks or doanloads.

As a reminder : 
 - Please check @advisor_yifan's great [post on kaggle](https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/54765)
 - if I had to choose one public kernel I would use [the first notebook](https://www.kaggle.com/nanomathias/feature-engineering-importance-testing). It is an extensive evaluation of all sorts of features in the context of TalkingData. In my opinion this is must read.

At the end of this notebook you will know how to :
 - convert a date in string format into an easy to use pandas datetime feature
 - be able to compute the click_rate fofr a given ip and app
 - hopefully get a better score on the LB with time features

What is included in this notebook is really advanced. 

Please don't be frustrated if you do not fully understand the pandas code, instead ask me questions. You can reach me directly on my slack @advisor_olivier !

Have a good read.

In [1]:
import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score, log_loss
from sklearn.linear_model import SGDClassifier
import time
import re
import gc

gc.enable()

Please change the file_path so that it points to where the train file is on your system  

In [2]:
file_path = "../input/train.csv.zip"

Specify data types to limit memory usage

In [3]:
dtypes = {
        'ip': 'uint32',
        'app': 'uint16',
        'device': 'uint16',
        'os': 'uint16',
        'channel': 'uint16',
        'is_attributed': 'uint8'
    }
cols = [f_ for f_ in dtypes.keys()]

## Reading talking data first 10 million rows

In this section we will read a few rows in the TalkingData training file and show how you can convert and use the date information inside the training file.

The time feature in the dataset is called **click_time**

In [4]:
# Take only day 6
train = pd.read_csv(file_path, nrows=9308568, usecols=['ip', 'app', 'click_time', 'is_attributed'], dtype=dtypes)

Let's look at the time format used in the dataset

In [5]:
train.head(5)

Unnamed: 0,ip,app,click_time,is_attributed
0,83230,3,2017-11-06 14:32:21,0
1,17357,3,2017-11-06 14:33:34,0
2,35810,3,2017-11-06 14:34:12,0
3,45745,14,2017-11-06 14:34:52,0
4,161007,3,2017-11-06 14:35:08,0


Transform date into pd.DateTime object

In [6]:
train['click_time'] = pd.to_datetime(train['click_time'], format='%Y-%m-%d %H:%M:%S')
train.head()

Unnamed: 0,ip,app,click_time,is_attributed
0,83230,3,2017-11-06 14:32:21,0
1,17357,3,2017-11-06 14:33:34,0
2,35810,3,2017-11-06 14:34:12,0
3,45745,14,2017-11-06 14:34:52,0
4,161007,3,2017-11-06 14:35:08,0


## Calculating the overal click_rate for an ip and app for each day

To create a click_rate per day for a given **ip**  and **app** 
address we need to do the following :

`click_rate_per_day = (time_of_last_appearance - time_of_first_appearance) / number_of_occurences` 

 Therefore we need to compute 
 - the time an ip and app first occured during the day
 - the time it last occured during the day
 - the number of occurences during the day
 
First let's create the `day` feature

In [7]:
train['day'] = train['click_time'].dt.day

Now we will create a feature to mix `ip`, `app` and `day` this will help make computations for each day separately

In [8]:
train['ip_app_day'] = train['ip'].apply(lambda x: str(x)) + '_' \
                      + train['app'].apply(lambda x: str(x)) + '_'\
                      + train['day'].apply(lambda x: str(x))  

You may wonder why I use an apply statement with a lambda function to convert `ip` and `day` into strings instead of a simple `.astype(str)`. The answer can be found in the next cells : apply function with lambda function is a lot quicker!

In [9]:
%%timeit 
train['ip'].apply(lambda x: str(x))

1 loop, best of 3: 4.06 s per loop


In [10]:
%%timeit 
train['ip'].astype(str)

1 loop, best of 3: 14 s per loop


Finally to compute `first_time`, `last_time` and occurences we can use a groupby statement:

In [11]:
ip_day_stats = train[['ip_app_day', 'click_time']].groupby('ip_app_day').agg(['min', 'max', 'count'])
ip_day_stats.columns = ip_day_stats.columns.droplevel(0)
ip_day_stats.head()

Unnamed: 0_level_0,min,max,count
ip_app_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
100002_10_6,2017-11-06 16:11:23,2017-11-06 16:11:23,1
100002_11_6,2017-11-06 16:11:49,2017-11-06 16:11:49,1
100002_12_6,2017-11-06 16:06:53,2017-11-06 23:39:24,44
100002_13_6,2017-11-06 22:05:33,2017-11-06 23:25:18,2
100002_14_6,2017-11-06 16:10:32,2017-11-06 23:39:09,6


What we just did is calculate for each couple `ip_app_day` the first click time (i.e. the min) the last click time (i.e. the max) and the number of occurences (i.e the count)

Now we need to compute the click rate from this

In [12]:
# Concert min and max to integers
ip_day_stats['max'] = ip_day_stats['max'].astype(np.int64) // 1e9
ip_day_stats['min'] = ip_day_stats['min'].astype(np.int64) // 1e9
# Compute the click rate
ip_day_stats['click_rate_per_ip_app_day'] = (ip_day_stats['max'] - ip_day_stats['min']) / ip_day_stats['count'] 
ip_day_stats['click_rate_per_ip_app_day'].head(10)

ip_app_day
100002_10_6       0.000000
100002_11_6       0.000000
100002_12_6     617.068182
100002_13_6    2392.500000
100002_14_6    4486.166667
100002_15_6     631.477273
100002_18_6     666.500000
100002_1_6     3931.428571
100002_21_6    4101.166667
100002_22_6       0.000000
Name: click_rate_per_ip_app_day, dtype: float64

Finally we need to put the calculated rates back into the training set. We will do this using the `.map` statement

In [13]:
train['click_rate'] = train['ip_app_day'].map(ip_day_stats['click_rate_per_ip_app_day'])

Now let's check if the click_rate over a day makes sense for the dataset

In [14]:
from sklearn.metrics import roc_auc_score
print("AUC score for click rate over a day : %.6f" % roc_auc_score(train['is_attributed'], train['click_rate']))

AUC score for click rate over a day : 0.213905


The fact that the AUC score is below 0.5 means there is an inverse relation betwee the `click_rate` and the target. This is expected since the higher the click rate the higher one can think a click farm is making the clicks or downloads. 

The actual AUC score for this single feature is `1 - 0.213905 = 0.786095`