## TalkingData AdTracking Fraud Detection Challenge
https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

1. This competition is a binary classification problem challenge. Kagglers are invited to classify users into "fraudlent" or "not fraudlent" as well as possible

2. The evaluation metric is ROC-AUC (the area under a curve on a Receiver Operator Characteristic graph)

3. There are three main datasets: train, test_supplement, test (for submission)


In [1]:
import pandas as pd
import numpy as np

# klearn import
from klearn.eda import PlotlyPlot

train_columns = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'is_attributed']
test_columns = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'click_id']
dtype = {
    'ip': 'uint32',
    'app': 'uint16',
    'device': 'uint16',
    'os': 'uint16',
    'channel': 'uint16',
    'is_attributed': 'uint16',
    'click_id': 'uint32'
}
# read data
df_train = pd.read_csv(
    filepath_or_buffer="../data/train.csv",
    usecols=train_columns,
    dtype=dtype,
    low_memory=True,
    parse_dates=['click_time'],
    infer_datetime_format=True,
#     skiprows=range(1, 59709852),
#     nrows=50000
)
df_test = pd.read_csv(
    filepath_or_buffer="../data/test_supplement.csv",
    usecols=test_columns,
    dtype=dtype,
    low_memory=True,
    parse_dates=['click_time'],
    infer_datetime_format=True,
#     skiprows=range(1,633),
#     nrows=10000
)
df_test_submit = pd.read_csv(
    filepath_or_buffer="../data/test.csv",
    usecols=test_columns,
    dtype=dtype,
    low_memory=True,
    parse_dates=['click_time'],
    infer_datetime_format=True,
)

## Let's look at file size and basic info

In [2]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 184903890 entries, 0 to 184903889
Data columns (total 7 columns):
ip               uint32
app              uint16
device           uint16
os               uint16
channel          uint16
click_time       datetime64[ns]
is_attributed    uint16
dtypes: datetime64[ns](1), uint16(5), uint32(1)
memory usage: 3.8 GB


In [3]:
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 57537505 entries, 0 to 57537504
Data columns (total 7 columns):
click_id      uint32
ip            uint32
app           uint16
device        uint16
os            uint16
channel       uint16
click_time    datetime64[ns]
dtypes: datetime64[ns](1), uint16(4), uint32(2)
memory usage: 1.3 GB


In [4]:
df_test_submit.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18790469 entries, 0 to 18790468
Data columns (total 7 columns):
click_id      uint32
ip            uint32
app           uint16
device        uint16
os            uint16
channel       uint16
click_time    datetime64[ns]
dtypes: datetime64[ns](1), uint16(4), uint32(2)
memory usage: 430.1 MB


Wow! We can see that this is some really big data. For those who don't have a decent amount of memory space in their machine, they need to use some tricks (eg. chunksize loading, hashing tricks, OGD) to get through this challenges.
Here are some note about the data:
1. We don't need to handle Nans
2. Besides 'click_time', all other features are integer type

## Looking at the columns
According to the data page, our data contains:

- ip: ip address of click
- app: app id for marketing
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded

In [5]:
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,is_attributed
0,83230,3,1,13,379,2017-11-06 14:32:21,0
1,17357,3,1,19,379,2017-11-06 14:33:34,0
2,35810,3,1,13,379,2017-11-06 14:34:12,0
3,45745,14,1,13,478,2017-11-06 14:34:52,0
4,161007,3,1,13,379,2017-11-06 14:35:08,0


Some note:
1. 'is_attributed' is the target variable (only 1 and 0)
2. Data is organized in timestamp sequence (ordered by 'click_time' ASCD)
3. Besides 'click_time', all other variables are categorical.

In [6]:
# Let's do unique counts on other variables
cols = ['ip', 'app', 'device', 'os', 'channel']
uniques = [len(df_train[col].unique()) for col in cols]
log_uniques = [np.log(unique) for unique in uniques]
# plot
plotly_plot = PlotlyPlot(
    title='Number of Unique Values Per Feature', 
    yaxis={'title': 'log(unique count)'}, 
    xaxis={'title': 'features'})
plotly_plot.barplot(
    x=cols, 
    y=log_uniques, 
    text=uniques,
    marker_attribute={
        'color': 'lightblue',
        'line': {'width': 2}
    }
)

If you look at the data samples above, you'll notice that all these variables are encoded - meaning we don't know what the actual value corresponds to - each value has instead been assigned an ID which we're given. This has likely been done because data such as IP addresses are sensitive, although it does unfortunately reduce the amount of feature engineering we can do on these.

## Look at the target variable

In [7]:
true_proportion = (df_train.is_attributed.values == 1).mean()
false_proportion = 1 - true_proportion
data_list = [true_proportion, false_proportion]
text_list = ["{:.2%}".format(v) for v in data_list]
# plot
plotly_plot = PlotlyPlot(
    title='Taget Variable (In Training) Distribution', 
    yaxis={'title': 'proportion'}, 
    xaxis={'title': 'Taget Value'})
plotly_plot.barplot(
    x=['fraudulant', 'not fraudulant'], 
    y=data_list,
    text=text_list,
    marker_attribute={
        'color': 'lightblue',
        'line': {'width': 2}
    },
    **{'width': [0.5, 0.5]}
)

Wow, that's a really imbalanced dataset. Only 0.2% of the dataset is made up of fraudulent clicks. This means that any models we run on the data will either need to be robust against class imbalance or will require some data resampling.

## Look at some time series pattern

In [8]:
# let's round timestamp to hour
df_train['click_hour'] = df_train['click_time'].dt.round('H')
num_clicks = df_train.groupby('click_hour').size()
# plot
plotly_plot = PlotlyPlot(
    title='Number of Clicks (Train) Time Series', 
    yaxis={'title': 'number of clicks'}, 
    xaxis={'title': 'time stamp'})
plotly_plot.lineplot(
    x=num_clicks.index, 
    y=num_clicks, 

)

There is a definitely pattern in frequency of clicks based on time of day. And this behavior is consistent throughout the time all the way to out-of-sample data. We might want to get 'hour' as one the feature in dataset

In [9]:
# let's round timestamp to hour
df_test['click_hour'] = df_test['click_time'].dt.round('H')
num_clicks = df_test.groupby('click_hour').size()
# plot
plotly_plot = PlotlyPlot(
    title='Number of Clicks (Test) Time Series', 
    yaxis={'title': 'number of clicks'}, 
    xaxis={'title': 'time stamp'})
plotly_plot.lineplot(
    x=num_clicks.index, 
    y=num_clicks, 

)

In [10]:
# let's round timestamp to hour
df_test_submit['click_hour'] = df_test_submit['click_time'].dt.round('H')
num_clicks = df_test_submit.groupby('click_hour').size()
# plot
plotly_plot = PlotlyPlot(
    title='Number of Clicks (Test For Submission) Time Series', 
    yaxis={'title': 'number of clicks'}, 
    xaxis={'title': 'time stamp'})
plotly_plot.lineplot(
    x=num_clicks.index, 
    y=num_clicks, 

)

There are a lot more clicks in some hours than other hours. As we can see there is a strong time pattern in this data. We definitely want to work some feature engineering around this time pattern.