### TalkingData AdTracking Fraud Detection Challenge

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

#### Description

Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

#### Evaluation
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

##### Submission File
For each click_id in the test set, you must predict a probability for the target is_attributed variable. The file should contain a header and have the following format:

|click_id,is_attributed|
|------|
|1,0.003|
|2,0.001|
|3,0.000|
|etc.|



#### Data Description
**For this competition, your objective is to predict whether a user will download an app after clicking a mobile app advertisement.**

*note: this is a classification problem - binary class*

##### File descriptions
- train.csv - the training set
- train_sample.csv - 100,000 randomly-selected rows of training data, to inspect data before downloading full set
- test.csv - the test set
- sampleSubmission.csv - a sample submission file in the correct format
- UPDATE: test_supplement.csv - This is a larger test set that was unintentionally released at the start of the competition. It is not necessary to use this data, but it is permitted to do so. The official test data is a subset of this data.

##### Data fields
Each row of the training data contains a click record, with the following features.

- ip: ip address of click.
- app: app id for marketing.
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded

*Note that ip, app, device, os, and channel are encoded.*

The test data is similar, with the following differences:

- click_id: reference for making predictions
- is_attributed: not included

In [1]:
import os
import gc
import pandas as pd
import numpy as np
from datetime import datetime

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

### Load data into a DataFrame

In [2]:
#define data directory 
file_dir='input/'
train_sample='train_sample.csv'
train='train.csv'
test='test.csv'
sample_submission='sample_submission.csv'
test_suppl='test_supplement.csv'

traincolumns = ['ip','app', 'device', 'os', 'channel', 'click_time', 'is_attributed']
dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'is_attributed' : 'uint8',
        'click_id'      : 'uint32'
        }


In [3]:
#load the data into a pandas data frame
#df_train=pd.read_csv(file_dir+train_sample,dtype=dtypes, header=0)

df_train = pd.read_csv(file_dir+train, nrows=30000000, 
                    usecols = traincolumns, dtype=dtypes, header=0)

df_train['mark'] = np.arange(30000000)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,is_attributed,mark
0,83230,3,1,13,379,2017-11-06 14:32:21,0,0
1,17357,3,1,19,379,2017-11-06 14:33:34,0,1
2,35810,3,1,13,379,2017-11-06 14:34:12,0,2
3,45745,14,1,13,478,2017-11-06 14:34:52,0,3
4,161007,3,1,13,379,2017-11-06 14:35:08,0,4


In [4]:
#the distribution of target variable
df_train['is_attributed'].value_counts()

0    29925277
1       74723
Name: is_attributed, dtype: int64

In [5]:
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=0)
X = df_train.drop(columns=['is_attributed','click_time'])
y = df_train['is_attributed']
X_resampled, y_resampled = rus.fit_sample(X, y)
del X,y,rus

In [6]:
print(len(X_resampled))
print(sum(y_resampled))
del y_resampled

149446
74723


In [7]:
X = pd.DataFrame(data = X_resampled, 
                 columns = ['ip','app', 'device', 'os', 'channel','mark'])
df_train = df_train.drop(columns = ['app', 'device', 'os', 'channel'])

In [8]:
df_train.head()

Unnamed: 0,ip,click_time,is_attributed,mark
0,83230,2017-11-06 14:32:21,0,0
1,17357,2017-11-06 14:33:34,0,1
2,35810,2017-11-06 14:34:12,0,2
3,45745,2017-11-06 14:34:52,0,3
4,161007,2017-11-06 14:35:08,0,4


In [9]:
X.head()

Unnamed: 0,ip,app,device,os,channel,mark
0,165079,2,1,19,477,16040473
1,92237,3,1,13,280,27239904
2,120775,15,1,19,245,26458436
3,177466,3,1,9,173,25001874
4,43462,3,1,16,417,15717795


In [10]:
df_train = X.merge(df_train,how='left',left_on='mark',right_on='mark',validate='one_to_one')

In [11]:
df_train.loc[:,'ip'] =df_train['ip_x'] 
df_train = df_train.drop(columns=['ip_y','ip_x','mark'])

### Feature Engineering

generate features from the click_time attribute:
- directly extract from click_time
    - the date 
    - hour
    - minute
    
- create a time_segment feature based on hour of the click_time, converting UTC time to Beijing local Time

    - UTC 6 AM - 2 PM (Beijing 2 PM - 10 PM): segment 0
    - UTC 2 PM - 5 PM (Beijing 10 PM - 1 AM): segment 1
    - UTC 5 PM - 2 AM (Beijing 1 AM - 10 AM): segment 2
    - UTC 2 AM - 6 AM (Beijing 10 AM - 2 PM): segment 3

In [12]:
#generate features from click_time 
df_train['click_time']=pd.to_datetime(df_train['click_time']) #convert the click_time 
df_train['dayofweek'] = df_train['click_time'].dt.dayofweek.astype('uint8')
df_train['date'] = df_train['click_time'].dt.dayofyear.astype('uint8')
df_train['hour']=df_train['click_time'].dt.hour.astype('uint8')
df_train['minute']=df_train['click_time'].dt.minute.astype('uint8')
df_train['second']= df_train['click_time'].dt.second.astype('uint8')

In [13]:
df_train.loc[:,'time_segment'] = 2
df_train.loc[df_train.loc[:,'hour']<17,'time_segment'] = 1
df_train.loc[df_train['hour']<14,'time_segment'] = 0
df_train.loc[df_train['hour']<6,'time_segment'] = 3
df_train.loc[df_train['hour']<2,'time_segment'] = 2

In [14]:
total_sum = len(df_train)
df_train['app_count'] = df_train.groupby('app')['app'].transform('count')

df_train.loc[:,'app_segment'] = 3
df_train.loc[df_train.loc[:,'app_count']<total_sum*.05,'app_segment'] = 2
df_train.loc[df_train.loc[:,'app_count']<total_sum*.01,'app_segment'] = 1
df_train.loc[df_train.loc[:,'app_count']<total_sum*.005,'app_segment'] = 0
df_train= df_train.drop(['app_count'], axis=1)

In [15]:
df_train['channel_count'] = df_train.groupby('channel')['channel'].transform('count')

df_train.loc[:,'channel_segment'] = 4
df_train.loc[df_train.loc[:,'channel_count']<total_sum*.08,'channel_segment'] = 3
df_train.loc[df_train.loc[:,'channel_count']<total_sum*.03,'channel_segment'] = 2
df_train.loc[df_train.loc[:,'channel_count']<total_sum*.01,'channel_segment'] = 1
df_train.loc[df_train.loc[:,'channel_count']<total_sum*.005,'channel_segment'] = 0
df_train= df_train.drop(['channel_count'], axis=1)

In [16]:
df_train['os_count'] = df_train.groupby('os')['os'].transform('count')

df_train.loc[:,'os_segment'] = 4
df_train.loc[df_train.loc[:,'os_count']<total_sum*.23,'os_segment'] = 3
df_train.loc[df_train.loc[:,'os_count']<total_sum*.2,'os_segment'] = 2
df_train.loc[df_train.loc[:,'os_count']<total_sum*.04,'os_segment'] = 1
df_train.loc[df_train.loc[:,'os_count']<total_sum*.02,'os_segment'] = 0
df_train= df_train.drop(['os_count'], axis=1)

In [17]:
df_train['device_count'] = df_train.groupby('device')['device'].transform('count')

df_train.loc[:,'device_segment'] = 2
df_train.loc[df_train.loc[:,'device_count']<total_sum*.1,'device_segment'] = 1
df_train.loc[df_train.loc[:,'device_count']<total_sum*.04,'device_segment'] = 0
df_train= df_train.drop(['device_count'], axis=1)
df_train.head()

Unnamed: 0,app,device,os,channel,click_time,is_attributed,ip,dayofweek,date,hour,minute,second,time_segment,app_segment,channel_segment,os_segment,device_segment
0,2,1,19,477,2017-11-07 01:56:36,0,165079,1,55,1,56,36,2,3,2,3,2
1,3,1,13,280,2017-11-07 05:18:37,0,92237,1,55,5,18,37,3,3,3,2,2
2,15,1,19,245,2017-11-07 05:04:37,0,120775,1,55,5,4,37,3,3,2,3,2
3,3,1,9,173,2017-11-07 04:40:49,0,177466,1,55,4,40,49,3,3,1,0,2
4,3,1,16,417,2017-11-07 01:50:21,0,43462,1,55,1,50,21,2,3,0,0,2


### Group-By-Aggregation
There are a lot of groupby -> count()/var()/mean() etc. feature engineering in the kernels I've checked out, so of course those have to be added as well :)

In [18]:
# Define all the groupby transformations
GROUPBY_AGGREGATIONS = [
    
    # V1 - GroupBy Features #
    #########################    
    # Variance in hour, for ip-app-os
    {'groupby': ['ip','app','os'], 'select': 'hour', 'agg': 'var'},
    # Variance in date, for ip-app-os
    {'groupby': ['ip','app','os'], 'select': 'date', 'agg': 'var'},
    # Count, for ip-app
    {'groupby': ['ip', 'app'], 'select': 'channel', 'agg': 'count'},        
    # Count, for ip-app-os
    {'groupby': ['ip', 'app', 'os'], 'select': 'channel', 'agg': 'count'},
    # Mean hour, for ip-app-channel
    {'groupby': ['ip','app','channel'], 'select': 'hour', 'agg': 'mean'}, 
    # Mean date, for ip-app-channel
    {'groupby': ['ip','app','channel'], 'select': 'date', 'agg': 'mean'}, 

    
    # V2 - GroupBy Features #
    #########################
    # Average clicks on app by distinct users; is it an app they return to?
    {'groupby': ['app'], 
     'select': 'ip', 
     'agg': lambda x: float(len(x)) / len(x.unique()), 
     'agg_name': 'AvgViewPerDistinct'
    },
    # How popular is the app or channel?
    {'groupby': ['app'], 'select': 'channel', 'agg': 'count'},
    {'groupby': ['channel'], 'select': 'app', 'agg': 'count'},
    
    # V3 - GroupBy Features                                              #
    # https://www.kaggle.com/bk0000/non-blending-lightgbm-model-lb-0-977 #
    ###################################################################### 
    {'groupby': ['ip'], 'select': 'channel', 'agg': 'nunique'}, 
    {'groupby': ['ip'], 'select': 'app', 'agg': 'nunique'}, 
    {'groupby': ['ip','app'], 'select': 'os', 'agg': 'nunique'}, 
    {'groupby': ['ip'], 'select': 'device', 'agg': 'nunique'}, 
    {'groupby': ['app'], 'select': 'channel', 'agg': 'nunique'}, 
    {'groupby': ['ip', 'device', 'os'], 'select': 'app', 'agg': 'nunique'}, 
    {'groupby': ['ip','device','os'], 'select': 'app', 'agg': 'cumcount'}, 
    {'groupby': ['ip'], 'select': 'app', 'agg': 'cumcount'}, 
    {'groupby': ['ip'], 'select': 'os', 'agg': 'cumcount'}    
]

# Apply all the groupby transformations
for spec in GROUPBY_AGGREGATIONS:
    
    # Name of the aggregation we're applying
    agg_name = spec['agg_name'] if 'agg_name' in spec else spec['agg']
    
    # Name of new feature
    new_feature = '{}_{}_{}'.format('_'.join(spec['groupby']), agg_name, spec['select'])
    
    # Info
    print("Grouping by {}, and aggregating {} with {}".format(
        spec['groupby'], spec['select'], agg_name
    ))
    
    # Unique list of features to select
    all_features = list(set(spec['groupby'] + [spec['select']]))
    
    # Perform the groupby
    gp = df_train[all_features]. \
        groupby(spec['groupby'])[spec['select']]. \
        agg(spec['agg']). \
        reset_index(). \
        rename(index=str, columns={spec['select']: new_feature})
        
    # Merge back to X_total
    if 'cumcount' == spec['agg']:
        df_train[new_feature] = gp[0].values
    else:
        df_train = df_train.merge(gp, on=spec['groupby'], how='left')
        
     # Clear memory
    del gp
    gc.collect()

df_train.head()

Grouping by ['ip', 'app', 'os'], and aggregating hour with var
Grouping by ['ip', 'app', 'os'], and aggregating date with var
Grouping by ['ip', 'app'], and aggregating channel with count
Grouping by ['ip', 'app', 'os'], and aggregating channel with count
Grouping by ['ip', 'app', 'channel'], and aggregating hour with mean
Grouping by ['ip', 'app', 'channel'], and aggregating date with mean
Grouping by ['app'], and aggregating ip with AvgViewPerDistinct
Grouping by ['app'], and aggregating channel with count
Grouping by ['channel'], and aggregating app with count
Grouping by ['ip'], and aggregating channel with nunique
Grouping by ['ip'], and aggregating app with nunique
Grouping by ['ip', 'app'], and aggregating os with nunique
Grouping by ['ip'], and aggregating device with nunique
Grouping by ['app'], and aggregating channel with nunique
Grouping by ['ip', 'device', 'os'], and aggregating app with nunique
Grouping by ['ip', 'device', 'os'], and aggregating app with cumcount
Grouping

Unnamed: 0,app,device,os,channel,click_time,is_attributed,ip,dayofweek,date,hour,...,channel_count_app,ip_nunique_channel,ip_nunique_app,ip_app_nunique_os,ip_nunique_device,app_nunique_channel,ip_device_os_nunique_app,ip_device_os_cumcount_app,ip_cumcount_app,ip_cumcount_os
0,2,1,19,477,2017-11-07 01:56:36,0,165079,1,55,1,...,3344,1,1,1,1,19,1,0,0,0
1,3,1,13,280,2017-11-07 05:18:37,0,92237,1,55,5,...,6553,6,4,1,1,32,1,0,0,0
2,15,1,19,245,2017-11-07 05:04:37,0,120775,1,55,5,...,4374,9,8,1,1,23,2,0,0,0
3,3,1,9,173,2017-11-07 04:40:49,0,177466,1,55,4,...,751,3,2,2,1,32,1,0,0,0
4,3,1,16,417,2017-11-07 01:50:21,0,43462,1,55,1,...,199,3,2,1,1,32,1,0,0,0


### Time till next click
It might be interesting to know e.g. how long it takes for a given ip-app-channel before they perform the next click. So I'll create some features for these as well. ****

In [19]:
GROUP_BY_NEXT_CLICKS = [
    
    # V1
    {'groupby': ['ip']},
    {'groupby': ['ip', 'app']},
    {'groupby': ['ip', 'channel']},
    {'groupby': ['ip', 'os']},
    
    # V3
    {'groupby': ['ip', 'app', 'device', 'os', 'channel']},
    {'groupby': ['ip', 'os', 'device']},
    {'groupby': ['ip', 'os', 'device', 'app']}
]

# Calculate the time to next click for each group
for spec in GROUP_BY_NEXT_CLICKS:
    
    # Name of new feature
    new_feature = '{}_nextClick'.format('_'.join(spec['groupby']))    
    
    # Unique list of features to select
    all_features = spec['groupby'] + ['click_time']
    
    # Run calculation
    print(f">> Grouping by {spec['groupby']}, and saving time to next click in: {new_feature}")
    df_train[new_feature] = df_train[all_features].groupby(spec['groupby']).click_time.transform(lambda x: x.diff().shift(-1)).dt.seconds
    

>> Grouping by ['ip'], and saving time to next click in: ip_nextClick
>> Grouping by ['ip', 'app'], and saving time to next click in: ip_app_nextClick
>> Grouping by ['ip', 'channel'], and saving time to next click in: ip_channel_nextClick
>> Grouping by ['ip', 'os'], and saving time to next click in: ip_os_nextClick
>> Grouping by ['ip', 'app', 'device', 'os', 'channel'], and saving time to next click in: ip_app_device_os_channel_nextClick
>> Grouping by ['ip', 'os', 'device'], and saving time to next click in: ip_os_device_nextClick
>> Grouping by ['ip', 'os', 'device', 'app'], and saving time to next click in: ip_os_device_app_nextClick


Unnamed: 0,app,device,os,channel,click_time,is_attributed,ip,dayofweek,date,hour,...,ip_device_os_cumcount_app,ip_cumcount_app,ip_cumcount_os,ip_nextClick,ip_app_nextClick,ip_channel_nextClick,ip_os_nextClick,ip_app_device_os_channel_nextClick,ip_os_device_nextClick,ip_os_device_app_nextClick
0,2,1,19,477,2017-11-07 01:56:36,0,165079,1,55,1,...,0,0,0,,,,,,,
1,3,1,13,280,2017-11-07 05:18:37,0,92237,1,55,5,...,0,0,0,73672.0,73672.0,,73672.0,,73672.0,73672.0
2,15,1,19,245,2017-11-07 05:04:37,0,120775,1,55,5,...,0,0,0,82451.0,,,64185.0,,64185.0,
3,3,1,9,173,2017-11-07 04:40:49,0,177466,1,55,4,...,0,0,0,1053.0,41557.0,,,,,
4,3,1,16,417,2017-11-07 01:50:21,0,43462,1,55,1,...,0,0,0,83702.0,83702.0,,83702.0,,83702.0,83702.0


### Clicks on app ad before & after
Has the user previously or subsequently clicked the exact same app-device-os-channel? I thought that might be an interesting feature to test out as well.

In [20]:
HISTORY_CLICKS = {
    'identical_clicks': ['ip', 'app', 'device', 'os', 'channel'],
    'app_clicks': ['ip', 'app']
}

# Go through different group-by combinations
for fname, fset in HISTORY_CLICKS.items():
    
    # Clicks in the past
    df_train['prev_'+fname] = df_train. \
        groupby(fset). \
        cumcount(). \
        rename('prev_'+fname)
        
    # Clicks in the future
    df_train['future_'+fname] = df_train.iloc[::-1]. \
        groupby(fset). \
        cumcount(). \
        rename('future_'+fname).iloc[::-1]

# Count cumulative subsequent clicks
df_train.head()

Unnamed: 0,app,device,os,channel,click_time,is_attributed,ip,dayofweek,date,hour,...,ip_app_nextClick,ip_channel_nextClick,ip_os_nextClick,ip_app_device_os_channel_nextClick,ip_os_device_nextClick,ip_os_device_app_nextClick,prev_identical_clicks,future_identical_clicks,prev_app_clicks,future_app_clicks
0,2,1,19,477,2017-11-07 01:56:36,0,165079,1,55,1,...,,,,,,,0,0,0,0
1,3,1,13,280,2017-11-07 05:18:37,0,92237,1,55,5,...,73672.0,,73672.0,,73672.0,73672.0,0,0,0,1
2,15,1,19,245,2017-11-07 05:04:37,0,120775,1,55,5,...,,,64185.0,,64185.0,,0,0,0,0
3,3,1,9,173,2017-11-07 04:40:49,0,177466,1,55,4,...,41557.0,,,,,,0,0,0,1
4,3,1,16,417,2017-11-07 01:50:21,0,43462,1,55,1,...,83702.0,,83702.0,,83702.0,83702.0,0,0,0,1


In [21]:
df_train = df_train.drop(columns=['click_time','date',
                                  'os','channel','device','app','ip'])

In [22]:
df_train.to_csv('input/train_fe.csv')