### TalkingData AdTracking Fraud Detection Challenge

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

#### Description

Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

TalkingData, China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. In their 2nd competition with Kaggle, you’re challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

#### Evaluation
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

##### Submission File
For each click_id in the test set, you must predict a probability for the target is_attributed variable. The file should contain a header and have the following format:

|click_id,is_attributed|
|------|
|1,0.003|
|2,0.001|
|3,0.000|
|etc.|



#### Data Description
**For this competition, your objective is to predict whether a user will download an app after clicking a mobile app advertisement.**

*note: this is a classification problem - binary class*

##### File descriptions
- train.csv - the training set
- train_sample.csv - 100,000 randomly-selected rows of training data, to inspect data before downloading full set
- test.csv - the test set
- sampleSubmission.csv - a sample submission file in the correct format
- UPDATE: test_supplement.csv - This is a larger test set that was unintentionally released at the start of the competition. It is not necessary to use this data, but it is permitted to do so. The official test data is a subset of this data.

##### Data fields
Each row of the training data contains a click record, with the following features.

- ip: ip address of click.
- app: app id for marketing.
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded

*Note that ip, app, device, os, and channel are encoded.*

The test data is similar, with the following differences:

- click_id: reference for making predictions
- is_attributed: not included

In [13]:
import os
import gc
import pandas as pd
import seaborn as sns
import numpy as np
from matplotlib import pyplot as plt
import matplotlib
from datetime import datetime
from sklearn.model_selection import train_test_split
import lightgbm as lgb

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

In [14]:
print(plt.style.available)
print('pandas version:',pd.__version__)
print('seaborn version:',sns.__version__)
print('matplotlib version:',matplotlib.__version__)

['bmh', '_classic_test', 'fast', 'seaborn-whitegrid', 'seaborn-ticks', 'Solarize_Light2', 'seaborn-white', 'tableau-colorblind10', 'seaborn-talk', 'grayscale', 'seaborn-bright', 'dark_background', 'seaborn-colorblind', 'seaborn-pastel', 'seaborn-paper', 'seaborn-notebook', 'seaborn-dark', 'seaborn-poster', 'classic', 'seaborn', 'seaborn-darkgrid', 'seaborn-dark-palette', 'seaborn-deep', 'ggplot', 'fivethirtyeight', 'seaborn-muted']
pandas version: 0.22.0
seaborn version: 0.8.1
matplotlib version: 2.2.2


### Load data into a DataFrame

In [15]:
#define data directory 
file_dir='input/'
train_sample='train_sample.csv'
train='train.csv'
test='test.csv'
sample_submission='sample_submission.csv'
test_suppl='test_supplement.csv'

dtypes = {
        'ip'            : 'uint32',
        'app'           : 'uint16',
        'device'        : 'uint16',
        'os'            : 'uint16',
        'channel'       : 'uint16',
        'is_attributed' : 'uint8',
        'click_id'      : 'uint32'
        }


In [16]:
#load the data into a pandas data frame
df_train=pd.read_csv(file_dir+train_sample,dtype=dtypes, header=0)
#df_train=pd.read_csv(file_dir+train,dtype=dtypes, header=0,low_memory=True)
test = pd.read_csv(file_dir+test, header=0)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,87540,12,1,13,497,2017-11-07 09:30:38,,0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0
2,101424,12,1,19,212,2017-11-07 18:05:24,,0
3,94584,13,1,13,477,2017-11-07 04:58:08,,0
4,68413,12,1,1,178,2017-11-09 09:00:09,,0


In [10]:
#disply the list of fields and shape of the dataframe
df_train.info() #the check the columns of 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100000 entries, 0 to 99999
Data columns (total 8 columns):
ip                 100000 non-null uint32
app                100000 non-null uint16
device             100000 non-null uint16
os                 100000 non-null uint16
channel            100000 non-null uint16
click_time         100000 non-null object
attributed_time    227 non-null object
is_attributed      100000 non-null uint8
dtypes: object(2), uint16(4), uint32(1), uint8(1)
memory usage: 2.8+ MB


In [17]:
#the distribution of target variable
df_train['is_attributed'].value_counts()

0    99773
1      227
Name: is_attributed, dtype: int64

### Feature Engineering

generate features from the click_time attribute:
- directly extract from click_time
    - the date 
    - hour
    - minute
    
- create a time_segment feature based on hour of the click_time
    - 10 PM - 6 AM: segment 0
    - 6 AM - 9 AM: segment 1
    - 9 AM - 6 PM: segment 2
    - 6 PM - 10 PM: segment 3

In [19]:
#generate features from click_time 
df_train['click_time']=pd.to_datetime(df_train['click_time']) #convert the click_time 
df_train['dayofweek'] = df_train['click_time'].dt.dayofweek.astype('uint8')
df_train['hour']=df_train['click_time'].dt.hour.astype('uint8')
df_train['minute']=df_train['click_time'].dt.minute.astype('uint8')
df_train['second']= df_train['click_time'].dt.second.astype('uint8')
df_train= df_train.drop(['click_time'], axis=1)

df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,dayofweek,hour,minute,second
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,1,9,30,38
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,1,13,40,27
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,1,18,5,24
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,1,4,58,8
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,3,9,0,9


In [20]:
df_train['time_segment']=df_train.apply(lambda row: 2 if row.hour<2 else 
                                                      (3 if row.hour<6 else 
                                                       (0 if row.hour<14 else 
                                                        (1 if row.hour<17 else 2)
                                                       ) 
                                                      ), axis=1)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,dayofweek,hour,minute,second,time_segment
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,1,9,30,38,0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,1,13,40,27,0
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,1,18,5,24,2
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,1,4,58,8,3
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,3,9,0,9,0


In [29]:
df_train['app_count'] = df_train.groupby('app')['app'].transform('count')
df_train['app_segment']=df_train.apply(lambda row: 0 if row.app_count<500 else 
                                                      (1 if row.app_count<1000 else 
                                                       (2 if row.app_count<5000 else 3) 
                                                      ), axis=1)

df_train= df_train.drop(['app_count'], axis=1)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,dayofweek,hour,minute,second,time_segment,app_segment
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,1,9,30,38,0,3
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,1,13,40,27,0,1
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,1,18,5,24,2,3
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,1,4,58,8,3,2
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,3,9,0,9,0,3


In [30]:
df_train['channel_count'] = df_train.groupby('channel')['channel'].transform('count')
df_train['channel_segment']=df_train.apply(lambda row: 0 if row.channel_count<500 else 
                                                      (1 if row.channel_count<1000 else 
                                                       (2 if row.channel_count<3000 else 
                                                        (3 if row.channel_count<8000 else 4) 
                                                      )
                                                    ), axis=1)

df_train= df_train.drop(['channel_count'], axis=1)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,dayofweek,hour,minute,second,time_segment,app_segment,channel_segment
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,1,9,30,38,0,3,0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,1,13,40,27,0,1,3
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,1,18,5,24,2,3,1
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,1,4,58,8,3,2,3
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,3,9,0,9,0,3,2


In [34]:
df_train['os_count'] = df_train.groupby('os')['os'].transform('count')
df_train['os_segment']=df_train.apply(lambda row: 0 if row.os_count<2000 else 
                                                      (1 if row.os_count<4000 else 
                                                       (2 if row.os_count<20000 else 
                                                        (3 if row.os_count<23000 else 4) 
                                                      )
                                                    ), axis=1)

df_train= df_train.drop(['os_count'], axis=1)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,dayofweek,hour,minute,second,time_segment,app_segment,channel_segment,os_segment
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,1,9,30,38,0,3,0,3
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,1,13,40,27,0,1,3,2
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,1,18,5,24,2,3,1,4
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,1,4,58,8,3,2,3,3
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,3,9,0,9,0,3,2,0


In [39]:
df_train['device_count'] = df_train.groupby('device')['device'].transform('count')
df_train['device_segment']=df_train.apply(lambda row: 0 if row.device_count<4000 else 
                                                      (1 if row.device_count<10000 else 2) 
                                          , axis=1)

df_train= df_train.drop(['device_count'], axis=1)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,dayofweek,hour,minute,second,time_segment,app_segment,channel_segment,os_segment,device_segment
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,1,9,30,38,0,3,0,3,2
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,1,13,40,27,0,1,3,2,2
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,1,18,5,24,2,3,1,4,2
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,1,4,58,8,3,2,3,3,2
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,3,9,0,9,0,3,2,0,2


create count features:
- number of clicks for the same app before: app_click_count

In [10]:
#sort the data frame and reset the index 
df_train=df_train.sort_values(by=['app','click_time'],ascending=True)
df_train=df_train.reset_index(drop=True)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_date,click_dayofweek,click_hour,click_minute,time_segment
0,66015,1,1,12,134,2017-11-06 16:00:23,,0,2017-11-06,0,16,0,1
1,925,1,1,8,134,2017-11-06 16:01:49,,0,2017-11-06,0,16,1,1
2,121105,1,1,18,134,2017-11-06 16:02:44,,0,2017-11-06,0,16,2,1
3,8694,1,1,18,135,2017-11-06 16:03:47,,0,2017-11-06,0,16,3,1
4,110476,1,1,13,377,2017-11-06 16:03:54,,0,2017-11-06,0,16,3,1


In [11]:
#calculate the number clicks for the same app in the same day before the current click
prior_count=0
pre_app=0
pre_date=datetime.now().date()

idx=[]
app_click_counts=[]
for index, row in df_train.iterrows():
    cur_app=row['app']
    cur_date=row['click_date']
    
    if cur_app==pre_app and pre_date==cur_date:
        prior_count=prior_count+1
        #print(prior_count)
    else:
        prior_count=0
        pre_app=cur_app
        pre_date=cur_date
    idx.append(index)
    app_click_counts.append(prior_count)  

df_freq=pd.DataFrame({'id':idx,
                        #'tmp_id':tmp_ids,
                       'app_click_count':app_click_counts
                       }
                      )
df_freq.head()


Unnamed: 0,app_click_count,id
0,0,0
1,1,1
2,2,2
3,3,3
4,4,4


In [12]:
#attach the app_click_count 
df_train['id']=df_train.index #re-assign id

df_train=pd.merge(df_train, df_freq,
         how='left', on='id')

df_train.head()


df_train=df_train.sort_values(by=['app','channel','click_time'],ascending=True)
df_train=df_train.reset_index(drop=True)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_date,click_dayofweek,click_hour,click_minute,time_segment,id,app_click_count
0,159103,1,1,14,13,2017-11-06 16:53:38,,0,2017-11-06,0,16,53,1,39,39
1,78928,1,1,53,13,2017-11-07 03:43:12,,0,2017-11-07,1,3,43,3,425,221
2,202255,1,1,2,13,2017-11-07 04:36:53,,0,2017-11-07,1,4,36,3,483,279
3,201141,1,1,9,13,2017-11-07 08:43:54,,0,2017-11-07,1,8,43,0,704,500
4,204146,1,1,31,13,2017-11-07 11:57:57,,0,2017-11-07,1,11,57,0,891,687


In [13]:
#create the feature: app_channel_click_count

df_train=df_train.sort_values(by=['app','channel','click_time'],ascending=True)
df_train=df_train.reset_index(drop=True)


prior_count=0
pre_app=0
pre_channel=0
pre_date=datetime.now().date()

idx=[]
app_channel_click_counts=[]
for index, row in df_train.iterrows():
    cur_app=row['app']
    cur_date=row['click_date']
    cur_channel=row['channel']
    
    if cur_app==pre_app and cur_channel==pre_channel and pre_date==cur_date:
        prior_count=prior_count+1
        #print(prior_count)
    else:
        prior_count=0
        pre_app=cur_app
        pre_channel=cur_channel
        pre_date=cur_date
    
    idx.append(index)
    app_channel_click_counts.append(prior_count)  

df_freq=pd.DataFrame({'id':idx,
                        #'tmp_id':tmp_ids,
                       'app_channel_click_count':app_channel_click_counts
                       }
                      )
df_freq.head()



df_train['id']=df_train.index #re-assign id
df_train=pd.merge(df_train, df_freq,
         how='left', on='id')

df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_date,click_dayofweek,click_hour,click_minute,time_segment,id,app_click_count,app_channel_click_count
0,159103,1,1,14,13,2017-11-06 16:53:38,,0,2017-11-06,0,16,53,1,0,39,0
1,78928,1,1,53,13,2017-11-07 03:43:12,,0,2017-11-07,1,3,43,3,1,221,0
2,202255,1,1,2,13,2017-11-07 04:36:53,,0,2017-11-07,1,4,36,3,2,279,1
3,201141,1,1,9,13,2017-11-07 08:43:54,,0,2017-11-07,1,8,43,0,3,500,2
4,204146,1,1,31,13,2017-11-07 11:57:57,,0,2017-11-07,1,11,57,0,4,687,3


In [14]:
#calculate how many clicks have been made via the same ip + app + device + os + channel
#df_click_freq=df_train_sample.groupby(['ip','app', 'device','os','channel']).size().reset_index(name='prior_click_count')
#df_click_freq['prior_click_count']=df_click_freq['prior_click_count']-1
#df_click_freq.sort_values(by=['prior_click_count'],ascending=False).head()
df_train['tmp_id']=df_train.apply(lambda row: str(row.ip)+'_' 
                                                + str(row.app)+'_'
                                                + str(row.device)+'_'
                                                + str(row.os)+'_'
                                                + str(row.channel), axis=1)

df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_date,click_dayofweek,click_hour,click_minute,time_segment,id,app_click_count,app_channel_click_count,tmp_id
0,159103,1,1,14,13,2017-11-06 16:53:38,,0,2017-11-06,0,16,53,1,0,39,0,159103_1_1_14_13
1,78928,1,1,53,13,2017-11-07 03:43:12,,0,2017-11-07,1,3,43,3,1,221,0,78928_1_1_53_13
2,202255,1,1,2,13,2017-11-07 04:36:53,,0,2017-11-07,1,4,36,3,2,279,1,202255_1_1_2_13
3,201141,1,1,9,13,2017-11-07 08:43:54,,0,2017-11-07,1,8,43,0,3,500,2,201141_1_1_9_13
4,204146,1,1,31,13,2017-11-07 11:57:57,,0,2017-11-07,1,11,57,0,4,687,3,204146_1_1_31_13


In [15]:
df_train=df_train.sort_values(by=['tmp_id','click_time'],ascending=True)
df_train=df_train.reset_index(drop=True)
df_train.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_date,click_dayofweek,click_hour,click_minute,time_segment,id,app_click_count,app_channel_click_count,tmp_id
0,100002,3,1,41,280,2017-11-08 02:29:19,,0,2017-11-08,2,2,29,3,22604,926,434,100002_3_1_41_280
1,100005,2,1,17,219,2017-11-07 03:46:59,,0,2017-11-07,1,3,46,3,5978,775,68,100005_2_1_17_219
2,100005,9,1,19,232,2017-11-09 14:33:29,,0,2017-11-09,3,14,33,1,41199,3004,273,100005_9_1_19_232
3,100009,64,1,18,459,2017-11-08 09:18:54,,0,2017-11-08,2,9,18,0,99040,205,205,100009_64_1_18_459
4,100013,13,1,10,477,2017-11-09 07:05:23,,0,2017-11-09,3,7,5,0,64374,287,214,100013_13_1_10_477


In [16]:
#iterate through the dataframe to count how many clicks had been made before the current click by same ip+app+device+os+channel
pre_tmp_id=''
prior_count=0

idx=[]
#tmp_ids=[]
click_counts=[]
for index, row in df_train.iterrows():
    cur_tmp_id=row['tmp_id']
    
    if cur_tmp_id==pre_tmp_id:
        prior_count=prior_count+1
        #print(prior_count)
    else:
        prior_count=0
        pre_tmp_id=cur_tmp_id
    idx.append(index)
    #tmp_ids.append(cur_tmp_id)
    click_counts.append(prior_count)  

    
df_click_freq=pd.DataFrame({'id':idx,
                            #'tmp_id':tmp_ids,
                           'prior_click_count':click_counts
                           }
                          )


#df_train_sample=df_train_sample.drop(['prior_click_count','tmp_id_x','tmp_id_y'],axis=1)
df_train['id']=df_train.index

df_train=pd.merge(df_train, df_click_freq,
         how='left', on='id')



In [17]:
df_train=df_train.drop(['tmp_id'],axis=1)

df_train[df_train['prior_click_count']>1].head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_date,click_dayofweek,click_hour,click_minute,time_segment,id,app_click_count,app_channel_click_count,prior_click_count
226,100182,12,1,19,259,2017-11-08 11:03:07,,0,2017-11-08,2,11,3,0,226,2554,255,2
431,100275,15,1,19,245,2017-11-08 05:40:09,,0,2017-11-08,2,5,40,3,431,857,299,2
450,100275,18,1,13,107,2017-11-08 21:05:35,,0,2017-11-08,2,21,5,2,450,2582,1219,2
451,100275,18,1,13,107,2017-11-09 01:51:41,,0,2017-11-09,3,1,51,2,451,276,143,3
458,100275,18,1,19,121,2017-11-08 17:43:37,,0,2017-11-08,2,17,43,2,458,2482,557,2


In [None]:
# Define all the groupby transformations
GROUPBY_AGGREGATIONS = [
    # Variance in day, for ip-app-channel
    {'groupby': ['ip','app','channel'], 'select': 'day', 'agg': 'var', 'type': 'float32'},
    # Variance in day, for ip-app-device
    {'groupby': ['ip','app','device'], 'select': 'day', 'agg': 'var', 'type': 'float32'},
    # Variance in day, for ip-app-os
    {'groupby': ['ip','app','os'], 'select': 'day', 'agg': 'var', 'type': 'float32'},
    
    # Variance in hour, for ip-app-channel
    #{'groupby': ['ip','app','channel'], 'select': 'hour', 'agg': 'var'},
    # Variance in hour, for ip-app-device
    #{'groupby': ['ip','app','device'], 'select': 'hour', 'agg': 'var'},
    # Variance in hour, for ip-app-os
    #{'groupby': ['ip','app','os'], 'select': 'hour', 'agg': 'var'},

    # Count, for ip-day
    #{'groupby': ['ip','day'], 'select': 'channel', 'agg': 'count'},
    # Count, for ip-day
    #{'groupby': ['ip','day'], 'select': 'device', 'agg': 'count'},
    # Count, for ip-day
    #{'groupby': ['ip','day'], 'select': 'os', 'agg': 'count'},
    
    # Count, for ip-hour
   # {'groupby': ['ip','hour'], 'select': 'channel', 'agg': 'count'},
    # Count, for ip-hour
    #{'groupby': ['ip','hour'], 'select': 'device', 'agg': 'count'},
    # Count, for ip-hour
    #{'groupby': ['ip','hour'], 'select': 'os', 'agg': 'count'},

    # Count, for ip-day-hour
    {'groupby': ['ip','day','hour'], 'select': 'channel', 'agg': 'count', 'type': 'uint32'},
    # Count, for ip-day-hour
    #{'groupby': ['ip','day','hour'], 'select': 'device', 'agg': 'count', 'type': 'uint32'},
    # Count, for ip-day-hour
   # {'groupby': ['ip','day','hour'], 'select': 'os', 'agg': 'count', 'type': 'uint32'},
    
    # Count, for ip-app
    {'groupby': ['ip', 'app'], 'select': 'channel', 'agg': 'count', 'type': 'uint32'},        
    # Count, for ip-app-os
    {'groupby': ['ip', 'app', 'os'], 'select': 'channel', 'agg': 'count', 'type': 'uint32'},
    # Count, for ip-app-day-hour
    {'groupby': ['ip','app','day','hour'], 'select': 'channel', 'agg': 'count', 'type': 'uint32'},
    
    # Mean hour, for ip-app-channel
    {'groupby': ['ip','app','channel'], 'select': 'hour', 'agg': 'mean', 'type': 'float32', 'type': 'float32'}
]
# Apply all the groupby transformations
for spec in GROUPBY_AGGREGATIONS:
    print(f"Grouping by {spec['groupby']}, and aggregating {spec['select']} with {spec['agg']}")
    
    # Unique list of features to select
    all_features = list(set(spec['groupby'] + [spec['select']]))
    # Name of new feature
    new_feature = '{}_{}_{}'.format('_'.join(spec['groupby']), spec['agg'], spec['select'])
     # Perform the groupby
    gp = train[all_features]. \
        groupby(spec['groupby'])[spec['select']]. \
        agg(spec['agg']). \
        reset_index(). \
        rename(index=str, columns={spec['select']: new_feature}).astype(spec['type'])
     # Merge back to X_train
    train = train.merge(gp, on=spec['groupby'], how='left')
del gp
gc.collect()
print("End")

In [None]:
train['app']           = train['app'].astype('uint16')
train['channel']       = train['channel'].astype('uint16')
train['device']        = train['device'].astype('uint16')
train['ip']            = train['ip'].astype('uint32')
train['os']            = train['os'].astype('uint16')

### Exploratory Data Analytics

In [32]:
#the distribution of target variable
df_train['click_dayofweek'].value_counts()

2    34035
1    32393
3    28561
0     5011
Name: click_dayofweek, dtype: int64

### Find Feature Importance

In [None]:
train_X  = train[:len_train].drop(['click_id', 'is_attributed'], axis=1)
train_y  = train[:len_train]['is_attributed'].astype('uint8')
test_X   = train[len_train:].drop(['click_id', 'is_attributed'], axis=1)
test_id  = train[len_train:]['click_id'].astype('int')
del train

In [1]:
#path_train_X = path_out + 'train_X.csv'
#path_train_y = path_out + 'train_y.csv'
#print('Loading the pre training data...')
#train_X = pd.read_csv(path_train_X, header=0)
#train_y = pd.read_csv(path_train_y, header=0)
#print('End loading pre train data...')
predictors  = ['app','device','os', 'channel', 'hour', 'day', 'doy', 'wday','minute','second',
               'ip_app_channel_var_day',
               'ip_app_device_var_day',
               'ip_app_os_var_day',
               'ip_day_hour_count_channel',
               'ip_app_count_channel',
               'ip_app_os_count_channel',
               'ip_app_day_hour_count_channel',
               'ip_app_channel_mean_hour',
              'nip_day_hh']
categorical = ['app','device','os', 'channel', 'hour', 'day', 'doy', 'wday','minute','second']           

In [None]:
metrics = 'auc'
lgb_params = {
        'boosting_type': 'gbdt',
        'objective': 'binary',
        'metric':metrics,
        'learning_rate': 0.05,
        'num_leaves': 7,  # we should let it be smaller than 2^(max_depth)
        'max_depth': 4,  # -1 means no limit
        'min_child_samples': 100,  # Minimum number of data need in a child(min_data_in_leaf)
        'max_bin': 100,  # Number of bucketed bin for feature values
        'subsample': 0.7,  # Subsample ratio of the training instance.
        'subsample_freq': 1,  # frequence of subsample, <=0 means no enable
        'colsample_bytree': 0.7,  # Subsample ratio of columns when constructing each tree.
        'min_child_weight': 0,  # Minimum sum of instance weight(hessian) needed in a child(leaf)
        'min_split_gain': 0,  # lambda_l1, lambda_l2 and min_gain_to_split to regularization
        'nthread': 8,
        'verbose': 0,
        'scale_pos_weight':99.7, # because training data is extremely unbalanced 
        'metric':metrics
}
 
early_stopping_rounds = 100
num_boost_round       = 10000

print("Preparing validation datasets")
train_X, val_X = train_test_split(train_X, train_size=.95, shuffle=False )
train_y, val_y = train_test_split(train_y, train_size=.95, shuffle=False )
print("End preparing validation datasets")

xgtrain = lgb.Dataset(train_X[predictors].values, label=train_y,feature_name=predictors,
                       categorical_feature=categorical)
xgvalid = lgb.Dataset(val_X[predictors].values, label=val_y,feature_name=predictors,
                      categorical_feature=categorical)
evals_results = {}
model_lgb     = lgb.train(lgb_params,xgtrain,valid_sets=[xgtrain, xgvalid], 
                          valid_names=['train','valid'], 
                           evals_result=evals_results, 
                           num_boost_round=num_boost_round,
                           early_stopping_rounds=early_stopping_rounds,
                           verbose_eval=10, feval=None)   



In [2]:
print("Features importance...")
gain = model_lgb.feature_importance('gain')
ft = pd.DataFrame({'feature':model_lgb.feature_name(), 
                   'split':model_lgb.feature_importance('split'), 
                   'gain':100 * gain / gain.sum()}).sort_values('gain', ascending=False)
print(ft.head(50))
ft.to_csv('importance_lightgbm.csv',index=True)
plt.figure()
ft = ft.sort_values('gain', ascending=True)
ft[['feature','gain']].head(50).plot(kind='barh', x='feature', y='gain', legend=False, figsize=(10, 10))
plt.gcf().savefig('features_importance.png')

Features importance...


NameError: name 'model_lgb' is not defined

In [None]:
sub = pd.DataFrame()
sub['click_id'] = test_id
print("Sub dimension "    + str(sub.shape))
print("Test_X dimension " + str(test_X.shape))

In [None]:
print("Predicting...")
sub['is_attributed'] = model_lgb.predict(test_X[predictors])  #
print("Writing...")
sub.to_csv('sub_Yatsenko_01.csv',index=False)
print("Done...")