# Project overview

- Problem : Companies advertise online, click fraud can haapen.
    - Resulting in misleading click data and wasted money.
    - 3 billion clicks per day, of which 90% are potentially fradulent.
- Current Solution : measure the journey of a user's click across their portfolio, and flag IP addresses who produce lots of clicks. 
    - build IP blacklist and device blacklist.
- Challenge : build an algorithm that predicts whether a user will download an app after clicking a mobile app ad.

#### Goal : predict if a person after clicking a mobile app advertisement will download the app. The data is highly unbalanced with only 0.25% data of positive class.

## 1. File overview

- train.csv - the training set
- train_sample.csv - 100,000 randomly-selected rows of training data, to inspect data before downloading full set 
- test.csv - the test set
- sampleSubmission.csv - a sample submission file in the correct format 
- UPDATE: test_supplement.csv - This is a larger test set that was unintentionally released at the start of the competition. It is not necessary to use this data, but it is permitted to do so. The official test data is a subset of this data.

### Data fields
Each row of the training data contains a click record, with the following features.

- ip: ip address of click.
- app: app id for marketing.
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
        - 싼 디바이스에서 혹은 구버전 디바이스에서 가짜 클릭이 많지 않을까? 
- os: os version id of user mobile phone
        - 업데이트가 안된 버전에서 가짜 클릭이 많지 않을까?
- channel: channel id of mobile ad publisher
        - 반복되는 채널에서 노출될 수 있을까?
- click_time: timestamp of click (UTC) 
        - 특정 아이피의 클릭 회수가 일정 이상일때, 시간의 차이가 적을수록 가짜 클릭이지 않을까?
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
        - 데이터가 많이 없다 (제거하자)
- is_attributed: the target that is to be predicted, indicating the app was downloaded Note that ip, app, device, os, and channel are encoded. 
        - 0이면 다운로드를 안함 (다운로드에는 사기클릭이 속해져 있다.)
        - 1이면 다운로드를 함 (마케팅에 좋은 영향을 받음)

- 광고를 통해 들어왔는데 클릭을 안하는 경우는?

The test data is similar, with the following differences:

- click_id: reference for making predictions
- is_attributed: not included

In [1]:
import gc
import numpy as np
import pandas as pd
import seaborn as sns
sns.set(style="whitegrid", color_codes=True)
sns.set(font_scale=1)
pd.set_option('display.float_format', lambda x: '%.3f' % x)
import matplotlib.pyplot as plt
%matplotlib inline
import time
from subprocess import check_output
path = '../../../DEVELOPMENT/Fraud Detection/input/'
from sklearn.linear_model import LogisticRegression
from scipy.special import expit, logit
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score,confusion_matrix

In [2]:
pwd

'/Users/mac/Documents/GitHub/Kaggle_project/Fraud Detection'

In [3]:
data = pd.read_csv('../../../DEVELOPMENT/Fraud Detection/input/train_sample.csv',parse_dates=['click_time'])
data.dtypes

ip                          int64
app                         int64
device                      int64
os                          int64
channel                     int64
click_time         datetime64[ns]
attributed_time            object
is_attributed               int64
dtype: object

In [4]:
data.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,87540,12,1,13,497,2017-11-07 09:30:38,,0
1,105560,25,1,17,259,2017-11-07 13:40:27,,0
2,101424,12,1,19,212,2017-11-07 18:05:24,,0
3,94584,13,1,13,477,2017-11-07 04:58:08,,0
4,68413,12,1,1,178,2017-11-09 09:00:09,,0


In [5]:
data.describe()

Unnamed: 0,ip,app,device,os,channel,is_attributed
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,91255.88,12.048,21.771,22.818,268.832,0.002
std,69835.554,14.941,259.668,55.943,129.724,0.048
min,9.0,1.0,0.0,0.0,3.0,0.0
25%,40552.0,3.0,1.0,13.0,145.0,0.0
50%,79827.0,12.0,1.0,18.0,258.0,0.0
75%,118252.0,15.0,1.0,19.0,379.0,0.0
max,364757.0,551.0,3867.0,866.0,498.0,1.0


In [6]:
data.describe(include='all')

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000,227,100000.0
unique,,,,,,80350,227,
top,,,,,,2017-11-08 12:01:02,2017-11-08 05:11:04,
freq,,,,,,7,1,
first,,,,,,2017-11-06 16:00:00,,
last,,,,,,2017-11-09 15:59:51,,
mean,91255.88,12.048,21.771,22.818,268.832,,,0.002
std,69835.554,14.941,259.668,55.943,129.724,,,0.048
min,9.0,1.0,0.0,0.0,3.0,,,0.0
25%,40552.0,3.0,1.0,13.0,145.0,,,0.0


In [7]:
NAs = data.isnull().sum()
NAs

ip                     0
app                    0
device                 0
os                     0
channel                0
click_time             0
attributed_time    99773
is_attributed          0
dtype: int64

In [8]:
data['click_time_dt']= pd.to_datetime(data['click_time'])
dt= data['click_time_dt'].dt
data['day'] = dt.day.astype('uint8')
data['hour'] = dt.hour.astype('uint8')
data['minute'] = dt.minute.astype('uint8')
data.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,click_time_dt,day,hour,minute
0,87540,12,1,13,497,2017-11-07 09:30:38,,0,2017-11-07 09:30:38,7,9,30
1,105560,25,1,17,259,2017-11-07 13:40:27,,0,2017-11-07 13:40:27,7,13,40
2,101424,12,1,19,212,2017-11-07 18:05:24,,0,2017-11-07 18:05:24,7,18,5
3,94584,13,1,13,477,2017-11-07 04:58:08,,0,2017-11-07 04:58:08,7,4,58
4,68413,12,1,1,178,2017-11-09 09:00:09,,0,2017-11-09 09:00:09,9,9,0


In [9]:
data = data.drop(['click_time','click_time_dt','attributed_time'], axis = 1)
data.head()

Unnamed: 0,ip,app,device,os,channel,is_attributed,day,hour,minute
0,87540,12,1,13,497,0,7,9,30
1,105560,25,1,17,259,0,7,13,40
2,101424,12,1,19,212,0,7,18,5
3,94584,13,1,13,477,0,7,4,58
4,68413,12,1,1,178,0,9,9,0


In [10]:
y = data['is_attributed']
x = data.drop(['is_attributed'], axis = 1)

In [11]:
x.head()

Unnamed: 0,ip,app,device,os,channel,day,hour,minute
0,87540,12,1,13,497,7,9,30
1,105560,25,1,17,259,7,13,40
2,101424,12,1,19,212,7,18,5
3,94584,13,1,13,477,7,4,58
4,68413,12,1,1,178,9,9,0


In [12]:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score,confusion_matrix
from sklearn.metrics import accuracy_score

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=42)


stack_model = LogisticRegression()
stack_model.fit(x_train, y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [13]:
roc = roc_auc_score(y_test, stack_model.predict_proba(x_test)[:,1])
print('ROC : ',roc)

ROC :  0.797519145314


In [14]:
test_df = pd.read_csv(path+"test.csv")

In [15]:
test_df.head()

Unnamed: 0,click_id,ip,app,device,os,channel,click_time
0,0,5744,9,1,3,107,2017-11-10 04:00:00
1,1,119901,9,1,3,466,2017-11-10 04:00:00
2,2,72287,21,1,19,128,2017-11-10 04:00:00
3,3,78477,15,1,13,111,2017-11-10 04:00:00
4,4,123080,12,1,13,328,2017-11-10 04:00:00


In [16]:
sub = pd.DataFrame()
sub['click_id'] = test_df['click_id'].astype('int')
test_df.drop(['click_id'], axis=1, inplace=True)
gc.collect()

28

In [17]:
test_df['click_time_dt']= pd.to_datetime(test_df['click_time'])
dt= test_df['click_time_dt'].dt
test_df['day'] = dt.day.astype('uint8')
test_df['hour'] = dt.hour.astype('uint8')
test_df['minute'] = dt.minute.astype('uint8')
test_df.head()

Unnamed: 0,ip,app,device,os,channel,click_time,click_time_dt,day,hour,minute
0,5744,9,1,3,107,2017-11-10 04:00:00,2017-11-10 04:00:00,10,4,0
1,119901,9,1,3,466,2017-11-10 04:00:00,2017-11-10 04:00:00,10,4,0
2,72287,21,1,19,128,2017-11-10 04:00:00,2017-11-10 04:00:00,10,4,0
3,78477,15,1,13,111,2017-11-10 04:00:00,2017-11-10 04:00:00,10,4,0
4,123080,12,1,13,328,2017-11-10 04:00:00,2017-11-10 04:00:00,10,4,0


In [18]:
test_df = test_df.drop(['click_time','click_time_dt'], axis = 1)
test_df.head()

Unnamed: 0,ip,app,device,os,channel,day,hour,minute
0,5744,9,1,3,107,10,4,0
1,119901,9,1,3,466,10,4,0
2,72287,21,1,19,128,10,4,0
3,78477,15,1,13,111,10,4,0
4,123080,12,1,13,328,10,4,0


In [19]:
test_df = stack_model.predict_proba(test_df)[:,1]

In [20]:
test_df
len(test_df)

18790469

In [21]:
sub['is_attributed'] = test_df
sub.head()

Unnamed: 0,click_id,is_attributed
0,0,0.0
1,1,0.0
2,2,0.001
3,3,0.001
4,4,0.0


In [None]:
sub.to_csv('Log_sub.csv', float_format='%.8f', index=False)

In [23]:
sub_df = pd.read_csv('Log_sub.csv')

In [24]:
sub_df.tail()

Unnamed: 0,click_id,is_attributed
18790464,18790464,0.0
18790465,18790465,0.0
18790466,18790467,0.0
18790467,18790466,0.001
18790468,18790468,0.0


In [25]:
sub_test_df = pd.read_csv(path+"sample_submission.csv")

In [26]:
sub_test_df.tail()

Unnamed: 0,click_id,is_attributed
18790464,18790464,0
18790465,18790465,0
18790466,18790467,0
18790467,18790466,0
18790468,18790468,0
