<a href="https://colab.research.google.com/github/TheHouseOfVermeulens/wernervermeulen.github.io/blob/master/TalkingData_AdTracking_Fraud_Detection_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DESCRIPTION: 
Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest
mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

[TalkingData](https://www.talkingdata.com), China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. Challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

# INTRODUCTION: 
The goal of the case is to predict if a user will download an app after clicking through an ad. For this case I have used a small sample of the data, dropping 99% of negative records (where the app wasn't downloaded) to make the target more balanced.

In [0]:
import pandas as pd

click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
                         parse_dates=['click_time'])
click_data.head(10)

Out[1]

In [0]:
      ip 	 app 	device 	os 	channel 	click_time 	       attributed_time 	is_attributed
0 	89489 	 3 	 1 	    13 	 379 	   2017-11-06 15:13:23 	NaN                 	0
1 	204158 	35 	 1 	    13 	  21 	   2017-11-06 15:41:07 	2017-11-07 08:17:19 	1
2 	3437 	   6 	 1 	    13 	 459 	   2017-11-06 15:42:32 	NaN 	                0
3 	167543 	 3 	 1 	    13 	 379 	   2017-11-06 15:56:17 	NaN 	                0
4 	147509 	 3 	 1 	    13 	 379 	   2017-11-06 15:57:01 	NaN 	                0
5 	71421 	15 	 1 	    13 	 153 	   2017-11-06 16:00:00 	NaN 	                0
6 	76953 	14 	 1 	    13 	 379 	   2017-11-06 16:00:01 	NaN 	                0
7 	187909 	 2 	 1 	    25 	 477 	   2017-11-06 16:00:01 	NaN 	                0
8 	116779 	 1 	 1 	     8 	 150 	   2017-11-06 16:00:01 	NaN 	                0
9 	47857 	 3 	 1 	    15 	 205 	   2017-11-06 16:00:01 	NaN 	                0

## Baseline Model

The first thing I need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First I need to do a bit of feature engineering before training the model itself.

### 1) Features from timestamps
From the timestamps, create features for the day, hour, minute and second. Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [0]:
# Split up the times
click_times = click_data['click_time']
clicks['day'] = click_times.dt.day.astype('uint8')
clicks['hour'] = click_times.dt.hour.astype('uint8')
clicks['minute'] = click_times.dt.minute.astype('uint8')
clicks['second'] = click_times.dt.second.astype('uint8')

### 2) Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, I use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [0]:
from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']
label_encoder = preprocessing.LabelEncoder()

# Create new columns in clicks using preprocessing.LabelEncoder()
for feature in cat_features:
    encoded = label_encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

### 3) One-hot Encoding

Now I have label encoded features, does it make sense to use one-hot encoding for the categorical variables ip, app, device, os, or channel?

Answer: The ip column has 58,000 values, which means it will create an extremely sparse matrix with 58,000 columns. This many columns will make your model run very slow, so in general you want to avoid one-hot encoding features with many levels. LightGBM models work with label encoded features, so you don't actually need to one-hot encode the categorical features.

## Train, validation, and test sets
With my baseline features ready, we need to split our data into training and validation sets. I should also hold out a test set to measure the final accuracy of the model.

### 4) Train/test splits with time series data
This is time series data. Are there any special considerations when creating train/test splits for time series? If so, what and why?

Answer: Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.

### Create train/validation/test splits

Here I'll create training, validation, and test splits. First, `clicks` DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [0]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

### Train with LightGBM

Now I can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [0]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

## Evaluate the model
Finally, with the model trained, I'll evaluate it's performance on the test set. 

In [0]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

Out[17]
Test score: 0.9726727334566094

This is my baseline score for the model. When I transform features, add new ones, or perform feature selection, I should be improving on this score. However, since this is the test set, I only want to look at it at the end of all our manipulations. At the very end of this case I'll look at the test score again to see if I've improved on the baseline model.