<a href="https://colab.research.google.com/github/TheHouseOfVermeulens/wernervermeulen.github.io/blob/master/TalkingData_AdTracking_Fraud_Detection_Challenge.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# DESCRIPTION: 
Fraud risk is everywhere, but for companies that advertise online, click fraud can happen at an overwhelming volume, resulting in misleading click data and wasted money. Ad channels can drive up costs by simply clicking on the ad at a large scale. With over 1 billion smart mobile devices in active use every month, China is the largest
mobile market in the world and therefore suffers from huge volumes of fradulent traffic.

[TalkingData](https://www.talkingdata.com), China’s largest independent big data service platform, covers over 70% of active mobile devices nationwide. They handle 3 billion clicks per day, of which 90% are potentially fraudulent. Their current approach to prevent click fraud for app developers is to measure the journey of a user’s click across their portfolio, and flag IP addresses who produce lots of clicks, but never end up installing apps. With this information, they've built an IP blacklist and device blacklist.

While successful, they want to always be one step ahead of fraudsters and have turned to the Kaggle community for help in further developing their solution. Challenged to build an algorithm that predicts whether a user will download an app after clicking a mobile app ad. To support your modeling, they have provided a generous dataset covering approximately 200 million clicks over 4 days!

# INTRODUCTION: 
The goal of the case is to predict if a user will download an app after clicking through an ad. For this case I have used a small sample of the data, dropping 99% of negative records (where the app wasn't downloaded) to make the target more balanced.

In [0]:
import pandas as pd

click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
                         parse_dates=['click_time'])
click_data.head(10)

Out[1]

In [0]:
      ip 	 app 	device 	os 	channel 	click_time 	       attributed_time 	is_attributed
0 	89489 	 3 	 1 	    13 	 379 	   2017-11-06 15:13:23 	NaN                 	0
1 	204158 	35 	 1 	    13 	  21 	   2017-11-06 15:41:07 	2017-11-07 08:17:19 	1
2 	3437 	   6 	 1 	    13 	 459 	   2017-11-06 15:42:32 	NaN 	                0
3 	167543 	 3 	 1 	    13 	 379 	   2017-11-06 15:56:17 	NaN 	                0
4 	147509 	 3 	 1 	    13 	 379 	   2017-11-06 15:57:01 	NaN 	                0
5 	71421 	15 	 1 	    13 	 153 	   2017-11-06 16:00:00 	NaN 	                0
6 	76953 	14 	 1 	    13 	 379 	   2017-11-06 16:00:01 	NaN 	                0
7 	187909 	 2 	 1 	    25 	 477 	   2017-11-06 16:00:01 	NaN 	                0
8 	116779 	 1 	 1 	     8 	 150 	   2017-11-06 16:00:01 	NaN 	                0
9 	47857 	 3 	 1 	    15 	 205 	   2017-11-06 16:00:01 	NaN 	                0

## Baseline Model

The first thing I need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First I need to do a bit of feature engineering before training the model itself.

### 1) Features from timestamps
From the timestamps, create features for the day, hour, minute and second. Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [0]:
# Split up the times
click_times = click_data['click_time']
clicks['day'] = click_times.dt.day.astype('uint8')
clicks['hour'] = click_times.dt.hour.astype('uint8')
clicks['minute'] = click_times.dt.minute.astype('uint8')
clicks['second'] = click_times.dt.second.astype('uint8')

### 2) Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, I use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [0]:
from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']
label_encoder = preprocessing.LabelEncoder()

# Create new columns in clicks using preprocessing.LabelEncoder()
for feature in cat_features:
    encoded = label_encoder.fit_transform(clicks[feature])
    clicks[feature + '_labels'] = encoded

### 3) One-hot Encoding

Now I have label encoded features, does it make sense to use one-hot encoding for the categorical variables ip, app, device, os, or channel?

Answer: The ip column has 58,000 values, which means it will create an extremely sparse matrix with 58,000 columns. This many columns will make your model run very slow, so in general you want to avoid one-hot encoding features with many levels. LightGBM models work with label encoded features, so you don't actually need to one-hot encode the categorical features.

## Train, validation, and test sets
With my baseline features ready, we need to split our data into training and validation sets. I should also hold out a test set to measure the final accuracy of the model.

### 4) Train/test splits with time series data
This is time series data. Are there any special considerations when creating train/test splits for time series? If so, what and why?

Answer: Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.

### Create train/validation/test splits

Here I'll create training, validation, and test splits. First, `clicks` DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [0]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

### Train with LightGBM

Now I can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [0]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

## Evaluate the model
Finally, with the model trained, I'll evaluate it's performance on the test set. 

In [0]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

Out[17]
Test score: 0.9726727334566094

This is my baseline score for the model. When I transform features, add new ones, or perform feature selection, I should be improving on this score. However, since this is the test set, I only want to look at it at the end of all our manipulations. At the very end of this case I'll look at the test score again to see if I've improved on the baseline model.

Now I can apply more advanced encodings to encode the categorical variables to improve your classifier model. The encodings I will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD  

Below I'll define a couple functions to help test the new encodings.

In [0]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

# Note here:

> I calculated the encodings from the training set only. If I include data from the validation and test sets into the encodings, I'll overestimate the model's performance. I should in general be vigilant to avoid leakage, that is, including any information from the validation and test sets into the model.



### 2) Count encodings

Here, encode the categorical features `['ip', 'app', 'device', 'os', 'channel']` using the count of each value in the data set. Using `CountEncoder` from the `category_encoders` library, I fit the encoding using the categorical feature columns defined in `cat_features`. Then apply the encodings to the train and validation sets, adding them as new columns with names suffixed `"_count"`.

In [0]:
import category_encoders as ce
from category_encoders import CountEncoder

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = CountEncoder(cols=cat_features)

# Learn encoding from the training set
count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))
valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))

# Train the model on the encoded datasets

In [0]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
_ = train_model(train_encoded, valid_encoded)

Out[12]
Training model!
Validation AUC score: 0.9653051135205329

Count encoding improved our model's score!

### 4) Target encoding

Next I try some supervised encodings that use the labels (the targets) to transform categorical features. The first one is target encoding. I create the target encoder from the `category_encoders` library. Then, I learn the encodings from the training dataset, and then apply the encodings to all the datasets and retrain the model.

In [0]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ce.TargetEncoder(cols=cat_features)

# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))
valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))

train.head()
bst = train_model(train, valid)


Out[13]

In [0]:
Training model!
Validation AUC score: 0.9622743228943659

In [0]:
_ = train_model(train_encoded, valid_encoded)

Out[14]

In [0]:
Training model!
Validation AUC score: 0.9540530347873288

# Note here:
Target encoding attempts to measure the population mean of the target for each level in a categorical feature. This means when there is less data per level, the estimated mean will be further away from the "true" mean, there will be more variance. There is little data per IP address so it's likely that the estimates are much noisier than for the other features. The model will rely heavily on this feature since it is extremely predictive. This causes it to make fewer splits on other features, and those features are fit on just the errors left over accounting for IP address. So, the model will perform very poorly when seeing new IP addresses that weren't in the training data (which is likely most new data). Going forward, we'll leave out the IP feature when trying different encodings.

### 6) CatBoost Encoding

The CatBoost encoder is supposed to working well with the LightGBM model. Encode the categorical features with `CatBoostEncoder` and train the model on the encoded data again.

In [0]:
train, valid, test = get_data_splits(clicks)

# Create the CatBoost encoder
cb_enc = ce.CatBoostEncoder(cols=cat_features, random_state=7)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))


In [0]:
_ = train_model(train, valid)

Out[15]

In [0]:
Training model!
Validation AUC score: 0.9622743228943659

The CatBoost encodings work the best, so I'll keep those.

In [0]:
encoded = cb_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

Now I am ready to generate completely new features from the data itself.

I'll create new features from the existing data. Again I'll compare the score lift for each new feature compared to a baseline model. 

### 1) Add interaction features

Here I add interaction features for each pair of categorical features (ip, app, device, os, channel). The easiest way to iterate through the pairs of features is with `itertools.combinations`. For each new column, join the values as strings with an underscore, so 13 and 47 would become `"13_47"`. As I add the new columns to the dataset, be sure to label encode the values.

In [0]:
import itertools

cat_features = ['ip', 'app', 'device', 'os', 'channel']
interactions = pd.DataFrame(index=clicks.index)
for col1, col2 in itertools.combinations(cat_features, 2):
        new_col_name = '_'.join([col1, col2])

        # Convert to strings and combine
        new_values = clicks[col1].map(str) + "_" + clicks[col2].map(str)

        encoder = preprocessing.LabelEncoder()
        interactions[new_col_name] = encoder.fit_transform(new_values)

In [0]:
clicks = clicks.join(interactions)
print("Score with interactions")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Out[20]

In [0]:
Score with interactions
Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9626212895350978

# Generating numerical features

Adding interactions is a quick way to create more categorical features from the data. It's also effective to create new numerical features, I'll typically get a lot of improvement in the model. This takes a bit of brainstorming and experimentation to find features that work well.

For this case I'm going to implement functions that operate on Pandas Series. It can take multiple minutes to run these functions on the entire data set so instead I'll provide feedback by running my function on a smaller dataset.

### 2) Number of events in the past six hours

The first feature I create is the number of events from the same IP in the last six hours. It's likely that someone who is visiting often will download the app.

Implement a function `count_past_events` that takes a Series of click times (timestamps) and returns another Series with the number of events in the last hour. **Tip:** The `rolling` method is useful for this.

In [0]:

    def count_past_events(series, time_window='6H'):
        series = pd.Series(series.index, index=series)
        # Subtract 1 so the current event isn't counted
        past_events = series.rolling(time_window).count() - 1
        return past_events

In [0]:
# Loading in from saved Parquet file
past_events = pd.read_parquet('../input/feature-engineering-data/past_6hr_events.pqt')
clicks['ip_past_6hr_counts'] = past_events

train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

In [0]:
Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9647255487084245

### 3) Time since last event

Implement a function `time_diff` that calculates the time since the last event in seconds from a Series of timestamps. This will be ran like so:

```python
timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)
```

In [0]:
def time_diff(series):
    """ Returns a series with the time since the last timestamp in seconds """
    return series.diff().dt.total_seconds()

In [0]:
# Loading in from saved Parquet file
past_events = pd.read_parquet('../input/feature-engineering-data/time_deltas.pqt')
clicks['past_events_6hr'] = past_events

train, valid, test = get_data_splits(clicks.join(past_events))
_ = train_model(train, valid, test)

Out[21]

In [0]:
Training model. Hold on a minute to see the validation score
Validation AUC score: 0.9651116624672765

### 4) Number of previous app downloads

It's likely that if a visitor downloaded an app previously, it'll affect the likelihood they'll download one again. Implement a function `previous_attributions` that returns a Series with the number of times an app has been download (`'is_attributed' == 1`) before the current event.

In [0]:
def previous_attributions(series):
    """ Returns a series with the """
    sums = series.expanding(min_periods=2).sum() - series
    return sums

# Run & check my work

Again loading pre-computed data.

In [0]:
# Loading in from saved Parquet file
past_events = pd.read_parquet('../input/feature-engineering-data/downloads.pqt')
clicks['ip_past_6hr_counts'] = past_events

In [0]:
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Out[23]

In [0]:
Training model. Hold on a minute to see the validation score
Validation AUC score: 0.965236652054989

I frequently want to pare these created features down for modeling. Thus, up next is Feature Selection

# Feature Selection
To find the most important features in the model.

Note here: Often I have seen hundreds and thousands of features after various encodings and feature generation. This can lead to two problems. First, the more features I have, the more likely I am to overfit to the training and validation sets. This will cause my model to perform worse at generalizing to new data.

Secondly, the more features I have, the longer it will take to train my model and optimize hyperparameters. Also, when building user-facing products, I want to make inference as fast as possible. Using fewer features can speed up inference at the cost of predictive performance.

To help with these issues, I want to use feature selection techniques to keep the most informative features for your model.

# Univariate Feature Selection

The simplest and fastest methods are based on univariate statistical tests. For each feature, measure how strongly the target depends on the feature using a statistical test like 𝜒2

or ANOVA.

From the scikit-learn feature selection module, feature_selection.SelectKBest returns the K best features given some scoring function. For our classification problem, the module provides three different scoring functions: 𝜒2

, ANOVA F-value, and the mutual information score. The F-value measures the linear dependency between the feature variable and the target. This means the score might underestimate the relation between a feature and the target if the relationship is nonlinear. The mutual information score is nonparametric and so can capture nonlinear relationships.

With SelectKBest, I define the number of features to keep, based on the score from the scoring function. Using .fit_transform(features, target) I get back an array with only the selected features.

# Note here: Which data to use for feature selection?

Since many feature selection methods require calculating statistics from the dataset, should you use all the data for feature selection?

Answer:: Including validation and test data within the feature selection is a source of leakage. You'll want to perform feature selection on the train set only, then use the results there to remove features from the validation and test sets.


In [0]:
from sklearn.feature_selection import SelectKBest, f_classif
feature_cols = clicks.columns.drop(['click_time', 'attributed_time', 'is_attributed'])
train, valid, test = get_data_splits(clicks)

# Do feature extraction on the training data only!
selector = SelectKBest(f_classif, k=40)
X_new = selector.fit_transform(train[feature_cols], train['is_attributed'])

    # Get back the features we've kept, zero out all other features
selected_features = pd.DataFrame(selector.inverse_transform(X_new), 
                                 index=train.index, 
                                 columns=feature_cols)

    # Dropped columns have values of all 0s, so var is 0, drop them
dropped_columns = selected_features.columns[selected_features.var() == 0]

In [0]:
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1),
                test.drop(dropped_columns, axis=1))

Out[30]

In [0]:
Training model!
Validation AUC score: 0.9625481759576047

# Use L1 regularization for feature selection

Now I try a more powerful approach using L1 regularization. Implementing a function `select_features_l1` that returns a list of features to keep.

Using a `LogisticRegression` classifier model with an L1 penalty to select the features. For the model, set the random state to 7 and the regularization parameter to 0.1. Fit the model then use `SelectFromModel` to return a model with the selected features.

In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import SelectFromModel

def select_features_l1(X, y):
    """ Return selected features using logistic regression with an L1 penalty """
    logistic = LogisticRegression(C=0.1, penalty="11", random_state=7).fit(X,y)
    model = SelectFromModel(logistic, prefit=True)

    X_new = model.transform(X)

        # Get back the kept features as a DataFrame with dropped columns as all 0s
    selected_features = pd.DataFrame(model.inverse_transform(X_new), 
                                        index=X.index,
                                        columns=X.columns)

        # Dropped columns have values of all 0s, keep other columns 
    cols_to_keep = selected_features.columns[selected_features.var() != 0]

    return cols_to_keep


In [0]:
dropped_columns = feature_cols.drop(selected_features)
_ = train_model(train.drop(dropped_columns, axis=1), 
                valid.drop(dropped_columns, axis=1),
                test.drop(dropped_columns, axis=1))

Out[34]

In [0]:
Training model!
Validation AUC score: 0.9658334271834417