# TalkingData AdTracking Fraud Detection Challenge
# Can you detect fraudulent click traffic for mobile app ads?
# https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection

**This notebook is inspired by an exercise in the [Feature Engineering](https://www.kaggle.com/learn/feature-engineering) course.**  
**You can reference the tutorial at [this link](https://www.kaggle.com/matleonard/baseline-model).**  
**You can reference my notebook at [this link](https://www.kaggle.com/georgezoto/feature-engineering-baseline-model)**  

---


# Introduction

In the exercise, you will work with data from the TalkingData AdTracking competition.  The goal of the competition is to predict if a user will download an app after clicking through an ad. 

<center><a href="https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection"><img src="https://i.imgur.com/srKxEkD.png" width=600px></a></center>

For this course you will use a small sample of the data, dropping 99% of negative records (where the app wasn't downloaded) to make the target more balanced.

After building a baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

## Setup

Begin by running the code cell below to set up the exercise.

## Baseline Model

The first thing you'll do is construct a baseline model. We'll begin by looking at the data.

Data fields  
Each row of the training data contains a click record, with the following features.  

- ip: ip address of click.
- app: app id for marketing.
- device: device type id of user mobile phone (e.g., iphone 6 plus, iphone 7, huawei mate 7, etc.)
- os: os version id of user mobile phone
- channel: channel id of mobile ad publisher
- click_time: timestamp of click (UTC)
- attributed_time: if user download the app for after clicking an ad, this is the time of the app download
- is_attributed: the target that is to be predicted, indicating the app was downloaded  

Note that ip, app, device, os, and channel are encoded.

The test data is similar, with the following differences:
- click_id: reference for making predictions
- is_attributed: not included

In [None]:
import pandas as pd

## ⚠️ Your notebook tried to allocate more memory than is available. It has restarted. ⚠️

In [None]:
#Sample data - (100000, 8)
#click_data = pd.read_csv('../input/talkingdata-adtracking-fraud-detection/train_sample.csv', parse_dates=['click_time'])

#Full data - No idea how large it is, this notebook can not handle its size in RAM

#Read only first limit rows
limit = 20_000_000

#Read only these columns - skip attributed_time 
usecols = ['ip', 'app', 'device', 'os', 'channel', 'click_time', 'is_attributed']

## Competition data

In [None]:
#click_data = pd.read_csv('../input/talkingdata-adtracking-fraud-detection/train.csv', nrows=limit, usecols=usecols, parse_dates=['click_time'])
competition_data = pd.read_csv('../input/talkingdata-adtracking-fraud-detection/train.csv', nrows=limit, usecols=usecols, parse_dates=['click_time'])

In [None]:
competition_data.describe()

In [None]:
competition_data['is_attributed'].value_counts()

In [None]:
competition_data['is_attributed'].value_counts(normalize=True)

## Feature-engineering data from the Kaggle course

In [None]:
click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv', nrows=limit, usecols=usecols, parse_dates=['click_time'])

In [None]:
click_data.describe()

In [None]:
click_data['is_attributed'].value_counts()

In [None]:
click_data['is_attributed'].value_counts(normalize=True)

In [None]:
print(click_data.shape)
click_data.head()

### Competition submission step

In [None]:
competition_test_data = pd.read_csv('../input/talkingdata-adtracking-fraud-detection/test.csv', parse_dates=['click_time'])

In [None]:
print(competition_test_data.shape)
competition_test_data.head()

### 1) Construct features from timestamps

Notice that the `click_data` DataFrame has a `'click_time'` column with timestamp data.

Use this column to create features for the coresponding day, hour, minute and second. 

Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [None]:
# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

In [None]:
clicks.head()

### Competition submission step

In [None]:
# Add new columns for timestamp features day, hour, minute, and second
competition_test_data = competition_test_data.copy()
competition_test_data['day'] = competition_test_data['click_time'].dt.day.astype('uint8')
# Fill in the rest
competition_test_data['hour'] = competition_test_data['click_time'].dt.hour.astype('uint8')
competition_test_data['minute'] = competition_test_data['click_time'].dt.minute.astype('uint8')
competition_test_data['second'] = competition_test_data['click_time'].dt.second.astype('uint8')

In [None]:
competition_test_data.head()

### Question ??? 

class sklearn.preprocessing.LabelEncoder[source]
Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode **target values**, i.e. y, and not the input X.

### 2) Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

## ⚠️ ValueError: y contains previously unseen labels: [0, 1, 2,... ⚠️

In [None]:
#for feature in cat_features:
    #encoder = preprocessing.LabelEncoder()
    
    #encoded = encoder.fit_transform(clicks[feature])
    #clicks[feature+'_labels'] = encoded
    
    ##encoded[feature+'_labels'] = clicks[feature].apply(encoder.fit_transform) - Incorrect 
    ##ValueError: y should be a 1d array, got an array of shape () instead.
    
    ##Competition submission
    #competition_enencoded = encoder.transform(competition_test_data[feature]) 
    ##ValueError: y contains previously unseen labels: [0, 2, 3, 4, 5,
    #competition_test_data[feature+'_labels'] = competition_enencoded

## 😀 Not the best solution to ValueError: y contains previously unseen labels: [0, 1, 2,... 😀
## unknown_value = -1
## ⚠️ Make sure this is int (as other labels) or you will not be able to predict in the end ⚠️
## https://stackoverflow.com/questions/21057621/sklearn-labelencoder-with-never-seen-before-values

## Workaround #2 potential data leakage???
## http://kagglesolutions.com/r/feature-engineering--label-encoding
## To resolve this issue we will first concatenate X_train and X_test together and then perform label encoding. You can have everything in a loop for all of your categorical features

```
X_train = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'cat', 'dog']})
X_test = pd.DataFrame({'x1': np.random.random(5), 'x2': ['cat', 'cat', 'dog', 'rat', 'dog']})

categorical_features = ['x2']

# make an encoder object
encoder = LabelEncoder()

# fit and transform feature x2
for col in categorical_features:
    encoder.fit(pd.concat([X_train[col], X_test[col]], axis=0, sort=False))
    X_train[col] = encoder.transform(X_train[col])
    X_test[col] = encoder.transform(X_test[col])
    
print(X_train.head(), '\n')
print(X_test.head(), '\n')
```

In [None]:
# Not the best solution to ValueError: y contains previously unseen labels: [0, 1, 2,...
unknown_value = -1 #Make sure this is int (as other labels) or you will not be able to predict in the end ⚠️

from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']

#encoder = preprocessing.LabelEncoder() - Incorrect, we need a label encoder for each feature
# Create new columns in clicks using preprocessing.LabelEncoder()

for feature in cat_features:
    #New encoder for each feature
    encoder = preprocessing.LabelEncoder()
    #Fit on all possible values of this feature
    encoder.fit(clicks[feature])
    #Create LabelEncoder of input to output
    le_dict = dict(zip(encoder.classes_, encoder.transform(encoder.classes_)))
    #Encode unseen values to the unknown_value label
    encoded = clicks[feature].apply(lambda x: le_dict.get(x, unknown_value))
    clicks[feature+'_labels'] = encoded
    
    #Competition submission
    competition_encoded = competition_test_data[feature].apply(lambda x: le_dict.get(x, unknown_value))
    #ValueError: y contains previously unseen labels: [0, 2, 3, 4, 5,
    competition_test_data[feature+'_labels'] = competition_encoded

In [None]:
clicks.head()

In [None]:
competition_test_data.head(20)

## How many unknown_value did we get in the test dataset?

In [None]:
train_ip_labels_unknowns = sum(clicks['ip_labels'] == unknown_value)
train_ip_labels_unknowns

In [None]:
compet_test_ip_labels_unknowns = sum(competition_test_data['ip_labels'] == unknown_value)
compet_test_ip_labels_unknowns

In [None]:
my_own_metrics={'limit': min(limit, clicks.shape[0]),
                'competition_test_data':competition_test_data.shape[0],
                'train ip_labels unknowns': train_ip_labels_unknowns,
                'compet_test ip_labels unknowns':compet_test_ip_labels_unknowns}
my_own_metrics

In [None]:
competition_test_data.shape

## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### 4) Train/test splits with time series data
This is time series data. Are there any special considerations when creating train/test splits for time series? If so, what are they?

### Create train/validation/test splits

Here we'll create training, validation, and test splits. First, `clicks` DataFrame is sorted in order of increasing time. The first 80% of the rows are the train set, the next 10% are the validation set, and the last 10% are the test set.

In [None]:
clicks.head()

In [None]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

In [None]:
print(clicks.shape,'\n',train.shape,'\n',valid.shape,'\n',test.shape)

### Train with LightGBM

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [None]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
#bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

In [None]:
#type(bst) #lightgbm.basic.Booster

### ??? TypeError: booster must be dict or LGBMModel

- booster (dict or LGBMModel) – Dictionary returned from lightgbm.train() or LGBMModel instance.
https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.plot_metric.html

In [None]:
#lgb.plot_metric(bst, metric=metrics.roc_auc_score, dataset_names=[dtrain, dvalid, dtest]) #??? TypeError: booster must be dict or LGBMModel
#, ax=None, xlim=None, ylim=None, title='Metric during training', xlabel='Iterations', ylabel='auto', figsize=None, dpi=None, grid=True)[source]

- evals_result (dict or None, optional (default=None)) –

This dictionary used to store all evaluation results of all the items in valid_sets.

https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.train.html

- validation_metrics inspired by:
https://github.com/Microsoft/LightGBM/blob/2e93cdab9eee02d4d7f5cb3b6b31128dec94e25e/examples/python-guide/plot_example.py

In [None]:
#Record eval results for plotting
validation_metrics = {}  

bst = lgb.train(param, 
                dtrain, 
                num_round, 
                valid_sets=[dvalid], 
                early_stopping_rounds=10,
                evals_result=validation_metrics,
                verbose_eval=10)

In [None]:
validation_metrics

### Plot validation AUC during training

In [None]:
import matplotlib.pyplot as plt
plt.rcParams["figure.figsize"] = [16,9]

ax = lgb.plot_metric(validation_metrics, metric='auc');
#plt.show();

## ML Explainability and taking a closer look at feature importance, individual trees
Inspired by: https://github.com/Microsoft/LightGBM/blob/2e93cdab9eee02d4d7f5cb3b6b31128dec94e25e/examples/python-guide/plot_example.py


In [None]:
print('Plot feature importances...')
ax = lgb.plot_importance(bst, max_num_features=15)
plt.show()

In [None]:
bst.num_trees()

In [None]:
tree_index = 0
print('Plot '+str(tree_index)+'th tree...')  # one tree use categorical feature to split
ax = lgb.plot_tree(bst, tree_index=tree_index, figsize=(64, 36), show_info=['split_gain'])
plt.show()

In [None]:
print('Plot'+str(tree_index)+'th tree with graphviz...')
graph = lgb.create_tree_digraph(bst, tree_index=tree_index, name='Tree'+str(tree_index))
graph.render(view=True)

### Download 'Tree...gv.pdf' from the working directory --->

## Evaluate the model
Finally, with the model trained, we evaluate its performance on the test set. 

In [None]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

In [None]:
my_own_metrics['test score'] = score

In [None]:
my_own_metrics

In [None]:
#plt.bar(my_own_metrics.keys(), my_own_metrics.values())

This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score. However, since this is the test set, we only want to look at it at the end of all our manipulations. At the very end of this course you'll look at the test score again to see if you improved on the baseline model.

# Keep Going
Now that you have a baseline model, you are ready to **[use categorical encoding techniques](https://www.kaggle.com/matleonard/categorical-encodings)** to improve it.

---




*Have questions or comments? Visit the [Learn Discussion forum](https://www.kaggle.com/learn-forum/161443) to chat with other Learners.*

# Submit test predictions to TalkingData AdTracking Fraud Detection Challenge competition using the ***limited*** train.csv records from this notebook

In [None]:
feature_cols + ['click_id']

In [None]:
competition_test_data = competition_test_data[feature_cols + ['click_id']]

In [None]:
competition_test_data.shape

In [None]:
competition_test_data.head()

In [None]:
test[feature_cols].head()

In [None]:
competition_predictions = bst.predict(competition_test_data[feature_cols])

In [None]:
type(competition_predictions)

In [None]:
competition_predictions

In [None]:
competition_predictions_df = pd.DataFrame(competition_predictions, columns=['is_attributed'])
competition_predictions_df

In [None]:
competition_predictions_df['click_id'] = competition_test_data['click_id']
competition_predictions_df = competition_predictions_df[['click_id', 'is_attributed']]
competition_predictions_df

In [None]:
competition_predictions_df

In [None]:
competition_predictions_df['is_attributed'].value_counts().sort_index()

In [None]:
pd.cut(competition_predictions_df['is_attributed'], bins=10).value_counts()

In [None]:
pd.cut(competition_predictions_df['is_attributed'], bins=10).value_counts().plot(kind='bar', rot=45);

In [None]:
#competition_predictions_df['is_attributed'].value_counts().sort_index().plot(kind='bar');

In [None]:
#sum(competition_predictions_df['is_attributed'] <= 0.5)/competition_predictions_df.shape[0]

In [None]:
#sum(competition_predictions_df['is_attributed'] > 0.5)/competition_predictions_df.shape[0]

In [None]:
competition_predictions_df.to_csv('submission.csv', index=False)

In [None]:
my_own_metrics['private score'] = 0.83173
my_own_metrics['public score'] = 0.82499
my_own_metrics

# Submit csv to competition


- Baseline Model: 
- https://www.kaggle.com/georgezoto/talkingdata-adtracking-competition-baseline-mode


- Competition data
- V1 Train Limit:1M/18M UNK:1.5M Test Score:0.8579
- 'private score': 0.83173
- 'public score': 0.82499


- V2 Train Limit:10M/18M UNK:3.7M Test Score:0.8558
- 'private score': 0.83903
- 'public score': 0.83123


- V3 Train Limit:20M/18M UNK:3.7M Test Score:0.8501
- 'private score': 0.83733
- 'public score': 0.83034


- Feature-Engineering Data
- V4FE Train Limit:2M/18M UNK:1.5M Test Score:0.9726
- 'private score': 0.95686
- 'public score': 0.95522

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)

![image.png](attachment:image.png)