# Exercise 1: Building a Baseline Model

In this exercise, you will develop a baseline model for predicting if a customer will buy the app after clicking through from an ad. With this baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

In [2]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *

In [3]:
import pandas as pd

click_data = pd.read_csv('../input/train_sample.csv', parse_dates=['click_time'])
click_data.head(10)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,89489,3,1,13,379,2017-11-06 15:13:23,,0
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1
2,3437,6,1,13,459,2017-11-06 15:42:32,,0
3,167543,3,1,13,379,2017-11-06 15:56:17,,0
4,147509,3,1,13,379,2017-11-06 15:57:01,,0
5,71421,15,1,13,153,2017-11-06 16:00:00,,0
6,76953,14,1,13,379,2017-11-06 16:00:01,,0
7,187909,2,1,25,477,2017-11-06 16:00:01,,0
8,116779,1,1,8,150,2017-11-06 16:00:01,,0
9,47857,3,1,15,205,2017-11-06 16:00:01,,0


## Baseline Model

The first thing you need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First you need to do a bit of feature engineering before training the model itself.

### 1) Exercise: Features from timestamps
From the timestamps, create features for the day, hour, minute and second. Store these as new integer columns `day`, `hour`, `minute`, and `second` and a new DataFrame `clicks`.

In [4]:
# Add new columns for timestamp features day, hour, minute, and second
clicks = ____

In [5]:
#%%RM_IF(PROD)%%
# My solution, there are a lot of ways to do this

# Split up the times
click_times = click_data['click_time']
clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
                           hour=click_times.dt.hour.astype('uint8'),
                           minute=click_times.dt.minute.astype('uint8'),
                           second=click_times.dt.second.astype('uint8'))

In [6]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1


### 2) Exercise: Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [7]:
from sklearn import preprocessing

In [8]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in using preprocessing.LabelEncoder()
____




In [9]:
#%%RM_IF(PROD)%%

# Label encode categorical variables 
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    new_feature = label_encoder.fit_transform(clicks[feature])
    clicks[feature +'_labels'] = new_feature

In [10]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,minute,second,ip_labels,app_labels,device_labels,os_labels,channel_labels
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,13,23,27226,3,1,13,120
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,41,7,110007,35,1,13,10
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,42,32,1047,6,1,13,157
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,56,17,76270,3,1,13,120
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,57,1,57862,3,1,13,120


### 3) One-hot encoding?

Now you have label encoded features, does it make sense to use one-hot encoding at this point?

**Answer:** the `ip_labels` column has 58,000 values, which means it will create an extremely sparse matrix with 58,000 columns. Generally a bad idea. Luckily, XGBoost works well with label encoded features.

## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### 4) Question, train/test splits with time series data
This is time series data. Are they any special considerations when creating train/test splits for time series? If so, what and why?

**Answer:** Since our model is meant to predict events in the future, we must also validate the model on events in the future. If the data is mixed up between the training and test sets, then future data will leak in to the model and our validation results will overestimate the performance on new data.

### 5) Exercise: Create train/validation/test splits

First, sort the data in order of increasing time. Use the first 80% of the rows as the train set, the next 10% as the validation set, and the last 10% as the test set. Then create XGBoost DMatrix objects for each of the smaller datasets.

In [11]:
import xgboost as xgb

In [12]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

# Be sure to sort the data by click time before making the splits
# Your code here: create train, validation, and test sets as xgb.DMatrix objects
dtrain = ____
dvalid = ____
dtest = ____

In [13]:
#%%RM_IF(PROD)%%

feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']


valid_fraction = 0.1
clicks = clicks.sort_values('click_time')
valid_rows = int(len(clicks) * valid_fraction)
train = clicks[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks[-valid_rows * 2:-valid_rows]
test = clicks[-valid_rows:]


dtrain = xgb.DMatrix(train[feature_cols], label=train['is_attributed'])
dvalid = xgb.DMatrix(valid[feature_cols], label=valid['is_attributed'])
dtest = xgb.DMatrix(test[feature_cols], label=test['is_attributed'])

  if getattr(data, 'base', None) is not None and \
  data.base is not None and isinstance(data, np.ndarray) \


In [22]:
## Fitting the xgb model
num_round = 100
param = {'max_depth': 2, 'eta': 1, 'silent': 1, 'objective': 'binary:logistic'}
evallist = [(dtrain, 'train'), (dvalid, 'eval')]
bst = xgb.train(param, dtrain, num_round, evallist)

[0]	train-error:0.146744	eval-error:0.150185
[1]	train-error:0.06836	eval-error:0.079881
[2]	train-error:0.069302	eval-error:0.080072
[3]	train-error:0.063382	eval-error:0.071104
[4]	train-error:0.056155	eval-error:0.074643
[5]	train-error:0.058005	eval-error:0.081602
[6]	train-error:0.05246	eval-error:0.067588
[7]	train-error:0.049595	eval-error:0.061972
[8]	train-error:0.04936	eval-error:0.061594
[9]	train-error:0.049511	eval-error:0.063454
[10]	train-error:0.04913	eval-error:0.063198
[11]	train-error:0.049594	eval-error:0.064676
[12]	train-error:0.049452	eval-error:0.064297
[13]	train-error:0.048869	eval-error:0.061372
[14]	train-error:0.048923	eval-error:0.062237
[15]	train-error:0.049246	eval-error:0.062315
[16]	train-error:0.049196	eval-error:0.062263
[17]	train-error:0.048634	eval-error:0.061889
[18]	train-error:0.048497	eval-error:0.06155
[19]	train-error:0.048327	eval-error:0.06122
[20]	train-error:0.048046	eval-error:0.061085
[21]	train-error:0.047443	eval-error:0.06022
[22]	

### 6) Exercise: Evaluate the model
Finally, with the model trained, evaluate it's performance on the test set `dtest`. Use the ROC AUC score from `sklearn.metrics`.

In [16]:
from sklearn import metrics

In [17]:
# Your code here. Make predictions on the test set with the trained XGBoost model
ypred = ____ 
score = ____

In [23]:
#%%RM_IF(PROD)%%
ypred = bst.predict(dtest)
score = metrics.roc_auc_score(test['is_attributed'], ypred)

In [24]:
print(f"Test score: {score}")

Test score: 0.9681879949707477


This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score.