# Exercise 1: Building a Baseline Model

In this exercise, you will develop a baseline model for predicting if a customer will buy the app after clicking through from an ad. With this baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

In [None]:
# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering.ex1 import *

In [None]:
import pandas as pd

click_data = pd.read_csv('../input/feature-engineering-data/train_sample.csv',
                         parse_dates=['click_time'])
click_data.head(10)

## Baseline Model

The first thing you need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First you need to do a bit of feature engineering before training the model itself.

### 1) Exercise: Features from timestamps
From the timestamps, create features for the day, hour, minute and second. Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [None]:
# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = ____
clicks['minute'] = ____
clicks['second'] = ____

In [None]:
q_1.check()

In [None]:
#%%RM_IF(PROD)%%
# My solution, there are a lot of ways to do this

# Split up the times
click_times = click_data['click_time']
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = click_times.dt.hour.astype('uint8')
clicks['minute'] = click_times.dt.minute.astype('uint8')
clicks['second'] = click_times.dt.second.astype('uint8')
# clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
#                            hour=click_times.dt.hour.astype('uint8'), 
#                            minute=click_times.dt.minute.astype('uint8'),
#                            second=click_times.dt.second.astype('uint8'))
q_1.check()

In [None]:
# Uncomment these if you need guidance
# q_1.hint()
# q_1.solution()

In [None]:
# Run this to check your work
q_1.check()

### 2) Exercise: Label Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [None]:
from sklearn import preprocessing

In [None]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
for feature in cat_features:
    ____

In [None]:
#%%RM_IF(PROD)%%

# Label encode categorical variables 
cat_features = ['ip', 'app', 'device', 'os', 'channel']
label_encoder = preprocessing.LabelEncoder()
for feature in cat_features:
    new_feature = label_encoder.fit_transform(clicks[feature])
    clicks[feature +'_labels'] = new_feature
    
q_2.check()

In [None]:
# Uncomment these if you need guidance
# q_2.hint()
# q_2.solution()

In [None]:
# Run this cell to check your work
q_2.check()

In [None]:
clicks.head()

### 3) One-hot encoding?

Now you have label encoded features, does it make sense to use one-hot encoding at this point?

Uncomment the following line after you've decided your answer.

In [None]:
#q_3.solution()

## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### 4) Question, train/test splits with time series data
This is time series data. Are they any special considerations when creating train/test splits for time series? If so, what and why?

Uncomment the following line after you've decided your answer.

In [None]:
#q_4.solution()

### 5) Exercise: Create train/validation/test splits

First, sort the `clicks` DataFrame in order of increasing time. Use the first 80% of the rows as the train set, the next 10% as the validation set, and the last 10% as the test set.

In [None]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

# Be sure to sort the data by click time before making the splits
# Your code here: create train, validation, and test sets as dataframes
train = ____
valid = ____
test = ____

In [None]:
# Uncomment these if you need guidance
#q_5.hint()
#q_5.solution()

In [None]:
# Run this cell to check your work
q_5.check()

In [None]:
#%%RM_IF(PROD)%%

feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
# valid size == test size, last two sections of the data
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

q_5.check()

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [None]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

### 6) Exercise: Evaluate the model
Finally, with the model trained, evaluate it's performance on the test set. Use the ROC AUC score from `sklearn.metrics`.

In [None]:
from sklearn import metrics

In [None]:
# Your code here. Make predictions on the test set with the trained LightGBM model
ypred = ____ 
score = ____

In [None]:
# Uncomment these if you need guidance
#q_6.hint()
#q_6.solution()

In [None]:
# Run this cell to check your work
q_6.check()

In [None]:
#%%RM_IF(PROD)%%
ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)

q_6.check()

In [None]:
print(f"Test score: {score}")

This will be our baseline score for the model. When we transform features, add new ones, or perform feature selection, we should be improving on this score. However, since this is the test set, we only want to look at it at the end of all our manipulations. At the very end of this course you'll look at the test score again to see if you improved on the baseline model.