# Feature Engineering 

# Introduction

In this exercise, you will develop a baseline model for predicting if a customer will buy an app after clicking on an ad. With this baseline model, you'll be able to see how your feature engineering and selection efforts improve the model's performance.

In [84]:

import pandas as pd

click_data = pd.read_csv('../hitchhikersGuideToMachineLearning/train_sample.csv',parse_dates=['click_time'])
click_data.head(10)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed
0,89489,3,1,13,379,2017-11-06 15:13:23,,0
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1
2,3437,6,1,13,459,2017-11-06 15:42:32,,0
3,167543,3,1,13,379,2017-11-06 15:56:17,,0
4,147509,3,1,13,379,2017-11-06 15:57:01,,0
5,71421,15,1,13,153,2017-11-06 16:00:00,,0
6,76953,14,1,13,379,2017-11-06 16:00:01,,0
7,187909,2,1,25,477,2017-11-06 16:00:01,,0
8,116779,1,1,8,150,2017-11-06 16:00:01,,0
9,47857,3,1,15,205,2017-11-06 16:00:01,,0


## Baseline Model

The first thing you need to do is construct a baseline model. All new features, processing, encodings, and feature selection should improve upon this baseline model. First you need to do a bit of feature engineering before training the model itself.

###  Features from timestamps
From the timestamps, create features for the day, hour, minute and second. Store these as new integer columns `day`, `hour`, `minute`, and `second` in a new DataFrame `clicks`.

In [85]:
# Add new columns for timestamp features day, hour, minute, and second
clicks = click_data.copy()
clicks['day'] = clicks['click_time'].dt.day.astype('uint8')
# Fill in the rest
clicks['hour'] = clicks['click_time'].dt.hour.astype('uint8')
clicks['minute'] = clicks['click_time'].dt.minute.astype('uint8')
clicks['second'] = clicks['click_time'].dt.second.astype('uint8')

###  Categorical Data Encoding
For each of the categorical features `['ip', 'app', 'device', 'os', 'channel']`, use scikit-learn's `LabelEncoder` to create new features in the `clicks` DataFrame. The new column names should be the original column name with `'_labels'` appended, like `ip_labels`.

In [86]:
from sklearn import preprocessing

cat_features = ['ip', 'app', 'device', 'os', 'channel']

# Create new columns in clicks using preprocessing.LabelEncoder()
encoder = preprocessing.LabelEncoder()
for feature in cat_features:
    encoded = encoder.fit_transform(clicks[feature])
    clicks[feature + "_labels"] = encoded
    


## Train, validation, and test sets
With our baseline features ready, we need to split our data into training and validation sets. We should also hold out a test set to measure the final accuracy of the model.

### Train/test splits with time series data
This is time series data. Are they any special considerations when creating train/test splits for time series.
I have explained Time series Cross Validation in this article link and Data Leakage caused by them i this Article link



In [87]:
feature_cols = ['day', 'hour', 'minute', 'second', 
                'ip_labels', 'app_labels', 'device_labels',
                'os_labels', 'channel_labels']

valid_fraction = 0.1
clicks_srt = clicks.sort_values('click_time')
valid_rows = int(len(clicks_srt) * valid_fraction)
train = clicks_srt[:-valid_rows * 2]
valid = clicks_srt[-valid_rows * 2:-valid_rows]
test = clicks_srt[-valid_rows:]

### Train with LightGBM

Now we can create LightGBM dataset objects for each of the smaller datasets and train the baseline model.

In [88]:
import lightgbm as lgb

dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])

param = {'num_leaves': 64, 'objective': 'binary'}
param['metric'] = 'auc'
num_round = 1000
bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], early_stopping_rounds=10)

[1]	valid_0's auc: 0.948979
Training until validation scores don't improve for 10 rounds
[2]	valid_0's auc: 0.949235
[3]	valid_0's auc: 0.950126
[4]	valid_0's auc: 0.950072
[5]	valid_0's auc: 0.950536
[6]	valid_0's auc: 0.950943
[7]	valid_0's auc: 0.951453
[8]	valid_0's auc: 0.951518
[9]	valid_0's auc: 0.952385
[10]	valid_0's auc: 0.952434
[11]	valid_0's auc: 0.952465
[12]	valid_0's auc: 0.952638
[13]	valid_0's auc: 0.95266
[14]	valid_0's auc: 0.952766
[15]	valid_0's auc: 0.953203
[16]	valid_0's auc: 0.953503
[17]	valid_0's auc: 0.953793
[18]	valid_0's auc: 0.953966
[19]	valid_0's auc: 0.954184
[20]	valid_0's auc: 0.9543
[21]	valid_0's auc: 0.954305
[22]	valid_0's auc: 0.954536
[23]	valid_0's auc: 0.954748
[24]	valid_0's auc: 0.955142
[25]	valid_0's auc: 0.955493
[26]	valid_0's auc: 0.955611
[27]	valid_0's auc: 0.955708
[28]	valid_0's auc: 0.955795
[29]	valid_0's auc: 0.956172
[30]	valid_0's auc: 0.95623
[31]	valid_0's auc: 0.956477
[32]	valid_0's auc: 0.956606
[33]	valid_0's auc: 0.95

## Evaluate the model
Finally, with the model trained, I'll evaluate it's performance on the test set. 

In [89]:
from sklearn import metrics

ypred = bst.predict(test[feature_cols])
score = metrics.roc_auc_score(test['is_attributed'], ypred)
print(f"Test score: {score}")

Test score: 0.9726727334566094


# Diving Deeper in Categorical Encoding

In this exercise you'll apply more advanced encodings to encode the categorical variables ito improve your classifier model. The encodings you will implement are:

- Count Encoding
- Target Encoding
- Leave-one-out Encoding
- CatBoost Encoding
- Feature embedding with SVD 

You'll refit the classifier after each encoding to check its performance on hold-out data. First, run the next cell to repeat the work you did in the last exercise.

Since we will be traning and testing alot lests make some functions! Wherever you to use a piece of code more thn thrice make a function.

In [93]:
def get_data_splits(dataframe, valid_fraction=0.1):
    """ Splits a dataframe into train, validation, and test sets. First, orders by 
        the column 'click_time'. Set the size of the validation and test sets with
        the valid_fraction keyword argument.
    """

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test=None, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    valid_pred = bst.predict(valid[feature_cols])
    valid_score = metrics.roc_auc_score(valid['is_attributed'], valid_pred)
    print(f"Validation AUC score: {valid_score}")
    
    if test is not None: 
        test_pred = bst.predict(test[feature_cols])
        test_score = metrics.roc_auc_score(test['is_attributed'], test_pred)
        return bst, valid_score, test_score
    else:
        return bst, valid_score

In [100]:
# Count encoding

In [94]:
import category_encoders as ce

cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the count encoder
count_enc = ce.CountEncoder(cols = cat_features)



# Learn encoding from the training set
count_encoded = count_enc.fit(train[cat_features])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_count` as a suffix to the new columns
train_encoded = train.join(count_enc.transform(train[cat_features]).add_suffix('_count'))

valid_encoded = valid.join(count_enc.transform(valid[cat_features]).add_suffix('_count'))


In [95]:
# Train the model on the encoded datasets
# This can take around 30 seconds to complete
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9649930021641716


In [99]:
#Target Encoding

In [97]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
train, valid, test = get_data_splits(clicks)

# Create the target encoder. You can find this easily by using tab completion.
# Start typing ce. the press Tab to bring up a list of classes and functions.
target_enc = ce.TargetEncoder(cols=cat_features)


# Learn encoding from the training set. Use the 'is_attributed' column as the target.
target_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_target` as a suffix to the new columns
train_encoded = train.join(target_enc.transform(train[cat_features]).add_suffix('_target'))

valid_encoded = valid.join(target_enc.transform(valid[cat_features]).add_suffix('_target'))



In [98]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9540530347873288


 leaving ip out of the encoded features and retrain the model with target encoding again. You should find that the score increases and is above the baseline score! 
 As IP addresses are too many target encoding will add noise only.

The IP address of an internet transaction
is another example of a large categorical variable. They are categorical variables
because, even though user IDs and IP addresses are numeric, their magnitude is
usually not relevant to the task at hand. For instance, the IP address might be
relevant when doing fraud detection on individual transactions—some IP
addresses or subnets may generate more fraudulent transactions than others. But
a subnet of 164.203.x.x is not inherently more fraudulent than 164.202.x.x; the
numeric value of the subnet does not matter.

In [101]:
train, valid, test = get_data_splits(clicks)
cat_features = ['app', 'device', 'os', 'channel']

# Create the CatBoost encoder
cb_enc = target_enc = ce.CatBoostEncoder(cols=cat_features , random_state=7)

# Learn encoding from the training set
cb_enc.fit(train[cat_features], train['is_attributed'])

# Apply encoding to the train and validation sets as new columns
# Make sure to add `_cb` as a suffix to the new columns
train_encoded = train.join(cb_enc.transform(train[cat_features]).add_suffix('_cb'))
valid_encoded = valid.join(cb_enc.transform(valid[cat_features]).add_suffix('_cb'))


In [102]:
_ = train_model(train_encoded, valid_encoded)

Training model!
Validation AUC score: 0.9626909733248851


CatBoost works best so we will keep it!

In [104]:

encoded = cb_enc.transform(clicks[cat_features])
for col in encoded:
    clicks.insert(len(clicks.columns), col + '_cb', encoded[col])

In [106]:
clicks.head(5)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,...,second,ip_labels,app_labels,device_labels,os_labels,channel_labels,app_cb,device_cb,os_cb,channel_cb
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,...,23,27226,3,1,13,120,0.028329,0.152087,0.138712,0.034049
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,...,7,110007,35,1,13,10,0.995828,0.152087,0.138712,0.950244
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,...,32,1047,6,1,13,157,0.009261,0.152087,0.138712,0.019384
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,...,17,76270,3,1,13,120,0.028329,0.152087,0.138712,0.034049
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,...,1,57862,3,1,13,120,0.028329,0.152087,0.138712,0.034049


## Add interaction features

Here you'll add interaction features for each pair of categorical features (ip, app, device, os, channel). The easiest way to iterate through the pairs of features is with `itertools.combinations`. For each new column, join the values as strings with an underscore, so 13 and 47 would become `"13_47"`. As you add the new columns to the dataset, be sure to label encode the values.

In [107]:
import itertools

cat_features = ['ip', 'app', 'device', 'os', 'channel']
interactions = pd.DataFrame(index=clicks.index)

# Iterate through each pair of features, combine them into interaction features
for col1 ,col2 in  itertools.combinations(cat_features,2):
    newcolname = col1 + "_" + col2 
    new_values = clicks[col1].map(str) + "_" + clicks[col2].map(str)

    encoder = preprocessing.LabelEncoder()
    interactions[newcolname] = encoder.fit_transform(new_values)


In [108]:
clicks = clicks.join(interactions)
print("Score with interactions")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid)

Score with interactions
Training model!
Validation AUC score: 0.9627791479263492


# Generating numerical features

Adding interactions is a quick way to create more categorical features from the data. It's also effective to create new numerical features, you'll typically get a lot of improvement in the model. This takes a bit of brainstorming and experimentation to find features that work well.

For these exercises I'm going to have you implement functions that operate on Pandas Series. It can take multiple minutes to run these functions on the entire data set so instead I'll provide feedback by running your function on a smaller dataset.

### 2) Number of events in the past six hours

The first feature you'll be creating is the number of events from the same IP in the last six hours. It's likely that someone who is visiting often will download the app.

Implement a function `count_past_events` that takes a Series of click times (timestamps) and returns another Series with the number of events in the last hour. **Tip:** The `rolling` method is useful for this.

In [112]:
def count_past_events(series, time_window='6H'):
    series = pd.Series(series.index, index=series)
    past_events = series.rolling(time_window).count() - 1
    return past_events


In [113]:
past_events = count_past_events(clicks['click_time'])

In [118]:
clicks['ip_past_6hr_counts'] = past_events.values


In [120]:
clicks.head(5)

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,...,ip_device,ip_os,ip_channel,app_device,app_os,app_channel,device_os,device_channel,os_channel,ip_past_6hr_counts
0,89489,3,1,13,379,2017-11-06 15:13:23,,0,6,15,...,327558,844429,1204595,3543,3973,621,795,1534,1123,0.0
1,204158,35,1,13,21,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,...,110989,324393,473773,3486,3715,561,795,1465,1059,1.0
2,3437,6,1,13,459,2017-11-06 15:42:32,,0,6,15,...,264762,590544,795240,4180,5063,777,795,1570,1154,2.0
3,167543,3,1,13,379,2017-11-06 15:56:17,,0,6,15,...,67781,221287,333763,3543,3973,621,795,1534,1123,3.0
4,147509,3,1,13,379,2017-11-06 15:57:01,,0,6,15,...,44449,166639,260146,3543,3973,621,795,1534,1123,4.0


In [119]:
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9623695834546713


### 4) Time since last event

Implement a function `time_diff` that calculates the time since the last event in seconds from a Series of timestamps. This will be ran like so:

```python
timedeltas = clicks.groupby('ip')['click_time'].transform(time_diff)
```

In [121]:
def time_diff(series):
    """ Returns a series with the time since the last timestamp in seconds """
    return series.diff().dt.total_seconds()

In [122]:
past_events = time_diff(clicks['click_time'])

array([      nan, 1.664e+03, 8.500e+01, ..., 1.000e+00, 0.000e+00,
       0.000e+00])

In [130]:
clicks['past_events_6hr'] = past_events.values

In [131]:
clicks.columns

Index(['ip', 'app', 'device', 'os', 'channel', 'click_time', 'attributed_time',
       'is_attributed', 'day', 'hour', 'minute', 'second', 'ip_labels',
       'app_labels', 'device_labels', 'os_labels', 'channel_labels', 'app_cb',
       'device_cb', 'os_cb', 'channel_cb', 'ip_app', 'ip_device', 'ip_os',
       'ip_channel', 'app_device', 'app_os', 'app_channel', 'device_os',
       'device_channel', 'os_channel', 'ip_past_6hr_counts',
       'past_events_6hr'],
      dtype='object')

In [133]:

train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Training model!
Validation AUC score: 0.9624544752151282


### 1) Which data to use for feature selection?

Since many feature selection methods require calculating statistics from the dataset, should you use all the data for feature selection?

Run the following line after you've decided your answer.