# Feature Generation Exercises

In this set of exercises, you'll create new features from the existing data. Again you'll compare the score lift for each new feature compared to a baseline model. First off, run the cells below to set up a baseline dataset and model.

In [1]:
import numpy as np
import pandas as pd
from sklearn import preprocessing, metrics
import lightgbm as lgb

# Set up code checking
from learntools.core import binder
binder.bind(globals())
from learntools.machine_learning.ex2 import *

# Create features from timestamps
click_data = pd.read_csv('../input/train_sample.csv', parse_dates=['click_time'])
click_times = click_data['click_time']
clicks = click_data.assign(day=click_times.dt.day.astype('uint8'),
                           hour=click_times.dt.hour.astype('uint8'),
                           minute=click_times.dt.minute.astype('uint8'),
                           second=click_times.dt.second.astype('uint8'))

# Label encoding for categorical features
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for feature in cat_features:
    label_encoder = preprocessing.LabelEncoder()
    clicks[feature] = label_encoder.fit_transform(clicks[feature])

This means that in case of installing LightGBM from PyPI via the ``pip install lightgbm`` command, you don't need to install the gcc compiler anymore.
Instead of that, you need to install the OpenMP library, which is required for running LightGBM on the system with the Apple Clang compiler.
You can install the OpenMP library by the following command: ``brew install libomp``.


In [2]:
def get_data_splits(dataframe, valid_fraction=0.1):

    dataframe = dataframe.sort_values('click_time')
    valid_rows = int(len(dataframe) * valid_fraction)
    train = dataframe[:-valid_rows * 2]
    # valid size == test size, last two sections of the data
    valid = dataframe[-valid_rows * 2:-valid_rows]
    test = dataframe[-valid_rows:]
    
    return train, valid, test

def train_model(train, valid, test, feature_cols=None):
    if feature_cols is None:
        feature_cols = train.columns.drop(['click_time', 'attributed_time',
                                           'is_attributed'])
    dtrain = lgb.Dataset(train[feature_cols], label=train['is_attributed'])
    dvalid = lgb.Dataset(valid[feature_cols], label=valid['is_attributed'])
    dtest = lgb.Dataset(test[feature_cols], label=test['is_attributed'])
    
    param = {'num_leaves': 64, 'objective': 'binary', 
             'metric': 'auc', 'seed': 7}
    num_round = 1000
    print("Training model!")
    bst = lgb.train(param, dtrain, num_round, valid_sets=[dvalid], 
                    early_stopping_rounds=20, verbose_eval=False)
    
    ypred = bst.predict(test[feature_cols])
    score = metrics.roc_auc_score(test['is_attributed'], ypred)
    print(f"Model AUC score: {score}")
    
    return bst

In [3]:
print("Baseline model score")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Baseline model score
Training model!
Model AUC score: 0.9728192258764643


### 1) Add interaction features

Here you'll add interaction features for each pair of categorical features (ip, app, device, os, channel). The easiest way to iterate through the pairs of features is with `itertools.combinations`. For each new column, join the values as strings with an undescore, so 13 and 47 would become `"13_47"`. As you add the new columns to the dataset, be sure to label encode the values.

In [4]:
import itertools

In [5]:
cat_features = ['ip', 'app', 'device', 'os', 'channel']
for col1, col2 in itertools.combinations(cat_features, 2):
    new_col_name = '_'.join([col1, col2])
    paired_values = zip(clicks[col1].apply(str), clicks[col2].apply(str))
    encoder = preprocessing.LabelEncoder()
    clicks[new_col_name] = encoder.fit_transform(np.array(['_'.join([s1, s2]) for s1, s2 in paired_values]))

In [6]:
print("Score with interactions")
train, valid, test = get_data_splits(clicks)
_ = train_model(train, valid, test)

Score with interactions
Training model!
Model AUC score: 0.973121586913155


# Generating numerical features

Adding interactions is a quick way to create more categorical features from the data. It's also effective to create new numerical features, you'll typically get a lot of improvement in the model. This takes a bit of brainstorming and experimentation to find features that work well.

For these exercises I'm going to have you implement functions that operate on Pandas Series. It can take multiple minutes to run these functions on the entire data set so instead I'll provide feedback by running your function on a smaller dataset.

### 2) Number of events in the past six hours

The first feature you'll be creating is the number of events from the same IP in the last six hours. It's likely that someone who is visiting often will download the app.

Implement a function `count_past_events` that takes a Series of click times (timestamps) and returns another Series with the number of events in the last hour. **Tip:** The `rolling` method is useful for this.

In [7]:
def count_past_events(series):
    ____
    return ____

In [8]:
#%%RM_IF(PROD)%%
# My solution
def count_past_events(series, time_window='6H'):
    series = pd.Series(series.index, index=series)
    # Subtract 1 so the current event isn't counted
    past_events = series.rolling(time_window).count() - 1
    return past_events

In [9]:
# NOTE: This takes a while to run
# Thinking about saving the output here as a CSV file and loading it in for the student
# So they don't have to wait for this to run
past_events = clicks.groupby('ip')['click_time'].transform(count_past_events)
clicks['ip_past_6hr_counts'] = past_events

In [10]:
train, valid, test = get_data_splits(clicks)
train_model(train, valid, test)

Training model!
Model AUC score: 0.9747573908203024


<lightgbm.basic.Booster at 0x1a5752f908>

In [None]:
#%%RM_IF(PROD)%%
# keeping this here as it's an implementation that is ~8 times faster
# than with the simple Pandas solution

HOUR_IN_NS = 3600 * 1_000_000_000
def count_past_events(series, past_hours=1):
    """ For each event in a time series, count the number of events in the past N hours """
    # Converting to numpy array here because the operations are 100x faster
    data = series.to_numpy()
    
    # Initialize an array for storing past events
    past_events = np.zeros(data.shape)
    
    # Time delta for the time window we're considering
    hour_delta = np.timedelta64(past_hours * HOUR_IN_NS)
    
    # Use a pointer to keep track of the edge of the time window 
    pointer = 0
    for ix, event_time in enumerate(data):
        # For the very first event there necessary will be zero past events
        if ix == 0:
            continue
            
        window_start = event_time - hour_delta
        # Move the pointer forward until we get to the start of our time window
        while data[pointer + 1] < window_start:
            pointer += 1
        
        past_events[ix] = ix - (pointer + 1)
    return past_events

### 3) Features from future information

In the last exercise you created a feature that looked at past events. You could also make features that use information from events in the future. Should you use future events or not? 

**Answer:** In general, you shouldn't use information from the future. When you're using models like this in a real-world scenario you won't have data from the future. Your model's score will likely be higher when training and testing on historical data, but it will overestimate the performance on real data. I should note that using future data will improve the score on Kaggle competition test data, but avoid it when building machine learning products.

### 4) Time since last event

Implement a function `time_diff` that calculates the time since the last event in seconds from a Series of timestamps. This will be ran like so:

```python
time_deltas = clicks.groupby('ip')['click_time'].transform(time_diff)
```

In [11]:
def time_diff(series):
    """ Returns a series with the time since the last timestamp in seconds """
    ____
    return ____

In [None]:
#%%RM_IF(PROD)%%
# My solution
def time_diff(series):
    return series.diff().dt.seconds

In [12]:
# NOTE: This takes a while to run
# Thinking about saving the output here as a CSV file and loading it in for the student
# So they don't have to wait for this to run
time_deltas = clicks.groupby('ip')['click_time'].transform(time_diff)
clicks['seconds_since_last_event'] = time_deltas

In [14]:
train, valid, test = get_data_splits(clicks)
train_model(train, valid, test)

Training model!
Model AUC score: 0.9756049810177201


<lightgbm.basic.Booster at 0x112eb1860>

### 5) Number of previous app downloads

It's likely that if a visitor downloaded an app previously, it'll affect the likelihood they'll download one again. Implement a function `previous_attributions` that returns a Series with the number of times an app has been download (`'is_attributed' == 1`) before the current event.

In [None]:
def previous_attribution(series):
    """ Returns a series with the """
    ____
    return ____

In [12]:
#%%RM_IF(PROD)%%
def previous_attribution(series):
    # Subtracting raw values so I don't count the current event
    sums = series.expanding(min_periods=2).sum() - series
    return sums

In [13]:
# NOTE: This takes a while to run
# Thinking about saving the output here as a CSV file and loading it in for the student
# So they don't have to wait for this to run
previous_downloads = clicks.groupby('ip')['is_attributed'].transform(previous_attribution)
clicks['previous_downloads'] = previous_downloads

In [16]:
train, valid, test = get_data_splits(clicks)
train_model(train, valid, test)

Training model!
Model AUC score: 0.9761224849037081


<lightgbm.basic.Booster at 0x1a79e39c18>

In [11]:
clicks.head()

Unnamed: 0,ip,app,device,os,channel,click_time,attributed_time,is_attributed,day,hour,...,ip_device,ip_os,ip_channel,app_device,app_os,app_channel,device_os,device_channel,os_channel,ip_past_6hr_counts
0,27226,3,1,13,120,2017-11-06 15:13:23,,0,6,15,...,219682,496314,681060,3631,4100,675,1229,1890,985,0.0
1,110007,35,1,13,10,2017-11-06 15:41:07,2017-11-07 08:17:19,1,6,15,...,14419,39852,57863,3581,3849,625,1229,1867,962,0.0
2,1047,6,1,13,157,2017-11-06 15:42:32,,0,6,15,...,6955,19603,28875,4196,5045,787,1229,1928,1018,0.0
3,76270,3,1,13,120,2017-11-06 15:56:17,,0,6,15,...,300967,792039,1140313,3631,4100,675,1229,1890,985,0.0
4,57862,3,1,13,120,2017-11-06 15:57:01,,0,6,15,...,274929,722619,1041993,3631,4100,675,1229,1890,985,0.0


### 6) Tree-based vs Neural Network Models

So far we've been using LightGBM, a tree-based model. Would these features we've generated work well for neural networks as well as tree-based models?

**Answer:** The features themselves will work for either model. However, numerical inputs to neural networks need to be standardized first. That is, the features need to be scaled such that they have 0 mean and a standard deviation of 1. This can be done using sklearn.preprocessing.StandardScaler.

Now that you've generated a bunch of different features, you'll typically want to remove some of them to reduce the size of the model and potentially improve the performance. Next, I'll show you how to do feature selection using a few different methods such as L1 regression and Boruta.