# Improving Feature Engineering
*Anders Poirel*

In the previous notebook I realized that one-hot-encoding resulted in a feature matrix that was way to large to be processed in memory, so a new approach was needed. I either need to make my dataloaded batch the dataset OR come up with a more compact encoding

In [1]:
import pandas as pd
import numpy as np
import tensorflow as tf

## Improved feature encoding

In [6]:
TRAIN_PATH = '../data/raw/train.csv'
TEST_PATH = '../data/raw/test.csv' 

In [7]:
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)

In [28]:
train = train.sample(frac = 0.3)

In [29]:
X_train = train.iloc[:, 1:12]

In [30]:
X_train['City'] = train['City']

In [None]:
X_train.columns.values

### Street Names

In [31]:
names = X_train['EntryStreetName'].unique()

Let's take a look at how the street names to see if there is some clever feature engineering to do

In [None]:
for name in names:
    print(name)

In [None]:
X_train.groupby('EntryStreetName')['EntryStreetName'].count()

We have a lot of different entries so we might want to aggregate by street type (we found these types by looking manually throught the data):

In [10]:
street_types = [
    'Boulevard', 'Street', 'Avenue', 'Drive', 'Parkway', 'Road', 'Place', 'Way', 
    'Highway', 'Bridge', 'Tunnel', 'Terrace', 'Square',
    'Connector', 'Lane', 'Broadway', 'Wharf', 'Court', 'Circle',
]

# connector should be tested for before street
# street sometimes spelt st

In [None]:
X_train.head()

In [32]:
def encode_type(street_name):
    street_types = [
        'Boulevard', 'Street', 'Avenue', 'Drive', 'Parkway', 'Road', 'Place', 'Way', 
        'Highway', 'Bridge', 'Tunnel', 'Terrace', 'Square',
        'Connector', 'Lane', 'Broadway', 'Wharf', 'Court', 'Circle',
        ]
    
    if pd.isna(street_name):
        return 'Not reported'
    
    # special cases to deal with redundant street names
    elif 'St' in street_name:
        return 'Street'
    elif 'Pkway' in street_name:
        return 'Parkway'
    
    else:
        for street_type in street_types:
            if street_type in street_name:
                return street_type
            
        return 'Other'

In [33]:
X_train['EntryStreetType'] = X_train['EntryStreetName'].apply(encode_type)
X_train['ExitStreetType'] = X_train['ExitStreetName'].apply(encode_type)

That being said, there are only 1700 different street types, so we might find it interesting to try fitting a model with one-hot-encoding on these, given how some of them appear several thousand times (and so we have a large sample for each category)

### Directions

Credits to D C Achaira's great Kaggle kernel for ideas on building more informative encodings for `EntryHeading` and `ExitHeading`:
[Feature Engineering and LightGBM](https://www.kaggle.com/dcaichara/feature-engineering-and-lightgbm)
I adapt his method directly here, computing the cardinal entry and exit directions, 
as well as the difference betweent the two.
This encoding is more informative as to similar directions will have more similar encodings.

In [34]:
def encode_direction(direction):
    encodings = {
        'N': 0,
        'NE': 0.25,
        'E':  0.5,
        'SE': 0.75,
        'S': 1,
        'SW': 1.25,
        'W': 1.5,
        'NW': 1.75,
    }
    return encodings[direction]

In [35]:
X_train['EntryHeading'] = X_train['EntryHeading'].apply(encode_direction)
X_train['ExitHeading'] = X_train['ExitHeading'].apply(encode_direction)

In [36]:
X_train['EntryExitDiff'] = X_train['ExitHeading'] - X_train['EntryHeading']

In [37]:
X_train.drop(['EntryStreetName', 'ExitStreetName', 'Path',
              'IntersectionId'], axis = 1, inplace = True)

In [38]:
X_train.head()

Unnamed: 0,Latitude,Longitude,EntryHeading,ExitHeading,Hour,Weekend,Month,City,EntryStreetType,ExitStreetType,EntryExitDiff
661578,39.96199,-75.1593,1.5,1.5,2,0,10,Philadelphia,Street,Street,0.0
42200,33.74457,-84.39448,0.5,0.5,6,0,9,Atlanta,Street,Street,0.0
617796,40.03954,-75.06014,1.25,1.25,12,1,11,Philadelphia,Boulevard,Boulevard,0.0
597658,39.95726,-75.21531,0.5,0.5,9,0,8,Philadelphia,Street,Street,0.0
671832,39.93302,-75.14945,1.5,1.5,14,0,11,Philadelphia,Street,Street,0.0


I will deal will encodings for intersectionID and more advanced feature engineering once I get a baseline model up and working

### Tensorflow boilerplate for preprocessing
I'm still learning my way around tensorflow's `feature_columns` and while a lot of this could probably have been done there it's easier for me to do in pandas then convert the results using some boilerplate from `feature_columns` for OHE. I'm sure there's a lot of power I'm not leveraging there though. 


This boilerplate is similar to that in the previous notebook so I won't comment on it here

In [39]:
y_train_1 = train['TotalTimeStopped_p20']
y_train_2 = train['TotalTimeStopped_p50']
y_train_3 = train['TotalTimeStopped_p80']

In [40]:
CATEGORICAL_COLUMNS = ['Hour', 'Weekend', 'Month', 'City', 'EntryStreetType', 'ExitStreetType']

NUMERIC_COLUMNS = ['Latitude', 'Longitude', 'EntryHeading', 'ExitHeading', 'EntryExitDiff']

In [41]:
def make_input(X, y, n_epochs = None, shuffle = False):
    def input_f():
        dataset = tf.data.Dataset.from_tensor_slices((dict(X), y)) 
        if shuffle:
            dataset = dataset.shuffle(len(y))
        dataset = dataset.repeat(n_epochs)
        dataset = dataset.batch(len(y))
        return dataset
    return input_f

In [22]:
def one_hot_cat_column(feature_name, vocab):
    return tf.feature_column.indicator_column(
        tf.feature_column.categorical_column_with_vocabulary_list(feature_name, vocab))

feature_columns = []
for feature_name in CATEGORICAL_COLUMNS:
    vocabulary = X_train[feature_name].unique()
    feature_columns.append(one_hot_cat_column(feature_name, vocabulary))
    
for feature_name in NUMERIC_COLUMNS:
    feature_columns.append(tf.feature_column.numeric_column(feature_name,
                                                            dtype = tf.float32))

#### Verification
Let's verify that this all works using the baseline `LinearRegressor` Estimator

## Modeling

In [43]:
from tensorflow.estimator import BoostedTreesRegressor
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error
from hyperopt import hp, fmin, tpe, Trials
from hyperopt.pyll import scope as ho_scope
from hyperopt.pyll.stochastic import sample as ho_sample

We define the objective function for `hyperopt` to optimize, similar to the previous notebook:

In [3]:
def rmse_cv_score(X, y, n_folds, params):
    scores = []
    model = BoostedTreesRegressor(**params)

    for train_index, val_index in KFold(n_splits = 5).split(X):
        
        X_train, X_val = X[train_index], X[val_index]
        y_train, y_val = y[train_index], y[val_index]
        
        train_input = make_input(X_train, y_train)
        val_input = make_input(X_val, y_val)
        
        model.train(train_input, max_steps = 1)
        
        # the loss is the mean squared error
        result = model.evaluate(val_input)[]
        scores.append(tf.math.sqrt(result['average_loss']))
        
    return np.mean(scores)

Define the search space for hyperopt. I decided on these paramaters by generalizing from my experience in tuning `xgboost`

In [23]:
param_space = {
    'feature_columns' : feature_columns,
    'n_batches_per_layer': 1,
    'train_in_memory': True,
    'pruning_mode' : 'pre',
    'n_trees' : hp.choice('ntrees', [25, 50, 75, 100, 150]),
    'max_depth': hp.choice('max_depth', list(range(6,18,2))),
    'learning_rate' : hp.uniform('learning_rate', 0.03, 0.12),
    'l2_regularization': hp.uniform('l2_regularization', 10, 200)
}

# try l2 regularization or tree_complexity based on time?

In [26]:
ho_sample(param_space)

{'feature_columns': ((('Hour',
    (0,
     1,
     2,
     3,
     4,
     5,
     6,
     7,
     8,
     9,
     10,
     11,
     12,
     13,
     14,
     15,
     16,
     17,
     18,
     19,
     20,
     21,
     22,
     23),
    tf.int64,
    -1,
    0),),
  (('Weekend', (0, 1), tf.int64, -1, 0),),
  (('Month', (6, 7, 8, 9, 10, 11, 12, 1, 5), tf.int64, -1, 0),),
  (('City',
    ('Atlanta', 'Boston', 'Chicago', 'Philadelphia'),
    tf.string,
    -1,
    0),),
  (('EntryStreetType',
    ('Boulevard',
     'Not reported',
     'Street',
     'Avenue',
     'Road',
     'Lane',
     'Drive',
     'Parkway',
     'Place',
     'Way',
     'Other',
     'Highway',
     'Terrace',
     'Square',
     'Circle',
     'Connector',
     'Broadway',
     'Bridge',
     'Wharf',
     'Court',
     'Tunnel'),
    tf.string,
    -1,
    0),),
  (('ExitStreetType',
    ('Boulevard',
     'Street',
     'Avenue',
     'Road',
     'Lane',
     'Drive',
     'Not reported',
     'Parkway',

In [45]:
# FIXME: Review how to do hyperopt correctly

In [44]:
trials_reg = Trials()
best_20 = fmin(rmse_cv_score, space = param_space, algo = tpe.suggest, max_evals = 25)
best_50 = fmin(rmse_cv_score, space = param_space, algo = tpe.suggest, max_evals = 25)
best_80 = fmin(rmse_cv_Score, space = param_space, algo = tpe.suggest, max_evals = 25)

  0%|                                                                             | 0/25 [00:00<?, ?it/s, best loss: ?]


TypeError: rmse_cv_score() missing 4 required positional arguments: 'y', 'n_folds', 'kwargs', and 'feature_columns'

## References

[1] [Feature Engineering and LightGBM](https://www.kaggle.com/dcaichara/feature-engineering-and-lightgbm)

[2] [Tutorial on Hyperopt](https://www.kaggle.com/fanvacoolt/tutorial-on-hyperopt)
