# Introduction to Feature Engineering
**Learning Objectives**
  * Improve the accuracy of a model by using feature engineering
  * Understand there's two places to do feature engineering in Tensorflow
    1. In the input functions
    2. Using the `tf.feature_column` module
    
Up until now we've been focusing on Tensorflow mechanics to make sure our code works, we have neglected model performance, which at this point is **9.26 RMSE**. 

In this notebook we'll attempt to improve on that using feature engineering.

In [1]:
import tensorflow as tf
import shutil
print(tf.__version__)

  from ._conv import register_converters as _register_converters


1.12.0


## 1) Load Raw Data

These are the same files created in the `create_datasets.ipynb` notebook

In [2]:
!gsutil cp gs://cloud-training-demos/taxifare/small/*.csv .
!ls -l *.csv

Copying gs://cloud-training-demos/taxifare/small/taxi-test.csv...
Copying gs://cloud-training-demos/taxifare/small/taxi-train.csv...              
Copying gs://cloud-training-demos/taxifare/small/taxi-valid.csv...              
| [3 files][ 11.3 MiB/ 11.3 MiB]                                                
Operation completed over 3 objects/11.3 MiB.                                     
-rw-r--r-- 1 root root 1867645 Jan  6 17:27 taxi-test.csv
-rw-r--r-- 1 root root 8289592 Jan  6 17:27 taxi-train.csv
-rw-r--r-- 1 root root 1737393 Jan  6 17:27 taxi-valid.csv


## 2) Train and Evaluate Input Functions

These are the same as before with one additional line of code: a call to `add_engineered_features()` from within the `_parse_row()` function.

In [3]:
CSV_COLUMN_NAMES = ['fare_amount','dayofweek','hourofday','pickuplon','pickuplat','dropofflon','dropofflat','passengers']
CSV_DEFAULTS = [[0.0],[1],[0],[-74.0], [40.0], [-74.0], [40.7], [1]]

def read_dataset(csv_path):
    def _parse_row(row):
        # Decode the CSV row into list of TF tensors
        fields = tf.decode_csv(row, record_defaults=CSV_DEFAULTS)

        # Pack the result into a dictionary
        features = dict(zip(CSV_COLUMN_NAMES, fields))
        
        # NEW: Add engineered features
        features = add_engineered_features(features)
        
        # Separate the label from the features
        label = features.pop('fare_amount') # remove label from features and store

        return features, label
    
    # Create a dataset containing the text lines.
    dataset = tf.data.Dataset.list_files(csv_path) # (i.e. data_file_*.csv)
    dataset = dataset.flat_map(lambda filename:tf.data.TextLineDataset(filename).skip(1))

    # Parse each CSV row into correct (features,label) format for Estimator API
    dataset = dataset.map(_parse_row)
    
    return dataset

def train_input_fn(csv_path, batch_size=128):
    #1. Convert CSV into tf.data.Dataset  with (features,label) format
    dataset = read_dataset(csv_path)
      
    #2. Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)
   
    return dataset

def eval_input_fn(csv_path, batch_size=128):
    #1. Convert CSV into tf.data.Dataset  with (features,label) format
    dataset = read_dataset(csv_path)

    #2.Batch the examples.
    dataset = dataset.batch(batch_size)
   
    return dataset

## 3) Feature Engineering: Feature Columns

There are two places in Tensorflow where we can do feature engineering. The first is using the `tf.feature_column` package. This allows us easily 

- bucketize continuous features
- one hot encode categorical features
- create feature crosses
- embed categorical features

For details on the possible `tf.feature_column` transformations and when to use each see the [official guide](https://www.tensorflow.org/guide/feature_columns).

Let's use `tf.feature_column` to create a feature that shows the combination of day of week and hour of day. This will allow our model to easily learn the difference between say Wednesday at 5pm (rush hour, expect higher fares) and Sunday at 5pm (light traffic, expect lower fares).

In [4]:
# 1. One hot encode dayofweek and hourofday, so they can be crossed
fc_dayofweek = tf.feature_column.categorical_column_with_identity('dayofweek', num_buckets = 8)
fc_hourofday = tf.feature_column.categorical_column_with_identity('hourofday', num_buckets = 24)

# 2. Cross features to get combination of day and hour
fc_day_hr = tf.feature_column.crossed_column([fc_dayofweek, fc_hourofday], 24 * 7)

# 3. Embed sparse vector into dense representation so that it can be used with DNN
fc_day_hr_embedded = tf.feature_column.embedding_column(fc_day_hr, 3)

feature_cols = [fc_day_hr_embedded]

## 4) Feature Engineering: Input Functions

While feature columns are very powerful, what happens when we want to something that there isn't a feature column for?

Recall the input functions recieve csv data, format it, then pass it batch by batch to the model.Here we can inject arbitrary tensorflow code to manipulate the data.

However, we need to be careful that any transformations we do in one input function, we do for all, otherwise we'll have [training-serving skew](https://developers.google.com/machine-learning/guides/rules-of-ml/#training-serving_skew).

To guard against this we encapsulate all feature engineering in a single function, `add_engineered_features()`, and call this function from every input function.

So what feature should we engineer?

Currently our model has pickup and drop off latitudes and longitudes as features, but feeding this in directly to the model isn't very useful. Let's calculate the euclidean distance between the pickup and dropoff points and feed that as a new feature to our model.

In [5]:
def add_engineered_features(features):
    latdiff = features['pickuplat'] - features['dropofflat']
    londiff = features['pickuplon'] - features['dropofflon']
    euclidean_dist = tf.sqrt(latdiff**2 + londiff**2)
    
    features['euclidean_dist'] = euclidean_dist
    return features

fc_distance = tf.feature_column.numeric_column('euclidean_dist')
feature_cols.append(fc_distance)

## 3) Serving Input Receiver Function 

Same as before with one addition: the received tensors are wrapped with `add_engineered_features()` before passing on to the model.

In [6]:
def serving_input_receiver_fn():
    receiver_tensors = {
        'dayofweek' : tf.placeholder(tf.int32, shape=[None]), # shape is vector to allow batch of requests
        'hourofday' : tf.placeholder(tf.int32, shape=[None]),
        'pickuplon' : tf.placeholder(tf.float32, shape=[None]), 
        'pickuplat' : tf.placeholder(tf.float32, shape=[None]),
        'dropofflat' : tf.placeholder(tf.float32, shape=[None]),
        'dropofflon' : tf.placeholder(tf.float32, shape=[None]),
        'passengers' : tf.placeholder(tf.int32, shape=[None]),
    }
    
    features = add_engineered_features(receiver_tensors) # 'features' is what is passed on to the model
    
    return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)

## 4) Monitoring with TensorBoard 

*Warning:* There is an issue with DataLab that causes TensorBoard to only work correctly the first time it's launched per session. So if you have issues try resetting the kernel and launching TensorBoard again


In [7]:
from google.datalab.ml import TensorBoard
TensorBoard().start('taxi_trained')

74808

## 5) Train and Evaluate

Same as before

In [None]:
%%time
OUTDIR = 'taxi_trained/500'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.DNNRegressor(
    hidden_units = [10,10], # specify neural architecture
    feature_columns = feature_cols, 
    model_dir = OUTDIR,
    config = tf.estimator.RunConfig(
          tf_random_seed=1, # for reproducibility
          save_checkpoints_steps=100 # checkpoint every N steps
    ) 
)

# Add custom evaluation metric
def my_rmse(labels, predictions):
    pred_values = tf.squeeze(predictions['predictions'],axis=-1)
    return {'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)}
model = tf.contrib.estimator.add_metrics(model, my_rmse) 
    
train_spec=tf.estimator.TrainSpec(
                   input_fn = lambda:train_input_fn('./taxi-train.csv'),
                   max_steps = 500)

exporter = tf.estimator.FinalExporter('exporter', serving_input_receiver_fn) # export SavedModel once at the end of training
# Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint

eval_spec=tf.estimator.EvalSpec(
                   input_fn=lambda:eval_input_fn('./taxi-valid.csv'),
                   steps = None,
                   start_delay_secs=1, # wait at least N seconds before first evaluation (default 120)
                   throttle_secs=1, # wait at least N seconds before each subsequent evaluation (default 600)
                   exporters = exporter) # export SavedModel once at the end of training

tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

## Results

Our RMSE is now 6.67, our first significan improvement! If we look at the RMSE trend in TensorBoard it appears the model is still learning, so training past 500 steps would likely lower the RMSE even more. Let's run again, this time for 10x as many steps.

In [None]:
%%time
OUTDIR = 'taxi_trained/5000'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model = tf.estimator.DNNRegressor(
    hidden_units = [10,10], # specify neural architecture
    feature_columns = feature_cols, 
    model_dir = OUTDIR,
    config = tf.estimator.RunConfig(
          tf_random_seed=1, # for reproducibility
          save_checkpoints_steps=500 # checkpoint every N steps
    ) 
)

# Add custom evaluation metric
def my_rmse(labels, predictions):
    pred_values = tf.squeeze(predictions['predictions'],axis=-1)
    return {'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)}
model = tf.contrib.estimator.add_metrics(model, my_rmse) 
    
train_spec=tf.estimator.TrainSpec(
                   input_fn = lambda:train_input_fn('./taxi-train.csv'),
                   max_steps = 5000)

exporter = tf.estimator.FinalExporter('exporter', serving_input_receiver_fn) # export SavedModel once at the end of training

eval_spec=tf.estimator.EvalSpec(
                   input_fn=lambda:eval_input_fn('./taxi-valid.csv'),
                   steps = None,
                   start_delay_secs=1, # wait at least N seconds before first evaluation (default 120)
                   throttle_secs=1, # wait at least N seconds before each subsequent evaluation (default 600)
                   exporters = exporter) # export SavedModel once at the end of training

tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

## Results: 5000 steps

Our RMSE is now 4.79! It looks like RMSE is still reducing, but training is getting slow so we should move to the cloud if we want to train longer.

Also we haven't explored our hyperparameters much. Is our neural architecture of [10,10] optimal? Is our learning rate ideal?

In the next notebook we'll show a way to choose the ideal values for all of these hyperparemeters automatically.

## 7) Cleanup

In [12]:
if len(TensorBoard.list())>0:
  [TensorBoard().stop(pid)for pid in TensorBoard.list()['pid']]
else: print('No TensorBoard instances to stop')

## Challenge Exercise

Modify your solution to the challenge exercise in c_dataset.ipynb appropriately.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License