# More Feature Engineering - Wide and Deep models

**Learning Objectives** 
  * Build a Wide and Deep model using the appropriate Tensorflow feature columns
    
In this notebook we'll use what we learned about feature columns to build a Wide & Deep model. 

In [19]:
# Ensure that we have Tensorflow 1.12 installed.
!pip3 freeze | grep tensorflow==1.12.0 || pip3 install tensorflow==1.12.0

tensorflow==1.12.0


In [20]:
import tensorflow as tf
import numpy as np
import shutil
print(tf.__version__)

1.12.0


## Load Raw Data

These are the same files created in the `create_datasets.ipynb` notebook

In [21]:
#!gsutil cp gs://cloud-training-demos/taxifare/small/*.csv .
#!ls -l *.csv

## Train and Evaluate Input Functions

These are the same as before with one additional line of code: a call to `add_engineered_features()` from within the `_parse_row()` function.

In [22]:
CSV_COLUMN_NAMES = ['fare_amount','dayofweek','hourofday','pickuplon','pickuplat','dropofflon','dropofflat']
CSV_DEFAULTS = [[0.0],[1],[0],[-74.0], [40.0], [-74.0], [40.7]]

def read_dataset(csv_path):
    def _parse_row(row):
        # Decode the CSV row into list of TF tensors
        fields = tf.decode_csv(row, record_defaults=CSV_DEFAULTS)

        # Pack the result into a dictionary
        features = dict(zip(CSV_COLUMN_NAMES, fields))
        
        # NEW: Add engineered features
        features = add_engineered_features(features)
        
        # Separate the label from the features
        label = features.pop('fare_amount') # remove label from features and store

        return features, label
    
    # Create a dataset containing the text lines.
    dataset = tf.data.Dataset.list_files(csv_path) # (i.e. data_file_*.csv)
    dataset = dataset.flat_map(lambda filename:tf.data.TextLineDataset(filename).skip(1))

    # Parse each CSV row into correct (features,label) format for Estimator API
    dataset = dataset.map(_parse_row)
    
    return dataset

def train_input_fn(csv_path, batch_size=128):
    #1. Convert CSV into tf.data.Dataset  with (features,label) format
    dataset = read_dataset(csv_path)
      
    #2. Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)
   
    return dataset

def eval_input_fn(csv_path, batch_size=128):
    #1. Convert CSV into tf.data.Dataset  with (features,label) format
    dataset = read_dataset(csv_path)

    #2.Batch the examples.
    dataset = dataset.batch(batch_size)
   
    return dataset

## Feature columns for Wide and Deep model

Recall,...Motivation talk about wide and deep 

In [23]:
# 1. One hot encode dayofweek and hourofday
fc_dayofweek = tf.feature_column.categorical_column_with_identity('dayofweek', num_buckets = 8)
fc_hourofday = tf.feature_column.categorical_column_with_identity('hourofday', num_buckets = 24)

# 2. Bucketize latitudes and longitudes
NBUCKETS = 16
latbuckets = np.linspace(38.0, 42.0, NBUCKETS).tolist()
lonbuckets = np.linspace(-76.0, -72.0, NBUCKETS).tolist()
fc_bucketized_plat = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('pickuplon'), lonbuckets)
fc_bucketized_plon = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('pickuplat'), latbuckets)
fc_bucketized_dlat = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('dropofflon'), lonbuckets)
fc_bucketized_dlon = tf.feature_column.bucketized_column(tf.feature_column.numeric_column('dropofflat'), latbuckets)

### Input functions



In [24]:
def add_engineered_features(features):
    features['latdiff'] = features['pickuplat'] - features['dropofflat'] # East/West
    features['londiff'] = features['pickuplon'] - features['dropofflon'] # North/South
    features['euclidean_dist'] = tf.sqrt(features['latdiff']**2 + features['londiff']**2)

    return features

### Gather list of feature columns




In [36]:
def get_wide_deep():
    # Wide columns are sparse, have linear relationship with the output
    wide_columns = [
        # Feature crosses
        tf.feature_column.crossed_column(keys = [fc_bucketized_dlat, fc_bucketized_dlon],
                                                           hash_bucket_size = NBUCKETS * NBUCKETS),
        tf.feature_column.crossed_column(keys = [fc_bucketized_plat, fc_bucketized_plon],
                                                           hash_bucket_size = NBUCKETS * NBUCKETS),
        tf.feature_column.crossed_column(keys = [fc_crossed_dloc, fc_crossed_ploc],
                                                              hash_bucket_size = NBUCKETS**4),
        tf.feature_column.crossed_column(keys = [fc_dayofweek, fc_hourofday],
                                                             hash_bucket_size = 24*7),        
        # Sparse columns
        fc_dayofweek, fc_hourofday
    ]
    
    # Continuous columns are deep, have a complex relationship with the output
    deep_columns = [
        # Embedding_column to "group" together ...
        tf.feature_column.embedding_column(fc_crossed_pd_pair, 10),
        tf.feature_column.embedding_column(fc_crossed_day_hr, 10),

        # Numeric columns
        tf.feature_column.numeric_column('pickuplat'),
        tf.feature_column.numeric_column('pickuplon'),
        tf.feature_column.numeric_column('dropofflon'),
        tf.feature_column.numeric_column('dropofflat'),
        tf.feature_column.numeric_column('latdiff'),
        tf.feature_column.numeric_column('londiff'),
        tf.feature_column.numeric_column('euclidean_dist'),
        
        tf.feature_column.indicator_column(fc_crossed_day_hr),
    ]
    
    return wide_columns, deep_columns

## Serving Input Receiver Function 

Same as before except the received tensors are wrapped with `add_engineered_features()`.

In [37]:
def serving_input_receiver_fn():
    receiver_tensors = {
        'dayofweek' : tf.placeholder(tf.int32, shape=[None]), # shape is vector to allow batch of requests
        'hourofday' : tf.placeholder(tf.int32, shape=[None]),
        'pickuplon' : tf.placeholder(tf.float32, shape=[None]), 
        'pickuplat' : tf.placeholder(tf.float32, shape=[None]),
        'dropofflat' : tf.placeholder(tf.float32, shape=[None]),
        'dropofflon' : tf.placeholder(tf.float32, shape=[None]),
    }
    
    features = add_engineered_features(receiver_tensors) # 'features' is what is passed on to the model
    
    return tf.estimator.export.ServingInputReceiver(features, receiver_tensors)

## Train and Evaluate (500 train steps)

The same as before, we'll train the model for 500 steps (sidenote: how many epochs do 500 trains steps represent?). Let's see how the engineered features we've added affect the performance. 

In [38]:
%%time
OUTDIR = 'taxi_trained_wd/500'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

# Collect the wide and deep columns from above
wide_columns, deep_columns = get_wide_deep()

model = tf.estimator.DNNLinearCombinedRegressor(
    model_dir = OUTDIR,
    linear_feature_columns = wide_columns,
    dnn_feature_columns = deep_columns,
    dnn_hidden_units = [10,10], # specify neural architecture
    config = tf.estimator.RunConfig(
          tf_random_seed=1, # for reproducibility
          save_checkpoints_steps=100 # checkpoint every N steps
    ) 
)

# Add custom evaluation metric
def my_rmse(labels, predictions):
    pred_values = tf.squeeze(predictions['predictions'],axis=-1)
    return {'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)}
model = tf.contrib.estimator.add_metrics(model, my_rmse) 
    
train_spec=tf.estimator.TrainSpec(
                   input_fn = lambda:train_input_fn('./taxi-train.csv'),
                   max_steps = 500)

exporter = tf.estimator.FinalExporter('exporter', serving_input_receiver_fn) # export SavedModel once at the end of training
# Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint

eval_spec=tf.estimator.EvalSpec(
                   input_fn=lambda:eval_input_fn('./taxi-valid.csv'),
                   steps = None,
                   start_delay_secs=1, # wait at least N seconds before first evaluation (default 120)
                   throttle_secs=1, # wait at least N seconds before each subsequent evaluation (default 600)
                   exporters = exporter) # export SavedModel once at the end of training

tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

INFO:tensorflow:Using config: {'_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff5e6bc70f0>, '_task_type': 'worker', '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_master': '', '_global_id_in_cluster': 0, '_model_dir': 'taxi_trained_wd/500', '_device_fn': None, '_eval_distribute': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 100, '_service': None, '_save_checkpoints_secs': None, '_num_worker_replicas': 1, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_protocol': None, '_experimental_distribute': None, '_is_chief': True, '_task_id': 0, '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_log_step_count_steps': 100, '_evaluation_master': ''}
INFO:tensorflow:Using config: {'_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff5e7c30da0>, '_task_type': 'worker', '_

### Results

Our RMSE for the Wide and Deep model is worse than for the DNN. However, we have only trained for 500 steps and it looks like the model is still learning. Just as before, let's run again, this time for 10x as many steps so we can give a fair comparison.

## Train and Evaluate (5,000 train steps)

Now, just as above, we'll execute a longer trianing job with 5,000 train steps using our engineered features and assess the performance.

In [32]:
%%time
OUTDIR = 'taxi_trained_wd/5000'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

# Collect the wide and deep columns from above
wide_columns, deep_columns = get_wide_deep()

model = tf.estimator.DNNLinearCombinedRegressor(
    model_dir = OUTDIR,
    linear_feature_columns = wide_columns,
    dnn_feature_columns = deep_columns,
    dnn_hidden_units = [10,10], # specify neural architecture
    config = tf.estimator.RunConfig(
          tf_random_seed=1, # for reproducibility
          save_checkpoints_steps=100 # checkpoint every N steps
    ) 
)

# Add custom evaluation metric
def my_rmse(labels, predictions):
    pred_values = tf.squeeze(predictions['predictions'],axis=-1)
    return {'rmse': tf.metrics.root_mean_squared_error(labels, pred_values)}
model = tf.contrib.estimator.add_metrics(model, my_rmse) 
    
train_spec=tf.estimator.TrainSpec(
                   input_fn = lambda:train_input_fn('./taxi-train.csv'),
                   max_steps = 5000)

exporter = tf.estimator.FinalExporter('exporter', serving_input_receiver_fn) # export SavedModel once at the end of training
# Note: alternatively use tf.estimator.BestExporter to export at every checkpoint that has lower loss than the previous checkpoint

eval_spec=tf.estimator.EvalSpec(
                   input_fn=lambda:eval_input_fn('./taxi-valid.csv'),
                   steps = None,
                   start_delay_secs=1, # wait at least N seconds before first evaluation (default 120)
                   throttle_secs=1, # wait at least N seconds before each subsequent evaluation (default 600)
                   exporters = exporter) # export SavedModel once at the end of training

tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
tf.estimator.train_and_evaluate(model, train_spec, eval_spec)

INFO:tensorflow:Using config: {'_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff5e7c30da0>, '_task_type': 'worker', '_keep_checkpoint_every_n_hours': 10000, '_tf_random_seed': 1, '_master': '', '_global_id_in_cluster': 0, '_model_dir': 'taxi_trained_wd/5000', '_device_fn': None, '_eval_distribute': None, '_save_summary_steps': 100, '_save_checkpoints_steps': 100, '_service': None, '_save_checkpoints_secs': None, '_num_worker_replicas': 1, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_protocol': None, '_experimental_distribute': None, '_is_chief': True, '_task_id': 0, '_keep_checkpoint_max': 5, '_num_ps_replicas': 0, '_log_step_count_steps': 100, '_evaluation_master': ''}
INFO:tensorflow:Using config: {'_train_distribute': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7ff5e7c4e748>, '_task_type': 'worker', '

### Results

Our RMSE is better but still not as good as the DNN we built. It looks like RMSE may still be reducing, but training is getting slow so we should move to the cloud if we want to train longer.

Also we haven't explored our hyperparameters much. Is our neural architecture of two layers with 10 nodes each optimal? 

In the next notebook we'll explore this.

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License