<h1> 2c. Refactoring to add batching and feature-creation </h1>

In this notebook, we continue reading the same small dataset, but refactor our ML pipeline in two small, but significant, ways:
<ol>
<li> Refactor the input to read data in batches.
<li> Refactor the feature creation so that it is not one-to-one with inputs.
</ol>

In [1]:
import tensorflow as tf
print tf.__version__

1.4.1


In [3]:
import datalab.bigquery as bq
import numpy as np
import shutil

<h2> 1. Refactor the input </h2>

Read data created in Lab1a, but this time make it more general, so that we are reading in batches.  Instead of using Pandas, we will create a TensorFlow Dataset.

In [None]:
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_dataset(filename, mode, batch_size=512):
    def decode_csv(value_column):
      columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
      features = dict(zip(CSV_COLUMNS, columns))
      label = features.pop(LABEL_COLUMN)
      return features, label
    
    dataset = tf.data.TextLineDataset(filename).map(decode_csv)
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
        dataset = dataset.shuffle(buffer_size=10*batch_size)
    else:
        num_epochs = 1 # end-of-input after this
 
    dataset = dataset.repeat(num_epochs).batch(batch_size)
    return dataset.make_one_shot_iterator().get_next()
    

def get_train():
  return read_dataset('./taxi-train.csv', mode=tf.contrib.learn.ModeKeys.TRAIN)

def get_valid():
  return read_dataset('./taxi-valid.csv', mode=tf.contrib.learn.ModeKeys.EVAL)

def get_test():
  return read_dataset('./taxi-test.csv', mode=tf.contrib.learn.ModeKeys.EVAL)

An alternate way of doing this is we can add a filename queue to the TensorFlow graph. This queue will be cycled through num_epochs times.

In [4]:
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
LABEL_COLUMN = 'fare_amount'
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_dataset(filename, mode, batch_size=512):
    if mode == tf.estimator.ModeKeys.TRAIN:
        num_epochs = None # indefinitely
    else:
        num_epochs = 1 # end-of-input after this
        
    input_file_names = tf.train.match_filenames_once(filename)
    filename_queue = tf.train.string_input_producer(
        input_file_names, num_epochs=num_epochs, shuffle=True)
    reader = tf.TextLineReader()
    _, value = reader.read_up_to(filename_queue, num_records=batch_size)
    if mode == tf.estimator.ModeKeys.TRAIN:
          value = tf.train.shuffle_batch([value], batch_size, capacity=10*batch_size, 
                                         min_after_dequeue=batch_size, enqueue_many=True, 
                                         allow_smaller_final_batch=False)
    
    value_column = tf.expand_dims(value, -1)
    columns = tf.decode_csv(value_column, record_defaults=DEFAULTS)
    features = dict(zip(CSV_COLUMNS, columns))
    label = features.pop(LABEL_COLUMN)
    return features, label

def get_train():
  return read_dataset('./taxi-train.csv', mode=tf.contrib.learn.ModeKeys.TRAIN)

def get_valid():
  return read_dataset('./taxi-valid.csv', mode=tf.contrib.learn.ModeKeys.EVAL)

def get_test():
  return read_dataset('./taxi-test.csv', mode=tf.contrib.learn.ModeKeys.EVAL)

<h2> 2. Refactor the way features are created. </h2>

For now, pass these through (same as previous lab).  However, refactoring this way will enable us to break the one-to-one relationship between inputs and features.

In [5]:
INPUT_COLUMNS = [
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
]

def add_more_features(feats):
  # nothing to add (yet!)
  return feats

feature_cols = add_more_features(INPUT_COLUMNS)

<h2> Create and train the model </h2>

Note that we no longer have a num_steps variable.  get_train() specifies a num_epochs.

In [9]:
tf.logging.set_verbosity(tf.logging.INFO)
OUTDIR='taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors=True) # start fresh each time
model = tf.estimator.LinearRegressor(
      feature_columns=feature_cols, model_dir=OUTDIR)
model.train(input_fn=get_train, steps=100);

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_save_checkpoints_secs': 600, '_session_config': None, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_is_chief': True, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f52df11a3d0>, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_service': None, '_num_ps_replicas': 0, '_tf_random_seed': None, '_master': '', '_num_worker_replicas': 1, '_task_id': 0, '_log_step_count_steps': 100, '_model_dir': 'taxi_trained', '_save_summary_steps': 100}
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Saving checkpoints for 1 into taxi_trained/model.ckpt.
INFO:tensorflow:loss = 103787.586, step = 1
INFO:tensorflow:Saving checkpoints for 100 into taxi_trained/model.ckpt.
INFO:tensorflow:Loss for final step: 65201.15.


<h3> Evaluate model </h3>

As before, evaluate on the validation data.  We'll do the third refactoring (to move the evaluation into the training loop) in the next lab.

In [7]:
def print_rmse(model, name, input_fn):
  metrics = model.evaluate(input_fn=input_fn, steps=1)
  print 'RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['average_loss']))
print_rmse(model, 'validation', get_valid)

INFO:tensorflow:Starting evaluation at 2018-02-02-16:47:44
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-100
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2018-02-02-16:47:44
INFO:tensorflow:Saving dict for global step 100: average_loss = 126.530785, global_step = 100, loss = 64783.76
RMSE on validation dataset = 11.2485904694


Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License