<h1> 2c. Loading large datasets progressively with the tf.data.Dataset </h1>

In this notebook, we continue reading the same small dataset, but refactor our ML pipeline in two small, but significant, ways:
<ol>
<li> Refactor the input to read data from disk progressively.
<li> Refactor the feature creation so that it is not one-to-one with inputs.
</ol>
<br/>
The Pandas function in the previous notebook first read the whole data into memory -- on a large dataset, this won't be an option.

In [1]:
import datalab.bigquery as bq
import tensorflow as tf
import numpy as np
import shutil
print(tf.__version__)

  from ._conv import register_converters as _register_converters


1.8.0


<h2> 1. Refactor the input </h2>

Read data created in Lab1a, but this time make it more general, so that we can later handle large datasets. We use the Dataset API for this. It ensures that, as data gets delivered to the model in mini-batches, it is loaded from disk only when needed.

In [17]:
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
DEFAULTS = [[0.0], [-74.0], [40.0], [-74.0], [40.7], [1.0], ['nokey']]

def read_line(fileLine):
  cols = tf.decode_csv(fileLine, record_defaults=DEFAULTS)
  features = dict(zip(CSV_COLUMNS, cols))
  features.pop("key");
  label = features.pop("fare_amount")
  return features, label

# TODO: Create an appropriate input function read_dataset
def read_dataset(filename, mode):
    #TODO Add CSV decoder function and dataset creation and methods
    dataset = tf.data.TextLineDataset(filename) \
      .map(read_line)
    
    batch = 250;
    
    if(mode == tf.estimator.ModeKeys.TRAIN):
      num_epoch = None
      dataset.shuffle(buffer_size = 1000)
    else:
      num_epoch = 1
    return dataset.repeat(num_epoch).batch(batch)
  
def get_train_input_fn():
  return read_dataset('./taxi-train.csv', mode = tf.estimator.ModeKeys.TRAIN)

def get_valid_input_fn():
  return read_dataset('./taxi-valid.csv', mode = tf.estimator.ModeKeys.EVAL)

<h2> 2. Refactor the way features are created. </h2>

For now, pass these through (same as previous lab).  However, refactoring this way will enable us to break the one-to-one relationship between inputs and features.

In [3]:
INPUT_COLUMNS = [
    tf.feature_column.numeric_column('pickuplon'),
    tf.feature_column.numeric_column('pickuplat'),
    tf.feature_column.numeric_column('dropofflat'),
    tf.feature_column.numeric_column('dropofflon'),
    tf.feature_column.numeric_column('passengers'),
]

def add_more_features(feats):
  # Nothing to add (yet!)
  return feats

feature_cols = add_more_features(INPUT_COLUMNS)

<h2> Create and train the model </h2>

Note that we train for num_steps * batch_size examples.

In [18]:
tf.logging.set_verbosity(tf.logging.INFO)
OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time
model = tf.estimator.LinearRegressor(
      feature_columns = feature_cols, model_dir = OUTDIR)
model.train(input_fn = get_train_input_fn, steps = 2000)

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_keep_checkpoint_max': 5, '_task_type': 'worker', '_save_summary_steps': 100, '_model_dir': 'taxi_trained', '_task_id': 0, '_save_checkpoints_steps': None, '_global_id_in_cluster': 0, '_is_chief': True, '_train_distribute': None, '_num_ps_replicas': 0, '_session_config': None, '_evaluation_master': '', '_service': None, '_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7fd5d8142a90>, '_tf_random_seed': None, '_save_checkpoints_secs': 600, '_log_step_count_steps': 100}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into taxi_trained/model.ckpt.
INFO:tensorflow:step = 1, loss = 548

<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x7fd5d8142f98>

<h3> Evaluate model </h3>

As before, evaluate on the validation data.  We'll do the third refactoring (to move the evaluation into the training loop) in the next lab.

In [19]:
metrics = model.evaluate(input_fn = get_valid_input_fn, steps = None)
print('RMSE on dataset = {}'.format(np.sqrt(metrics['average_loss'])))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-04-25-17:36:44
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-2000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2019-04-25-17:36:44
INFO:tensorflow:Saving dict for global step 2000: average_loss = 108.8457, global_step = 2000, loss = 25889.729
RMSE on dataset = 10.432914733886719


## Challenge Exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Unlike in the challenge exercise for b_estimator.ipynb, assume that your measurements of r, h and V are all rounded off to the nearest 0.1. Simulate the necessary training dataset. This time, you will need a lot more data to get a good predictor.

Hint (highlight to see):
<p style='color:white'>
Create random values for r and h and compute V. Then, round off r, h and V (i.e., the volume is computed from the true value of r and h; it's only your measurement that is rounded off). Your dataset will consist of the round values of r, h and V. Do this for both the training and evaluation datasets.
</p>

Now modify the "noise" so that instead of just rounding off the value, there is up to a 10% error (uniformly distributed) in the measurement followed by rounding off.

In [3]:
import random 
import tensorflow as tf
import shutil

def calc_volume(r, h):
  return 2*3.14*r*r*h

feature_cols = [tf.feature_column.numeric_column("radius"), tf.feature_column.numeric_column("height")];

def train_input_factory(limit):
  def train_input_fn():
    r = [0.5 + 0.01 * random.randint(1, 150) for i in range(limit)]
    h = [0.5 + 0.01 * random.randint(1, 150) for i in range(limit)]
    v = [ calc_volume(r1, h1) for r1, h1 in zip(r, h) ]

    r1 = [round(i, 1) for i in r];
    h1 = [round(i, 1) for i in h];
    
    features = { "radius": r, "height": h}
    return features, v
  return train_input_fn

# shutil.rmtree("DNNModel")
model = tf.estimator.DNNRegressor(feature_columns = feature_cols, 
                                   hidden_units = [6,16,13,2],
                                 model_dir = "DNNModel")

model.train(train_input_factory(100000), steps=1500);

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_is_chief': True, '_evaluation_master': '', '_save_checkpoints_secs': 600, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f9041ab84e0>, '_tf_random_seed': None, '_save_checkpoints_steps': None, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_master': '', '_service': None, '_task_type': 'worker', '_log_step_count_steps': 100, '_keep_checkpoint_max': 5, '_save_summary_steps': 100, '_model_dir': 'DNNModel', '_num_worker_replicas': 1, '_num_ps_replicas': 0, '_train_distribute': None, '_session_config': None, '_task_id': 0}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from DNNModel/model.ckpt-15000
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1500

In [4]:
import numpy as np

metrics = model.evaluate(input_fn = train_input_factory(100), steps=1);
print(metrics)
print('RMSE on dataset = {}'.format(np.sqrt(metrics['average_loss'])))

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2019-04-25-18:32:52
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from DNNModel/model.ckpt-16500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Evaluation [1/1]
INFO:tensorflow:Finished evaluation at 2019-04-25-18:32:52
INFO:tensorflow:Saving dict for global step 16500: average_loss = 126.82414, global_step = 16500, loss = 12682.414
{'global_step': 16500, 'average_loss': 126.82414, 'loss': 12682.414}
RMSE on dataset = 11.261622428894043


Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License