# Introducing tf.estimator

<img src='assets/TFHierarchy.png' width='50%'>
<sup>(image: https://www.tensorflow.org/guide/premade_estimators)</sup>

Tensorflow is a hierarchical framework. The further down the heirarchy you go, the more flexibility you have, but that more code you have to write. Generally one starts at the highest level of abstraction. Then if you need additional flexibility drop down one layer.

In this notebook we will be operating at the highest level of Tensorflow abstraction, using the Estimator API to predict taxifare prices on the sampled dataset we created previously.

In [1]:
import tensorflow as tf
import pandas as pd
import shutil

print(tf.__version__)

  from ._conv import register_converters as _register_converters


1.12.0


## 1) Load Raw Data 

We'll load the .csv data created in the previous notebook. Because the files are small we can load them into in-memory Pandas dataframes.

In [12]:
CSV_COLUMN_NAMES = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURE_NAMES = CSV_COLUMN_NAMES[1:len(CSV_COLUMN_NAMES) - 1] # all but first and last columns
LABEL_NAME = CSV_COLUMN_NAMES[0] # first column

df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMN_NAMES)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMN_NAMES)
df_test = pd.read_csv('./taxi-test.csv', header = None, names = CSV_COLUMN_NAMES)

## 2) Create Feature Columns

Feature columns make it easy to perform common type of feature engineering on your raw data. For example you can one-hot encode categorical data, create feature crosses, embeddings and more. We'll cover these later in the course, but if you want to a sneak peak browse the official TensorFlow [feature columns guide](https://www.tensorflow.org/guide/feature_columns).

In our case we won't do any feature engineering. However we still need to create a list of feature columns because the Estimator we will use requires one. To specify the numeric values should be passed on without modification we use `tf.feature_column.numeric_column()`

We use a [python list comprehension](https://www.pythonforbeginners.com/basics/list-comprehensions-in-python) to create the list of feature columns, which is just an elegant alternative to a for loop.

In [3]:
feature_columns = [tf.feature_column.numeric_column(k) for k in FEATURE_NAMES]
feature_columns

[_NumericColumn(key='pickuplon', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='pickuplat', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='dropofflon', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='dropofflat', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None),
 _NumericColumn(key='passengers', shape=(1,), default_value=None, dtype=tf.float32, normalizer_fn=None)]

## 3) Define Input Function

Now that your estimator knows what type of data to expect and how to intepret it, you need to actually pass the data to it! This is the job of the input function.

The input function returns a new batch of (features, label) tuples each time it is called by the Estimator.

- features: A python dictionary. Each key is a feature column name and its value is the tensor containing the data for that feature
- label: A Tensor containing the labels

So how do we get from our current Pandas dataframes to (features, label) tuples that return one batch at a time?

The `tf.data` module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model. https://www.tensorflow.org/guide/datasets_for_estimators

In [4]:
def train_input_fn(df, batch_size=128):
    #1. Convert dataframe into correct (features,label) format for Estimator API
    dataset = tf.data.Dataset.from_tensor_slices((dict(df[FEATURE_NAMES]), df[LABEL_NAME]))
    
    # Note:
    # If we returned now, the Dataset would iterate over the data once  
    # in a fixed order, and only produce a single element at a time.
    
    #2. Shuffle, repeat, and batch the examples.
    dataset = dataset.shuffle(1000).repeat().batch(batch_size)
   
    return dataset

def eval_input_fn(df, batch_size=128):
    #1. Convert dataframe into correct (features,label) format for Estimator API
    dataset = tf.data.Dataset.from_tensor_slices((dict(df[FEATURE_NAMES]), df[LABEL_NAME]))

    #2.Batch the examples.
    dataset = dataset.batch(batch_size)
   
    return dataset

def predict_input_fn(df, batch_size=128):
    #1. Convert dataframe into correct (features) format for Estimator API
    dataset = tf.data.Dataset.from_tensor_slices(dict(df[FEATURE_NAMES])) # no label

    #2.Batch the examples.
    dataset = dataset.batch(batch_size)
   
    return dataset

## 4) Choose Estimator

Tensorflow has several premade estimators for you to choose from:

- LinearClassifier/Regressor
- BoostedTreesClassifier/Regressor
- DNNClassifier/Regressor
- DNNLinearCombinedClassifier/Regressor

If none of these meet your needs you can implement a custom estimator using `tf.Keras`. We'll cover that later in the course.

For now we will use the premade LinearRegressor. To instantiate an estimator simply pass it what feature columns to expect and specify an directory for it to output checkpoint files to.

In [13]:
OUTDIR = 'taxi_trained'

model = tf.estimator.LinearRegressor(
    feature_columns=feature_columns,
    model_dir = OUTDIR,
    config = tf.estimator.RunConfig(tf_random_seed=1) # for reproducibility
)

INFO:tensorflow:Using config: {'_evaluation_master': '', '_train_distribute': None, '_task_type': 'worker', '_model_dir': 'taxi_trained', '_keep_checkpoint_max': 5, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_device_fn': None, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_experimental_distribute': None, '_task_id': 0, '_log_step_count_steps': 100, '_is_chief': True, '_save_checkpoints_secs': 600, '_tf_random_seed': 1, '_master': '', '_eval_distribute': None, '_protocol': None, '_save_checkpoints_steps': None, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f76e6ea66a0>}


## 5) Train

Simply invoke the estimator's `train()` function. Specify the `input_fn` which tells it how to load in data, and specify the number of steps to train for.

By default estimators check the output directory for checkpoint files before beginning training, so it can pickup where it last left off. To prevent this we'll delete the output directory before starting training each time.

In [16]:
%%time
tf.logging.set_verbosity(tf.logging.INFO) # so loss is printed during training
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

model.train(
    input_fn=lambda:train_input_fn(df_train), 
    steps=500)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into taxi_trained/model.ckpt.
INFO:tensorflow:loss = 22609.832, step = 1
INFO:tensorflow:global_step/sec: 176.337
INFO:tensorflow:loss = 8806.908, step = 101 (0.569 sec)
INFO:tensorflow:global_step/sec: 160.321
INFO:tensorflow:loss = 10165.997, step = 201 (0.624 sec)
INFO:tensorflow:global_step/sec: 178.758
INFO:tensorflow:loss = 7250.997, step = 301 (0.559 sec)
INFO:tensorflow:global_step/sec: 177.966
INFO:tensorflow:loss = 9367.691, step = 401 (0.562 sec)
INFO:tensorflow:Saving checkpoints for 500 into taxi_trained/model.ckpt.
INFO:tensorflow:Loss for final step: 11392.97.
CPU times: user 3.49 s, sys: 1.5 s, total: 4.99 s
Wall time: 4.59 s


## 6) Evaluate

Estimators similarly have an `evaluate()` function. In this case we don't need to specify the number of steps to train because we didn't tell our input function to repeat the data. Once the input function reaches the end of the data evaluation will end. 

Loss is reported as MSE by default so we take the square root before printing.

In [7]:
def print_rmse(model, df):
  metrics = model.evaluate(input_fn=lambda:eval_input_fn(df))
  print('RMSE on dataset = {}'.format(metrics['average_loss']**.5))
print_rmse(model, df_valid)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-21-15:56:27
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-21-15:56:27
INFO:tensorflow:Saving dict for global step 500: average_loss = 109.09749, global_step = 500, label/mean = 11.666427, loss = 12974.808, prediction/mean = 12.183354
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 500: taxi_trained/model.ckpt-500
RMSE on dataset = 10.44497431319581


RMSE of 10.44 is far worse than our benchmark (RMSE of $6 or so on this data). However given that we haven't done any feature engineering or hyperparameter tuning, and we're training on a small dataset using a simple linear model, we shouldn't yet expect good performance. 

The goal at this point is to demonstrate the mechanics of the Estimator API. In subsequent notebooks we'll improve on the model.

## 7) Predict

In [8]:
predictions = model.predict(input_fn=lambda:predict_input_fn(df_test))
for items in predictions:
  print(items)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
{'predictions': array([12.134162], dtype=float32)}
{'predictions': array([12.131337], dtype=float32)}
{'predictions': array([12.132298], dtype=float32)}
{'predictions': array([12.12957], dtype=float32)}
{'predictions': array([12.133875], dtype=float32)}
{'predictions': array([12.133627], dtype=float32)}
{'predictions': array([12.132052], dtype=float32)}
{'predictions': array([12.132091], dtype=float32)}
{'predictions': array([12.134051], dtype=float32)}
{'predictions': array([12.131647], dtype=float32)}
{'predictions': array([12.134162], dtype=float32)}
{'predictions': array([12.134347], dtype=float32)}
{'predictions': array([12.126886], dtype=float32)}
{'predictions': array([12.131234], dtype=float32)}
{'predictions': array([1

{'predictions': array([12.126996], dtype=float32)}
{'predictions': array([12.276859], dtype=float32)}
{'predictions': array([12.129967], dtype=float32)}
{'predictions': array([12.133788], dtype=float32)}
{'predictions': array([12.132904], dtype=float32)}
{'predictions': array([12.205458], dtype=float32)}
{'predictions': array([12.276597], dtype=float32)}
{'predictions': array([12.13229], dtype=float32)}
{'predictions': array([12.350427], dtype=float32)}
{'predictions': array([12.202695], dtype=float32)}
{'predictions': array([12.132523], dtype=float32)}
{'predictions': array([12.129102], dtype=float32)}
{'predictions': array([12.11675], dtype=float32)}
{'predictions': array([12.131047], dtype=float32)}
{'predictions': array([12.205647], dtype=float32)}
{'predictions': array([12.132571], dtype=float32)}
{'predictions': array([12.419511], dtype=float32)}
{'predictions': array([12.132815], dtype=float32)}
{'predictions': array([12.133749], dtype=float32)}
{'predictions': array([12.134494]

{'predictions': array([12.132174], dtype=float32)}
{'predictions': array([12.131079], dtype=float32)}
{'predictions': array([12.276454], dtype=float32)}
{'predictions': array([12.132441], dtype=float32)}
{'predictions': array([12.132125], dtype=float32)}
{'predictions': array([12.421764], dtype=float32)}
{'predictions': array([12.130813], dtype=float32)}
{'predictions': array([12.204072], dtype=float32)}
{'predictions': array([12.203921], dtype=float32)}
{'predictions': array([12.133751], dtype=float32)}
{'predictions': array([12.134824], dtype=float32)}
{'predictions': array([12.421853], dtype=float32)}
{'predictions': array([12.131507], dtype=float32)}
{'predictions': array([12.127681], dtype=float32)}
{'predictions': array([12.420049], dtype=float32)}
{'predictions': array([12.421165], dtype=float32)}
{'predictions': array([12.131564], dtype=float32)}
{'predictions': array([12.203569], dtype=float32)}
{'predictions': array([12.128726], dtype=float32)}
{'predictions': array([12.13207

{'predictions': array([12.419729], dtype=float32)}
{'predictions': array([12.130803], dtype=float32)}
{'predictions': array([12.1338215], dtype=float32)}
{'predictions': array([12.204914], dtype=float32)}
{'predictions': array([12.131798], dtype=float32)}
{'predictions': array([12.132126], dtype=float32)}
{'predictions': array([12.131085], dtype=float32)}
{'predictions': array([12.131921], dtype=float32)}
{'predictions': array([12.13165], dtype=float32)}
{'predictions': array([12.1338625], dtype=float32)}
{'predictions': array([12.129304], dtype=float32)}
{'predictions': array([12.131766], dtype=float32)}
{'predictions': array([12.4234], dtype=float32)}
{'predictions': array([12.131942], dtype=float32)}
{'predictions': array([12.131811], dtype=float32)}
{'predictions': array([12.132921], dtype=float32)}
{'predictions': array([12.272953], dtype=float32)}
{'predictions': array([12.117051], dtype=float32)}
{'predictions': array([12.494161], dtype=float32)}
{'predictions': array([12.131404

{'predictions': array([12.132079], dtype=float32)}
{'predictions': array([12.133135], dtype=float32)}
{'predictions': array([12.131152], dtype=float32)}
{'predictions': array([12.1310425], dtype=float32)}
{'predictions': array([12.202091], dtype=float32)}
{'predictions': array([12.131085], dtype=float32)}
{'predictions': array([12.493785], dtype=float32)}
{'predictions': array([12.131851], dtype=float32)}
{'predictions': array([12.132384], dtype=float32)}
{'predictions': array([12.13391], dtype=float32)}
{'predictions': array([12.419852], dtype=float32)}
{'predictions': array([12.419988], dtype=float32)}
{'predictions': array([12.4063225], dtype=float32)}
{'predictions': array([12.277951], dtype=float32)}
{'predictions': array([12.275328], dtype=float32)}
{'predictions': array([12.127944], dtype=float32)}
{'predictions': array([12.422157], dtype=float32)}
{'predictions': array([12.204749], dtype=float32)}
{'predictions': array([12.203633], dtype=float32)}
{'predictions': array([12.1327

{'predictions': array([12.275193], dtype=float32)}
{'predictions': array([12.132175], dtype=float32)}
{'predictions': array([12.205043], dtype=float32)}
{'predictions': array([12.133215], dtype=float32)}
{'predictions': array([12.127126], dtype=float32)}
{'predictions': array([12.132367], dtype=float32)}
{'predictions': array([12.133446], dtype=float32)}
{'predictions': array([12.13137], dtype=float32)}
{'predictions': array([12.417905], dtype=float32)}
{'predictions': array([12.204135], dtype=float32)}
{'predictions': array([12.202544], dtype=float32)}
{'predictions': array([12.132459], dtype=float32)}
{'predictions': array([12.135127], dtype=float32)}
{'predictions': array([12.131594], dtype=float32)}
{'predictions': array([12.132366], dtype=float32)}
{'predictions': array([12.132615], dtype=float32)}
{'predictions': array([12.133183], dtype=float32)}
{'predictions': array([12.279485], dtype=float32)}
{'predictions': array([12.1313925], dtype=float32)}
{'predictions': array([12.13164

{'predictions': array([12.413527], dtype=float32)}
{'predictions': array([12.276788], dtype=float32)}
{'predictions': array([12.131474], dtype=float32)}
{'predictions': array([12.133037], dtype=float32)}
{'predictions': array([12.274988], dtype=float32)}
{'predictions': array([12.132064], dtype=float32)}
{'predictions': array([12.42094], dtype=float32)}
{'predictions': array([12.132024], dtype=float32)}
{'predictions': array([12.1316595], dtype=float32)}
{'predictions': array([12.205003], dtype=float32)}
{'predictions': array([12.492576], dtype=float32)}
{'predictions': array([12.130113], dtype=float32)}
{'predictions': array([12.129158], dtype=float32)}
{'predictions': array([12.19728], dtype=float32)}
{'predictions': array([12.130149], dtype=float32)}
{'predictions': array([12.13154], dtype=float32)}
{'predictions': array([12.204261], dtype=float32)}
{'predictions': array([12.419963], dtype=float32)}
{'predictions': array([12.271479], dtype=float32)}
{'predictions': array([12.131627]

Further evidence of the primitiveness of our model, it predicts almost the same amount for every trip!

## 8) Change Estimator Type

One of the payoffs for using the Estimator API is we can swap in a different model type with just a few lines of code. Let's try a DNN. Note how now we need to specify the number of neurons in each hidden layer.

In [9]:
%%time
tf.logging.set_verbosity(tf.logging.INFO)
shutil.rmtree(OUTDIR, ignore_errors = True) 
model = tf.estimator.DNNRegressor(
    hidden_units = [32, 8, 2], # specify neural architecture
    feature_columns = feature_columns, 
    model_dir = OUTDIR,
    config = tf.estimator.RunConfig(tf_random_seed=1)
)
model.train(
    input_fn=lambda:train_input_fn(df_train), 
    steps=500)
print_rmse(model, df_valid)

INFO:tensorflow:Using config: {'_evaluation_master': '', '_train_distribute': None, '_task_type': 'worker', '_model_dir': 'taxi_trained', '_keep_checkpoint_max': 5, '_session_config': allow_soft_placement: true
graph_options {
  rewrite_options {
    meta_optimizer_iterations: ONE
  }
}
, '_device_fn': None, '_num_ps_replicas': 0, '_save_summary_steps': 100, '_experimental_distribute': None, '_task_id': 0, '_log_step_count_steps': 100, '_is_chief': True, '_save_checkpoints_secs': 600, '_tf_random_seed': 1, '_master': '', '_eval_distribute': None, '_protocol': None, '_save_checkpoints_steps': None, '_num_worker_replicas': 1, '_keep_checkpoint_every_n_hours': 10000, '_global_id_in_cluster': 0, '_service': None, '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f76f7b3de80>}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_in

Our performance hasn't improved, which proves an important point of machine learning: A more complex model can't outrun bad data. 

Currently since we're not doing any feature engineering our input data has very little signal to learn from, so using a DNN doesn't help.

<h2> Benchmark dataset </h2>

Let's do this on the benchmark dataset.

In [10]:
import datalab.bigquery as bq

def create_query(phase, EVERY_N):
  #phase: 1 = train, 2 = valid
  base_query = """
  SELECT
      (tolls_amount + fare_amount) AS fare_amount,
      CONCAT(STRING(pickup_datetime), STRING(pickup_longitude), STRING(pickup_latitude), STRING(dropoff_latitude), STRING(dropoff_longitude)) AS key,
      DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
      HOUR(pickup_datetime)*1.0 AS hourofday,
      pickup_longitude AS pickuplon,
      pickup_latitude AS pickuplat,
      dropoff_longitude AS dropofflon,
      dropoff_latitude AS dropofflat,
      passenger_count*1.0 AS passengers,
  FROM
      [nyc-tlc:yellow.trips]
  WHERE
      trip_distance > 0
      AND fare_amount >= 2.5
      AND pickup_longitude > -78
      AND pickup_longitude < -70
      AND dropoff_longitude > -78
      AND dropoff_longitude < -70
      AND pickup_latitude > 37
      AND pickup_latitude < 45
      AND dropoff_latitude > 37
      AND dropoff_latitude < 45
      AND passenger_count > 0
  """
    
  if EVERY_N == None:
    if phase < 2:
      # Training
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 < 2".format(base_query)
    else:
      # Validation
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 == {1}".format(base_query, phase)
  else:
    query = "{0} AND ABS(HASH(pickup_datetime)) % {1} == {2}".format(base_query, EVERY_N, phase)
    
  return query

query = create_query(2, 100000)
df = bq.Query(query).to_dataframe()

In [11]:
print_rmse(model, df)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2018-12-21-15:56:43
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-500
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2018-12-21-15:56:43
INFO:tensorflow:Saving dict for global step 500: average_loss = 89.35371, global_step = 500, label/mean = 11.333685, loss = 11357.085, prediction/mean = 12.227922
INFO:tensorflow:Saving 'checkpoint_path' summary for global step 500: taxi_trained/model.ckpt-500
RMSE on dataset = 9.452708942936056


RMSE on benchmark dataset is <b>9.45</b> 

This is not only way more than our original benchmark of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.02.

Fear not -- you have learned how to write a TensorFlow model, but not to do all the things that you will have to do to your ML model performant. We will do this in subsequent notebooks

## Challenge Exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Simulate the necessary training dataset.
<p>
Hint (highlight to see):
<p style='color:white'>
The input features will be r and h and the label will be $\pi r^2 h$
Create random values for r and h and compute V.
Your dataset will consist of r, h and V.
Then, use a DNN regressor.
Make sure to generate enough data.
</p>

Copyright 2019 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License