<h1>2b. Machine Learning using tf.estimator </h1>

In this notebook, we will create a machine learning model using tf.estimator and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. 

In [1]:
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

print(tf.__version__)

  from ._conv import register_converters as _register_converters


1.8.0


Read data created in the previous chapter.

In [82]:
# In CSV, label is the first column, after the features, followed by the key
CSV_COLUMNS = ['fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
FEATURES = CSV_COLUMNS[1:len(CSV_COLUMNS) - 1]
LABEL = CSV_COLUMNS[0]

print(type(CSV_COLUMNS)) # list
print(type(FEATURES)) # list, from pickuplon to key
print(type(LABEL)) #string, fare_amount
print(LABEL) #fare_amount

df_train = pd.read_csv('./taxi-train.csv', header = None, names = CSV_COLUMNS)
df_valid = pd.read_csv('./taxi-valid.csv', header = None, names = CSV_COLUMNS)
df_test = pd.read_csv('./taxi-test.csv', header = None, names = CSV_COLUMNS)

print(df_train.dtypes)


<class 'list'>
<class 'list'>
<class 'str'>
fare_amount
fare_amount    float64
pickuplon      float64
pickuplat      float64
dropofflon     float64
dropofflat     float64
passengers       int64
key              int64
dtype: object


<h2> Train and eval input functions to read from Pandas Dataframe </h2>

In [38]:
# TODO: Create an appropriate input_fn to read the training data
def make_train_input_fn(df, num_epochs):
  return tf.estimator.inputs.pandas_input_fn(
    #ADD CODE HERE
    x = df,
    y = df[LABEL],
    batch_size = 128,
    num_epochs = num_epochs,
    shuffle = True,
    queue_capacity = 1000,
    num_threads = 1
  )

In [51]:
# TODO: Create an appropriate input_fn to read the validation data
def make_eval_input_fn(df):
  return tf.estimator.inputs.pandas_input_fn(
    #ADD CODE HERE
    x = df,
    y = df[LABEL],
    shuffle = True
  )

Our input function for predictions is the same except we don't provide a label

In [42]:
# TODO: Create an appropriate prediction_input_fn
def make_prediction_input_fn(df):
  return tf.estimator.inputs.pandas_input_fn(
    #ADD CODE HERE
    x = df
  )

### Create feature columns for estimator

In [39]:
# TODO: Create feature columns
#feature_columns = {'fare_amount', 'pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key'}

"""
feature_columns = [
  tf.feature_column.numeric_column("pickuplon"),
  tf.feature_column.numeric_column("pickuplat"),
  tf.feature_column.numeric_column("dropofflon"),
  tf.feature_column.numeric_column("dropofflat"),
  tf.feature_column.numeric_column("passengers"),
  tf.feature_column.numeric_column("key")]
"""

features_columns = [tf.feature_column.numeric_column(k) for k in FEATURES]


<h3> Linear Regression with tf.Estimator framework </h3>

In [59]:
tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

# TODO: Train a linear regression model
model = tf.estimator.LinearRegressor(feature_columns, OUTDIR)

print(type(df_train))

model.train(make_train_input_fn(df_train, 10))

INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': 'worker', '_num_worker_replicas': 1, '_evaluation_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f05a44af550>, '_tf_random_seed': None, '_task_id': 0, '_is_chief': True, '_global_id_in_cluster': 0, '_master': '', '_model_dir': 'taxi_trained', '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 100, '_session_config': None, '_keep_checkpoint_max': 5, '_service': None, '_train_distribute': None, '_save_checkpoints_secs': 600}
<class 'pandas.core.frame.DataFrame'>
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into taxi_trained/model.ckpt

<tensorflow.python.estimator.canned.linear.LinearRegressor at 0x7f05a4c44ac8>

Evaluate on the validation data (we should defer using the test data to after we have selected a final model).

In [62]:
def print_rmse(model, df):
  metrics = model.evaluate(input_fn = make_eval_input_fn(df))# dfには、df_validが渡されている
  print('RMSE on dataset = {}'.format(np.sqrt(metrics['average_loss'])))
  
print_rmse(model, df_valid)

INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-05-21-15:39:40
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-608
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Finished evaluation at 2020-05-21-15:39:40
INFO:tensorflow:Saving dict for global step 608: average_loss = 109.50922, global_step = 608, loss = 13023.774
RMSE on dataset = 10.464665412902832


This is nowhere near our benchmark (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

In [61]:
# TODO: Predict from the estimator model we trained using test dataset
# 要は、googleの方でtrain済みのmodelをここで読み込むらしい

import itertools
model = tf.estimator.LinearRegressor(feature_columns, OUTDIR)
preds_iter = model.predict(make_eval_input_fn(df_valid))
print([pred["predictions"][0] for pred in list(itertools.islice(preds_iter, 5))])


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': 'worker', '_num_worker_replicas': 1, '_evaluation_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f05a4d8e198>, '_tf_random_seed': None, '_task_id': 0, '_is_chief': True, '_global_id_in_cluster': 0, '_master': '', '_model_dir': 'taxi_trained', '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 100, '_session_config': None, '_keep_checkpoint_max': 5, '_service': None, '_train_distribute': None, '_save_checkpoints_secs': 600}
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from taxi_trained/model.ckpt-608
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
[10.795773, 10.951951, 10.844597, 10.781398, 10.793583]


This explains why the RMSE was so high -- the model essentially predicts the same amount for every trip.  Would a more complex model help? Let's try using a deep neural network.  The code to do this is quite straightforward as well.

<h3> Deep Neural Network regression </h3>

In [76]:
# TODO: Copy your LinearRegressor estimator and replace with DNNRegressor. 
# Remember to add a list of hidden units i.e. [32, 8, 2]


tf.logging.set_verbosity(tf.logging.INFO)

OUTDIR = 'taxi_trained'
shutil.rmtree(OUTDIR, ignore_errors = True) # start fresh each time

## DNN Regressorを今回は利用する。
model = tf.estimator.DNNRegressor(feature_columns = feature_columns,
                                  hidden_units = [32, 8, 2],
                                    activation_fn = tf.nn.relu,
                                    dropout = 0.2,
                                    optimizer="Adam"
                                    )

print(type(df_train))

model.train(make_train_input_fn(df_train, 100))

print("print_rmse\n")
print_rmse(model, df_valid)


INFO:tensorflow:Using default config.
INFO:tensorflow:Using config: {'_task_type': 'worker', '_num_worker_replicas': 1, '_evaluation_master': '', '_cluster_spec': <tensorflow.python.training.server_lib.ClusterSpec object at 0x7f05c6566cf8>, '_tf_random_seed': None, '_task_id': 0, '_is_chief': True, '_global_id_in_cluster': 0, '_master': '', '_model_dir': '/tmp/tmpg8psl4vb', '_save_checkpoints_steps': None, '_log_step_count_steps': 100, '_num_ps_replicas': 0, '_keep_checkpoint_every_n_hours': 10000, '_save_summary_steps': 100, '_session_config': None, '_keep_checkpoint_max': 5, '_service': None, '_train_distribute': None, '_save_checkpoints_secs': 600}
<class 'pandas.core.frame.DataFrame'>
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Create CheckpointSaverHook.
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 1 into /tmp/tmpg8psl4vb/mo

We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this course is about!

But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model.


どちらのモデルでもベンチマークを打ち負かしていない...どうしたの？ 機械学習にTensorFlowを使用している可能性がありますが、まだ十分に使用していません。 それがこのコースの残りの部分です。

ただし、記録のために、2つのモデルから選択する必要があったとします。 検証エラーの少ない方を選択します。 最後に、この選択したモデルを使用して、テストデータのRMSEを測定します。



<h2> Benchmark dataset </h2>

Let's do this on the benchmark dataset.

In [79]:
from google.cloud import bigquery
import numpy as np
import pandas as pd

def create_query(phase, EVERY_N):
  """
  phase: 1 = train 2 = valid
  """
  base_query = """
SELECT
  (tolls_amount + fare_amount) AS fare_amount,
  EXTRACT(DAYOFWEEK FROM pickup_datetime) * 1.0 AS dayofweek,
  EXTRACT(HOUR FROM pickup_datetime) * 1.0 AS hourofday,
  pickup_longitude AS pickuplon,
  pickup_latitude AS pickuplat,
  dropoff_longitude AS dropofflon,
  dropoff_latitude AS dropofflat,
  passenger_count * 1.0 AS passengers,
  CONCAT(CAST(pickup_datetime AS STRING), CAST(pickup_longitude AS STRING), CAST(pickup_latitude AS STRING), CAST(dropoff_latitude AS STRING), CAST(dropoff_longitude AS STRING)) AS key
FROM
  `nyc-tlc.yellow.trips`
WHERE
  trip_distance > 0
  AND fare_amount >= 2.5
  AND pickup_longitude > -78
  AND pickup_longitude < -70
  AND dropoff_longitude > -78
  AND dropoff_longitude < -70
  AND pickup_latitude > 37
  AND pickup_latitude < 45
  AND dropoff_latitude > 37
  AND dropoff_latitude < 45
  AND passenger_count > 0
  """

  if EVERY_N == None:
    if phase < 2:
      # Training
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 4)) < 2".format(base_query)
    else:
      # Validation
      query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), 4)) = {1}".format(base_query, phase)
  else:
    query = "{0} AND ABS(MOD(FARM_FINGERPRINT(CAST(pickup_datetime AS STRING)), {1})) = {2}".format(base_query, EVERY_N, phase)
    
  return query

query = create_query(2, 100000)
df = bigquery.Client().query(query).to_dataframe()

In [81]:
print(df.dtypes)
print_rmse(model, df)

fare_amount    float64
dayofweek      float64
hourofday      float64
pickuplon      float64
pickuplat      float64
dropofflon     float64
dropofflat     float64
passengers     float64
key             object
dtype: object
INFO:tensorflow:Calling model_fn.
INFO:tensorflow:Done calling model_fn.
INFO:tensorflow:Starting evaluation at 2020-05-21-16:06:15
INFO:tensorflow:Graph was finalized.
INFO:tensorflow:Restoring parameters from /tmp/tmpg8psl4vb/model.ckpt-6071
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.


UnimplementedError: Cast string to float is not supported
	 [[Node: dnn/input_from_feature_columns/input_layer/key/ToFloat = Cast[DstT=DT_FLOAT, SrcT=DT_STRING, _device="/job:localhost/replica:0/task:0/device:CPU:0"](dnn/input_from_feature_columns/input_layer/key/ExpandDims)]]

Caused by op 'dnn/input_from_feature_columns/input_layer/key/ToFloat', defined at:
  File "/usr/local/envs/py3env/lib/python3.5/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/local/envs/py3env/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/__main__.py", line 3, in <module>
    app.launch_new_instance()
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/traitlets/config/application.py", line 658, in launch_instance
    app.start()
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/kernelapp.py", line 486, in start
    self.io_loop.start()
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tornado/ioloop.py", line 888, in start
    handler_func(fd_obj, events)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 450, in _handle_events
    self._handle_recv()
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 480, in _handle_recv
    self._run_callback(callback, msg)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/zmq/eventloop/zmqstream.py", line 432, in _run_callback
    callback(*args, **kwargs)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 283, in dispatcher
    return self.dispatch_shell(stream, msg)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 233, in dispatch_shell
    handler(stream, idents, msg)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/kernelbase.py", line 399, in execute_request
    user_expressions, allow_stdin)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/ipkernel.py", line 208, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/ipykernel/zmqshell.py", line 537, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2662, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2785, in _run_cell
    interactivity=interactivity, compiler=compiler, result=result)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2907, in run_ast_nodes
    if self.run_code(code, result):
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/IPython/core/interactiveshell.py", line 2961, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-81-c5c04ffae3b1>", line 2, in <module>
    print_rmse(model, df)
  File "<ipython-input-62-a1fce65954e2>", line 2, in print_rmse
    metrics = model.evaluate(input_fn = make_eval_input_fn(df))# dfには、df_validが渡されている
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 425, in evaluate
    name=name)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 1087, in _evaluate_model
    features, labels, model_fn_lib.ModeKeys.EVAL, self.config)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/estimator/estimator.py", line 831, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/estimator/canned/dnn.py", line 494, in _model_fn
    config=config)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/estimator/canned/dnn.py", line 183, in _dnn_model_fn
    logits = logit_fn(features=features, mode=mode)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/estimator/canned/dnn.py", line 91, in dnn_logit_fn
    features=features, feature_columns=feature_columns)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py", line 277, in input_layer
    trainable, cols_to_vars)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py", line 202, in _internal_input_layer
    trainable=trainable)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py", line 2297, in _get_dense_tensor
    return inputs.get(self)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py", line 2100, in get
    transformed = column._transform_feature(self)  # pylint: disable=protected-access
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/feature_column/feature_column.py", line 2272, in _transform_feature
    return math_ops.to_float(input_tensor)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 841, in to_float
    return cast(x, dtypes.float32, name=name)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/ops/math_ops.py", line 787, in cast
    x = gen_math_ops.cast(x, base_type, name=name)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/ops/gen_math_ops.py", line 1525, in cast
    "Cast", x=x, DstT=DstT, name=name)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 3392, in create_op
    op_def=op_def)
  File "/usr/local/envs/py3env/lib/python3.5/site-packages/tensorflow/python/framework/ops.py", line 1718, in __init__
    self._traceback = self._graph._extract_stack()  # pylint: disable=protected-access

UnimplementedError (see above for traceback): Cast string to float is not supported
	 [[Node: dnn/input_from_feature_columns/input_layer/key/ToFloat = Cast[DstT=DT_FLOAT, SrcT=DT_STRING, _device="/job:localhost/replica:0/task:0/device:CPU:0"](dnn/input_from_feature_columns/input_layer/key/ExpandDims)]]


RMSE on benchmark dataset is <b>9.41</b> (your results will vary because of random seeds).

This is not only way more than our original benchmark of 6.00, but it doesn't even beat our distance-based rule's RMSE of 8.02.

Fear not -- you have learned how to write a TensorFlow model, but not to do all the things that you will have to do to your ML model performant. We will do this in the next chapters. In this chapter though, we will get our TensorFlow model ready for these improvements.

In a software sense, the rest of the labs in this chapter will be about refactoring the code so that we can improve it.

## Challenge Exercise

Create a neural network that is capable of finding the volume of a cylinder given the radius of its base (r) and its height (h). Assume that the radius and height of the cylinder are both in the range 0.5 to 2.0. Simulate the necessary training dataset.
<p>
Hint (highlight to see):
<p style='color:white'>
The input features will be r and h and the label will be $\pi r^2 h$
Create random values for r and h and compute V.
Your dataset will consist of r, h and V.
Then, use a DNN regressor.
Make sure to generate enough data.
</p>

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License