<h1> 2a. Getting started with TensorFlow </h1>

This notebook is Lab2a of CPB 102, Google's course on Machine Learning using Cloud ML.

In this notebook, we will create a machine learning model using tf.learn and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. 

In [10]:
import datalab.bigquery as bq
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

Read data created in Lab1a.

In [11]:
def read_dataset(filename):
  return pd.read_csv(filename, header=None, names=['pickuplon','pickuplat','dropofflon','dropofflat','passengers','fare_amount'])

df_train = read_dataset('../lab1a/taxi-train.csv')
df_valid = read_dataset('../lab1a/taxi-valid.csv')
df_test = read_dataset('../lab1a/taxi-test.csv')
df_train[:5]

Unnamed: 0,pickuplon,pickuplat,dropofflon,dropofflat,passengers,fare_amount
0,-73.984162,40.767241,-73.967796,40.752417,1,9.7
1,-74.005099,40.719629,-74.010202,40.719718,3,5.3
2,-74.004951,40.748075,-74.013482,40.715892,1,9.5
3,-73.988091,40.733528,-73.939537,40.705488,3,17.5
4,-73.970687,40.764815,-73.984393,40.764038,5,5.3


Setup a couple of variables based on the above dataset

In [12]:
FEATURE_COLS = np.arange(0,5)
TARGET_COL   = 'fare_amount'

<h3> Linear Regression with tf.learn Estimators framework </h3>

In [13]:
tf.logging.set_verbosity(tf.logging.INFO)
predictors = df_train.iloc[:,FEATURE_COLS].values # np.ndarray
targets = df_train[TARGET_COL].values
features = tf.contrib.learn.infer_real_valued_columns_from_input(predictors)
shutil.rmtree('taxi_model', ignore_errors=True) # start fresh each time
model = tf.contrib.learn.LinearRegressor(feature_columns=features, model_dir='taxi_model')
model = model.fit(predictors, targets, steps=1000)

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Create CheckpointSaver
INFO:tensorflow:Step 1: loss = 211.869
INFO:tensorflow:Step 101: loss = 86.5553
INFO:tensorflow:Step 201: loss = 86.5521
INFO:tensorflow:Saving checkpoints for 300 into taxi_model/model.ckpt.
INFO:tensorflow:Step 301: loss = 86.5501
INFO:tensorflow:Step 401: loss = 86.5485
INFO:tensorflow:Step 501: loss = 86.5471
INFO:tensorflow:Saving checkpoints for 600 into taxi_model/model.ckpt.
INFO:tensorflow:Step 601: loss = 86.546
INFO:tensorflow:Step 701: loss = 86.5451
INFO:tensorflow:Step 801: loss = 86.5441
INFO:tensorflow:Saving checkpoints for 900 into taxi_model/model.ckpt.
INFO:tensorflow:Step 901: loss = 86.5432
INFO:tensorflow:Saving checkpoints for 1000 into taxi_model/model.ckpt.
INFO:tensorflow:Loss for final step: 86.5425.


Evaluate on the validation data (we should defer using the test data to after we have selected a final model).

In [14]:
def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(model):
  print "Train RMSE = {0}".format(compute_rmse(df_train[TARGET_COL], model.predict(df_train.iloc[:,FEATURE_COLS].values)))
  print "Valid RMSE = {0}".format(compute_rmse(df_valid[TARGET_COL], model.predict(df_valid.iloc[:,FEATURE_COLS].values)))

print_rmse(model)

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.
INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.


Train RMSE = 9.30282133554
Valid RMSE = 9.32620596983


This is nowhere near our benchmark (RMSE of $5.70 or so), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

In [15]:
ROWS = np.arange(10,15)
inputs = df_test.iloc[ROWS,FEATURE_COLS]
trainedmodel = tf.contrib.learn.LinearRegressor(
  feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(inputs.values),
  model_dir='taxi_model')
print trainedmodel.predict(inputs.values)
print df_test.iloc[ROWS,:]

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.


[ 11.1360836   11.12549973  11.13498116  11.46659088  11.1355896 ]
    pickuplon  pickuplat  dropofflon  dropofflat  passengers  fare_amount
10 -73.994770  40.755957  -73.991973   40.744678           1          7.0
11 -73.137393  41.366138  -73.137393   41.366138           1         13.7
12 -73.984789  40.724415  -73.983028   40.750280           1         10.5
13 -73.989340  40.772857  -73.976058   40.775860           5          6.0
14 -73.983511  40.747200  -73.956757   40.780723           1          9.0


This explains why the RMSE was so high -- the model essentially predicts $11 for every trip.  Would a more complex model help? Let's try using a deep neural network.  The code to do this is quite straightforward as well.

<h3> Deep Neural Network regression </h3>

In [16]:
shutil.rmtree('taxi_model', ignore_errors=True) # start fresh each time
model = tf.contrib.learn.DNNRegressor(feature_columns=features, hidden_units=[128, 100, 8], model_dir='taxi_model')
model = model.fit(predictors, targets, steps=1000)
print_rmse(model)

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Create CheckpointSaver
INFO:tensorflow:Step 1: loss = 106.131
INFO:tensorflow:Step 101: loss = 86.5586
INFO:tensorflow:Step 201: loss = 86.5562
INFO:tensorflow:Saving checkpoints for 300 into taxi_model/model.ckpt.
INFO:tensorflow:Step 301: loss = 86.5545
INFO:tensorflow:Step 401: loss = 86.553
INFO:tensorflow:Step 501: loss = 86.5517
INFO:tensorflow:Saving checkpoints for 600 into taxi_model/model.ckpt.
INFO:tensorflow:Step 601: loss = 86.5506
INFO:tensorflow:Step 701: loss = 86.5495
INFO:tensorflow:Step 801: loss = 86.5487
INFO:tensorflow:Saving checkpoints for 900 into taxi_model/model.ckpt.
INFO:tensorflow:Step 901: loss = 86.5478
INFO:tensorflow:Saving checkpoints for 1000 into taxi_model/model.ckpt.
INFO:tensorflow:Loss for final step: 86.5469.
INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, def

Train RMSE = 9.30306016354
Valid RMSE = 9.32489015235


We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this course is about!

But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model.

In [17]:
print "Test RMSE = {0}".format(compute_rmse(df_test[TARGET_COL], model.predict(df_test.iloc[:,FEATURE_COLS].values)))

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.


Test RMSE = 9.66621602842


Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License