<h1> 2a. Getting started with TensorFlow </h1>

This notebook is Lab2a of CPB 102, Google's course on Machine Learning using Cloud ML.

In this notebook, we will create a machine learning model using tf.learn and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. 

In [1]:
import datalab.bigquery as bq
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

Read data created in Lab1a.

In [2]:
def read_dataset(filename):
  return pd.read_csv(filename, header=None, names=['pickuplon','pickuplat','dropofflon','dropofflat','passengers','fare_amount'])

df_train = read_dataset('../lab1a/taxi-train.csv')
df_valid = read_dataset('../lab1a/taxi-valid.csv')
df_test = read_dataset('../lab1a/taxi-test.csv')
df_train[:5]

Unnamed: 0,pickuplon,pickuplat,dropofflon,dropofflat,passengers,fare_amount
0,-73.99205,40.75138,-73.954217,40.766692,2,16.5
1,-73.993046,40.728159,-73.988222,40.731995,1,3.3
2,-73.955591,40.782526,-73.972944,40.751067,1,10.5
3,-73.96429,40.77335,-73.96544,40.75543,1,10.0
4,-73.969442,40.797912,-73.982227,40.765247,1,9.7


Setup a couple of variables based on the above dataset

In [3]:
FEATURE_COLS = np.arange(0,5)
TARGET_COL   = 'fare_amount'

<h3> Linear Regression with tf.learn Estimators framework </h3>

In [4]:
tf.logging.set_verbosity(tf.logging.INFO)
predictors = df_train.iloc[:,FEATURE_COLS].values # np.ndarray
targets = df_train[TARGET_COL].values
features = tf.contrib.learn.infer_real_valued_columns_from_input(predictors)
shutil.rmtree('taxi_model', ignore_errors=True) # start fresh each time
model = tf.contrib.learn.LinearRegressor(feature_columns=features, model_dir='taxi_model')
model = model.fit(predictors, targets, steps=1000)

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Create CheckpointSaver
INFO:tensorflow:Step 1: loss = 218.036
INFO:tensorflow:Step 101: loss = 89.9517
INFO:tensorflow:Step 201: loss = 89.9487
INFO:tensorflow:Saving checkpoints for 300 into taxi_model/model.ckpt.
INFO:tensorflow:Step 301: loss = 89.9468
INFO:tensorflow:Step 401: loss = 89.9453
INFO:tensorflow:Step 501: loss = 89.944
INFO:tensorflow:Saving checkpoints for 600 into taxi_model/model.ckpt.
INFO:tensorflow:Step 601: loss = 89.9429
INFO:tensorflow:Step 701: loss = 89.9419
INFO:tensorflow:Step 801: loss = 89.941
INFO:tensorflow:Saving checkpoints for 900 into taxi_model/model.ckpt.
INFO:tensorflow:Step 901: loss = 89.9402
INFO:tensorflow:Saving checkpoints for 1000 into taxi_model/model.ckpt.
INFO:tensorflow:Loss for final step: 89.9394.


Evaluate on the validation data (we should defer using the test data to after we have selected a final model).

In [5]:
def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(model):
  print "Train RMSE = {0}".format(compute_rmse(df_train[TARGET_COL], model.predict(df_train.iloc[:,FEATURE_COLS].values)))
  print "Valid RMSE = {0}".format(compute_rmse(df_valid[TARGET_COL], model.predict(df_valid.iloc[:,FEATURE_COLS].values)))

print_rmse(model)

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.
INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.


Train RMSE = 9.48363845152
Valid RMSE = 9.00198329991


This is nowhere near our benchmark (RMSE of $5.70 or so), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

In [6]:
ROWS = np.arange(10,15)
inputs = df_test.iloc[ROWS,FEATURE_COLS]
trainedmodel = tf.contrib.learn.LinearRegressor(
  feature_columns=tf.contrib.learn.infer_real_valued_columns_from_input(inputs.values),
  model_dir='taxi_model')
print trainedmodel.predict(inputs.values)
print df_test.iloc[ROWS,:]

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Loading model from checkpoint: taxi_model/model.ckpt-1000-?????-of-00001.


[ 11.31593513  11.32413197  11.31696892  11.3170681   11.31684685]
    pickuplon  pickuplat  dropofflon  dropofflat  passengers  fare_amount
10 -73.978947  40.747617  -74.012342   40.701557           1         15.5
11 -73.960992  40.760830  -73.955862   40.772037           3          4.5
12 -73.953117  40.776119  -74.003342   40.743809           1         18.0
13 -73.997927  40.756622  -73.985392   40.745540           1         10.5
14 -73.985128  40.748082  -73.976318   40.765825           1          7.7


This explains why the RMSE was so high -- the model essentially predicts $11.32 for every trip.  Would a more complex model help? Let's try using a deep neural network.  The code to do this is quite straightforward as well.

<h3> Deep Neural Network regression </h3>

In [7]:
shutil.rmtree('taxi_model', ignore_errors=True) # start fresh each time
model = tf.contrib.learn.DNNRegressor(feature_columns=features, hidden_units=[128, 100, 8], model_dir='taxi_model')
model = model.fit(predictors, targets, steps=1000)
print_rmse(model)

INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, default_value=None, dtype=tf.float32)
INFO:tensorflow:Create CheckpointSaver
INFO:tensorflow:Step 1: loss = 110.621
INFO:tensorflow:Step 101: loss = 177.964
INFO:tensorflow:Step 201: loss = 162.341
INFO:tensorflow:Saving checkpoints for 300 into taxi_model/model.ckpt.
INFO:tensorflow:Step 301: loss = 151.408
INFO:tensorflow:Step 401: loss = 142.96
INFO:tensorflow:Step 501: loss = 136.117
INFO:tensorflow:Saving checkpoints for 600 into taxi_model/model.ckpt.
INFO:tensorflow:Step 601: loss = 130.419
INFO:tensorflow:Step 701: loss = 125.586
INFO:tensorflow:Step 801: loss = 121.432
INFO:tensorflow:Saving checkpoints for 900 into taxi_model/model.ckpt.
INFO:tensorflow:Step 901: loss = 117.827
INFO:tensorflow:Saving checkpoints for 1000 into taxi_model/model.ckpt.
INFO:tensorflow:Loss for final step: 114.706.
INFO:tensorflow:Transforming feature_column _RealValuedColumn(column_name='', dimension=5, def

Train RMSE = 10.7098367157
Valid RMSE = 10.1138401491


We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this course is about!

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License