<h1> 2b. Working with low-level TensorFlow </h1>

This notebook is Lab2b of CPB 102, Google's course on Machine Learning using Cloud ML.

In this notebook, we will work with relatively low-level TensorFlow functions to implement a linear regression model. We will use this notebook to demonstrate early stopping -- a technique whereby training is stopped once the error on the validation dataset starts to increase. 

In [1]:
import datalab.bigquery as bq
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

print tf.__version__

1.0.0-rc2


Code to read data and compute error is similar to Lab2a, but we simply load the dataset into memory rather than come up with an input_fn.

In [2]:
def read_dataset(filename):
  return pd.read_csv(filename, header=None, names=['pickuplon','pickuplat','dropofflon','dropofflat','passengers','fare_amount'])

df_train = read_dataset('../lab1a/taxi-train.csv')
df_valid = read_dataset('../lab1a/taxi-valid.csv')
df_test = read_dataset('../lab1a/taxi-test.csv')
df_train[:5]

Unnamed: 0,pickuplon,pickuplat,dropofflon,dropofflat,passengers,fare_amount
0,-74.003375,40.743642,-73.993685,40.728487,1,6.9
1,-73.990978,40.74534,-73.979782,40.75298,2,5.7
2,-74.0114,40.701761,-73.978215,40.764434,1,20.0
3,-73.970957,40.758473,-73.97944,40.759398,1,4.9
4,-73.990905,40.749596,-73.974205,40.756526,1,9.5


In [3]:
FEATURE_COLS = np.arange(0,5)
TARGET_COL   = 'fare_amount'

In [4]:
def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(model):
  print "Train RMSE = {0}".format(compute_rmse(df_train[TARGET_COL], model.predict(df_train.iloc[:,FEATURE_COLS].values)))
  print "Valid RMSE = {0}".format(compute_rmse(df_valid[TARGET_COL], model.predict(df_valid.iloc[:,FEATURE_COLS].values)))

<h3> Linear Regression </h3>

In [7]:
predictors = df_train.iloc[:,FEATURE_COLS].values
targets = df_train[TARGET_COL].values
prev_valid_error = 10000 # huge number
modelprefix = '/tmp/trained_model'
with tf.Session() as sess:
  npredictors = len(FEATURE_COLS)
  noutputs = 1
  feature_data = tf.placeholder("float", [None, npredictors])
  target_data = tf.placeholder("float", [None, noutputs])
  weights = tf.Variable(tf.truncated_normal([npredictors, noutputs], stddev=0.01))
  biases = tf.Variable(tf.ones([noutputs]))
  model = (tf.matmul(feature_data, weights) + biases) # LINEAR REGRESSION
  cost = tf.nn.l2_loss(model - target_data) # Square Error, not RMSE
  saver = tf.train.Saver({'weights' : weights, 'biases' : biases})
    
  training_step = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(cost)
  tf.global_variables_initializer().run()
  for iter in xrange(0, 10000):
    _, trainerr = sess.run([training_step, cost], feed_dict = {
        feature_data : predictors,
        target_data : targets.reshape(len(predictors), noutputs)
      })
    if (iter%100 == 1):
      # early stop if validation error doesn't keep dropping
      preds = sess.run(model, feed_dict = {feature_data : df_valid.iloc[:,FEATURE_COLS].values})
      trmse = np.sqrt(trainerr/len(predictors))
      vrmse = compute_rmse(df_valid[TARGET_COL], preds[0])      
      print 'iter={0} train_error={1} valid_err={2}'.format(iter, trmse, vrmse)
      if vrmse > prev_valid_error:
         print "Early stop!"
         break  # out of iteration loop
      else:
         prev_valid_error = vrmse
         # save the model so that we can read it
         modelfile = saver.save(sess, modelprefix, global_step=iter)
         print 'Model written to {0}'.format(modelfile)

  preds = sess.run(model, feed_dict = {feature_data : df_test.iloc[:,FEATURE_COLS].values})
  testrmse = compute_rmse(df_test[TARGET_COL], preds[0]) 
  print 'Error on Test data = {0}'.format(testrmse)

iter=1 train_error=10.4740593948 valid_err=15.1333942323
Model written to /tmp/trained_model-1
iter=101 train_error=9.31031243471 valid_err=13.4961987771
Model written to /tmp/trained_model-101
iter=201 train_error=8.39193264165 valid_err=12.2002376379
Model written to /tmp/trained_model-201
iter=301 train_error=7.7169462126 valid_err=11.2414351895
Model written to /tmp/trained_model-301
iter=401 train_error=7.25939013787 valid_err=10.5837084169
Model written to /tmp/trained_model-401
iter=501 train_error=6.97592067564 valid_err=10.1681273848
Model written to /tmp/trained_model-501
iter=601 train_error=6.81620878642 valid_err=9.9265273249
Model written to /tmp/trained_model-601
iter=701 train_error=6.73450035362 valid_err=9.7967053956
Model written to /tmp/trained_model-701
iter=801 train_error=6.69655993007 valid_err=9.73157710121
Model written to /tmp/trained_model-801
iter=901 train_error=6.68056697177 valid_err=9.70058509883
Model written to /tmp/trained_model-901
iter=1001 train_e

Notice that the training error can be driven down very low, but it doesn't actually reduce the validation error.  To help prevent over-fitting, the loop above makes use of "early-stopping", to stop the training when the validation error starts to increase.  In tf.learn, we didn't pass in a validation dataset, but we got similar performance on the validation set -- that's because tf.learn uses a different technique called regularization to help prevent over-fitting.

Early stopping and regularization are not that critical in linear regression (because the model is quite simple), but are crucial once you start creating deep neural networks where there are thousands of weights.