<h1> Working with 'raw' TensorFlow </h1>

In [1]:
import datalab.bigquery as bq
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

Code to read data and compute error is the same as Lab2a.

In [2]:
def read_dataset(filename):
  return pd.read_csv(filename, header=None, names=['hour','dayofweek','pickuplon','pickuplat','dropofflon','dropofflat','passengers','fare_amount'])

df_train = read_dataset('../lab1a/taxi-train.csv')
df_valid = read_dataset('../lab1a/taxi-valid.csv')
df_test = read_dataset('../lab1a/taxi-test.csv')
df_train[:5]

Unnamed: 0,hour,dayofweek,pickuplon,pickuplat,dropofflon,dropofflat,passengers,fare_amount
0,6,2,-73.989192,40.748615,-73.97018,40.75638,1,12.0
1,6,2,-73.974967,40.735102,-73.776493,40.644974,3,49.57
2,17,5,-73.995045,40.725998,-74.004505,40.734823,1,6.1
3,11,0,-73.967158,40.772232,-73.991385,40.748908,1,12.0
4,16,0,-73.977785,40.752055,-73.97914,40.762352,5,6.5


In [3]:
FEATURE_COLS = np.arange(0,7)
TARGET_COL   = 'fare_amount'

In [4]:
def compute_rmse(actual, predicted):
  return np.sqrt(np.mean((actual-predicted)**2))

def print_rmse(model):
  print "Train RMSE = {0}".format(compute_rmse(df_train[TARGET_COL], model.predict(df_train.iloc[:,FEATURE_COLS].values)))
  print "Valid RMSE = {0}".format(compute_rmse(df_valid[TARGET_COL], model.predict(df_valid.iloc[:,FEATURE_COLS].values)))

<h3> Linear Regression </h3>

In [6]:
predictors = df_train.iloc[:,FEATURE_COLS].values
targets = df_train[TARGET_COL].values
prev_valid_error = 10000 # huge number
modelprefix = '/tmp/trained_model'
with tf.Session() as sess:
  npredictors = len(FEATURE_COLS)
  noutputs = 1
  feature_data = tf.placeholder("float", [None, npredictors])
  target_data = tf.placeholder("float", [None, noutputs])
  weights = tf.Variable(tf.truncated_normal([npredictors, noutputs], stddev=0.01))
  biases = tf.Variable(tf.ones([noutputs]))
  model = (tf.matmul(feature_data, weights) + biases) # LINEAR REGRESSION
  cost = tf.nn.l2_loss(model - target_data) # Square Error, not RMSE
  saver = tf.train.Saver({'weights' : weights, 'biases' : biases})
    
  training_step = tf.train.AdamOptimizer(learning_rate=0.0001).minimize(cost)
  tf.initialize_all_variables().run()
  for iter in xrange(0, 10000):
    _, trainerr = sess.run([training_step, cost], feed_dict = {
        feature_data : predictors,
        target_data : targets.reshape(len(predictors), noutputs)
      })
    if (iter%100 == 1):
      # early stop if validation error doesn't keep dropping
      preds = sess.run(model, feed_dict = {feature_data : df_valid.iloc[:,FEATURE_COLS].values})
      trmse = np.sqrt(trainerr/len(predictors))
      vrmse = compute_rmse(df_valid[TARGET_COL], preds[0])      
      print 'iter={0} train_error={1} valid_err={2}'.format(iter, trmse, vrmse)
      if vrmse > prev_valid_error:
         print "Early stop!"
         break  # out of iteration loop
      else:
         prev_valid_error = vrmse
         # save the model so that we can read it
         modelfile = saver.save(sess, modelprefix, global_step=iter)
         print 'Model written to {0}'.format(modelfile)


iter=1 train_error=10.7505159105 valid_err=15.3667935745
Model written to /tmp/trained_model-1
iter=101 train_error=9.48247596285 valid_err=13.4785478256
Model written to /tmp/trained_model-101
iter=201 train_error=8.48463910298 valid_err=12.0009206089
Model written to /tmp/trained_model-201
iter=301 train_error=7.75634470641 valid_err=10.9344600153
Model written to /tmp/trained_model-301
iter=401 train_error=7.26874988588 valid_err=10.2350971161
Model written to /tmp/trained_model-401
iter=501 train_error=6.97228667881 valid_err=9.82477963397
Model written to /tmp/trained_model-501
iter=601 train_error=6.80938532048 valid_err=9.61251634544
Model written to /tmp/trained_model-601
iter=701 train_error=6.72856808981 valid_err=9.5178234541
Model written to /tmp/trained_model-701
iter=801 train_error=6.69235088203 valid_err=9.48333397142
Model written to /tmp/trained_model-801
iter=901 train_error=6.67767691998 valid_err=9.47496511212
Model written to /tmp/trained_model-901
iter=1001 train

Notice that the training error can be driven down very low, but it doesn't actually reduce the validation error.  To help prevent over-fitting, the loop above makes use of "early-stopping", to stop the training when the validation error starts to increase.  In tf.learn, we didn't pass in a validation dataset, but we got similar performance on the validation set -- that's because tf.learn uses a different technique called regularization to help prevent over-fitting.

Early stopping and regularization are not that critical in linear regression (because the model is quite simple), but are crucial once you start creating deep neural networks where there are thousands of weights.