<h1> 2a. Getting started with TensorFlow </h1>

In this notebook, we will create a machine learning model using tf.learn and evaluate its performance.  The dataset is rather small (7700 samples), so we can do it all in-memory.  We will also simply pass the raw data in as-is. 

In [None]:
import datalab.bigquery as bq
import tensorflow as tf
import pandas as pd
import numpy as np
import shutil

In [None]:
print tf.__version__

Read data created in Lab1a.

In [None]:
FEATURES = ['pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
TARGET = 'fare_amount'

def read_dataset(filename):
  columns = list([TARGET])
  columns.extend(FEATURES) # in CSV, target is the first column, after the features
  columns.append('KEY')
  
  # read with Pandas
  df = pd.read_csv(filename, header=None, names=columns)
  # tensorflow prefers float32
  df[:] = df[:].astype('float32')
  # create features, columns
  feature_cols = {k: tf.constant(df[k].values) for k in FEATURES}
  labels = tf.constant(df[TARGET].values)
  return feature_cols, labels

def get_train():
  return read_dataset('../lab1a/taxi-train.csv')

def get_valid():
  return read_dataset('../lab1a/taxi-valid.csv')

def get_test():
  return read_dataset('../lab1a/taxi-test.csv')

<h3> Linear Regression with tf.learn Estimators framework </h3>

In [None]:
tf.logging.set_verbosity(tf.logging.INFO)
feature_cols = [tf.contrib.layers.real_valued_column(k)
                  for k in FEATURES]
shutil.rmtree('taxi_model', ignore_errors=True) # start fresh each time
model = tf.contrib.learn.LinearRegressor(
      feature_columns=feature_cols, model_dir='taxi_model')
model.fit(input_fn=get_train, steps=1000)

Evaluate on the validation data (we should defer using the test data to after we have selected a final model).

In [None]:
def print_rmse(model, name, input_fn):
  metrics = model.evaluate(input_fn=input_fn, steps=1)
  print 'RMSE on {} dataset = {}'.format(name, np.sqrt(metrics['loss']))
print_rmse(model, 'validation', get_valid)

This is nowhere near our benchmark (RMSE of $6 or so on this data), but it serves to demonstrate what TensorFlow code looks like.  Let's use this model for prediction.

In [None]:
import itertools
feature_cols = [tf.contrib.layers.real_valued_column(k)
                  for k in FEATURES]
# read saved model and use it for prediction
model = tf.contrib.learn.LinearRegressor(
      feature_columns=feature_cols, model_dir='taxi_model')
preds_iter = model.predict(input_fn=get_valid)
print list(itertools.islice(preds_iter, 5)) # first 5

This explains why the RMSE was so high -- the model essentially predicts $11 for every trip.  Would a more complex model help? Let's try using a deep neural network.  The code to do this is quite straightforward as well.

<h3> Deep Neural Network regression </h3>

In [None]:
shutil.rmtree('taxi_model', ignore_errors=True) # start fresh each time
model = tf.contrib.learn.DNNRegressor(hidden_units=[128, 100, 8],
      feature_columns=feature_cols, model_dir='taxi_model')
model.fit(input_fn=get_train, steps=1000)
print_rmse(model, 'validation', get_valid)

We are not beating our benchmark with either model ... what's up?  Well, we may be using TensorFlow for Machine Learning, but we are not yet using it well.  That's what the rest of this course is about!

But, for the record, let's say we had to choose between the two models. We'd choose the one with the lower validation error. Finally, we'd measure the RMSE on the test data with this chosen model.

<h2> Benchmark dataset </h2>

Let's do this on the benchmark dataset.

In [None]:
import datalab.bigquery as bq
import numpy as np
import pandas as pd


def create_query(phase, EVERY_N):
  """
  phase: 1=train 2=valid
  """
  base_query = """
SELECT
  DAYOFWEEK(pickup_datetime)*1.0 AS dayofweek,
  HOUR(pickup_datetime)*1.0 AS hourofday,
  pickup_longitude AS pickuplon, pickup_latitude AS pickuplat, 
  dropoff_longitude AS dropofflon, dropoff_latitude AS dropofflat,
  passenger_count*1.0 AS passengers,
  (tolls_amount + fare_amount) as fare_amount
FROM
  [nyc-tlc:yellow.trips]
WHERE
    trip_distance > 0
    AND fare_amount >= 2.5
    AND pickup_longitude > -78
    AND pickup_longitude < -70
    AND dropoff_longitude > -78
    AND dropoff_longitude < -70
    AND pickup_latitude > 37
    AND pickup_latitude < 45
    AND dropoff_latitude > 37
    AND dropoff_latitude < 45
    AND passenger_count > 0 
  """

  if EVERY_N == None:
    if phase < 2:
      # training
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 < 2".format(base_query)
    else:
      query = "{0} AND ABS(HASH(pickup_datetime)) % 4 == {1}".format(base_query, phase)
  else:
      query = "{0} AND ABS(HASH(pickup_datetime)) % {1} == {2}".format(base_query, EVERY_N, phase)
    
  return query


def input_fn():
  query = create_query(2, 100000)
  df = bq.Query(query).to_dataframe()
  df[:] = df[:].astype('float32')
  # create features, columns
  feature_cols = {k: tf.constant(df[k].values) for k in FEATURES}
  labels = tf.constant(df[TARGET].values)
  return feature_cols, labels

print_rmse(model, 'benchmark', input_fn)

Copyright 2017 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License