TF.Learn simplifies training a model.

# Predict the tip for a taxi ride

NYC provides quite a lot of data from taxi trips at http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml. It's a good exercise to think about how this data can be used to improve the city. You can find more details about the features here: https://data.cityofnewyork.us/view/6phq-6kwz

For this workshop, I thought we could work on a simple problem: predicting the tip for a given ride. To keep the download small, I prepared a subset of the data for us to use from Jan '09. It includes 100,000 rows for training and 10,000 rows for evaluation, with five features. Much richer data is available in the original files. Here's the format the subset.

| passenger_count | trip_distance | RatecodeID | payment_type | fare_amount | tip_amount |
|-----------------|---------------|------------|--------------|-------------|------------|
| 1               | 2.72          | 1          | 1            | 10.5        | 3.39       |
| 5               | 3.21          | 1          | 1            | 12.5        | 2          |
| 1               | .80           | 1          | 1            | 5.5         | 1          | 

and so on.

For reference, here are the commands I used to extract this data from https://s3.amazonaws.com/nyc-tlc/trip+data/yellow_tripdata_2009-01.csv

This command was used to skip the header, select the columns we want, and shuffle the file (since it's sorted by date/time):

* $ tail -n +2 yellow_tripdata_2016-01.csv | cut -d',' -f4,5,8,12,13,16 | gshuf > cab.csv

These commands were used to select rows for train and test:

* $ head -n 100000 cab.csv > cab_train.csv 

* $ tail -n 10000 cab.csv > cab_test.csv 

In [None]:
import tensorflow as tf
import numpy as np

tf.logging.set_verbosity(tf.logging.ERROR)

In [None]:
# use numpy to read the CSVs
train_data = np.loadtxt(open("cab_train.csv"), delimiter=",", skiprows=1)
x_data = train_data[:,0:5] # first five columns are features
y_data = train_data[:,5] # last column is the tip amount

test_data = np.loadtxt(open("cab_test.csv"), delimiter=",")
x_test = test_data[:,0:5] # first five columns are features
y_test = test_data[:,5] # last column is the tip amount

# used to inform the classifier about our features
# here they're all real-valued
feature_columns=[tf.contrib.layers.real_valued_column('', dimension=5)]

In [None]:
# train and evaluate a linear regression model
R = tf.contrib.learn.LinearRegressor(feature_columns)
R.fit(x_data, y_data, batch_size=100, max_steps=1000)
R.evaluate(x_test, y_test)

In [None]:
# train and evaluate a fully connected deep neural net
# note: the parameters I'm using are silly
# still, the loss should decrease vs. the linear model
D = tf.contrib.learn.DNNRegressor(feature_columns=feature_columns,
                                  hidden_units=[10,20,10])
D.fit(x_data, y_data, batch_size=100, max_steps=1000)
D.evaluate(x_test, y_test)

In [None]:
# predict the tip for a single ride (the actual tip was $11.67)
D.predict(np.asarray([1,14.85,2,1,52]))