<h1> Scaling up ML using Cloud ML Engine </h1>

Thanks to Google for providing training materials and coursework on TensorFlow, GCP and ML Engine. I used the code that they provided here as a starting point and structure for my project:

https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/machine_learning/deepdive/03_tensorflow

<h2> Environment variables for project and bucket </h2>

In [1]:
import os
PROJECT = 'tnw-nyc-taxi-fare-prediction' # REPLACE WITH YOUR PROJECT ID
BUCKET = 'tnw-nyc-taxi-fare-prediction' # REPLACE WITH YOUR BUCKET NAME
REGION = 'us-central1' # REPLACE WITH YOUR BUCKET REGION e.g. us-central1

In [2]:
# for bash
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TFVERSION'] = '1.9'  # Tensorflow version

In [3]:
%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

Updated property [core/project].
Updated property [compute/region].


Allow the Cloud ML Engine service account to read/write to the bucket containing training data.

In [4]:
%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print(response['serviceAccount'])")

echo "Authorizing the Cloud ML Service account $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET
gsutil -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored
gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

Authorizing the Cloud ML Service account service-861692681329@cloud-ml.google.com.iam.gserviceaccount.com to access files in tnw-nyc-taxi-fare-prediction


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100   233    0   233    0     0    983      0 --:--:-- --:--:-- --:--:--   987
No changes to gs://tnw-nyc-taxi-fare-prediction/
No changes to gs://tnw-nyc-taxi-fare-prediction/taxi-test.csv
No changes to gs://tnw-nyc-taxi-fare-prediction/taxi-train.csv
No changes to gs://tnw-nyc-taxi-fare-prediction/taxi-valid.csv
No changes to gs://tnw-nyc-taxi-fare-prediction/test.csv
No changes to gs://tnw-nyc-taxi-fare-prediction/test.json
No changes to gs://tnw-nyc-taxi-fare-prediction/test2.json
No changes to gs://tnw-nyc-taxi-fare-prediction/train.csv
No changes to gs://tnw-nyc-taxi-fare-prediction/datalab-backups/us-central1-a/highmemvm/content/daily-20180817225242
No changes to gs://tnw-nyc-taxi-fare-prediction/datalab-backups/us-central1-a/highmemvm/content/d

<h2> Packaging up the code </h2>

In [5]:
!find taxifare

taxifare
taxifare/PKG-INFO
taxifare/v1-trainer
taxifare/v1-trainer/model-Copy1.py
taxifare/v1-trainer/model.py
taxifare/v1-trainer/__init__.py
taxifare/v1-trainer/task.py
taxifare/v1-trainer/.ipynb_checkpoints
taxifare/trainer.egg-info
taxifare/trainer.egg-info/SOURCES.txt
taxifare/trainer.egg-info/PKG-INFO
taxifare/trainer.egg-info/dependency_links.txt
taxifare/trainer.egg-info/top_level.txt
taxifare/setup.py
taxifare/setup.cfg
taxifare/v2-trainer
taxifare/v2-trainer/model.py
taxifare/v2-trainer/__init__.py
taxifare/v2-trainer/task.py
taxifare/.ipynb_checkpoints


In [6]:
!cat taxifare/v2-trainer/model.py

#!/usr/bin/env python

# Copyright 2017 Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import tensorflow as tf
import shutil
import pandas as pd
import numpy as np

tf.logging.set_verbosity(tf.logging.INFO)

# List the CSV columns
CSV_COLUMNS = ['key', 'fare_amount', 'pickup_longitude', 'pickup_latitude', 'dropoff_longitude', 

<h2> Find absolute paths to your data </h2>

In [6]:
%bash
echo $PWD
head -1 $PWD/taxi-train.csv
head -1 $PWD/taxi-valid.csv

/content/datalab/notebooks/training-data-analyst/courses/machine_learning/deepdive/03_tensorflow


head: cannot open '/content/datalab/notebooks/training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/taxi-train.csv' for reading: No such file or directory
head: cannot open '/content/datalab/notebooks/training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/taxi-valid.csv' for reading: No such file or directory


<h4> Monitor using Tensorboard </h4>

In [8]:
from google.datalab.ml import TensorBoard
TensorBoard().start('./v2_trained')

  from ._conv import register_converters as _register_converters


4419

<h2> Submit training job using gcloud </h2>

In [10]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/input/v2_trained
JOBNAME=taxi_v2$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=v2-trainer.task \
   --package-path=${PWD}/taxifare/v2-trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=STANDARD_1 \
   --runtime-version=$TFVERSION \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/input/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/input/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=5300000

gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained us-central1 taxi_v2180914_034927
jobId: taxi_v2180914_034927
state: QUEUED


Job [taxi_v2180914_034927] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe taxi_v2180914_034927

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs taxi_v2180914_034927


gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained us-central1 taxi_v2180914_034927 

jobId: taxi_v2180914_034927

state: QUEUED

Job [taxi_v2180914_034927] submitted successfully.

Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe taxi_v2180914_034927

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs taxi_v2180914_034927

In [11]:
for pid in TensorBoard.list()['pid']:
  TensorBoard().stop(pid)
  print('Stopped TensorBoard with pid {}'.format(pid))

Stopped TensorBoard with pid 4419


In [12]:
!ls $PWD/v2_trained

ls: cannot access '/content/datalab/notebooks/training-data-analyst/courses/machine_learning/deepdive/03_tensorflow/v2_trained': No such file or directory


## Create and run baseline predictor
Based on the data exploration, $3.40/km seemed to be the average fare based on the straight-line distance.  And I'm using that metric as a baseline to beat, so I'll import the test data and fill in the `fare_amount` column with `distance_km` * 3.4.  Then I'll submit that csv to the [Kaggle competition](https://www.kaggle.com/c/new-york-city-taxi-fare-prediction) and see how well the baseline performs.

In [28]:
%bash
echo $BUCKET
gsutil -m cp gs://${BUCKET}/taxi-test.csv ${PWD}/

tnw-nyc-taxi-fare-prediction


Copying gs://tnw-nyc-taxi-fare-prediction/taxi-test.csv...
/ [0/1 files][    0.0 B/  1.1 MiB]   0% Done                                    / [1/1 files][  1.1 MiB/  1.1 MiB] 100% Done                                    
Operation completed over 1 objects/1.1 MiB.                                      


In [32]:
import pandas as pd

test_df = pd.read_csv('taxi-test.csv', index_col=0)  # Remove index_col if taxi-test.csv has been recreated without an index col

Unnamed: 0,key,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,day_of_week,day_of_month,week,month,year,distance_km
0,2015-01-27 13:08:24.0000002,-73.97332,40.763805,-73.98143,40.743835,1,13,1,27,5,1,15,2.32326
1,2015-01-27 13:08:24.0000003,-73.986862,40.719383,-73.998886,40.739201,1,13,1,27,5,1,15,2.425353
2,2011-10-08 11:53:44.0000002,-73.982524,40.75126,-73.979654,40.746139,1,11,5,8,40,10,11,0.618628
3,2012-12-01 21:12:12.0000002,-73.98116,40.767807,-73.990448,40.751635,1,21,5,1,48,12,12,1.961033
4,2012-12-01 21:12:12.0000003,-73.966046,40.789775,-73.988565,40.744427,1,21,5,1,48,12,12,5.387301


In [34]:
test_df['fare_amount'] = test_df.distance_km * 3.4

test_df.head()

Unnamed: 0,key,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,hour,day_of_week,day_of_month,week,month,year,distance_km,fare_amount
0,2015-01-27 13:08:24.0000002,-73.97332,40.763805,-73.98143,40.743835,1,13,1,27,5,1,15,2.32326,7.899083
1,2015-01-27 13:08:24.0000003,-73.986862,40.719383,-73.998886,40.739201,1,13,1,27,5,1,15,2.425353,8.2462
2,2011-10-08 11:53:44.0000002,-73.982524,40.75126,-73.979654,40.746139,1,11,5,8,40,10,11,0.618628,2.103335
3,2012-12-01 21:12:12.0000002,-73.98116,40.767807,-73.990448,40.751635,1,21,5,1,48,12,12,1.961033,6.667511
4,2012-12-01 21:12:12.0000003,-73.966046,40.789775,-73.988565,40.744427,1,21,5,1,48,12,12,5.387301,18.316824


In [36]:
test_df.to_csv('taxi-predict-baseline.csv', columns=['key', 'fare_amount'], index=False)

<h2> Deploy model </h2>

In [7]:
%bash
gsutil ls gs://${BUCKET}/taxifare/input/v2_trained/export/exporter

gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536822793/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536823890/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536824393/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536825733/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536826848/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536827455/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536828055/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536828649/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536829247/
gs://tnw-nyc-taxi-fare-prediction/taxifare/input/v2_trained/export/exporter/1536829855/
gs://tnw-nyc-taxi-fare-prediction/taxifare/

In [8]:
%bash
MODEL_NAME="taxi"
MODEL_VERSION="v2"
MODEL_LOCATION=$(gsutil ls gs://${BUCKET}/taxifare/input/v2_trained/export/exporter | tail -1)
echo "Run these commands one-by-one (the very first time, you'll create a model and then create a version)"
#gcloud ml-engine versions delete ${MODEL_VERSION} --model ${MODEL_NAME}
#gcloud ml-engine models delete ${MODEL_NAME}
gcloud ml-engine models create ${MODEL_NAME} --regions $REGION
gcloud ml-engine versions create ${MODEL_VERSION} --model ${MODEL_NAME} --origin ${MODEL_LOCATION} --runtime-version $TFVERSION

Run these commands one-by-one (the very first time, you'll create a model and then create a version)


ERROR: (gcloud.ml-engine.models.create) Resource in project [tnw-nyc-taxi-fare-prediction] is the subject of a conflict: Field: model.name Error: A model with the same name already exists.
- '@type': type.googleapis.com/google.rpc.BadRequest
  fieldViolations:
  - description: A model with the same name already exists.
    field: model.name
Creating version (this might take a few minutes)......
.......................................................................................................done.


<h2> Prediction </h2>
<p>Unfortunately, the modifications I made to the model code to pass the key through the model on to be rejoind with the prediction didn't work, which can be seen in the bash gcloud command below, but as you can see in the following cell, online (real-time) prediction does indeed work when passing in json without a key.</p>

In [7]:
%writefile ./test1.json
{"key": "2015-01-27 13:08:24.0000002", "pickup_longitude": -73.9733200073, "pickup_latitude": 40.7638053894, "dropoff_longitude": -73.9814300537, "dropoff_latitude": 40.7438354492, "passenger_count": 1, "hour": 13, "day_of_week": 1, "day_of_month": 27, "week": 5, "month": 1, "year": 15, "distance_km": 2.3232596604}

Overwriting ./test1.json


In [9]:
%bash
gcloud ml-engine predict --model=taxifare --version=v2 --json-instances=./test1.json

{
  "error": "Prediction failed: Unexpected tensor name: key"
}


In [23]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1_discovery.json')

request_data = {'instances':
  [
      {"pickup_longitude": -73.885262, "pickup_latitude": 40.773008, "dropoff_longitude": -73.987232, "dropoff_latitude": 40.732403, "passenger_count": 2,
 "hour": 14, "day_of_week": 3, "day_of_month": 16, "week": 25, "month": 6, "year": 11, "distance_km": 3.4},
      {"pickup_longitude": -73.885262, "pickup_latitude": 40.773008, "dropoff_longitude": -73.987232, "dropoff_latitude": 40.732403, "passenger_count": 2,
 "hour": 14, "day_of_week": 3, "day_of_month": 16, "week": 25, "month": 6, "year": 11, "distance_km": 3.4}
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v2')
response = api.projects().predict(body=request_data, name=parent).execute()
print("response={}".format(response))

response={'predictions': [{'predictions': [21.749521255493164]}, {'predictions': [21.749521255493164]}]}


In [18]:
%bash
gcloud ml-engine jobs submit prediction v2_prediction_2 \
    --model taxifare \
    --version v2 \
    --data-format TEXT \
    --region us-central1 \
    --input-paths gs://${BUCKET}/taxifare/input/keyless_taxi_test.json \
    --output-path gs://${BUCKET}/taxifare/predictions

jobId: v2_prediction_2
state: QUEUED


Job [v2_prediction_2] submitted successfully.
Your job is still active. You may view the status of your job with the command

  $ gcloud ml-engine jobs describe v2_prediction_2

or continue streaming the logs with the command

  $ gcloud ml-engine jobs stream-logs v2_prediction_2


Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License