<h1> Scaling up ML using Cloud ML </h1>

This notebook is Lab3a of CPB 102, Google's course on Machine Learning using Cloud ML.

In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud ML. For now, we'll run this on a small dataset. The model that was developed is rather simplistic, and therefore, the accuracy of the model is not great either.  However, this notebook illustrates *how* to package up a TensorFlow model to run it within Cloud ML. 

<div id="toc"></div>

Later in the course, we will look at ways to make a more effective machine learning model.

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

In [None]:
import google.cloud.ml as ml
import tensorflow as tf
print tf.__version__
print ml.sdk_location

<h2> Environment variables for project and bucket </h2>

Change the cell below to reflect your Project ID and bucket name. Note that:
<ol>
<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos </li>
<li> Cloud training often involves saving and restoring model files. Therefore, we should <b>create a single-region bucket</b>. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available) </li>
</ol>

The next cell ensures that your bucket is writeable by Cloud ML. You need to do this only once (not once per notebook).

In [None]:
import os
PROJECT = 'cloud-training-demos'    # CHANGE THIS
BUCKET = 'cloud-training-demos-ml'  # CHANGE THIS

os.environ['PROJECT'] = PROJECT # for bash
os.environ['BUCKET'] = BUCKET # for bash

In [None]:
%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" -H "Authorization: Bearer $AUTH_TOKEN" https://ml.googleapis.com/v1beta1/projects/$PROJECT_ID:getConfig | python -c "import json; import sys; response = json.load(sys.stdin); print response['serviceAccount']")
echo "Authorizing Cloud ML service account $SVC_ACCOUNT to write to $BUCKET"
gsutil acl ch -u $SVC_ACCOUNT:WRITE gs://$BUCKET/
gsutil defacl ch -u $SVC_ACCOUNT:O gs://$BUCKET/

Verify the project and bucket settings:

In [None]:
%bash
echo "project=$PROJECT"
echo "bucket=$BUCKET"

<h2> Package up TensorFlow model </h2>

The TensorFlow model needs to be packaged up into a Python module.  This has a very specific folder structure (you'd typically maintain this exact structure in your source repository). Then, you create an archive of it using the 'tar' command:

In [None]:
%bash
rm -rf taxifare.tar.gz taxi_trained
tar cvfz taxifare.tar.gz taxifare

task.py and model.py contain the code that you wrote earlier in Datalab. We moved it into a Python module.

<h2> Running training locally </h2>

Once you have a packaged TensorFlow model, you can run training by passing in the paths to your data.

Note the absolute paths below. /content is mapped in Datalab to where the home icon takes you

In [None]:
%bash
rm -rf /content/training-data-analyst/CPB102/lab3a/taxi_trained
head -1 /content/training-data-analyst/CPB102/lab1a/taxi-train.csv
head -1 /content/training-data-analyst/CPB102/lab1a/taxi-valid.csv

In [None]:
!gcloud --quiet components update

In [None]:
import tensorflow as tf
print tf.__version__

In [None]:
%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:/content/training-data-analyst/CPB102/lab3a/taxifare
python -m trainer.task \
   --train_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-train.csv \
   --eval_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-valid.csv  \
   --output_dir=/content/training-data-analyst/CPB102/lab3a/taxi_trained \
   --num_epochs=100

In [None]:
%bash
gcloud beta ml local train \
   --module-name=trainer.task \
   --package-path=/content/training-data-analyst/CPB102/lab3a/taxifare.tar.gz \
   -- \
   --train_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-train.csv \
   --eval_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-valid.csv  \
   --output_dir=/content/training-data-analyst/CPB102/lab3a/taxi_trained \
   --num_epochs=100

In [None]:
%%mlalpha train
package_uris: /content/training-data-analyst/CPB102/lab3a/taxifare.tar.gz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
runtime_version: 1.0.0-rc2
args:
  train_data_paths: /content/training-data-analyst/CPB102/lab1a/taxi-train.csv
  eval_data_paths: /content/training-data-analyst/CPB102/lab1a/taxi-valid.csv
  output_dir: /content/training-data-analyst/CPB102/lab3a/taxi_trained
  num_epochs: 100

In [None]:
!ls /content/training-data-analyst/CPB102/lab3a/taxi_trained

In [None]:
%mlalpha summary --dir /content/training-data-analyst/CPB102/lab3a/taxi_trained/summaries  /content/training-data-analyst/CPB102/lab3a/taxi_trained/eval  --name loss accuracy --step

The loss is the RMSE on the training dataset; the accuracy is the RMSE on the validation dataset.  The loss is reported frequently since it is computed anyway, but we compute the accuracy only once every 30s of training, so there won't be as many points associated with the error. 

In [None]:
%tensorboard start --logdir /content/training-data-analyst/CPB102/lab3a/taxi_trained

In [None]:
%tensorboard stop --pid 6326

<h2> Training on cloud </h2>

In order to train on the cloud, we have to copy the model and data to our bucket on Google Cloud Storage (GCS).

In [None]:
%bash
rm -rf taxifare.tar.gz taxi_trained
tar cvfz taxifare.tar.gz taxifare
gsutil cp taxifare.tar.gz gs://$BUCKET/taxifare/source/taxifare.tar.gz
gsutil cp ../lab1a/*.csv  gs://$BUCKET/taxifare/input/
gsutil -m rm -r -f gs://$BUCKET/taxifare/taxi_preproc
gsutil -m rm -r -f gs://$BUCKET/taxifare/taxi_trained

When you run your preprocessor, you have to change the input and output to be on GCS.  

Using DirectPipelineRunner runs Dataflow locally, but the inputs & outputs are on the cloud. Using BlockingDataflowPipelineRunner will use Cloud Dataflow (and take much longer because of the overhead involved for such a small dataset). To see the status of your BlockingDataflowPipelineRunner job, visit https://console.cloud.google.com/dataflow 

In [None]:
# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.dataflow.io.tfrecordio as tfrecordio
import google.cloud.ml.io as io
import os

# Change as needed
RUNNER = 'DirectPipelineRunner'  # 
#RUNNER = 'BlockingDataflowPipelineRunner'

# defines
feature_set = TaxifareFeatures()
OUTPUT_DIR = 'gs://{0}/taxifare/taxi_preproc'.format(BUCKET)

pipeline = beam.Pipeline(argv=['--project', PROJECT,
                               '--runner', RUNNER,
                               '--job_name', 'lab3a',
                               '--extra_package', ml.sdk_location,
                               '--no_save_main_session', 'True',  # to prevent pickling and uploading Datalab itself!
                               '--staging_location', 'gs://{0}/taxifare/staging'.format(BUCKET),
                               '--temp_location', 'gs://{0}/taxifare/temp'.format(BUCKET)])


# preprocessing
training_data = beam.io.TextFileSource(
    'gs://{0}/taxifare/input/taxi-train.csv'.format(BUCKET),
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)
eval_data = beam.io.TextFileSource(
    'gs://{0}/taxifare/input/taxi-valid.csv'.format(BUCKET),
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)


(metadata, train_features, eval_features) = ((train, eval) |
   'Preprocess' >> ml.Preprocess(feature_set))

(metadata
   | 'SaveMetadata'
   >> io.SaveMetadata(os.path.join(OUTPUT_DIR, 'metadata.yaml')))
(train_features
   | 'WriteTraining'
   >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_train')))
(eval_features
   | 'WriteEval'
   >> io.SaveFeatures(os.path.join(OUTPUT_DIR, 'features_eval')))

# run pipeline
pipeline.run()

In [None]:
%bash
gsutil ls gs://$BUCKET/taxifare/taxi_preproc

Finally, submit the training job to the cloud.  Cloud ML jobs usually take hours and are, therefore, queued. It may be a couple of minutes before your job starts being executed. This being a small job, though, the task should complete a few seconds later.

In [None]:
# set up parameters for mlapha command.
package_uris = 'gs://' + BUCKET + '/taxifare/source/taxifare.tar.gz'
train_data_paths = 'gs://' + BUCKET + '/taxifare/taxi_preproc/features_train*'
eval_data_paths = 'gs://' + BUCKET + '/taxifare/taxi_preproc/features_eval*'
metadata_path = 'gs://' + BUCKET + '/taxifare/taxi_preproc/metadata.yaml'
output_path = 'gs://' + BUCKET + '/taxifare/taxi_trained'

In [None]:
%mlalpha train --cloud
package_uris: $package_uris
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths: $train_data_paths
  eval_data_paths: $eval_data_paths
  metadata_path: $metadata_path
  output_path: $output_path
  max_steps: 1000

In [None]:
%mlalpha jobs --name trainer_task_170114_222854

<h2> Prediction </h2>

Make sure that the training job has completed before proceeding to this step (check the log above)

To predict the taxifare for new inputs, you first have to deploy the trained model (deleting a previous one if necessary):

In [None]:
%bash
# Work around https://buganizer.corp.google.com/issues/31730085
gsutil cp gs://$BUCKET/taxifare/taxi_preproc/metadata.yaml gs://$BUCKET/taxifare/taxi_trained/model/

In [None]:
%mlalpha delete --name taxifare.v1

In [None]:
%mlalpha deploy --name taxifare.v1 --path gs://$BUCKET/taxifare/taxi_trained/model/

In [None]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

import google.cloud.ml.features as features
from google.cloud.ml import session_bundle

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1beta1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')

request_data = {'instances':
  [
    {'examples':
      {
        'pickup_longitude': -73.885262,
        'pickup_latitude': 40.773008,
        'dropoff_longitude': -73.987232,
        'dropoff_latitude': 40.732403,
        'passenger_count': 2,
        'fare_amount': -999
      }
    }
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License