<h1> Scaling up ML using Cloud ML </h1>

This notebook is Lab3a of CPB 102, Google's course on Machine Learning using Cloud ML.

In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud ML. For now, we'll run this on a small dataset. The model that was developed is rather simplistic, and therefore, the accuracy of the model is not great either.  However, this notebook illustrates *how* to package up a TensorFlow model to run it within Cloud ML. 

<div id="toc"></div>

Later in the course, we will look at ways to make a more effective machine learning model.

In [11]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

<IPython.core.display.Javascript object>

In [1]:
%bash
# remember to "Reset Session" if you execute this cell -- this is needed to restart the Python kernel with updated package
gsutil cp gs://cloud-ml/sdk/cloudml-0.1.4.tar.gz .
pip install --force-reinstall --upgrade cloudml-0.1.4.tar.gz

Processing ./cloudml-0.1.4.tar.gz
Collecting oauth2client==2.2.0 (from cloudml==0.1.4)
Collecting six>=1.10.0 (from cloudml==0.1.4)
  Using cached six-1.10.0-py2.py3-none-any.whl
Collecting google-cloud-dataflow>=0.4.0 (from cloudml==0.1.4)
Collecting bs4>=0.0.1 (from cloudml==0.1.4)
  Using cached bs4-0.0.1.tar.gz
Collecting numpy>=1.10.4 (from cloudml==0.1.4)
  Using cached numpy-1.11.1-cp27-cp27mu-manylinux1_x86_64.whl
Collecting pillow>=3.2.0 (from cloudml==0.1.4)
  Using cached Pillow-3.3.1-cp27-cp27mu-manylinux1_x86_64.whl
Collecting dpkt>=1.8.7 (from cloudml==0.1.4)
  Using cached dpkt-1.8.8-py2-none-any.whl
Collecting nltk>=3.2.1 (from cloudml==0.1.4)
  Using cached nltk-3.2.1.tar.gz
Collecting httplib2>=0.9.1 (from oauth2client==2.2.0->cloudml==0.1.4)
  Using cached httplib2-0.9.2.zip
Collecting rsa>=3.1.4 (from oauth2client==2.2.0->cloudml==0.1.4)
  Using cached rsa-3.4.2-py2.py3-none-any.whl
Collecting pyasn1>=0.1.7 (from oauth2client==2.2.0->cloudml==0.1.4)
  Using cached p

Copying gs://cloud-ml/sdk/cloudml-0.1.4.tar.gz...
Downloading file://./cloudml-0.1.4.tar.gz:                       0 B/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       72 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       144 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       216 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       288 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       360 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       432 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       504 KiB/545.79 KiB    Downloading file://./cloudml-0.1.4.tar.gz:                       545.79 KiB/545.79 KiB    


<h2> Write code for preprocessing and feature engineering </h2>

Realistic ML models involve a fair bit of preprocessing and feature engineering. The standard Cloud ML pipeline expects this. We haven't covered this in class yet, so we'll just pull out the input variables and pass them through untransformed (i.e. we will simply do identity() on the columns).

<br/>

Datalab can generate the following template code for you.  Just type <b>%ml features</b> into an empty cell, and then fill out the path, headers, target, id.  Running that cell in turn will create Python code to define features. You can then edit it. (try it out by creating a new code block, and starting with %ml features in it)

In [3]:
import google.cloud.ml.features as features

#import google.cloud.ml as ml
#print ml.sdk_location

class TaxifareFeatures(object):
  """This class is generated from command line:
        %ml features
        path: ../lab1a/taxi-train.csv
        headers: pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude,passenger_count,fare_amount
        target: fare_amount
        Please modify it as appropriate!!!
  """
  csv_columns = ('pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','passenger_count','fare_amount')
  fare_amount = features.target('fare_amount').regression()
  attrs = [
      features.numeric('pickup_longitude').identity(),
      features.numeric('dropoff_longitude').identity(),
      features.numeric('passenger_count').identity(),
      features.numeric('pickup_latitude').identity(),
      features.numeric('dropoff_latitude').identity(),
  ]


gs://cloud-ml/sdk/cloudml-0.1.4.tar.gz


<h2> Dataflow pipeline for preprocessing </h2>

Dataflow pipeline code can also be created using code generation in Datalab.  Type <b>%ml preprocess</b> (or <b>%ml preprocess --cloud</b> to create a template with a DataflowPipelineRunner and gs:// paths) into an empty cell, run it, fill in some params and execute it again. (create a new code cell and try it out!)

Note that this code references the features class above (TaxifareFeatures)

In [3]:

# header
"""
Following code is generated from command line:
%%ml preprocess
train_data_path: ../lab1a/taxi-train.csv
eval_data_path: ../lab1a/taxi-valid.csv
data_format: CSV
output_dir: ./taxi_preproc
feature_set_class_name: TaxifareFeatures

Please modify as appropriate!!!
"""

# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.dataflow.io.tfrecordio as tfrecordio
import google.cloud.ml.io as io
import os

# defines
feature_set = TaxifareFeatures()
OUTPUT_DIR = './taxi_preproc'
pipeline = beam.Pipeline('DirectPipelineRunner')


# preprocessing
training_data = beam.io.TextFileSource(
    '../lab1a/taxi-train.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)
eval_data = beam.io.TextFileSource(
    '../lab1a/taxi-valid.csv',
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)
(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set))
train_parameters = tfrecordio.TFRecordParameters(
    file_path_prefix=os.path.join(OUTPUT_DIR, 'features_train'),
    file_name_suffix='',
    shard_file=False,
    compress_file=True)
eval_parameters = tfrecordio.TFRecordParameters(
    file_path_prefix=os.path.join(OUTPUT_DIR, 'features_eval'),
    file_name_suffix='',
    shard_file=False,
    compress_file=True)
(metadata, train_features, eval_features) | (
    io.SavePreprocessed('SavingData', OUTPUT_DIR,
                        file_parameters_list=[
                            os.path.join(OUTPUT_DIR, 'metadata.yaml'),
                            train_parameters, eval_parameters]))

# run pipeline
pipeline.run()


<apache_beam.runners.direct_runner.DirectPipelineResult at 0x7f969b23da50>

Running the above preprocessing code creates TFRecords, an efficient compressed format that is suitable for repeated training, distribution, and hyperparameter tuning. This is what our TensorFlow model receives. In addition, the preprocessing pipeline creates metadata.yaml, a set of statistics computed from the input data that is necessary for many of the input transformations covered in the next chapter.

In [4]:
!ls taxi_preproc

features_eval  features_train  info  metadata.yaml


In [5]:
!head -20 taxi_preproc/metadata.yaml

columns:
  dropoff_latitude:
    max: 41.366138
    mean: 40.751464661690754
    min: 40.514429
    name: dropoff_latitude
    type: numeric
  dropoff_longitude:
    max: -73.137393
    mean: -73.97474299191431
    min: -74.417107
    name: dropoff_longitude
    type: numeric
  fare_amount:
    max: 88.0
    mean: 11.195111969111972
    min: 2.5
    name: fare_amount
    scenario: continuous
    type: target


<h2> Package up TensorFlow model </h2>

The TensorFlow model needs to be packaged up into a Python module.  This has a very specific folder structure (you'd typically maintain this exact structure in your source repository). Then, you create an archive of it using the 'tar' command:

In [6]:
%bash
rm -rf taxifare.tar.gz taxi_trained
tar cvfz taxifare.tar.gz taxifare

taxifare/
taxifare/PKG-INFO
taxifare/setup.cfg
taxifare/setup.py
taxifare/trainer/
taxifare/trainer/__init__.py
taxifare/trainer/task.py
taxifare/trainer/taxifare.py
taxifare/trainer.egg-info/
taxifare/trainer.egg-info/dependency_links.txt
taxifare/trainer.egg-info/PKG-INFO
taxifare/trainer.egg-info/SOURCES.txt
taxifare/trainer.egg-info/top_level.txt


Only three of those files are ones that you would actually edit, and one of them (taxifare.py) has meaningful code associated with it.

The first is setup.py.  You would change it to reflect your module name, author, author_email and description. You might also add Python packages that you depend upon

In [7]:
!grep -v "^#" taxifare/setup.py


from setuptools import find_packages
from setuptools import setup

REQUIRED_PACKAGES = [
]

setup(
    name='taxifare',
    version='0.1',
    author = 'Google',
    author_email = 'training-feedback@cloud.google.com',
    install_requires=REQUIRED_PACKAGES,
    packages=find_packages(),
    include_package_data=True,
    description='CPB102 taxifare in Cloud ML',
    requires=[]
)


Theoretically, the second is task.py. This file is canonical code to run your TensorFlow model by reading data in batches, setting up summary statistics, etc.  Most of this code will, in the future, move away from your Python module. For now, simply do a string-replace of 'taxifare' with the name of your Python module and add any hyperparameters you have.

In [8]:
!grep -E "add_argument|taxifare" taxifare/trainer/task.py

import taxifare
  parser.add_argument("--train_data_paths", type=str, action='append')
  parser.add_argument("--eval_data_paths", type=str, action='append')
  parser.add_argument("--metadata_path", type=str)
  parser.add_argument("--output_path", type=str)
  parser.add_argument("--max_steps", type=int, default=2000)
  parser.add_argument("--hidden1", type=int, default=300)
  parser.add_argument("--hidden2", type=int, default=200)
  parser.add_argument("--hidden3", type=int, default=100)
  """Train taxifare for a number of steps."""
  # test on taxifare.
      _, train_examples = taxifare.read_examples(
          taxifare.create_inputs(metadata, input_data=train_examples))
      output = taxifare.inference(inputs, metadata, layer_sizes)
      loss = taxifare.loss(output, targets)
      train_op, global_step = taxifare.training(loss,
    placeholder, inputs, _, keys = taxifare.create_inputs(metadata)
    output = taxifare.inference(inputs, metadata, layer_sizes)
    _, 

The third one is taxifare.py -- this is the real TensorFlow model and the only one for which you have work to do.

<h2> Implementing TensorFlow model </h2>

Here are the methods in taxifare.py that get called from task.py. It's your job to implement them.

In [9]:
!grep def taxifare/trainer/taxifare.py | grep -v "def _"

def read_examples(input_files, batch_size, shuffle, num_epochs=None):
def create_inputs(metadata, input_data=None):
def inference(inputs, metadata, hyperparams):
def loss(output, targets):
def training(loss, learning_rate):


Take the loss function for example.  This should feel familiar:

In [10]:
!grep -A 10 "def loss" taxifare/trainer/taxifare.py

def loss(output, targets):
  """Calculates the loss from the output and the labels.
  Args:
    output: output layer tensor, float - [batch_size].
    targets: Target value tensor, float - [batch_size].
  Returns:
    loss: Loss tensor of type float.
  """
  loss = tf.sqrt(tf.reduce_mean(tf.square(output - targets)), name = 'loss') # RMSE
  return loss



Change the cell above to look at the other functions.  Essentially, you'll implement your TensorFlow model in terms of these modules (or refactor an existing monolithic TensorFlow model into these modules) and put the pieces in the right spots:
<ol>
<li> create_inputs will take the input data and do any input transformations that you want to do. </li>
<li> inference will create the TensorFlow ML model i.e. the computational graph. </li>
<li> loss will specify what you want to optimize </li>
<li> training will implement the training loop. You typically don't have to change this from the sample implementation </li>
<li> read_examples can also be left as-is unless you want to change what preprocessing outputs or how batching happens (you probably don't). </li>
</ol>

<h2> Running training locally </h2>

Once you have a packaged TensorFlow model, you can run training by passing in the paths to your data.

Type %ml train into an empty cell, run it, fill in some params and execute it again. (create a new code cell and try it out!)

In [11]:
%ml train

Parameters,Local Run Required,Cloud Run Required,Description
package_uris,True,True,A GCS or local (for local run only) path to your python training program package.
python_module,True,True,The module to run.
scale_tier,False,True,"Type of resources requested for the job. On local run, BASIC means 1 master process only, and any other values mean 1 master 1 worker and 1 ps processes. But you can also override the values by setting worker_count and parameter_server_count. On cloud, see service definition for possible values."
region,False,True,Where the training job runs. For cloud run only.
args,False,False,Args that will be passed to your training program.


In [None]:
%%ml train [--cloud]
package_uris: gs://your-bucket/my-training-package.tar.gz
python_module: your_program.your_module
scale_tier: BASIC
region: us-central1
args:
  string_arg: value
  int_arg: value
  appendable_arg:
    - value1
    - value2


In [7]:
%bash
rm -rf /content/CPB102/lab3a/taxi_trained

In [8]:
%%ml train
package_uris: /content/CPB102/lab3a/taxifare.tar.gz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - /content/CPB102/lab3a/taxi_preproc/features_train
  eval_data_paths:
    - /content/CPB102/lab3a/taxi_preproc/features_eval
  metadata_path: /content/CPB102/lab3a/taxi_preproc/metadata.yaml
  output_path: /content/CPB102/lab3a/taxi_trained
  max_steps: 600
  hidden1: 64
  hidden2:  8
  hidden3:  4
  

In [9]:
!ls /content/CPB102/lab3a/taxi_trained

eval  logdir  model  summaries


In [10]:
%ml summary --dir /content/CPB102/lab3a/taxi_trained/summaries  /content/CPB102/lab3a/taxi_trained/eval  --name loss error --step

The loss is the RMSE on the training dataset; the error is the RMSE on the validation dataset.  The loss is reported frequently since it is computed anyway, but we compute the error only once every 30s of training, so there won't be as many points associated with the error. 

In [18]:
%tensorboard start --logdir /content/CPB102/lab3a/taxi_trained

In [19]:
%tensorboard stop --pid 16178

<h2> Training on cloud </h2>

First of all, we need to set permissions on our bucket so that Cloud ML can read/write to it.  In CloudShell, go to CPB102/lab3a and run ./get_service_account.sh.  Use that account in the following code.

In [21]:
%bash
# you can find the Cloud ML service account by running ./get_service_account.sh
# change the service account and bucket as appropriate
SVCACCT=cloud-ml-service@cml-663413318684.iam.gserviceaccount.com
BUCKET=cloud-training-demos
gsutil acl ch -u $SVCACCT:WRITE gs://$BUCKET/
gsutil defacl ch -u $SVCACCT:O gs://$BUCKET/

No changes to gs://cloud-training-demos/
No changes to gs://cloud-training-demos/


Next, we have to copy the model and data to Google Cloud Storage (GCS).  Change bucket name as appropriate.

In [11]:
%bash
BUCKET=cloud-training-demos
rm -rf taxifare.tar.gz taxi_trained
tar cvfz taxifare.tar.gz taxifare
gsutil cp taxifare.tar.gz gs://$BUCKET/taxifare/source/taxifare.tar.gz
gsutil cp ../lab1a/*.csv  gs://$BUCKET/taxifare/input/
gsutil -m rm -r -f gs://$BUCKET/taxifare/taxi_preproc
gsutil -m rm -r -f gs://$BUCKET/taxifare/taxi_trained

taxifare/
taxifare/PKG-INFO
taxifare/setup.cfg
taxifare/setup.py
taxifare/trainer/
taxifare/trainer/__init__.py
taxifare/trainer/task.py
taxifare/trainer/taxifare.py
taxifare/trainer.egg-info/
taxifare/trainer.egg-info/dependency_links.txt
taxifare/trainer.egg-info/PKG-INFO
taxifare/trainer.egg-info/SOURCES.txt
taxifare/trainer.egg-info/top_level.txt


Copying file://taxifare.tar.gz [Content-Type=application/x-tar]...
Uploading   ...d-training-demos/taxifare/source/taxifare.tar.gz: 0 B/6.42 KiB    Uploading   ...d-training-demos/taxifare/source/taxifare.tar.gz: 6.42 KiB/6.42 KiB    
Copying file://../lab1a/taxi-test.csv [Content-Type=text/csv]...
Uploading   ...loud-training-demos/taxifare/input/taxi-test.csv: 0 B/79.68 KiB    Uploading   ...loud-training-demos/taxifare/input/taxi-test.csv: 71.48 KiB/79.68 KiB    Uploading   ...loud-training-demos/taxifare/input/taxi-test.csv: 79.68 KiB/79.68 KiB    
Copying file://../lab1a/taxi-train.csv [Content-Type=text/csv]...
Uploading   ...oud-training-demos/taxifare/input/taxi-train.csv: 0 B/370.48 KiB    Uploading   ...oud-training-demos/taxifare/input/taxi-train.csv: 71.48 KiB/370.48 KiB    Uploading   ...oud-training-demos/taxifare/input/taxi-train.csv: 143.48 KiB/370.48 KiB    Uploading   ...oud-training-demos/taxifare/input/taxi-train.csv: 215.48 KiB/370.48 KiB    Uploading   ..

When you run your preprocessor, you have to change the input and output to be on GCS.  

Using DirectPipelineRunner runs Dataflow locally, but the inputs & outputs are on the cloud. Using BlockingDataflowPipelineRunner will use Cloud Dataflow (and take much longer because of the overhead involved for such a small dataset). To see the status of your BlockingDataflowPipelineRunner job, visit https://console.cloud.google.com/dataflow 

In [None]:
# imports
import apache_beam as beam
import google.cloud.ml as ml
import google.cloud.ml.dataflow.io.tfrecordio as tfrecordio
import google.cloud.ml.io as io
import os

# Change as needed
BUCKET = 'cloud-training-demos'
PROJECT = 'cloud-training-demos'
RUNNER = 'DirectPipelineRunner'  # RUNNER = 'BlockingDataflowPipelineRunner'

# defines
feature_set = TaxifareFeatures()
OUTPUT_DIR = 'gs://{0}/taxifare/taxi_preproc'.format(BUCKET)

pipeline = beam.Pipeline(argv=['--project', PROJECT,
                               '--runner', RUNNER,
                               '--job_name', 'lab3a',
                               '--extra_package', ml.sdk_location,
                               '--no_save_main_session', 'True',  # to prevent pickling and uploading Datalab itself!
                               '--staging_location', 'gs://{0}/taxifare/staging'.format(BUCKET),
                               '--temp_location', 'gs://{0}/taxifare/temp'.format(BUCKET)])

# preprocessing
training_data = beam.io.TextFileSource(
    'gs://{0}/taxifare/input/taxi-train.csv'.format(BUCKET),
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
train = pipeline | beam.Read('ReadTrainingData', training_data)
eval_data = beam.io.TextFileSource(
    'gs://{0}/taxifare/input/taxi-valid.csv'.format(BUCKET),
    strip_trailing_newlines=True,
    coder=io.CsvCoder.from_feature_set(feature_set, feature_set.csv_columns))
eval = pipeline | beam.Read('ReadEvalData', eval_data)
(metadata, train_features, eval_features) = ((train, eval) |
    ml.Preprocess('Preprocess', feature_set))
train_parameters = tfrecordio.TFRecordParameters(
    file_path_prefix=os.path.join(OUTPUT_DIR, 'features_train'),
    file_name_suffix='',
    shard_file=False,
    compress_file=True)
eval_parameters = tfrecordio.TFRecordParameters(
    file_path_prefix=os.path.join(OUTPUT_DIR, 'features_eval'),
    file_name_suffix='',
    shard_file=False,
    compress_file=True)
(metadata, train_features, eval_features) | (
    io.SavePreprocessed('SavingData', OUTPUT_DIR,
                        file_parameters_list=[
                            os.path.join(OUTPUT_DIR, 'metadata.yaml'),
                            train_parameters, eval_parameters]))

# run pipeline
pipeline.run()




In [6]:
%bash
gsutil ls gs://cloud-training-demos/taxifare/taxi_preproc

gs://cloud-training-demos/taxifare/taxi_preproc/features_eval
gs://cloud-training-demos/taxifare/taxi_preproc/features_train
gs://cloud-training-demos/taxifare/taxi_preproc/info
gs://cloud-training-demos/taxifare/taxi_preproc/metadata.yaml


Finally, submit the training job to the cloud.  Note that unlike Dataflow jobs (which usually take minutes), Cloud ML jobs usually take hours and are, therefore, queued. It may be a couple of minutes before your job starts being executed. This being a small job, though, the task should complete a few seconds later.

In [7]:
%%ml train --cloud
package_uris: gs://cloud-training-demos/taxifare/source/taxifare.tar.gz
python_module: trainer.task
scale_tier: BASIC
region: us-central1
args:
  train_data_paths:
    - gs://cloud-training-demos/taxifare/taxi_preproc/features_train
  eval_data_paths:
    - gs://cloud-training-demos/taxifare/taxi_preproc/features_eval
  metadata_path: gs://cloud-training-demos/taxifare/taxi_preproc/metadata.yaml
  output_path: gs://cloud-training-demos/taxifare/taxi_trained
  max_steps: 1000
  hidden1: 64
  hidden2:  8
  hidden3:  4


<h2> Prediction </h2>

Make sure that the training job has completed before proceeding to this step (check the log above)

To predict the taxifare for new inputs, you first have to deploy the trained model (deleting a previous one if necessary):

In [8]:
%ml delete --name taxifare.v1

In [9]:
%ml deploy --name taxifare.v1 --path gs://cloud-training-demos/taxifare/taxi_trained/model/

In [10]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

import google.cloud.ml.features as features
from google.cloud.ml import session_bundle

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1beta1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')
#request = {'instances': ['-73.885262,40.773008,-73.987232,40.732403,2']}
request = {'instances': [
    {'pickup_longitude': -73.885262,
     'pickup_latitude': 40.773008,
     'dropoff_longitude': -73.987232,
     'dropoff_latitude': 40.732403,
     'passenger_count': 2}]}
parent = 'projects/%s/models/%s/versions/%s' % ('cloud-training-demos', 'taxifare', 'v1')
response = api.projects().predict(body=request, name=parent).execute()
print "response={0}".format(response)

response={u'error': u'Prediction failed: Unable to get element from the feed as bytes.'}


Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License