<h1> Scaling up ML using Cloud ML </h1>

This notebook is Lab3a of CPB 102, Google's course on Machine Learning using Cloud ML.

In this notebook, we take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run in Cloud ML. For now, we'll run this on a small dataset. The model that was developed is rather simplistic, and therefore, the accuracy of the model is not great either.  However, this notebook illustrates *how* to package up a TensorFlow model to run it within Cloud ML. 

<div id="toc"></div>

Later in the course, we will look at ways to make a more effective machine learning model.

In [None]:
%%javascript
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js')

In [None]:
%bash
pip install --upgrade tensorflow_transform

In [None]:
import google.cloud.ml as ml
import tensorflow as tf
import tensorflow_transform as tft
print tf.__version__
print ml.sdk_location

<h2> Environment variables for project and bucket </h2>

Change the cell below to reflect your Project ID and bucket name. Note that:
<ol>
<li> Your project id is the *unique* string that identifies your project (not the project name). You can find this from the GCP Console dashboard's Home page.  My dashboard reads:  <b>Project ID:</b> cloud-training-demos </li>
<li> Cloud training often involves saving and restoring model files. Therefore, we should <b>create a single-region bucket</b>. If you don't have a bucket already, I suggest that you create one from the GCP console (because it will dynamically check whether the bucket name you want is available) </li>
</ol>

The next cell ensures that your bucket is writeable by Cloud ML. You need to do this only once (not once per notebook).

In [None]:
import os
PROJECT = 'cloud-training-demos'    # CHANGE THIS
BUCKET = 'cloud-training-demos-ml'  # CHANGE THIS
REGION = 'us-central1' # CHANGE THIS

os.environ['PROJECT'] = PROJECT # for bash
os.environ['BUCKET'] = BUCKET # for bash
os.environ['REGION'] = REGION # for bash

In [None]:
!gcloud beta ml init-project -q

Verify the project and bucket settings:

In [None]:
%bash
echo "project=$PROJECT"
echo "bucket=$BUCKET"
echo "region=$REGION"

<h2> Packaging up the code </h2>

Take your code and put into a standard Python package structure.  model.py and task.py contain the Tensorflow code from earlier

In [None]:
!find taxifare

In [None]:
!cat taxifare/trainer/model.py

<h2> Find absolute paths to your data </h2>

Note the absolute paths below. /content is mapped in Datalab to where the home icon takes you

In [None]:
%bash
rm -rf /content/training-data-analyst/CPB102/lab3a/taxi_trained
head -1 /content/training-data-analyst/CPB102/lab1a/taxi-train.csv
head -1 /content/training-data-analyst/CPB102/lab1a/taxi-valid.csv

<h2> Running the Python module from the command-line </h2>

In [None]:
%bash
rm -rf taxifare.tar.gz taxi_trained
export PYTHONPATH=${PYTHONPATH}:/content/training-data-analyst/CPB102/lab3a/taxifare
python -m trainer.task \
   --train_data_paths="/content/training-data-analyst/CPB102/lab1a/taxi-train*" \
   --eval_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-valid.csv  \
   --output_dir=/content/training-data-analyst/CPB102/lab3a/taxi_trained \
   --num_epochs=10

In [None]:
!ls /content/training-data-analyst/CPB102/lab3a/taxi_trained

<h2> Running locally using gcloud </h2>

In [None]:
!gcloud components update --quiet

In [None]:
%bash
rm -rf taxifare.tar.gz taxi_trained
gcloud beta ml local train \
   --module-name=trainer.task \
   --package-path=/content/training-data-analyst/CPB102/lab3a/taxifare/trainer \
   -- \
   --train_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-train.csv \
   --eval_data_paths=/content/training-data-analyst/CPB102/lab1a/taxi-valid.csv  \
   --num_epochs=10 \
   --output_dir=/content/training-data-analyst/CPB102/lab3a/taxi_trained 

In [None]:
from datalab.mlalpha import TensorBoard
TensorBoard().start('/content/training-data-analyst/CPB102/lab3a/taxi_trained')

In [None]:
TensorBoard().stop(7279)

In [None]:
!ls /content/training-data-analyst/CPB102/lab3a/taxi_trained

<h2> Submit training job using gcloud </h2>

In [None]:
%bash
echo $BUCKET
gsutil rm -rf gs://${BUCKET}/taxifare/smallinput/
gsutil cp /content/training-data-analyst/CPB102/lab1a/*.csv gs://${BUCKET}/taxifare/smallinput/

In [None]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/smallinput/taxi_trained
JOBNAME=lab3a_$(date -u +%y%m%d_%H%M%S)
echo $OUTDIR $REGION $JOBNAME
gsutil rm -rf $OUTDIR
gcloud beta ml jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=/content/training-data-analyst/CPB102/lab3a/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=1.0 \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/smallinput/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --num_epochs=100

<h2> Preprocessing </h2>

In [None]:
!rm -rf /content/training-data-analyst/CPB102/lab3a/taxi_preproc

In [None]:
import apache_beam as beam
import tensorflow as tf

from tensorflow_transform import coders
from tensorflow_transform.beam import impl as tft
from tensorflow_transform.beam import io
from tensorflow_transform.tf_metadata import dataset_metadata
from tensorflow_transform.tf_metadata import dataset_schema

from tensorflow_transform import api
from tensorflow_transform import mappers

INPUT_COLUMNS = ['pickuplon','pickuplat','dropofflon','dropofflat','passengers', 'key']
LABEL_COLUMN = 'fare_amount'

class PathConstants:
  def __init__(self):
    self.TEMP_DIR = 'tmp'
    self.TRANSFORM_FN_DIR = 'transform_fn'
    self.RAW_METADATA_DIR = 'raw_metadata'
    self.TRANSFORMED_METADATA_DIR = 'transformed_metadata'
    self.TRANSFORMED_TRAIN_DATA_FILE_PREFIX = 'features_train'
    self.TRANSFORMED_EVAL_DATA_FILE_PREFIX = 'features_eval'
    self.TRANSFORMED_PREDICT_DATA_FILE_PREFIX = 'features_predict'
    self.TRAIN_RESULTS_FILE = 'train_results'
    self.DEPLOY_SAVED_MODEL_DIR = 'saved_model'
    self.MODEL_EVALUATIONS_FILE = 'model_evaluations'
    self.BATCH_PREDICTION_RESULTS_FILE = 'batch_prediction_results'
    
def make_preprocessing_fn():
  # stop-gap ...
  def _scalar_to_vector(scalar):
    # FeatureColumns expect shape (batch_size, 1), not just (batch_size)
    return api.map(lambda x: tf.expand_dims(x, -1), scalar)
  
  def preprocessing_fn(inputs):
    result = {LABEL_COLUMN: _scalar_to_vector(inputs[LABEL_COLUMN])}
    for name in INPUT_COLUMNS:
      result[name] = _scalar_to_vector(mappers.scale_to_0_1(inputs[name]))

    # use tft.map() to create new columns
    # tft.scale_to_0_1
    # tft.map(tf.sparse_column_with_keys, inputs['gender'], Statistic({'M', 'F'})
    # tft.string_to_int(inputs[name], frequency_threshold=frequency_threshold)
    return result

  return preprocessing_fn

def make_input_schema(mode):
  input_schema = ({} if mode == tf.contrib.learn.ModeKeys.INFER
            else {LABEL_COLUMN: tf.FixedLenFeature(shape=[], dtype=tf.float64)})
  for name in INPUT_COLUMNS:
    input_schema[name] = tf.FixedLenFeature(
        shape=[], dtype=tf.float64, default_value=0)
  input_schema = dataset_schema.from_feature_spec(input_schema)
  return input_schema

def make_coder(schema, mode):
  column_names = [] if mode == tf.contrib.learn.ModeKeys.INFER else [LABEL_COLUMN]
  column_names.extend(INPUT_COLUMNS)
  coder = coders.CsvCoder(column_names, schema)
  return coder

def preprocess(pipeline, training_data, eval_data, predict_data, output_dir, mode=tf.contrib.learn.ModeKeys.TRAIN):
  path_constants = PathConstants()
  work_dir = os.path.join(output_dir, path_constants.TEMP_DIR)
  
  # create schema
  input_schema = make_input_schema(mode)

  # coder
  coder = make_coder(input_schema, mode)

  # 3) Read from text using the coder.
  train_data = (
      pipeline
      | 'ReadTrainingData' >> beam.io.ReadFromText(training_data)
      | 'ParseTrainingCsv' >> beam.Map(coder.decode))

  evaluate_data = (
      pipeline
      | 'ReadEvalData' >> beam.io.ReadFromText(eval_data)
      | 'ParseEvalCsv' >> beam.Map(coder.decode))

  # metadata
  input_metadata = dataset_metadata.DatasetMetadata(schema=input_schema)

  _ = (input_metadata
       | 'WriteInputMetadata' >> io.WriteMetadata(
           os.path.join(output_dir, path_constants.RAW_METADATA_DIR),
           pipeline=pipeline))

  preprocessing_fn = make_preprocessing_fn()
  (train_dataset, train_metadata), transform_fn = (
      (train_data, input_metadata)
      | 'AnalyzeAndTransform' >> tft.AnalyzeAndTransformDataset(
          preprocessing_fn, work_dir))

  # WriteTransformFn writes transform_fn and metadata to fixed subdirectories
  # of output_dir, which are given by path_constants.TRANSFORM_FN_DIR and
  # path_constants.TRANSFORMED_METADATA_DIR.
  transform_fn_is_written = (transform_fn | io.WriteTransformFn(output_dir))

  # TODO(b/34231369) Remember to eventually also save the statistics.

  (evaluate_dataset, evaluate_metadata) = (
      ((evaluate_data, input_metadata), transform_fn)
      | 'TransformEval' >> tft.TransformDataset())

  train_coder = coders.ExampleProtoCoder(train_metadata.schema)
  _ = (train_dataset
       | 'SerializeTrainExamples' >> beam.Map(train_coder.encode)
       | 'WriteTraining'
       >> beam.io.WriteToTFRecord(
           os.path.join(output_dir,
                        path_constants.TRANSFORMED_TRAIN_DATA_FILE_PREFIX),
           file_name_suffix='.tfrecord.gz'))

  evaluate_coder = coders.ExampleProtoCoder(evaluate_metadata.schema)
  _ = (evaluate_dataset
       | 'SerializeEvalExamples' >> beam.Map(evaluate_coder.encode)
       | 'WriteEval'
       >> beam.io.WriteToTFRecord(
           os.path.join(output_dir,
                        path_constants.TRANSFORMED_EVAL_DATA_FILE_PREFIX),
           file_name_suffix='.tfrecord.gz'))

  if predict_data:
    predict_mode = tf.contrib.learn.ModeKeys.INFER
    predict_schema = make_input_schema(mode=predict_mode)
    tsv_coder = make_coder(predict_schema, mode=predict_mode)
    predict_coder = coders.ExampleProtoCoder(predict_schema)
    _ = (pipeline
         | 'ReadPredictData' >> beam.io.ReadFromText(predict_data,
                                                     coder=tsv_coder)
         # TODO(b/35194257) Obviate the need for this explicit serialization.
         | 'EncodePredictData' >> beam.Map(predict_coder.encode)
         | 'WritePredictData' >> beam.io.WriteToTFRecord(
             os.path.join(output_dir,
                          path_constants.TRANSFORMED_PREDICT_DATA_FILE_PREFIX),
             file_name_suffix='.tfrecord.gz'))

  # Workaround b/35366670, to ensure that training and eval don't start before
  # the transform_fn is written.
  train_dataset |= beam.Map(
      lambda x, y: x, y=beam.pvalue.AsSingleton(transform_fn_is_written))
  evaluate_dataset |= beam.Map(
      lambda x, y: x, y=beam.pvalue.AsSingleton(transform_fn_is_written))

  return transform_fn, train_dataset, evaluate_dataset


train_data_paths='/content/training-data-analyst/CPB102/lab1a/taxi-train.csv' 
eval_data_paths='/content/training-data-analyst/CPB102/lab1a/taxi-valid.csv'  
output_dir='/content/training-data-analyst/CPB102/lab3a/taxi_preproc' 
predict_data_paths=None
p = beam.Pipeline()
transform_fn, train_dataset, eval_dataset = preprocess(
      p, train_data_paths, eval_data_paths, predict_data_paths, output_dir)

p.run()

<h2> Prediction </h2>

Make sure that the training job has completed before proceeding to this step (check the log above)

To predict the taxifare for new inputs, you first have to deploy the trained model (deleting a previous one if necessary):

In [None]:
%bash
# Work around https://buganizer.corp.google.com/issues/31730085
gsutil cp gs://$BUCKET/taxifare/taxi_preproc/metadata.yaml gs://$BUCKET/taxifare/taxi_trained/model/

In [None]:
%mlalpha delete --name taxifare.v1

In [None]:
%mlalpha deploy --name taxifare.v1 --path gs://$BUCKET/taxifare/taxi_trained/model/

In [None]:
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
import json

import google.cloud.ml.features as features
from google.cloud.ml import session_bundle

credentials = GoogleCredentials.get_application_default()
api = discovery.build('ml', 'v1beta1', credentials=credentials,
            discoveryServiceUrl='https://storage.googleapis.com/cloud-ml/discovery/ml_v1beta1_discovery.json')

request_data = {'instances':
  [
    {'examples':
      {
        'pickup_longitude': -73.885262,
        'pickup_latitude': 40.773008,
        'dropoff_longitude': -73.987232,
        'dropoff_latitude': 40.732403,
        'passenger_count': 2,
        'fare_amount': -999
      }
    }
  ]
}

parent = 'projects/%s/models/%s/versions/%s' % (PROJECT, 'taxifare', 'v1')
response = api.projects().predict(body=request_data, name=parent).execute()
print "response={0}".format(response)

Copyright 2016 Google Inc. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License