<h1> Scaling up training using Cloud ML Engine </h1>

In this notebook, you will take a previously developed TensorFlow model to predict taxifare rides and package it up so that it can be run on Cloud Machine Learning Engine (MLE). For now, the model will be trained on a small dataset. The model is still rather simplistic, and therefore, the accuracy of the model is not great either.  However, this notebook illustrates *how* to package up a TensorFlow model to run it within Cloud MLE. 

Later in the course, you will look at ways to make a more effective machine learning model.



---
Before you start, **make sure that you are logged in with your student account**. Otherwise you may incur Google Cloud charges for using this notebook. 

---

Also, remember to uncheck "Reset all runtimes before running" when executing the next cell.

Reseting the runtime will delete any files you may have on your notebook file system. 

![](https://i.imgur.com/9dgw0h0.png)


In [0]:
#@markdown Copy-paste your GCP Project ID in the following field:

PROJECT = "" #@param {type: "string"}

#@markdown Next, use Shift-Enter to run this cell and complete authentication.

try:  
  from google.colab import auth
  auth.authenticate_user()  
  print("AUTHENTICATED")
except:
  print("FAILED to authenticate")

#Modify the following to use a different bucket and/or region
#for Google Cloud Storage and for Cloud MLE
BUCKET = PROJECT  
REGION = "us-central1"  

# Copy taxi-*.csv files from github if they are missing from the runtime.
!wget --quiet -nc https://github.com/osipov/training-data-analyst/raw/master/bootcamps/serverless_ml/taxi-11k-datasets.zip
!unzip -q -n taxi-11k-datasets.zip 

<h2> Environment variables</h2>

The previous code cell initialized the Python runtime with values for your project ID, Google Cloud Storage bucket ID, and a Google Cloud region where your data is stored and processed by Cloud MLE jobs.

The next code cell takes these default values and uses Python `os` library to create bash shell environment variables for the project, bucket, region and the TensorFlow version. This is needed so you can continue using the default values for these variables in the bash scripts later in this notebook.

In [0]:
# for bash
import os
os.environ['PROJECT'] = PROJECT
os.environ['BUCKET'] = BUCKET
os.environ['REGION'] = REGION
os.environ['TF_VERSION'] = '1.12'  # Cloud MLE Latest supported Tensorflow version

Colab includes the latest version of [Google Cloud SDK](https://cloud.google.com/sdk) which provides a set of management tools for a Google Cloud account. One of the tools in the SDK is called `gcloud` and it is a general purpose tool for managing almost all services on Google Cloud. The next code cell uses `gcloud` to configure the shell environment to use your GCP Project ID and the default Google Cloud region.

In [0]:
%%bash
gcloud config set project $PROJECT
gcloud config set compute/region $REGION

The default security settings of Google Cloud do not allow Cloud MLE (or any other service) to access private data stored in your storage bucket.  Next, you will run a standard script that lets Cloud MLE to read and write data in your project's bucket. This will ensure that MLE can read training, validation, and test data from the bucket and also write checkpoints of a trained TensorFlow model to the bucket.

In [0]:
%%bash
PROJECT_ID=$PROJECT
AUTH_TOKEN=$(gcloud auth print-access-token)
SVC_ACCOUNT=$(curl -X GET -H "Content-Type: application/json" \
    -H "Authorization: Bearer $AUTH_TOKEN" \
    https://ml.googleapis.com/v1/projects/${PROJECT_ID}:getConfig \
    | python -c "import json; import sys; response = json.load(sys.stdin); \
    print response['serviceAccount']")

echo "Authorizing the Cloud ML Service $SVC_ACCOUNT to access files in $BUCKET"
gsutil -m defacl ch -u $SVC_ACCOUNT:R gs://$BUCKET

echo "NOTE: the following CommandException (No URLs matched if bucket is empty) can be ignored"
gsutil -q -m acl ch -u $SVC_ACCOUNT:R -r gs://$BUCKET  # error message (if bucket is empty) can be ignored

gsutil -m acl ch -u $SVC_ACCOUNT:W gs://$BUCKET

Don't worry if you see an exception message about `No URLs matched` in the previous cell. It just means that the bucket was empty when the script ran.

<h2> Packaging up the code </h2>

The script takes the code from github and organizes it into a standard Python package structure. The model is now located in `model.py` while the main entry point into the model is in `task.py` The `find` command at the end of the script shows the details of the directory structure for the Python package. This matches what you learned earlier about packaging TensorFlow models for Cloud MLE.

In [0]:
%%bash
rm -rf taxifare
mkdir -p taxifare/trainer

for file in taxifare/setup.py \
            taxifare/trainer/__init__.py \
            taxifare/trainer/model.py \
            taxifare/trainer/task.py
do
  wget --quiet -nc \
  https://github.com/osipov/training-data-analyst/raw/master/bootcamps/serverless_ml/cloudmle/$file \
  -O $file
done

find taxifare

Take a few minutes to confirm that `model.py` contains the code you used earlier...


In [0]:
!cat taxifare/trainer/model.py

... and that code in `task.py` isn't too suprising.

In [0]:
!cat taxifare/trainer/task.py

<h2> Submit training job using gcloud </h2>

Copy the training data from the Colab file system to the storage bucket:

In [0]:
%%bash
echo $BUCKET
gsutil -m rm -rf gs://${BUCKET}/taxifare/11k/*.csv
gsutil -m cp ${PWD}/*.csv gs://${BUCKET}/taxifare/11k/

Don't worry if in the previous code cell you see a message about files/objects that could not be removed. This message occurs because `gsutil rm` command tries to clean up the  directory for the training, validation, and test csv files. The `gsutil` command is another tool from the Google Cloud SDK. It used to manage storage buckets and to move data to and from buckets over the network.




In the next cell, submit a training job to Cloud MLE using the `gcloud` command. Notice that the command is using the Python package in the `taxifare/trainer` directory of your Colab environment. In contrast, the training and validation data files are sourced from the storage bucket. The `OUTDIR` variable specifies the location in the storage bucket where  MLE will store model checkpoint files. In the upcoming notebooks you will use the trained model from that location.

In [0]:
%%bash
OUTDIR=gs://${BUCKET}/taxifare/11k/taxi_trained

JOBNAME=mle_train_$(date -u +%y%m%d_%H%M%S)

echo $OUTDIR $REGION $JOBNAME

gsutil -m rm -rf $OUTDIR
gcloud ml-engine jobs submit training $JOBNAME \
   --region=$REGION \
   --module-name=trainer.task \
   --package-path=${PWD}/taxifare/trainer \
   --job-dir=$OUTDIR \
   --staging-bucket=gs://$BUCKET \
   --scale-tier=BASIC \
   --runtime-version=${TF_VERSION} \
   -- \
   --train_data_paths="gs://${BUCKET}/taxifare/11k/taxi-train*" \
   --eval_data_paths="gs://${BUCKET}/taxifare/11k/taxi-valid*"  \
   --output_dir=$OUTDIR \
   --train_steps=10000

After you submit the job you should see a message confirming that your job was QUEUED. To monitor the progress of the job from the GCP user interface, navigate to [Jobs](https://console.cloud.google.com/mlengine/jobs) part of the Cloud ML Engine service. Use the "View Logs" link to get the details. In the upcoming lab, you will also monitor training details using TensorBoard.

<h2>Recap</h2>

In this notebook, you configured security permissions to allow Cloud MLE to access files on your Google Cloud Storage bucket. Recall that you are using the bucket to support scalability to very large (petabyte sized) datasets. 

Next, you packaged the TensorFlow code in the taxifare directory, following Python packaging conventions. You checked that `task.py` file is the main entrypoint into your model and its code passes various parameters to your model.

Finally, you copied the training and validation csv files to your storage bucket, launched the training process, and viewed the logs to confirm that the training started.

<b>Don't wait for training to finish!</b> If you confirmed that the training job is started, you're done with this lab and are ready to continue with the rest of this session!

Copyright 2019 Counter Factual .AI LLC. Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License