# LAB 5a:  Training Keras model on Vertex AI

**Learning Objectives**

1. Setup up the environment
1. Create trainer module's task.py to hold hyperparameter argparsing code
1. Create trainer module's model.py to hold Keras model code
1. Run trainer module package locally
1. Submit training job to Vertex AI
1. Submit hyperparameter tuning job to Vertex AI


## Introduction
After having testing our training pipeline both locally and in the cloud on a susbset of the data, we can submit another (much larger) training job to the cloud. It is also a good idea to run a hyperparameter tuning job to make sure we have optimized the hyperparameters of our model. 

In this notebook, we'll be training our Keras model at scale using Vertex AI.

In this lab, we will set up the environment, create the trainer module's task.py to hold hyperparameter argparsing code, create the trainer module's model.py to hold Keras model code, run the trainer module package locally, submit a training job to Vertex AI, and submit a hyperparameter tuning job to Vertex AI.

Each learning objective will correspond to a __#TODO__ in this student lab notebook -- try to complete this notebook first and then review the [solution notebook](../solutions/5a_train_keras_ai_platform_babyweight_vertex.ipynb).

## Set up environment variables and load necessary libraries

First we will install the `cloudml-hypertune` package on our local machine. This is the package which we will use to report hyperparameter tuning metrics to Vertex AI. Installing the package will allow us to test our trainer package locally.

In [None]:
try:
    import hypertune

except ImportError:
    !pip3 install -U cloudml-hypertune --user

    print("Please restart the kernel and re-run the notebook.")

If the above command resulted in an installation, please restart the notebook kernel and re-run the notebook.

Import necessary libraries.

In [None]:
import os

### Lab Task #1: Set environment variables.

Set environment variables so that we can use them throughout the entire lab. We will be using our project name for our bucket, so you only need to change your project and region.

In [None]:
%%bash
export PROJECT=$(gcloud config list project --format "value(core.project)")
echo "Your current GCP Project Name is: "${PROJECT}

In [None]:
# TODO: Change these to try this notebook out
PROJECT = "asl-ml-immersion"  # Replace with your PROJECT
BUCKET = PROJECT  # defaults to PROJECT
REGION = "us-central1"  # Replace with your REGION

In [None]:
os.environ["PROJECT"] = PROJECT
os.environ["BUCKET"] = BUCKET
os.environ["REGION"] = REGION

Create the bucket if does not exist, and confirm below that the bucket is regional and its region equals to the specified region:

In [None]:
 %%bash
if ! gsutil ls | grep -q gs://${BUCKET}/; then
    gsutil mb -l ${REGION} gs://${BUCKET}
fi
gsutil ls -Lb gs://$BUCKET | grep "gs://\|Location"
echo $REGION

In [None]:
%%bash
gcloud config set project ${PROJECT}
gcloud config set ai/region ${REGION}

## Check data exists

Verify that you previously created CSV files we'll be using for training and evaluation. If not, go back to lab [1b_prepare_data_babyweight](../solutions/1b_prepare_data_babyweight.ipynb) to create them.

In [None]:
%%bash
gsutil ls gs://${BUCKET}/babyweight/data/*000000000000.csv

Now that we have the [Keras wide-and-deep code](../solutions/4c_keras_wide_and_deep_babyweight.ipynb) working on a subset of the data, we can package the TensorFlow code up as a Python module and train it on Vertex AI.

## Train on Vertex AI

Training on Vertex AI requires:
* Making the code a Python source distribution
* Using gcloud to submit the training code to [Vertex AI](https://console.cloud.google.com/vertex-ai)

Ensure that the Vertex AI API is enabled by going to this [link](https://console.developers.google.com/apis/library/aiplatform.googleapis.com).

### Move code into a Python package

A Python package is simply a collection of one or more `.py` files along with an `__init__.py` file to identify the containing directory as a package. The `__init__.py` sometimes contains initialization code but for our purposes an empty file suffices.

The bash command `touch` creates an empty file in the specified location, the directory `babyweight` should already exist.

In [None]:
%%bash
mkdir -p babyweight/trainer
touch babyweight/trainer/__init__.py

We then use the `%%writefile` magic to write the contents of the cell below to a file called `task.py` in the `babyweight/trainer` folder.

### Lab Task #2: Create trainer module's task.py to hold hyperparameter argparsing code.

The cell below writes the file `babyweight/trainer/task.py` which sets up our training job. Here is where we determine which parameters of our model to pass as flags during training using the `parser` module. Look at how `batch_size` is passed to the model in the code below. Use this as an example to parse arguements for the following variables
- `nnsize` which represents the hidden layer sizes to use for DNN feature columns
- `nembeds` which represents the embedding size of a cross of n key real-valued parameters
- `train_examples` which represents the number of examples (in thousands) to run the training job
- `eval_steps` which represents the positive number of steps for which to evaluate model

Be sure to include a default value for the parsed arguments above and specfy the `type` if necessary.

In [None]:
%%writefile babyweight/trainer/task.py
import argparse
import json
import os

from trainer import model

import tensorflow as tf

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--train_data_path",
        help="GCS location of training data",
        required=True,
    )
    parser.add_argument(
        "--eval_data_path",
        help="GCS location of evaluation data",
        required=True,
    )
    parser.add_argument(
        "--output_dir",
        help="GCS location to write checkpoints and export models",
        default=os.getenv("AIP_MODEL_DIR"),
    )
    parser.add_argument(
        "--batch_size",
        help="Number of examples to compute gradient over.",
        type=int,
        default=512,
    )

    # TODO: Add nnsize argument

    # TODO: Add nembeds argument

    # TODO: Add num_epochs argument

    # TODO: Add train_examples argument

    # TODO: Add eval_steps argument

    # Parse all arguments
    args = parser.parse_args()
    arguments = args.__dict__

    # Modify some arguments
    arguments["train_examples"] *= 1000

    # Run the training job
    model.train_and_evaluate(arguments)

In the same way we can write to the file `model.py` the model that we developed in the previous notebooks. 

### Lab Task #3: Create trainer module's model.py to hold Keras model code.

Complete the TODOs in the code cell below to create our `model.py`. We'll use the code we wrote for the Wide & Deep model. Look back at your [9_keras_wide_and_deep_babyweight](../solutions/9_keras_wide_and_deep_babyweight.ipynb) notebook and copy/paste the necessary code from that notebook into its place in the cell below.

In [None]:
%%writefile babyweight/trainer/model.py
import datetime
import os
import shutil
import numpy as np
import tensorflow as tf
import hypertune

# Determine CSV, label, and key columns
# TODO: Add CSV_COLUMNS and LABEL_COLUMN

# Set default values for each CSV column.
# Treat is_male and plurality as strings.
# TODO: Add DEFAULTS


def features_and_labels(row_data):
    # TODO: Add your code here
    pass


def load_dataset(pattern, batch_size=1, mode=tf.estimator.ModeKeys.EVAL):
    # TODO: Add your code here
    pass


def create_input_layers():
    # TODO: Add your code here
    pass


def categorical_fc(name, values):
    # TODO: Add your code here
    pass


def create_feature_columns(nembeds):
    # TODO: Add your code here
    pass


def get_model_outputs(wide_inputs, deep_inputs, dnn_hidden_units):
    # TODO: Add your code here
    pass


def rmse(y_true, y_pred):
    # TODO: Add your code here
    pass


def build_wide_deep_model(dnn_hidden_units=[64, 32], nembeds=3):
    # TODO: Add your code here
    pass


# Instantiate the HyperTune reporting object
hpt = hypertune.HyperTune()

# Reporting callback
class HPTCallback(tf.keras.callbacks.Callback):
    def on_epoch_end(self, epoch, logs=None):
        global hpt
        hpt.report_hyperparameter_tuning_metric(
            hyperparameter_metric_tag="val_rmse",
            metric_value=logs["val_rmse"],
            global_step=epoch,
        )


def train_and_evaluate(args):
    model = build_wide_deep_model(args["nnsize"], args["nembeds"])
    print("Here is our Wide-and-Deep architecture so far:\n")
    print(model.summary())

    trainds = load_dataset(
        args["train_data_path"],
        args["batch_size"],
        tf.estimator.ModeKeys.TRAIN,
    )

    evalds = load_dataset(
        args["eval_data_path"], 1000, tf.estimator.ModeKeys.EVAL
    )
    if args["eval_steps"]:
        evalds = evalds.take(count=args["eval_steps"])

    num_batches = args["batch_size"] * args["num_epochs"]
    steps_per_epoch = args["train_examples"] // num_batches

    checkpoint_path = os.path.join(
        args["output_dir"], "checkpoints/babyweight"
    )
    cp_callback = tf.keras.callbacks.ModelCheckpoint(
        filepath=checkpoint_path, verbose=1, save_weights_only=True
    )

    history = model.fit(
        trainds,
        validation_data=evalds,
        epochs=args["num_epochs"],
        steps_per_epoch=steps_per_epoch,
        verbose=2,  # 0=silent, 1=progress bar, 2=one line per epoch
        callbacks=[cp_callback, HPTCallback()],
    )

    EXPORT_PATH = os.path.join(
        args["output_dir"], datetime.datetime.now().strftime("%Y%m%d%H%M%S")
    )
    tf.saved_model.save(
        obj=model, export_dir=EXPORT_PATH
    )  # with default serving function

    print("Exported trained model to {}".format(EXPORT_PATH))

## Train locally

After moving the code to a package, make sure it works as a standalone. Note, we incorporated the `--train_examples` flag so that we don't try to train on the entire dataset while we are developing our pipeline. Once we are sure that everything is working on a subset, we can change it so that we can train on all the data. Even for this subset, this takes about *3 minutes* in which you won't see any output ...

### Lab Task #4: Run trainer module package locally.

Fill in the missing code in the TODOs below so that we can run a very small training job over a single file with a small batch size, 1 epoch, 1 train example, and 1 eval step.

In [None]:
%%bash
OUTDIR=babyweight_trained
rm -rf ${OUTDIR}
export PYTHONPATH=${PYTHONPATH}:${PWD}/babyweight
python3 -m trainer.task \
    --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv \
    --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv \
    --output_dir=${OUTDIR} \
    --batch_size=# TODO: Add batch size
    --num_epochs=# TODO: Add the number of epochs to train for
    --train_examples=# TODO: Add the number of examples to train each epoch for
    --eval_steps=# TODO: Add the number of evaluation batches to run

## Lab Task #5: Training on Vertex AI

Now that we see everything is working locally, it's time to train on the cloud! First, we need to package our code as a source distribution. For this, we can use `setuptools`. 

In [None]:
%%writefile babyweight/setup.py
from setuptools import find_packages
from setuptools import setup

setup(
    name="babyweight_trainer",
    version="0.1",
    packages=find_packages(),
    include_package_data=True,
    description="Babyweight model training application.",
)

In [None]:
%%bash
cd babyweight
python ./setup.py sdist --formats=gztar
cd ..

We will store our package in the Cloud Storage bucket.

In [None]:
%%bash
gsutil cp babyweight/dist/babyweight_trainer-0.1.tar.gz gs://${BUCKET}/babyweight/

To submit to the Cloud we use [`gcloud custom-jobs create`](https://cloud.google.com/sdk/gcloud/reference/ai/custom-jobs/create) and simply specify some additional parameters for the Vertex AI Training Service:
- display-name: A unique identifier for the Cloud job. We usually append system time to ensure uniqueness
- region: Cloud region to train in. See [here](https://cloud.google.com/vertex-ai/docs/general/locations) for supported Vertex AI Training Service regions

You might have earlier seen `gcloud ai custom-jobs create` executed with the `worker pool spec` and pass-through Python arguments specified directly in the command call, here we will use a YAML file, this will make it easier to transition to hyperparameter tuning.

Through the `args:` argument we add in the passed-through arguments for our `task.py` file.

Complete the __#TODO__s to make sure you have the necessary user_args for our task.py's parser.

In [None]:
%%bash

TIMESTAMP=$(date -u +%Y%m%d_%H%M%S)
OUTDIR=gs://${BUCKET}/babyweight/trained_model_$TIMESTAMP
JOB_NAME=babyweight_$TIMESTAMP

PYTHON_PACKAGE_URI=gs://${BUCKET}/babyweight/babyweight_trainer-0.1.tar.gz
PYTHON_PACKAGE_EXECUTOR_IMAGE_URI="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-3:latest"
PYTHON_MODULE=trainer.task

echo > ./config.yaml "workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-4
  replicaCount: 1
  pythonPackageSpec:
    executorImageUri: $PYTHON_PACKAGE_EXECUTOR_IMAGE_URI
    packageUris: $PYTHON_PACKAGE_URI
    pythonModule: $PYTHON_MODULE
    args:
    - --train_data_path=# TODO: Add path to training data in GCS
    - --eval_data_path=# TODO: Add path to evaluation data in GCS
    - --output_dir=$OUTDIR
    - --num_epochs=# TODO: Add the number of epochs to train for
    - --train_examples=# TODO: Add the number of examples to train each epoch for
    - --eval_steps=# TODO: Add the number of evaluation batches to run
    - --batch_size=# TODO: Add batch size
    - --nembeds=# TODO: Add number of embedding dimensions

gcloud ai custom-jobs create \
  --region=${REGION} \
  --display-name=$JOB_NAME \
  --config=config.yaml

The training job should complete within 10 to 15 minutes. You will need a trained model to complete our next lab.

## Lab Task #6: Hyperparameter tuning

To do hyperparameter tuning, create a YAML file and and pass its name with `--config`.
This step could take <b>hours</b> -- you can increase `--parallel-trial-count` or reduce `--max-trial-count` to get it done faster.  Since `--parallel-trial-count` is the number of initial seeds to start searching from, you don't want it to be too large; otherwise, all you have is a random search.

Complete __#TODO__s in the yaml file and gcloud training job bash command so that we can run hyperparameter tuning.

In [None]:
%%bash
TIMESTAMP=$(date -u +%Y%m%d_%H%M%S)
BASE_OUTPUT_DIR=gs://${BUCKET}/babyweight/hp_tuning_$TIMESTAMP
JOB_NAME=babyweight_hpt_$TIMESTAMP

PYTHON_PACKAGE_URI=gs://${BUCKET}/babyweight/babyweight_trainer-0.1.tar.gz
PYTHON_PACKAGE_EXECUTOR_IMAGE_URI="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-3:latest"
PYTHON_MODULE=trainer.task

echo > ./hyperparam.yaml "displayName: $JOB_NAME
studySpec:
  metrics:
  - metricId: # TODO: Add metric we want to optimize
    goal: # TODO: MAXIMIZE or MINIMIZE?
  parameters:
  - parameterId: batch_size
    # TODO: What datatype (which ValueSpec)?
      minValue: # TODO: Choose a min value
      maxValue: # TODO: Choose a max value
    scaleType: # TODO: UNIT_LINEAR_SCALE or UNIT_LOG_SCALE?
  - parameterId: nembeds
    # TODO: What datatype (which ValueSpec)?
      minValue: # TODO: Choose a min value
      maxValue: # TODO: Choose a max value
    scaleType: # TODO: UNIT_LINEAR_SCALE or UNIT_LOG_SCALE?
  algorithm: ALGORITHM_UNSPECIFIED # results in Bayesian optimization
trialJobSpec:
  baseOutputDirectory:
    outputUriPrefix: $BASE_OUTPUT_DIR
  workerPoolSpecs:
  - machineSpec:
      machineType: n1-standard-8
    pythonPackageSpec:
      executorImageUri: $PYTHON_PACKAGE_EXECUTOR_IMAGE_URI
      packageUris:
      - $PYTHON_PACKAGE_URI
      pythonModule: $PYTHON_MODULE
      args:
      - --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv
      - --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv
      - --num_epochs=10
      - --train_examples=5000
      - --eval_steps=100
      - --batch_size=32
      - --nembeds=8
    replicaCount: 1"
        
gcloud beta ai hp-tuning-jobs create \
    --region=$REGION \
    --display-name=$JOB_NAME \
    --# TODO: Add config for hyperparam.yaml
    --max-trial-count=20 \
    --parallel-trial-count=5

## Repeat training

This time with tuned parameters for `batch_size` and `nembeds`. Note that your best results may differ from below. So be sure to fill yours in!

In [None]:
%%bash
TIMESTAMP=$(date -u +%Y%m%d_%H%M%S)
OUTDIR=gs://${BUCKET}/babyweight/tuned_$TIMESTAMP
JOB_NAME=babyweight_tuned_$TIMESTAMP

PYTHON_PACKAGE_URI=gs://${BUCKET}/babyweight/babyweight_trainer-0.1.tar.gz
PYTHON_PACKAGE_EXECUTOR_IMAGE_URI="us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-3:latest"
PYTHON_MODULE=trainer.task

echo > ./tuned_config.yaml "workerPoolSpecs:
  machineSpec:
    machineType: n1-standard-8
  replicaCount: 1
  pythonPackageSpec:
    executorImageUri: $PYTHON_PACKAGE_EXECUTOR_IMAGE_URI
    packageUris: $PYTHON_PACKAGE_URI
    pythonModule: $PYTHON_MODULE
    args:
    - --train_data_path=gs://${BUCKET}/babyweight/data/train*.csv
    - --eval_data_path=gs://${BUCKET}/babyweight/data/eval*.csv
    - --output_dir=$OUTDIR
    - --num_epochs=10
    - --train_examples=20000
    - --eval_steps=100
    - --batch_size=32
    - --nembeds=8"
    
gcloud ai custom-jobs create \
  --region=${REGION} \
  --display-name=$JOB_NAME \
  --config=tuned_config.yaml

## Lab Summary: 
In this lab, we set up the environment, created the trainer module's task.py to hold hyperparameter argparsing code, created the trainer module's model.py to hold Keras model code, ran the trainer module package locally, and submitted a training job to Vertex AI.

Copyright 2021 Google LLC
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
    https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.