In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Get started with Vertex AI Distributed Training

<table align="left">
      <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/get_started_with_vertex_distributed_training.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/get_started_with_vertex_distributed_training.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/get_started_with_vertex_distributed_training.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
Open in Vertex AI Workbench
    </a>
  </td>
</table>
<br/><br/><br/>

## Overview

This tutorial demonstrates how to use the Vertex AI Python client library to do distrbuted training of a TensorFlow model.

*Note:* There are incompatibilities between Colab and Docker and the Docker section may not work until resolved by the platform.

Learn more about [Vertex AI Distributed Training](https://cloud.google.com/vertex-ai/docs/training/distributed-training).

### Objective

In this tutorial, you learn how to use `Vertex AI Distributed Training` when training with `Vertex AI`.

This tutorial uses the following Google Cloud ML services:

- `Vertex AI Distributed Training`
- `Vertex AI Reduction Server`

The steps performed include:

- `MirroredStrategy`: Train on a single VM with multiple GPUs.
- `MultiWorkerMirroredStrategy`: Train on multiple VMs with automatic setup of replicas.
- `MultiWorkerMirroredStrategy`: Train on multiple VMs with fine grain control of replicas.
- `ReductionServer`: Train on multiple VMS and sync updates across VMS with `Vertex AI Reduction Server`.
- `TPUTraining`: Train with multiple Cloud TPUs.

### Recommendations

When doing E2E MLOps on Google Cloud, the following are best practices for when to use Vertex AI Distributed Training:

**Single VM / Single Device (OneDeviceStrategy)**

You are experimenting and the total training data and number of model parameters is small.

If the number of model parameters is very small, you may not get much benefit from a GPU and may consider using the VM's CPU.

**Single VM / Multiple Compute Devices (MirroredStrategy)**

The number of model parameters is very large, but the total training data is small.

**Multiple VM / Multiple Compute Devices (MultiWorkerMirroredStrategy)**

The number of model parameters is very large and the total training data is very large.

**ReductionServer**

While training across a large number of VMs and the model parameters updates to sync is very large.

### Dataset

The dataset used for this tutorial is the [Boston Housing Prices dataset](https://www.cs.toronto.edu/~delve/data/boston/bostonDetail.html). The version of the dataset you use in this tutorial is built into TensorFlow. The trained model predicts the median price of a house in units of 1K USD.

### Costs
 
This tutorial uses billable components of Google Cloud:

Vertex AI
Cloud Storage

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage pricing](https://cloud.google.com/storage/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/),
        to generate a cost estimate based on your projected usage.

## Installation

Install the packages required for executing this notebook.

In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com).

4. If you are running this notebook locally, install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Set the region

**Optional**: Update the 'REGION' variable to specify the region that you want to use. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

To authenticate your Google Cloud account, follow the instructions for your Jupyter environment:

**1. Vertex AI Workbench**
<br>You are already authenticated.

**2. Local JupyterLab instance**
<br>Uncomment and run the following code:

In [None]:
# ! gcloud auth login

**3. Colab**
<br>Uncomment and run the following code:

In [None]:
# from google.colab import auth

# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [None]:
import os

import google.cloud.aiplatform as aip

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aip.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

#### Set hardware accelerators

You can set hardware accelerators for training and prediction.

Set the variables `TRAIN_GPU/TRAIN_NGPU` and `DEPLOY_GPU/DEPLOY_NGPU` to use a container image supporting a GPU and the number of GPUs allocated to the virtual machine (VM) instance. For example, to use a GPU container image with 4 Nvidia Telsa K80 GPUs allocated to each VM, you would specify:

    (aip.AcceleratorType.NVIDIA_TESLA_K80, 4)


Otherwise specify `(None, None)` to use a container image to run on a CPU.

Learn more about [hardware accelerator support for your region](https://cloud.google.com/vertex-ai/docs/general/locations#accelerators).

*Note*: TF releases before 2.3 for GPU support will fail to load the custom model in this tutorial. It is a known issue and fixed in TF 2.3. This is caused by static graph ops that are generated in the serving function. If you encounter this issue on your own custom models, use a container image for TF 2.3 with GPU support.

In [None]:
if os.getenv("IS_TESTING_TRAIN_GPU"):
    TRAIN_GPU, TRAIN_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_TRAIN_GPU")),
    )
else:
    TRAIN_GPU, TRAIN_NGPU = (aip.gapic.AcceleratorType.NVIDIA_TESLA_K80, 4)

if os.getenv("IS_TESTING_DEPLOY_GPU"):
    DEPLOY_GPU, DEPLOY_NGPU = (
        aip.gapic.AcceleratorType.NVIDIA_TESLA_K80,
        int(os.getenv("IS_TESTING_DEPLOY_GPU")),
    )
else:
    DEPLOY_GPU, DEPLOY_NGPU = (None, None)

#### Set pre-built containers

Set the pre-built Docker container image for training and prediction.


For the latest list, see [Pre-built containers for training](https://cloud.google.com/ai-platform-unified/docs/training/pre-built-containers).


For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers).

In [None]:
if os.getenv("IS_TESTING_TF"):
    TF = os.getenv("IS_TESTING_TF")
else:
    TF = "2.5".replace(".", "-")

if TF[0] == "2":
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf2-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf2-cpu.{}".format(TF)
else:
    if TRAIN_GPU:
        TRAIN_VERSION = "tf-gpu.{}".format(TF)
    else:
        TRAIN_VERSION = "tf-cpu.{}".format(TF)
    if DEPLOY_GPU:
        DEPLOY_VERSION = "tf-gpu.{}".format(TF)
    else:
        DEPLOY_VERSION = "tf-cpu.{}".format(TF)

TRAIN_IMAGE = "{}-docker.pkg.dev/vertex-ai/training/{}:latest".format(
    REGION.split("-")[0], TRAIN_VERSION
)
DEPLOY_IMAGE = "{}-docker.pkg.dev/vertex-ai/prediction/{}:latest".format(
    REGION.split("-")[0], DEPLOY_VERSION
)

print("Training:", TRAIN_IMAGE, TRAIN_GPU, TRAIN_NGPU)
print("Deployment:", DEPLOY_IMAGE, DEPLOY_GPU, DEPLOY_NGPU)

#### Set machine type

Next, set the machine type to use for training.

- Set the variable `TRAIN_COMPUTE` to configure  the compute resources for the VMs you will use for for training.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU.
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [None]:
if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

## Mirrored Strategy

When training on a single VM, one can either train was a single compute device or with multiple compute devices on the same VM. With Vertex AI Distributed Training you can specify both the number of compute devices for the VM instance and type of compute devices: CPU, GPU.

Vertex AI Distributed Training supports `tf.distribute.MirroredStrategy' for TensorFlow models. To enable training across multiple compute devices on the same VM, you do the following additional steps in your Python training script:

1. Set the tf.distribute.MirrorStrategy
2. Compile the model within the scope of tf.distribute.MirrorStrategy. *Note:* Tells MirroredStrategy which variables to mirror across your compute devices.
3. Increase the batch size for each compute device to num_devices * batch size.

During transitions, the distribution of batches will be synchronized as well as the updates to the model parameters.

### Create and run custom training job


To train a custom model, you perform two steps: 1) create a custom training job, and 2) run the job.

#### Create custom training job

A custom training job is created with the `CustomTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the custom training job.
- `container_uri`: The training container image.

- `python_package_gcs_uri`: The location of the Python training package as a tarball.
- `python_module_name`: The relative path to the training script in the Python package.
- `model_serving_container_uri`: The container image for deploying the model.

*Note:* There is no requirements parameter. You specify any requirements in the `setup.py` script in your Python package.

In [None]:
DISPLAY_NAME = "boston_" + UUID

job = aip.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    python_package_gcs_uri=f"{BUCKET_URI}/trainer_boston.tar.gz",
    python_module_name="trainer.task",
    container_uri=TRAIN_IMAGE,
    model_serving_container_image_uri=DEPLOY_IMAGE,
    project=PROJECT_ID,
)

### Examine the training package

#### Package layout

Before you start the training, you will look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.

- PKG-INFO
- README.md
- setup.cfg
- setup.py
- trainer
  - \_\_init\_\_.py
  - task.py

The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.

The file `trainer/task.py` is the Python script for executing the custom training job. *Note*, when we referred to it in the worker pool specification, we replace the directory slash with a dot (`trainer.task`) and dropped the file suffix (`.py`).

#### Package Assembly

In the following cells, you will assemble the training package.

In [None]:
# Make folder for Python training script
! rm -rf custom
! mkdir custom

# Add package information
! touch custom/README.md

setup_cfg = "[egg_info]\n\ntag_build =\n\ntag_date = 0"
! echo "$setup_cfg" > custom/setup.cfg

setup_py = "import setuptools\n\nsetuptools.setup(\n\n    install_requires=[\n\n        'tensorflow==2.5.0',\n\n        'tensorflow_datasets==1.3.0',\n\n    ],\n\n    packages=setuptools.find_packages())"
! echo "$setup_py" > custom/setup.py

pkg_info = "Metadata-Version: 1.0\n\nName: Boston Housing cloud\n\nVersion: 0.0.0\n\nSummary: Demostration training script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: aferlitsch@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex"
! echo "$pkg_info" > custom/PKG-INFO

# Make the training subfolder
! mkdir custom/trainer
! touch custom/trainer/__init__.py

#### Task.py contents

In the next cell, you write the contents of the training script task.py. I won't go into detail, it's just there for you to browse. In summary:

- Get the directory where to save the model artifacts from the command line (`--model_dir`), and if not specified, then from the environment variable `AIP_MODEL_DIR`.
- Loads Boston Housing dataset from TF.Keras builtin datasets
- Builds a simple deep neural network model using TF.Keras model API.
- Compiles the model (`compile()`).
- Sets a training distribution strategy according to the argument `args.distribute`.
- Trains the model (`fit()`) with epochs specified by `args.epochs`.
- Saves the trained model (`save(args.model_dir)`) to the specified model directory.
- Saves the maximum value for each feature `f.write(str(params))` to the specified parameters file.

In [None]:
%%writefile custom/trainer/task.py
# Single, Mirrored and MultiWorker Distributed Training

import tensorflow_datasets as tfds
import tensorflow as tf
from tensorflow.python.client import device_lib
import numpy as np
import argparse
import os
import sys
import logging

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
                    default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')
parser.add_argument('--lr', dest='lr',
                    default=0.001, type=float,
                    help='Learning rate.')
parser.add_argument('--epochs', dest='epochs',
                    default=10, type=int,
                    help='Number of epochs.')
parser.add_argument('--steps', dest='steps',
                    default=100, type=int,
                    help='Number of steps per epoch.')
parser.add_argument('--batch_size', dest='batch_size',
                    default=16, type=int,
                    help='Size of a batch.')
parser.add_argument('--distribute', dest='distribute', type=str, default='single',
                    help='distributed training strategy')
parser.add_argument('--param-file', dest='param_file',
                    default='/tmp/param.txt', type=str,
                    help='Output file for parameters')
args = parser.parse_args()

logging.info('DEVICES'  + str(device_lib.list_local_devices()))

# Single Machine, single compute device
if args.distribute == 'single':
    if tf.test.is_gpu_available():
        strategy = tf.distribute.OneDeviceStrategy(device="/gpu:0")
    else:
        strategy = tf.distribute.OneDeviceStrategy(device="/cpu:0")
    logging.info("Single device training")
# Single Machine, multiple compute device
elif args.distribute == 'mirrored':
    strategy = tf.distribute.MirroredStrategy()
    logging.info("Mirrored Strategy distributed training")
# Multi Machine, multiple compute device
elif args.distribute == 'multiworker':
    strategy = tf.distribute.MultiWorkerMirroredStrategy()
    logging.info("Multi-worker Strategy distributed training")
    logging.info('TF_CONFIG = {}'.format(os.environ.get('TF_CONFIG', 'Not found')))
    # Single Machine, multiple TPU devices
elif args.distribute == 'tpu':
    cluster_resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu="local")
    tf.config.experimental_connect_to_cluster(cluster_resolver)
    tf.tpu.experimental.initialize_tpu_system(cluster_resolver)
    strategy = tf.distribute.TPUStrategy(cluster_resolver)
    print("All devices: ", tf.config.list_logical_devices('TPU'))

logging.info('num_replicas_in_sync = {}'.format(strategy.num_replicas_in_sync))

def _is_chief(task_type, task_id):
    ''' Check for primary if multiworker training
    '''
    return (task_type == 'chief') or (task_type == 'worker' and task_id == 0) or task_type is None


def get_data():
    # Scaling Boston Housing data features
    def scale(feature):
        max = np.max(feature)
        feature = (feature / max).astype(np.float)
        return feature, max

    (x_train, y_train), (x_test, y_test) = tf.keras.datasets.boston_housing.load_data(
        path="boston_housing.npz", test_split=0.2, seed=113
    )

    params = []
    for _ in range(13):
        x_train[_], max = scale(x_train[_])
        x_test[_], _ = scale(x_test[_])
    params.append(max)

    # store the normalization (max) value for each feature
    with tf.io.gfile.GFile(args.param_file, 'w') as f:
        f.write(str(params))
    return (x_train, y_train), (x_test, y_test)

def get_model():
    model = tf.keras.Sequential([
          tf.keras.layers.Dense(128, activation='relu', input_shape=(13,)),
          tf.keras.layers.Dense(128, activation='relu'),
          tf.keras.layers.Dense(1, activation='linear')
    ])

    model.compile(
          loss='mse',
          optimizer=tf.keras.optimizers.RMSprop(learning_rate=args.lr)
    )
    return model

def train(model, x_train, y_train):
    NUM_WORKERS = strategy.num_replicas_in_sync
    # Here the batch size scales up by number of workers since
    # `tf.data.Dataset.batch` expects the global batch size.
    GLOBAL_BATCH_SIZE = args.batch_size * NUM_WORKERS

    model.fit(x_train, y_train, epochs=args.epochs, batch_size=GLOBAL_BATCH_SIZE)

    if args.distribute == 'multiworker':
        task_type, task_id = (strategy.cluster_resolver.task_type,
                              strategy.cluster_resolver.task_id)
    else:
        task_type, task_id = None, None

    if args.distribute=="tpu":
        save_locally = tf.saved_model.SaveOptions(experimental_io_device='/job:localhost')
        model.save(args.model_dir, options=save_locally)
    # single, mirrored or primary for multiworker
    elif _is_chief(task_type, task_id):
        model.save(args.model_dir)
    # non-primary workers for multi-workers
    else:
        # each worker saves their model instance to a unique temp location
        worker_dir = args.model_dir + '/workertemp_' + str(task_id)
        tf.io.gfile.makedirs(worker_dir)
        model.save(worker_dir)

with strategy.scope():
    # Creation of dataset, and model building/compiling need to be within
    # `strategy.scope()`.
    model = get_model()

(x_train, y_train), (x_test, y_test) = get_data()

train(model, x_train, y_train)

#### Store training script on your Cloud Storage bucket

Next, you package the training folder into a compressed tar ball, and then store it in your Cloud Storage bucket.

In [None]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_URI/trainer_boston.tar.gz

#### Run the custom Python package training job

Next, you run the custom job to start the training job by invoking the method `run()`. The parameters are the same as when running a CustomTrainingJob.

In [None]:
MODEL_DIR = BUCKET_URI

CMDARGS = ["--epochs=5", "--batch_size=16", "--distribute=mirrored"]

model = job.run(
    model_display_name="boston_" + UUID,
    args=CMDARGS,
    replica_count=1,
    machine_type=TRAIN_COMPUTE,
    accelerator_type=TRAIN_GPU.name,
    accelerator_count=TRAIN_NGPU,
    base_output_dir=MODEL_DIR,
    sync=True,
)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

#### Delete the model

The method 'delete()' will delete the model.

In [None]:
model.delete()

## Multi-Worker Mirrored Strategy

With Vertex AI Distributed Training you can train with multiple VM instances

Vertex AI Distributed Training supports `tf.distribute.MultiWorkerMirroredStrategy' for TensorFlow and PyTorch models. To enable training across multiple VMS, you do the following additional steps in your Python training script:

1. All the additional steps for MirroredStrategy, except that MultiWorkerStrategy is set in place of MirroredStrategy.
2. Setup the worker pools.
3. Alter the saving of the model so that the non-primary workers save their model instance to a unique temporary directory each.

*Note:* You do not need to construct the TF_CONFIG environment variable. It is automatically constructed by Vertex AI Distributed Training.

Learn more about [Distributed Training](https://cloud.google.com/vertex-ai/docs/training/distributed-training).

### Worker pools

If you run a distributed training job with Vertex AI, you specify multiple machines (nodes) in a training cluster. The training service allocates the resources for the machine types you specify. Your running job on a given node is called a replica. A group of replicas with the same configuration is called a worker pool.

Each replica in the training cluster is given a single role or task in distributed training. For example:

- **Primary replica**: Exactly one replica is designated the primary replica. This task manages the others and reports status for the job as a whole.

- **Worker(s)**: One or more replicas may be designated as workers. These replicas do their portion of the work as you designate in your job configuration.

- Parameter server(s): If supported by your ML framework, one or more replicas may be designated as parameter servers. These replicas store model parameters and coordinate shared model state between the workers.

Evaluator(s): If supported by your ML framework, one or more replicas may be designated as evaluators. These replicas can be used to evaluate your model. If you are using TensorFlow, note that TensorFlow generally expects that you use no more than one evaluator.

To configure a distributed training job, define your list of worker pools (workerPoolSpecs[]), designating one WorkerPoolSpec for each type of task:

*Note:* The worker pool is order dependent (0..3):

**workerPoolSpecs[0]**: Primary, chief, scheduler, or "master"

**workerPoolSpecs[1]**: Secondary, replicas, workers

**workerPoolSpecs[2]**: Parameter servers, Reduction Server

**workerPoolSpecs[2]**: Evaluators

### Distributed training options for Multi-Worker Mirrored Strategy

How you setup the worker pools is dependent on the Vertex AI method you use for training.

**CustomTrainingJob** / **CustomContainerTrainingJob** / **CustomPythonPackageTrainingJob**

The `replica_count` includes the primary and secondary (replica_count-1), and share the same machine type and accelerators.

You cannot specify a parameter server or evaluation node.

**CustomJob**

You specify a `worker_pool_spec`, where you can specify detailed settings for each of the four worker pools.

### Create and run custom training job


To train a custom model, you perform two steps: 1) create a custom training job, and 2) run the job.

#### Create custom training job

A custom training job is created with the `CustomTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the custom training job.
- `container_uri`: The training container image.

- `python_package_gcs_uri`: The location of the Python training package as a tarball.
- `python_module_name`: The relative path to the training script in the Python package.
- `model_serving_container_uri`: The container image for deploying the model.

*Note:* There is no requirements parameter. You specify any requirements in the `setup.py` script in your Python package.

In [None]:
DISPLAY_NAME = "boston_" + UUID

job = aip.CustomPythonPackageTrainingJob(
    display_name=DISPLAY_NAME,
    python_package_gcs_uri=f"{BUCKET_URI}/trainer_boston.tar.gz",
    python_module_name="trainer.task",
    container_uri=TRAIN_IMAGE,
    model_serving_container_image_uri=DEPLOY_IMAGE,
    project=PROJECT_ID,
)

#### Run the custom Python package training job

Next, you run the custom job to start the training job by invoking the method `run()`. The parameters are the same as when running a CustomTrainingJob.

In [None]:
MODEL_DIR = BUCKET_URI

CMDARGS = ["--epochs=5", "--batch_size=16", "--distribute=multiworker"]

try:
    model = job.run(
        model_display_name="boston_" + UUID,
        args=CMDARGS,
        replica_count=4,
        machine_type=TRAIN_COMPUTE,
        accelerator_type=TRAIN_GPU.name,
        accelerator_count=TRAIN_NGPU,
        base_output_dir=MODEL_DIR,
        sync=True,
    )
except Exception as e:
    # may fail duing model.save() -- seems to be some issue when merging checkpoints from the workers
    print(e)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

### Multiworker distributed training with CustomJob

Multiworker distributed training with `CustomJob` has the advantages of fine detail control of the primary replica and optionally specifying worker pools for parameter server and evaluators. Creating a `CustomJob` includes the following steps:


1. Specify individual details for each worker pool.
2. Embed training package into Docker image.

### Create a Docker file

To use your own custom training container, you build a Docker file and embed into the container your training scripts.

#### Write the Docker file contents

Your first step in containerizing your code is to create a Docker file. In your Docker you’ll include all the commands needed to run your container image. It’ll install all the libraries you’re using and set up the entry point for your training code.

1. Install a pre-defined container image from TensorFlow repository for deep learning images.
2. Copies in the Python training code, to be shown subsequently.
3. Sets the entry into the Python training script as `trainer/task.py`. Note, the `.py` is dropped in the ENTRYPOINT command, as it is implied.

In [None]:
%%writefile custom/Dockerfile

FROM gcr.io/deeplearning-platform-release/tf2-gpu.2-5

WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

#### Build the container locally

Next, you will provide a name for your customer container that you will use when you submit it to the Google Container Registry.

In [None]:
TRAIN_IMAGE = "gcr.io/" + PROJECT_ID + "/boston:v1"

Next, build the container.

In [None]:
import sys

IS_COLAB = "google.colab" in sys.modules
if not IS_COLAB:
    ! docker build custom -t $TRAIN_IMAGE
else:
    # install docker daemon
    ! apt-get -qq install docker.io

#### Test the container locally

Run the container within your notebook instance to ensure it’s working correctly. You will run it for 5 epochs.

In [None]:
if not IS_COLAB:
    ! docker run $TRAIN_IMAGE --epochs=5 --model-dir=./

#### Register the custom container

When you’ve finished running the container locally, push it to Google Container Registry.

In [None]:
if not IS_COLAB:
    ! docker push $TRAIN_IMAGE

*Executes in Colab*

In [None]:
%%bash -s $IS_COLAB $TRAIN_IMAGE
if [ $1 == "False" ]; then
  exit 0
fi
set -x
dockerd -b none --iptables=0 -l warn &
for i in $(seq 5); do [ ! -S "/var/run/docker.sock" ] && sleep 2 || break; done
docker build custom -t $2
docker run $2 --epochs=5 --model-dir=./
docker push $2
kill $(jobs -p)

#### Primary worker pool

The primary worker pool (index 0) coordinates the work done by all the other replicas. Set the replicaCount to 1. Since the worker is coordinating and not training, use a general purpose CPU, instead of a GPU.

Learn more about [Machine Types for Training](https://cloud.google.com/vertex-ai/docs/training/configure-compute#machine-types).

In [None]:
PRIMARY_COMPUTE = "n2-highcpu-64"

MODEL_DIR = BUCKET_URI

CMDARGS = [
    "--model-dir=" + MODEL_DIR,
    "--epochs=5",
    "--batch_size=16",
    "--distribute=multiworker",
]

CONTAINER_SPEC = {"image_uri": TRAIN_IMAGE, "command": "trainer.task", "args": CMDARGS}

PRIMARY_WORKER_POOL = {
    "replica_count": 1,
    "machine_spec": {"machine_type": PRIMARY_COMPUTE, "accelerator_count": 0},
    "container_spec": CONTAINER_SPEC,
}

WORKER_POOL_SPECS = [PRIMARY_WORKER_POOL]

#### Training worker pool

The secondary worker pool (index 1) performs model training. Each of the replicas will have an instance of the your training package installed on it.

Each replica may have one (single device training) or multiple (mirrored) compute devices for training.

In [None]:
TRAIN_WORKER_POOL = {
    "replica_count": 4,
    "machine_spec": {
        "machine_type": TRAIN_COMPUTE,
        "accelerator_count": TRAIN_NGPU,
        "accelerator_type": TRAIN_GPU,
    },
    "container_spec": CONTAINER_SPEC,
}

WORKER_POOL_SPECS.append(TRAIN_WORKER_POOL)

### Create CustomJob with worker pool specifications

Next, you create a `CustomJob` for the multi-worker distributed training job:

-`display_name`: The display name for the custom job.

-`worker_pool_specs`: The detailed specifications for each worker pool.

In [None]:
DISPLAY_NAME = "boston_" + UUID

job = aip.CustomJob(display_name=DISPLAY_NAME, worker_pool_specs=WORKER_POOL_SPECS)

### Run the CustomJob

Next, you run the custom job.

In [None]:
try:
    job.run(sync=True)
except Exception as e:
    # may fail in multi-worker to find startup script
    print(e)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

## Reduction Server

To speed up training of large models, many engineering teams are adopting distributed training using scale-out clusters of ML accelerators. However, distributed training at scale brings its own set of challenges. Specifically, limited network bandwidth between nodes makes optimizing performance of distributed training inherently difficult, particularly for large cluster configurations.

Vertex AI Reduction Server optimizes bandwidth and latency of multi-node distributed training on NVIDIA GPUs for synchronous data parallel algorithms. Synchronous data parallelism is the foundation of many widely adopted distributed training frameworks, including TensorFlow’s MultiWorkerMirroredStrategy, Horovod, and PyTorch Distributed. By optimizing bandwidth usage and latency of the all-reduce collective operation used by these frameworks, Reduction Server can decrease both the time and cost of large training jobs.

Learn more about [Optimizing training performance using Vertex Reduction Server](https://cloud.google.com/blog/topics/developers-practitioners/optimize-training-performance-reduction-server-vertex-ai)

In [None]:
reduction_server_count = 1
reduction_server_machine_type = "n1-highcpu-16"
reduction_server_image_uri = (
    "us-docker.pkg.dev/vertex-ai-restricted/training/reductionserver:latest"
)

PARAMETER_POOL = {
    "replica_count": reduction_server_count,
    "machine_spec": {
        "machine_type": reduction_server_machine_type,
    },
    "container_spec": {"image_uri": reduction_server_image_uri},
}
WORKER_POOL_SPECS.append(PARAMETER_POOL)

### Create CustomJob with worker pool specifications

Next, you create a `CustomJob` for the multi-worker distributed training job:

-`display_name`: The display name for the custom job.

-`worker_pool_specs`: The detailed specifications for each worker pool.

In [None]:
DISPLAY_NAME = "boston_" + UUID

job = aip.CustomJob(display_name=DISPLAY_NAME, worker_pool_specs=WORKER_POOL_SPECS)

### Run the CustomJob

Next, you run the custom job.

In [None]:
try:
    job.run(sync=True)
except Exception as e:
    # may fail in multi-worker to find startup script
    print(e)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

## Cloud TPU Training

To further speed up trainig, your organization can utilize Google's Cloud Tensor Processing Units (TPU) pods.

Cloud TPU is the custom-designed machine learning ASIC that powers Google products like Translate, Photos, Search, Assistant, and Gmail. Cloud TPU is designed to run cutting-edge machine learning models with AI services on Google Cloud. And its custom high-speed network offers over 100 petaflops of performance in a single pod.

Learn more about [Cloud TPU](https://cloud.google.com/tpu)

*Note*: TPU VM Training is currently an opt-in feature. Your GCP project must first be added to the feature allowlist. Please email your project information(project id/number) to vertex-ai-tpu-vm-training-support@google.com for the allowlist. You will receive an email as soon as your project is ready.

### Write Docker file for TPU training

Currently, there is no pre-built Vertex AI Docker image for training with TPUs. No problems, you can make your own, as follows:

1. Create a vanilla Python 3 image (e.g., `python3:8`).
2. Get and install the TPU library (`libtpu.so`).
3. Copy in your training package

In [None]:
%%writefile custom/Dockerfile
FROM python:3.8

WORKDIR /

# Copies the trainer code to the docker image.
COPY trainer /trainer

RUN pip3 install tensorflow-datasets

# Install TPU Tensorflow and dependencies.
# libtpu.so must be under the '/lib' directory.
RUN wget https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/libtpu/20210525/libtpu.so -O /lib/libtpu.so
RUN chmod 777 /lib/libtpu.so

RUN wget https://storage.googleapis.com/cloud-tpu-tpuvm-artifacts/tensorflow/20210525/tf_nightly-2.6.0-cp38-cp38-linux_x86_64.whl
RUN pip3 install tf_nightly-2.6.0-cp38-cp38-linux_x86_64.whl
RUN rm tf_nightly-2.6.0-cp38-cp38-linux_x86_64.whl
# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python", "-m", "trainer.task"]

### Build and push the Docker image to the Artifact Registry

In [None]:
TRAIN_IMAGE = "gcr.io/" + PROJECT_ID + "/tpu-train:latest"

os.chdir("custom")
! docker build --quiet --tag={TRAIN_IMAGE} .
! docker push {TRAIN_IMAGE}
os.chdir("..")

### TPU worker specification pool

Next, you create the worker specification pool. For TPUs, you do:

- Create only one worker pool (Primary).
- Set the machine type to `cloud-tpu`.
- Set the accelerator type to a `TPU`.

In [None]:
# Use TPU Accelerators. Temporarily using numeric codes, until types are added to the SDK
#   6 = TPU_V2
#   7 = TPU_V3
TRAIN_TPU, TRAIN_NTPU = (7, 8)
TRAIN_COMPUTE = "cloud-tpu"


if not TRAIN_NTPU or TRAIN_NTPU < 2:
    TRAIN_STRATEGY = "single"
else:
    TRAIN_STRATEGY = "tpu"
print(TRAIN_STRATEGY)

EPOCHS = 20
STEPS = 10000

TRAINER_ARGS = [
    "--epochs=" + str(EPOCHS),
    "--steps=" + str(STEPS),
    "--distribute=" + TRAIN_STRATEGY,
]


WORKER_POOL_SPECS = [
    {
        "container_spec": {
            "args": TRAINER_ARGS,
            "image_uri": TRAIN_IMAGE,
        },
        "replica_count": 1,
        "machine_spec": {
            "machine_type": TRAIN_COMPUTE,
            "accelerator_type": TRAIN_TPU,
            "accelerator_count": TRAIN_NTPU,
        },
    }
]

print(WORKER_POOL_SPECS[0])

### Create CustomJob with worker pool specifications

Next, you create a `CustomJob` for the multi-worker distributed training job:

-`display_name`: The display name for the custom job.

-`worker_pool_specs`: The detailed specifications for each worker pool.

In [None]:
DISPLAY_NAME = "boston_" + UUID

job = aip.CustomJob(display_name=DISPLAY_NAME, worker_pool_specs=WORKER_POOL_SPECS)

### Run the CustomJob

Next, you run the custom job.

In [None]:
try:
    job.run(sync=True)
except Exception as e:
    # may fail in multi-worker to find startup script
    print(e)

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [None]:
job.delete()

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:


- Cloud Storage Bucket

In [None]:
# Set this to true only if you'd like to delete your bucket
delete_bucket = False

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI