In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# PyTorch Multi-Node Distributed Training with Torchrun on Vertex AI Training using a Custom Container

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/multi_node_ddp_nccl_vertex_training/multi_node_ddp_nccl_vertex_training_with_custom_container.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/multi_node_ddp_nccl_vertex_training/multi_node_ddp_nccl_vertex_training_with_custom_container.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/multi_node_ddp_nccl_vertex_training/multi_node_ddp_nccl_vertex_training_with_custom_container.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>                                                                                               
</table>

## Overview

This tutorial demonstrates running a multi-node distributed training job on Vertex AI with Torchrun. The model trained is an image classification model using PyTorch and the training job is run using distributed nodes configured with GPUs on Vertex AI.

### Objective

In this notebook, you learn how to train an Image classification model using PyTorch's Torchrun on multiple nodes.

This tutorial uses the following Google Cloud ML services and resources:
- Vertex AI Training
- Vertex AI Model Registry
- Vertex AI Experiments (Tensorboard)

The steps performed include:
- Install necessary libraries.
- Configure Cloud Storage and Tensorboard for training.
- Create custom container to train model using code from PyTorch Elastic's Github repository.
- Train the model using multiple nodes with GPUs.
- List the saved model files.

### Dataset

This notebook uses the [Vegetable Image Dataset](https://www.kaggle.com/datasets/misrakahmed/vegetable-image-dataset) for training the image classification model. This dataset consists of 15 classes and has 15000 images under training set and 3000 images under validation set. Each of the images are available in 224×224 size and **.jpg** format. 

**Note**: The original dataset is restructured and made available in the public Cloud Storage bucket `cloud-samples-data` which is where this notebook uses the data from.

### Costs 

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage
* Cloud Build
* Artifact Registry

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), [Cloud Build pricing](https://cloud.google.com/build/pricing), [Artifact Registry pricing](https://cloud.google.com/artifact-registry/pricing) and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages required to execute this notebook. 

{TODO: Suggest using the latest major GA version of each package; i.e., --upgrade}

In [None]:
# Install the packages
! pip3 install --user --upgrade google-cloud-aiplatform

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).

3. [Enable the Vertex AI API](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com). {TODO: Update the APIs needed for your tutorial. Edit the API names, and update the link to append the API IDs, separating each one with a comma. For example, container.googleapis.com,cloudbuild.googleapis.com}

4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk).

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Region

You can also change the `REGION` variable used by Vertex AI. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [1]:
REGION = "us-central1"  # @param {type: "string"}

### Authenticate your Google Cloud account

Depending on your Jupyter environment, you may have to manually authenticate. Follow the relevant instructions below.

**1. Vertex AI Workbench**
* Do nothing as you are already authenticated.

**2. Local JupyterLab instance, uncomment and run:**

In [2]:
# ! gcloud auth login

**3. Colab, uncomment and run:**

In [3]:
# from google.colab import auth
# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

- *{Note to notebook author: For any user-provided strings that need to be unique (like bucket names or model ID's), append "-unique" to the end so proper testing can occur}*

In [None]:
BUCKET_URI = "gs://your-bucket-name-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Enable Artifact Registry API

Enable the Artifact Registry API service for your project in case you haven't done it at the project setup section.

Learn more about [Enabling service](https://cloud.google.com/artifact-registry/docs/enable-service).

In [None]:
! gcloud services enable artifactregistry.googleapis.com

### Create a private Docker repository

Create your own Docker repository in Artifact Registry.

1. Run the `gcloud artifacts repositories create` command to create a new Docker repository with your specified region and description.

2. Run the `gcloud artifacts repositories list` command to verify that your repository was created.

Set `REPOSITORY` to the name of your repository.

In [None]:
REPOSITORY = "[your-repo-name]"  # @param {type:"string"}

if REPOSITORY == "[your-repo-name]":
    REPOSITORY = "torchrun-imageclassify-repo"

In [None]:
# Create the repository in Artifact registry
! gcloud artifacts repositories create {REPOSITORY} --repository-format=docker --location={REGION} --description="Docker repository"

# List all repositories and check your repository
! gcloud artifacts repositories list

### Configure authentication to your private repo

Before you push or pull container images, configure Docker to use the `gcloud` command-line tool to authenticate requests to `Artifact Registry` for your region.

In [None]:
! gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet

### Configure access to Vertex AI Tensorboard

In this notebook, you also create and use a tensorboard instance on Vertex AI to monitor your training process. To do so, you must have the `Vertex AI Tensorboard Web App User` or `Vertex AI Admin` IAM role. 

To access the Vertex TensorBoard web app, grant the above roles to your account through the [IAM Console](https://console.cloud.google.com/iam-admin/iam).

### Import libraries and define constants
Import the required Python libraries.

In [None]:
from datetime import datetime

from google.cloud import aiplatform

Define the constants used in this notebook.

In [None]:
# Set the Cloud Storage path to the dataset
DATASET_PATH = (
    "/gcs/cloud-samples-data/ai-platform-unified/datasets/images/vegetable_images"
)

# Set `content_name` to use as a common display name for the resources being created in the further steps.
content_name = "pytorch-imageclassify-multi-node"

### Initialize Vertex AI SDK

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [None]:
aiplatform.init(
    project=PROJECT_ID,
    staging_bucket=BUCKET_URI,
    location=REGION,
)

## Vertex AI Training using Vertex AI SDK and Custom Container

### Create a Vertex Tensorboard Instance

Create a Vertex AI Tensorboard instance in Vertex AI Experiments to monitor the training.

#### Option: Use a Previously Created Vertex Tensorboard Instance

In case you want to use an already created Tensorboard instance, replace the `tensorboard_name` with yours in the following cells and load the instance as given below.
```
tensorboard_name = "Your Tensorboard Resource Name or Tensorboard ID"
tensorboard = aiplatform.Tensorboard(tensorboard_name=tensorboard_name)
```

In [None]:
# Create the instance
tensorboard = aiplatform.Tensorboard.create(
    display_name=content_name,
)

In [None]:
# Check the resource name
TENSORBOARD_NAME = tensorboard.resource_name
print(TENSORBOARD_NAME)

### Build Custom Container

Next, you build and push a docker image to the created repository using Cloud Build. 

Learn more about the process of [Building and pushing a Docker image with Cloud Build](https://cloud.google.com/build/docs/build-push-docker-image).

In [None]:
CONTAINER_NAME = content_name + "-gpu"
TAG = "latest"
custom_container_image_uri = (
    f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{REPOSITORY}/{CONTAINER_NAME}:{TAG}"
)

While submitting the docker image using Cloud Build, it takes roughly 1 hour for the process to be finished. Hence, the timeout is set to `1h` in the command below. You can increase it in case there is a timeout after 1 hour.

In [None]:
%cd trainer-gpu
!gcloud builds submit --timeout="1h" --region={REGION} --tag=$custom_container_image_uri
%cd ..

To ensure your that push was succeeded, list the images in the created repository.

In [None]:
! gcloud artifacts docker images list $REGION-docker.pkg.dev/$PROJECT_ID/$REPOSITORY

### Run a CustomContainerTrainingJob with GPUs using Vertex AI SDK

Configure your custom training job with GPUs and other required resources. The task is created with a `TIMESTAMP` suffix so that you can run the job multiple times under a different `display-name`.

The custom container takes the following arguments depending on the training task defined in `"trainer-gpu/main.py"` file.

- `--data`: Path to dataset.
- `--arch`: Model architecture. Ex: resnet18 and other architectures available from *torchvision.models*.
- `--workers`: Number of data loading workers.
- `--batch-size`: Mini-batch size (default: 32), per worker (GPU).
- `--learning-rate`: Initial learning rate.
- `--weight-decay`: Weight decay (default: 1e-4).
- `--print-freq`: Print frequency (default: 10).
- `--dist-backend`: Distributed backend (`nccl` or `gloo`). This notebook uses `nccl`.
- `--checkpoint-file`: Checkpoint file path, to load and save to.

The training job uses the GCFS filesystem to load the dataset and to read/write the model checkpoints. Therefore the paths are given with the prefix `/gcs/` rather than the `gs://`.

In [None]:
TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")
PRIMARY_COMPUTE = "n1-highmem-16"
TRAIN_COMPUTE = "n1-highmem-16"
NUM_CPUS = 14  # Set to a few less than max CPUs per instance for paralle data loading
TRAIN_GPU = "NVIDIA_TESLA_T4"
TRAIN_NGPU = 1
BATCH_SIZE = 256
REPLICAS = 4
EPOCHS = 5

display_name = (
    CONTAINER_NAME
    + f"{REPLICAS}workers-{TRAIN_NGPU}{TRAIN_GPU}-{BATCH_SIZE}batch-"
    + TIMESTAMP
)
gcs_output_uri_prefix = f"{BUCKET_URI}/{display_name}"

CMDARGS = [
    f"--epochs={EPOCHS}",
    "--arch=resnet18",
    f"--batch-size={BATCH_SIZE}",
    "--dist-backend=nccl",
    f"--data={DATASET_PATH}",
    f"--workers={NUM_CPUS}",
    f"--checkpoint-file=/gcs/{BUCKET_NAME}/checkpoint.pth.tar",
]

CONTAINER_SPEC = {"image_uri": custom_container_image_uri, "args": CMDARGS}

PRIMARY_WORKER_POOL = {
    "replica_count": 1,
    "machine_spec": {
        "machine_type": PRIMARY_COMPUTE,
        "accelerator_count": TRAIN_NGPU,
        "accelerator_type": TRAIN_GPU,
    },
    "container_spec": CONTAINER_SPEC,
}

WORKER_POOL_SPECS = [PRIMARY_WORKER_POOL]

TRAIN_WORKER_POOL = {
    "replica_count": REPLICAS,
    "machine_spec": {
        "machine_type": TRAIN_COMPUTE,
        "accelerator_count": TRAIN_NGPU,
        "accelerator_type": TRAIN_GPU,
    },
    "container_spec": CONTAINER_SPEC,
}

WORKER_POOL_SPECS.append(TRAIN_WORKER_POOL)

job = aiplatform.CustomJob(
    display_name=display_name,
    base_output_dir=gcs_output_uri_prefix,
    worker_pool_specs=WORKER_POOL_SPECS,
)

Run the custom training job on Vertex AI Training with monitoring enabled through the created tensorboard instance.

Links to monitor the progress of the custom training job and the tensorboard instance are generated.

In [None]:
job.run(
    sync=True, tensorboard=tensorboard.resource_name, service_account=SERVICE_ACCOUNT
)

## Check the Model artifacts

List the Cloud Storage bucket to see the model checkpoint files including the one with the best performance (named as `model_best.pth.tar`).

In [None]:
! gsutil ls -al $BUCKET_URI

## Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

Set `delete_bucket` to **True** to delete the Cloud Storage bucket.

In [None]:
delete_bucket = False

# Delete artifact repository
! gcloud artifacts repositories delete $REPOSITORY --location=$REGION --quiet

# Delete Tensorboard instance
! gcloud ai tensorboards delete $TENSORBOARD_NAME

# Delete Cloud Storage bucket
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil -m rm -r $BUCKET_URI