In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Distributed XGBoost training with Dask

<table align="left">

  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/xgboost_data_parallel_training_on_cpu_using_dask.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/official/training/xgboost_data_parallel_training_on_cpu_using_dask.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/official/training/xgboost_data_parallel_training_on_cpu_using_dask.ipynb">
        <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>
</table>

## Overview

This tutorial shows you how to create a distributed custom training job using XGBoost with Dask on Vertex AI that can handle large amounts of training data.

Learn more about [Custom training](https://cloud.google.com/vertex-ai/docs/training/custom-training).

### Objective

In this tutorial, you learn how to create a distributed training job using XGBoost with Dask. You build a custom docker container with simple Dask configuration to run a custom training job. When your training job is running, you can access the Dask dashboard to monitor the real-time status of your cluster, resources, and computations.

This tutorial uses the following Google Cloud ML services:

- `Vertex AI Training`
- `Artifact Registry`

The steps performed include:

- Configure the `PROJECT_ID` and `REGION` variables for your Google Cloud project.
- Create a Cloud Storage bucket to store your model artifacts.
- Build a custom Docker container that hosts your training code and push the container image to Artifact Registry.
- Run a Vertex AI SDK CustomContainerTrainingJob

### Dataset

This tutorial uses the <a href="https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html">IRIS dataset</a>, which predicts the iris species.


### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI

* Cloud Storage

* Artifact Registry

Learn about [Vertex AI pricing](https://cloud.google.com/vertex-ai/pricing), [Cloud Storage pricing](https://cloud.google.com/storage/pricing) and [Artifact Registry](https://cloud.google.com/artifact-registry/pricing), and use the [Pricing Calculator](https://cloud.google.com/products/calculator/),
        to generate a cost estimate based on your projected usage.


## Installation

Install the packages required for executing this notebook.

In [None]:
! pip3 install --upgrade --quiet google-cloud-aiplatform 

### Colab only: Uncomment the following cell to restart the kernel.

In [None]:
# Automatically restart kernel after installs so that your environment can access the new packages
# import IPython

# app = IPython.Application.instance()
# app.kernel.do_shutdown(True)

#### Set your project ID

**If you don't know your project ID**, try the following:
* Run `gcloud config list`.
* Run `gcloud projects list`.
* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)

In [None]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

# Set the project id
! gcloud config set project {PROJECT_ID}

#### Set the region

**Optional**: Update the 'REGION' variable to specify the region that you want to use. Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [None]:
REGION = "us-central1"  # @param {type: "string"}

#### UUID

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a uuid for each instance session, and append it onto the name of resources you create in this tutorial.

In [None]:
import random
import string


# Generate a uuid of a specifed length(default=8)
def generate_uuid(length: int = 8) -> str:
    return "".join(random.choices(string.ascii_lowercase + string.digits, k=length))


UUID = generate_uuid()

### Authenticate your Google Cloud account

To authenticate your Google Cloud account, follow the instructions for your Jupyter environment:

**1. Vertex AI Workbench**
<br>You are already authenticated.

**2. Local JupyterLab instance**
<br>Uncomment and run the following code:

In [None]:
# ! gcloud auth login

**3. Colab**
<br>Uncomment and run the following code:

In [None]:
# from google.colab import auth

# auth.authenticate_user()

**4. Service account or other**
* See how to grant Cloud Storage permissions to your service account at https://cloud.google.com/storage/docs/gsutil/commands/iam#ch-examples.

### Create a Cloud Storage bucket

Create a storage bucket to store intermediate artifacts such as datasets.

In [None]:
BUCKET_URI = f"gs://your-bucket-name-{PROJECT_ID}-unique"  # @param {type:"string"}

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [None]:
! gsutil mb -l $REGION -p $PROJECT_ID $BUCKET_URI

### Import libraries and define constants

In [None]:
import os

from google.cloud import aiplatform

# Create a custom training Python package

Before you can perform local training, you must a create a training script file and a docker file.

Create a `trainer` directory for all of your training code.

In [None]:
PYTHON_PACKAGE_APPLICATION_DIR = "trainer"
!mkdir -p $PYTHON_PACKAGE_APPLICATION_DIR

### Write the Training Script

The `train.py` file checks whether the current node is the chief node or a worker node and runs `dask-scheduler` for the chief node and `dask-worker` for worker nodes. Worker nodes connect to the chief node through the IP address and port number specified in `CLUSTER_SPEC`.

After the Dask scheduler is set up and connected to worker nodes, call `xgb.dask.train` to train a model through Dask. Once model training is complete, the model is uploaded to `AIP_MODEL_DIR`.

In [None]:
%%writefile trainer/train.py
from dask.distributed import Client, wait
from xgboost.dask import DaskDMatrix
from google.cloud import storage
import xgboost as xgb
import dask.dataframe as dd
import sys
import os
import subprocess
import time
import json

IRIS_DATA_FILENAME = 'gs://cloud-samples-data/ai-platform/iris/iris_data.csv'
IRIS_TARGET_FILENAME = 'gs://cloud-samples-data/ai-platform/iris/iris_target.csv'
MODEL_FILE = 'model.bst'
MODEL_DIR = os.getenv("AIP_MODEL_DIR")
XGB_PARAMS = {
    'verbosity': 2,
    'learning_rate': 0.1,
    'max_depth': 8,
    'objective': 'reg:squarederror',
    'subsample': 0.6,
    'gamma': 1,
    'verbose_eval': True,
    'tree_method': 'hist',
    'nthread': 1
}


def square(x):
    return x ** 2

def neg(x):
    return -x

def launch(cmd):
    """ launch dask workers
    """
    return subprocess.check_call(cmd, stdout=sys.stdout, stderr=sys.stderr, shell=True)


def get_chief_ip(cluster_config_dict):
    if 'workerpool0' in cluster_config_dict['cluster']:
      ip_address = cluster_config_dict['cluster']['workerpool0'][0].split(":")[0]
    else:
      # if the job is not distributed, 'chief' will be populated instead of
      # workerpool0.
      ip_address = cluster_config_dict['cluster']['chief'][0].split(":")[0]

    print('The ip address of workerpool 0 is : {}'.format(ip_address))
    return ip_address

def get_chief_port(cluster_config_dict):

    if "open_ports" in cluster_config_dict:
      port = cluster_config_dict['open_ports'][0]
    else:
      # Use any port for the non-distributed job.
      port = 7777
    print("The open port is: {}".format(port))

    return port

if __name__ == '__main__':
    cluster_config_str = os.environ.get('CLUSTER_SPEC')
    cluster_config_dict  = json.loads(cluster_config_str)
    print(json.dumps(cluster_config_dict, indent=2))
    print('The workerpool type is:', flush=True)
    print(cluster_config_dict['task']['type'], flush=True)
    workerpool_type = cluster_config_dict['task']['type']
    chief_ip = get_chief_ip(cluster_config_dict)
    chief_port = get_chief_port(cluster_config_dict)
    chief_address = "{}:{}".format(chief_ip, chief_port)

    if workerpool_type == "workerpool0":
      print('Running the dask scheduler.', flush=True)
      proc_scheduler = launch('dask-scheduler --dashboard --dashboard-address 8888 --port {} &'.format(chief_port))
      print('Done the dask scheduler.', flush=True)

      client = Client(chief_address, timeout=1200)
      print('Waiting the scheduler to be connected.', flush=True)
      client.wait_for_workers(1)

      X = dd.read_csv(IRIS_DATA_FILENAME, header=None)
      y = dd.read_csv(IRIS_TARGET_FILENAME, header=None)
      X.persist()
      y.persist()
      wait(X)
      wait(y)
      dtrain = DaskDMatrix(client, X, y)

      output = xgb.dask.train(client, XGB_PARAMS, dtrain,  num_boost_round=100, evals=[(dtrain, 'train')])
      print("Output: {}".format(output), flush=True)
      print("Saving file to: {}".format(MODEL_FILE), flush=True)
      output['booster'].save_model(MODEL_FILE)
      bucket_name = MODEL_DIR.replace("gs://", "").split("/", 1)[0]
      folder = MODEL_DIR.replace("gs://", "").split("/", 1)[1]
      bucket = storage.Client().bucket(bucket_name)
      print("Uploading file to: {}/{}{}".format(bucket_name, folder, MODEL_FILE), flush=True)
      blob = bucket.blob('{}{}'.format(folder, MODEL_FILE))
      blob.upload_from_filename(MODEL_FILE)
      print("Saved file to: {}/{}".format(MODEL_DIR, MODEL_FILE), flush=True)

      # Waiting 10 mins to connect the Dask dashboard
      time.sleep(60 * 10)
      client.shutdown()

    else:
      print('Running the dask worker.', flush=True)
      client = Client(chief_address, timeout=1200)
      print('client: {}.'.format(client), flush=True)
      launch('dask-worker {}'.format(chief_address))
      print('Done with the dask worker.', flush=True)

      # Waiting 10 mins to connect the Dask dashboard
      time.sleep(60 * 10)


### Write the docker file
The docker file is used to build the custom training container and passed to the Vertex Training.

In [None]:
%%writefile Dockerfile
FROM us-docker.pkg.dev/vertex-ai/training/tf-cpu.2-9:latest
WORKDIR /root

# Update the keyring in order to run apt-get update.
RUN rm -rf /usr/share/keyrings/cloud.google.gpg
RUN rm -rf /etc/apt/sources.list.d/google-cloud-sdk.list
RUN curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo apt-key add -
RUN echo "deb https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list

RUN apt-get update
RUN apt-get install -y telnet netcat iputils-ping  net-tools
RUN python3.8 -m pip install 'xgboost>=1.4.2' 'dask-ml[complete]==2022.5.27' 'dask[complete]==2022.7.1' --upgrade
RUN python3.8 -m pip install dask==2022.7.1 distributed==2022.7.1 bokeh==2.4.3 dask-cuda==22.8.0  --upgrade
RUN python3.8 -m pip install gcsfs --upgrade


# Make sure gsutil will use the default service account
RUN echo '[GoogleCompute]\nservice_account = default' > /etc/boto.cfg

# Copies the trainer code
RUN mkdir /root/trainer
COPY trainer/train.py /root/trainer/train.py

# Sets up the entry point to invoke the trainer.
ENTRYPOINT ["python3.8", "trainer/train.py"]


## Create a custom training job

### Build a custom training container

#### Enable Artifact Registry API
You must enable the Artifact Registry API for your project. You will store your custom training container in Artifact Registry.

<a href="https://cloud.google.com/artifact-registry/docs/enable-service">Learn more about Enabling service</a>.


In [None]:
! gcloud services enable artifactregistry.googleapis.com

### Create a private Docker repository
Your first step is to create a Docker repository in Artifact Registry.

1 - Run the `gcloud artifacts repositories create` command to create a new Docker repository with your region with the description `docker repository`.

2 - Run the `gcloud artifacts repositories list` command to verify that your repository was created.

In [None]:
PRIVATE_REPO = "my-docker-repo"

if os.getenv("IS_TESTING"):
    ! sudo apt-get update --yes && sudo apt-get --only-upgrade --yes install google-cloud-sdk-cloud-run-proxy google-cloud-sdk-harbourbridge google-cloud-sdk-cbt google-cloud-sdk-gke-gcloud-auth-plugin google-cloud-sdk-kpt google-cloud-sdk-local-extract google-cloud-sdk-minikube google-cloud-sdk-app-engine-java google-cloud-sdk-app-engine-go google-cloud-sdk-app-engine-python google-cloud-sdk-spanner-emulator google-cloud-sdk-bigtable-emulator google-cloud-sdk-nomos google-cloud-sdk-package-go-module google-cloud-sdk-firestore-emulator kubectl google-cloud-sdk-datastore-emulator google-cloud-sdk-app-engine-python-extras google-cloud-sdk-cloud-build-local google-cloud-sdk-kubectl-oidc google-cloud-sdk-anthos-auth google-cloud-sdk-app-engine-grpc google-cloud-sdk-pubsub-emulator google-cloud-sdk-datalab google-cloud-sdk-skaffold google-cloud-sdk google-cloud-sdk-terraform-tools google-cloud-sdk-config-connector
    ! gcloud components update --quiet

! gcloud artifacts repositories create {PRIVATE_REPO} --repository-format=docker --location={REGION} --description="Docker repository"

! gcloud artifacts repositories list

In [None]:
TRAIN_IMAGE = (
    f"{REGION}-docker.pkg.dev/" + PROJECT_ID + f"/{PRIVATE_REPO}" + "/dask_support"
)
print("Deployment:", TRAIN_IMAGE)

## Authenticate Docker to your repository
### Configure authentication to your private repo
Before you can push or pull container images to or from your Artifact Registry repository, you must configure Docker to use the gcloud command-line tool to authenticate requests to Artifact Registry for your region. On Colab, you'll have to use Cloud Build as the docker command is not available,

In [None]:
import sys

IS_COLAB = "google.colab" in sys.modules
if not IS_COLAB:
    ! gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet

### Set the custom Docker container image
Set the custom Docker container image.

1. Pull the corresponding CPU or GPU Docker image from Docker Hub.
2. Create a tag for registering the image with Artifact Registry
3. Register the image with Artifact Registry.

In [None]:
if not IS_COLAB:
    ! docker build -t $TRAIN_IMAGE -f Dockerfile .
    ! docker push $TRAIN_IMAGE

## Build and push the custom docker container image by using Cloud Build

Build and push a Docker image with Cloud Build

In [None]:
if IS_COLAB:
    !  gcloud builds submit --timeout=1800s --region={REGION} --tag $TRAIN_IMAGE

## Run training job with SDK (Option 1) or with gcloud (Option 2)

### 1.1 Initialize Vertex AI SDK

In [None]:
aiplatform.init(
    project=PROJECT_ID,
    staging_bucket=BUCKET_URI,
    location=REGION,
)

### 1.2 Run a Vertex AI SDK CustomContainerTrainingJob

You can specify the fields enable_web_access and enable_dashboard_access. The enable_web_access enables the interactive shell for the job and enable_dashboard_access allows the dask dashboard to be accessed.

In [None]:
gcs_output_uri_prefix = f"{BUCKET_URI}/output"
replica_count = 2
machine_type = "n1-standard-4"
display_name = "test_display_name"
DEPLOY_IMAGE = "us-docker.pkg.dev/vertex-ai/prediction/tf2-cpu.2-8:latest"

custom_container_training_job = aiplatform.CustomContainerTrainingJob(
    display_name=display_name,
    model_serving_container_image_uri=DEPLOY_IMAGE,
    container_uri=TRAIN_IMAGE,
)

custom_container_training_job.run(
    base_output_dir=gcs_output_uri_prefix,
    replica_count=replica_count,
    machine_type=machine_type,
    enable_dashboard_access=True,
    enable_web_access=True,
    sync=False,
)

Wait for a few minutes for the Custom Job to start

In [None]:
import time

time.sleep(60 * 10)

In [None]:
try:
    print(f"Custom Training Job Name: {custom_container_training_job.resource_name}")
    print(f"GCS Output URI Prefix: {gcs_output_uri_prefix}")
except Exception as e:
    print(e)

You can access the link to the Custom Job in the Cloud Console UI here:

In [None]:
try:
    print(
        f"Custom Training Job URI: {custom_container_training_job._custom_job_console_uri()}"
    )
except Exception as e:
    print(e)

Once the job is in the state "RUNNING", you can access the web access and dashboard access URIs here:

In [None]:
try:
    print(
        f"Web Access and Dashboard URIs: {custom_container_training_job.web_access_uris}"
    )
except Exception as e:
    print(e)

The interactive shell has the key with the format "workerpool0-0", while the dashboard uri has the key with the format "workerpool0-0:" + port number (workerpool0-0:8888 in this example). On the page for your Custom Job in the Cloud Console UI, you can also "Launch web terminal" for "workerpool0-0" for web access, or click "Launch web terminal" for "workerpool0-0:" + port number for dashboard access.

Note that you can only access an interactive shell and dashboard while the job is running. If you don't see Launch web terminal in the UI or the URIs in the output of the Web Access and Dashboard URIs command, this might be because Vertex AI hasn't started running your job yet, or because the job has already finished or failed. If the job's Status is Queued or Pending, wait a minute; then try refreshing the page, or trying the command again.

### 2. Run a CustomContainerTraining Job with gcloud

You can also create a training job with the gcloud command.  With the gcloud command, you can specify the field enableWebAccess and enableDashboardAccess. The enableWebAccess enables the interactive shell for the job and enableDashboardAccess allows the dask dashboard to be accessed.

In [None]:
%%bash -s "$BUCKET_URI/output" "$TRAIN_IMAGE"

cat <<EOF >config.yaml
enableDashboardAccess: true
enableWebAccess: true
# Creates two worker pool. The first worker pool is a chief and the second is
# a worker.
workerPoolSpecs:
  - machineSpec:
      machineType: n1-standard-8
    replicaCount: 1
    containerSpec:
      imageUri: $2
  - machineSpec:
      machineType: n1-standard-8
    replicaCount: 1
    containerSpec:
      imageUri: $2
baseOutputDirectory:
  outputUriPrefix: $1
EOF
cat config.yaml

The following command creates a training job.

In [None]:
! gcloud ai custom-jobs create --region=us-central1 --config=config.yaml --display-name={display_name}

#### Access the dashboard and interactive shell for a gcloud custom job

Once the job is created, you can access the web access URI and dashboard access URI by using the `gcloud ai custom-jobs describe` command to print the field webAccessUris. The interactive shell has the key with the format "workerpool0-0", while the dashboard uri has the key with the format "workerpool0-0:" + port number (workerpool0-0:8888 in this example).

You also can find the links in the Cloud Console UI. In the Cloud Console UI, in the Vertex AI section, go to Training and then Custom Jobs. Click on the name of your custom training job. On the page for your job, click "Launch web terminal" for "workerpool0-0" for web access, or click "Launch web terminal" for "workerpool0-0:" + port number for dashboard access.

Note that you can only access an interactive shell and dashboard while the job is running. If you don't see Launch web terminal in the UI or the URIs in the output of the gcloud command, this might be because Vertex AI hasn't started running your job yet, or because the job has already finished or failed. If the job's Status is Queued or Pending, wait a minute; then try refreshing the page, or trying the gcloud command again.

#### Troubleshooting

The [interactive shell](https://cloud.google.com/vertex-ai/docs/training/monitor-debug-interactive-shell) can be used to debugging the access of the dask dashboard. You can get the dashboard point by the following command.

In [None]:
# Note the following command should run inside the interactive shell.
# printenv | grep AIP_DASHBOARD_PORT

Then you can check if there are dashboard monitoring the port.

In [None]:
# Note the following command should run inside the interactive shell.
# netstat -ntlp

You can manually turn up the dashboard instance.

In [None]:
# Note the following command should run inside the interactive shell.
# dask-scheduler --dashboard-address :port_number

### View training output artifact

In [None]:
! gsutil ls $gcs_output_uri_prefix/model/

# Cleaning up

To clean up all Google Cloud resources used in this project, you can [delete the Google Cloud
project](https://cloud.google.com/resource-manager/docs/creating-managing-projects#shutting_down_projects) you used for the tutorial.

Otherwise, you can delete the individual resources you created in this tutorial:

- Cloud Storage Bucket
- Cloud Vertex Training Job

In [None]:
import logging
import traceback

# Set this to true only if you'd like to delete your bucket
delete_bucket = False

! gsutil rm -rf $gcs_output_uri_prefix

if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r $BUCKET_URI

try:
    custom_container_training_job.delete()
except Exception as e:
    logging.error(traceback.format_exc())
    print(e)

! gcloud artifacts repositories delete {PRIVATE_REPO} --location={REGION} --quiet