In [None]:
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

This notebook is an updated version of a notebook authored by [Jesus Chavez](https://github.com/jchavezar)

# E2E ML on GCP: MLOps stage 2 : experimentation: Get started with distributed training using DASK

<table align="left">
  <td>
    <a href="https://colab.research.google.com/github/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage2/get_started_with_dask_xgboost.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Colab logo"> Run in Colab
    </a>
  </td>
  <td>
    <a href="https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/notebooks/community/ml_ops/stage2/get_started_with_dask_xgboost.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      View on GitHub
    </a>
  </td>
  <td>
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/GoogleCloudPlatform/vertex-ai-samples/main/notebooks/community/ml_ops/stage2/get_started_with_dask_xgboost.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      Open in Vertex AI Workbench
    </a>
  </td>        
</table>
<br/><br/><br/>

## Overview

This tutorial demonstrates how to do distributed training for XGBoost models with `Vertex AI Training`. The tutorial includes using DASK for supporting distributed training for XGBoost and Scikit-learn models, and Flask as a web server for a custom serving container.

### Objective

In this tutorial, you learn how to use `Vertex AI Training` for distributed training of XGBoost model using the OSS package DASK. Additionally, you learn to construct and deploy a custom serving container using a Flask web server.

This tutorial uses the following Google Cloud ML services and resources:

- `Vertex AI Training`
- `Vertex AI Prediction`
- `Vertex AI Model` resource
- `Vertex AI Endpoint` resource

The steps performed include:

- Construct an XGBoost training script using DASK for distributed training.
- Construct a custom training container.
- Configure a distributed custom training job.
- Execute the custom training job.
- Construct a custom serving container using Flask.
- Upload the trained XGBoost model as a `Vertex AI Model` resource.
- Create a `Vertex AI Endpoint` resource.
- Deploy the `Vertex AI Model` resource to `Vertex AI Endpoint` resource.
- Make a prediction.

### Dataset

The dataset used in this tutorial is the [Forest Cover Type](https://archive.ics.uci.edu/ml/datasets/covertype). The version of this dataset is stored in CSV format on a public Cloud Storage bucket. The dataset predicts the forest cover type from cartographic variables only.

### Costs

This tutorial uses billable components of Google Cloud:

* Vertex AI
* Cloud Storage

Learn about [Vertex AI
pricing](https://cloud.google.com/vertex-ai/pricing) and [Cloud Storage
pricing](https://cloud.google.com/storage/pricing), and use the [Pricing
Calculator](https://cloud.google.com/products/calculator/)
to generate a cost estimate based on your projected usage.

## Installation

Install the following packages to execute this notebook.

In [None]:
import os

# The Vertex AI Workbench Notebook product has specific requirements
IS_WORKBENCH_NOTEBOOK = os.getenv("DL_ANACONDA_HOME")
IS_USER_MANAGED_WORKBENCH_NOTEBOOK = os.path.exists(
    "/opt/deeplearning/metadata/env_version"
)

# Vertex AI Notebook requires dependencies to be installed with '--user'
USER_FLAG = ""
if IS_WORKBENCH_NOTEBOOK:
    USER_FLAG = "--user"

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG -q

### Restart the kernel

After you install the additional packages, you need to restart the notebook kernel so it can find the packages.

In [None]:
# Automatically restart kernel after installs
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

## Before you begin

### Set up your Google Cloud project

**The following steps are required, regardless of your notebook environment.**

1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.

2. [Make sure that billing is enabled for your project.](https://cloud.google.com/billing/docs/how-to/modify-project)

3. [Enable the following APIs: Vertex AI APIs, Compute Engine APIs, and Cloud Storage.](https://console.cloud.google.com/flows/enableapi?apiid=aiplatform.googleapis.com,compute_component,storage-component.googleapis.com)

4. If you are running this notebook locally, you need to install the [Cloud SDK]((https://cloud.google.com/sdk)).

5. Enter your project ID in the cell below. Then run the  cell to make sure the
Cloud SDK uses the right project for all the commands in this notebook.

**Note**: Jupyter runs lines prefixed with `!` as shell commands, and it interpolates Python variables prefixed with `$`.

#### Set your project ID

**If you don't know your project ID**, you may be able to get your project ID using `gcloud`.

In [1]:
PROJECT_ID = "[your-project-id]"  # @param {type:"string"}

In [2]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

Project ID: andy-1234-221921


In [3]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations).

In [4]:
REGION = "[your-region]"  # @param {type: "string"}

if REGION == "[your-region]":
    REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [5]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Authenticate your Google Cloud account

**If you are using Vertex AI Workbench Notebooks**, your environment is already authenticated. Skip this step.

**If you are using Colab**, run the cell below and follow the instructions when prompted to authenticate your account via oAuth.

**Otherwise**, follow these steps:

In the Cloud Console, go to the [Create service account key](https://console.cloud.google.com/apis/credentials/serviceaccountkey) page.

**Click Create service account**.

In the **Service account name** field, enter a name, and click **Create**.

In the **Grant this service account access to project** section, click the Role drop-down list. Type "Vertex" into the filter box, and select **Vertex Administrator**. Type "Storage Object Admin" into the filter box, and select **Storage Object Admin**.

Click Create. A JSON file that contains your key downloads to your local environment.

Enter the path to your service account key as the GOOGLE_APPLICATION_CREDENTIALS variable in the cell below and run the cell.

In [6]:
# If you are running this notebook in Colab, run this cell and follow the
# instructions to authenticate your GCP account. This provides access to your
# Cloud Storage bucket and lets you submit training jobs and prediction
# requests.

import os
import sys

# If on Vertex AI Workbench, then don't execute this code
IS_COLAB = "google.colab" in sys.modules
if not os.path.exists("/opt/deeplearning/metadata/env_version") and not os.getenv(
    "DL_ANACONDA_HOME"
):
    if "google.colab" in sys.modules:
        from google.colab import auth as google_auth

        google_auth.authenticate_user()

    # If you are running this notebook locally, replace the string below with the
    # path to your service account key and run this cell to authenticate your GCP
    # account.
    elif not os.getenv("IS_TESTING"):
        %env GOOGLE_APPLICATION_CREDENTIALS ''

### Create a Cloud Storage bucket

**The following steps are required, regardless of your notebook environment.**

When you initialize the Vertex AI SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [7]:
BUCKET_NAME = "[your-bucket-name]"  # @param {type:"string"}
BUCKET_URI = f"gs://{BUCKET_NAME}"

In [8]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "[your-bucket-name]":
    BUCKET_NAME = PROJECT_ID + "aip-" + TIMESTAMP
    BUCKET_URI = "gs://" + BUCKET_NAME

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [9]:
! gsutil mb -l $REGION $BUCKET_URI

Creating gs://andy-1234-221921aip-20220808183359/...


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [10]:
! gsutil ls -al $BUCKET_URI

### Enable Artifact Registry API

You must enable the Artifact Registry API service for your project.

Learn more about [Enabling service](https://cloud.google.com/artifact-registry/docs/enable-service).

In [11]:
! gcloud services enable artifactregistry.googleapis.com

### Set up variables

Next, set up some variables used throughout the tutorial.
### Import libraries and define constants

In [12]:
import google.cloud.aiplatform as aiplatform

### Initialize Vertex AI SDK for Python

Initialize the Vertex AI SDK for Python for your project and corresponding bucket.

In [13]:
aiplatform.init(project=PROJECT_ID, staging_bucket=BUCKET_URI)

## Introduction to DASK

Excerpts from XGBoost DASK documentation.

### What is DASK?

```
Dask is a parallel computing library built on Python. Dask allows easy management of distributed workers and excels at handling large distributed data science workflows. The implementation in XGBoost originates from dask-xgboost with some extended functionalities and a different interface. 
```

### XGBoost DASK Overview

```
A dask cluster consists of three different components: a centralized scheduler, one or more workers, and one or more clients which act as the user-facing entry point for submitting tasks to the cluster. When using XGBoost with dask, one needs to call the XGBoost dask interface from the client side. Below is a small example which illustrates basic usage of running XGBoost on a dask cluster
```



Learn more about [XGBoost DASK](https://xgboost.readthedocs.io/en/stable/tutorials/dask.html)

Learn more about [DASK](dask.org)

## Introduction to XGBoost training

Once you have trained a XGBoost model, you will want to save it at a Cloud Storage location, so it can subsequently be uploaded to a `Vertex AI Model` resource.
The XGBoost package does not have support to save the model to a Cloud Storage location. Instead, you will do the following steps to save to a Cloud Storage location.

1. Save the in-memory model to the local filesystem (e.g., model.bst).
2. Use `google.cloud.storage` to copy the local copy to the specified Cloud Storage location.

### Examine the training package

#### Package layout

Before you start the training, you will look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.

- PKG-INFO
- README.md
- trainer
  - \_\_init\_\_.py
  - task.py

The file `trainer/task.py` is the Python script for executing the custom training job. *Note*, when you referred to it in the worker pool specification, you replace the directory slash with a dot (`trainer.task`) and dropped the file suffix (`.py`).

#### Package Assembly

In the following cells, you assemble the training package.

In [14]:
# Make folder for Python training script
! rm -rf custom prediction
! mkdir custom prediction

# Add package information
! touch custom/README.md

pkg_info = "Metadata-Version: 1.0\n\nName: cover_type classification\n\nVersion: 0.0.0\n\nSummary: Demostration training script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: aferlitsch@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex"
! echo "$pkg_info" > custom/PKG-INFO

# Make the training subfolder
! mkdir custom/trainer
! touch custom/trainer/__init__.py

## Construct the training script with Dask + CUDA (GPU)

Next, you write your custom training script to train an XGBoost model using DASK for distributed training.

- `args`: Arguments passed to the training script:
  - `dataset-source`: The Cloud Storage location of the CSV file containing the training data.
  - `model_dir`: The Cloud Storage location to store the model artifacts.
  - `model_name`: The file name of the model.
  - `--num-gpu-per-worker`: For the scheduler, the number of GPUs per worker. You need to additionally set (subsequently) the accelerator_count in the `run()` method for the job to the same number of GPUs -- to be allocated.
  - `--threads-per-worker`: For efficiency, you should set the number of threads per worker to the number of GPUs.
- `get_scheduler_info()`: Gets VM/process scheduling related information for setting up a distributed cluster.
- `using_quantile_device_dmatrix()`: Does the distributed training:
  - Note `client` is the cluster controller for the distributed training.
  - Reads the dataset in from CSV file using `dask_cudf.read_csv()`.
  - Splits and preprocesses the dataset into train/eval.
  - Loads dataset for distributed training using `dxgb.DaskDeviceQuantileDMatrix()`.
  - Does distributed training using `xgb.dask.train()`
- `saved_model()`: Saves the model and evaluation metrics to the specified Cloud Storage location.

In [30]:
%%writefile custom/trainer/task.py

import argparse
import os
import logging
import dask_cudf
import xgboost as xgb
import pandas as pd
import subprocess
import time
from google.cloud import storage

from dask_cuda import LocalCUDACluster
from dask.distributed import Client
from dask.distributed import wait
from dask import array as da
from xgboost import dask as dxgb
from xgboost.dask import DaskDMatrix
from dask.utils import parse_bytes

parser = argparse.ArgumentParser()
parser.add_argument(
    '--dataset-source', dest='dataset',
    type=str,
    help='Dataset.')
parser.add_argument(
    '--model-dir',
    default=os.getenv('AIP_MODEL_DIR'),
    help='GCS location to export models')
parser.add_argument(
    '--model-name',
    default="custom-train",
    help='The name of your saved model')
parser.add_argument(
    '--num-gpu-per-worker', type=str, help='num of workers',
    default=2)
parser.add_argument(
    '--threads-per-worker', type=str, help='num of threads per worker',
    default=4)

args = parser.parse_args()


def save_model(model_dir):
    logging.info(f"Reading input job_dir: {model_dir}")
    model_dir = model_dir.split("/")
    bucket_name = model_dir[2]
    object_prefix = "/".join(model_dir[3:]).rstrip("/")
    logging.info(f"Reading object_prefix: {object_prefix}")

    if object_prefix:
        model_path = '{}/{}'.format(object_prefix, "xgboost")
    else:
        model_path = '{}'.format("xgboost")
            
    logging.info(f"The model path is {model_path}")
    bucket = storage.Client().bucket(bucket_name)    
    local_path = os.path.join("/tmp", "xgboost")
    
    files = [f for f in os.listdir(local_path) if os.path.isfile(os.path.join(local_path, f))]
    for file in files:
        local_file = os.path.join(local_path, file)
        blob = bucket.blob("/".join([model_path, file]))
        blob.upload_from_filename(local_file)
    logging.info(local_file)
    logging.info(f"gs://{bucket_name}/{model_path}")
    logging.info(f"Saved model files in gs://{bucket_name}/{model_path}")

        
def using_quantile_device_dmatrix(client: Client, 
                                  dataset_source: str, 
                                  model_dir: str, 
                                  model_name: str):
    
    start_time = time.time()
    logging.info(f"Importing dataset {dataset_source}")
    df = dask_cudf.read_csv(dataset_source)

    logging.info("Cleaning and standarizing dataset")
    df = df.dropna() 

    logging.info("Splitting dataset")
    df_train, df_eval = df.random_split([0.8, 0.2], random_state=123)
    df_train_features= df_train.drop('Cover_Type', axis=1)
    df_eval_features= df_eval.drop('Cover_Type', axis=1)
    df_train_labels = df_train.pop('Cover_Type')
    df_eval_labels = df_eval.pop('Cover_Type')

    logging.info("Train Dataset for dask")
    dtrain = dxgb.DaskDeviceQuantileDMatrix(client, df_train_features, df_train_labels)
    
    logging.info("Eval Dataset for dask")
    dvalid = dxgb.DaskDeviceQuantileDMatrix(client, df_eval_features, df_eval_labels)
    logging.info("[INFO]: ------ QuantileDMatrix is formed in {} seconds ---".format((time.time() - start_time)))

    del df_train_features
    del df_train_labels
    del df_eval_features
    del df_eval_labels 
    
    start_time = time.time()
    
    logging.info("Training")
    logging.info(f"XGBoost version: {xgb.__version__}")
    output = xgb.dask.train(
        client,
        {
            "verbosity": 2, 
            "tree_method": "gpu_hist", 
            "objective": "multi:softprob",
            "eval_metric": ["mlogloss"],
            "learning_rate": 0.1,
            "gamma": 0.9,
            "subsample": 0.5,
            "max_depth": 9,
            "num_class": 8
        },
        dtrain,
        num_boost_round=10,
        evals=[(dvalid, "valid1")],
        early_stopping_rounds=5
    ) 
    print("[INFO]: ------ Training is completed in {} seconds ---".format((time.time() - start_time)))

    # Saving models and exporting performance metrics
    
    df_eval_metrics = pd.DataFrame(output["history"]["valid1"])
    model = output["booster"]
    best_model = model[: model.best_iteration]
    print(f"Best model: {best_model}")
    
    temp_dir = "/tmp/xgboost"
    os.mkdir(temp_dir)
    best_model.save_model("{}/{}".format(temp_dir, model_name))
    df_eval_metrics.to_json("{}/all_results.json".format(temp_dir))

    save_model(model_dir)
        
def get_scheduler_info():
    scheduler_ip =  subprocess.check_output(['hostname','--all-ip-addresses'])
    scheduler_ip = scheduler_ip.decode('UTF-8').split()[0]
    scheduler_port = '8786'
    scheduler_uri = '{}:{}'.format(scheduler_ip, scheduler_port)
    return scheduler_ip, scheduler_uri

if __name__ == '__main__':
    print("Creating dask cluster")
    
    sched_ip, sched_uri = get_scheduler_info()
    
    print(f"Sched_ip and Sched_uri, {sched_ip}, {sched_uri}")

    print("[INFO]: ------ LocalCUDACluster is being formed ")
    
    with LocalCUDACluster(
        ip=sched_ip,
        n_workers=int(args.num_gpu_per_worker), 
        threads_per_worker=int(args.threads_per_worker) 
    ) as cluster:
        with Client(cluster) as client:
            print('[INFO]: ------ Calling main function ')
            using_quantile_device_dmatrix(client, 
                                          dataset_source=args.dataset, 
                                          model_dir=args.model_dir, 
                                          model_name=args.model_name
                                         )

Overwriting custom/trainer/task.py


### Construct custom training container

Next, you construct a custom (Docker) container for training a XGBoost model with GPU CUDA support. As a base image, you use an image from Rapids AI with CUDA support. You then install into the image packages for XGBoost, Google Cloud Fuse, and Cloud Storage.

*Note:* Currently, Vertex AI does not support a pre-built container for XGBoost with GPU support (CPU only). 

In [31]:
%%writefile custom/Dockerfile

FROM rapidsai/rapidsai-nightly:22.04-cuda11.2-base-ubuntu20.04-py3.9

RUN pip install google.cloud[storage] \
  && pip install gcsfs \
  && pip install xgboost --upgrade

COPY trainer trainer/

ENTRYPOINT ["python", "trainer/task.py"]

Overwriting custom/Dockerfile


## Create a private Docker repository

Next, you create your own Docker repository in Google Artifact Registry.

1. Run the `gcloud artifacts repositories create` command to create a new Docker repository with your region with the description "docker repository".

2. Run the `gcloud artifacts repositories list` command to verify that your repository was created.

In [32]:
PRIVATE_REPO = "my-docker-repo"

! gcloud artifacts repositories create {PRIVATE_REPO} --repository-format=docker --location={REGION} --description="Docker repository"

! gcloud artifacts repositories list

[1;31mERROR:[0m (gcloud.artifacts.repositories.create) ALREADY_EXISTS: the repository already exists
Listing items under project andy-1234-221921, across all locations.

                                                                        ARTIFACT_REGISTRY
REPOSITORY                              FORMAT  DESCRIPTION                     LOCATION     LABELS  ENCRYPTION          CREATE_TIME          UPDATE_TIME
custom-preprocess-container-prediction  DOCKER                                  us-central1          Google-managed key  2022-03-11T00:47:46  2022-07-29T18:59:32
my-docker-repo                          DOCKER  Docker repository               us-central1          Google-managed key  2022-08-05T03:10:25  2022-08-08T21:29:53
tpu-training-repository                 DOCKER  Vertex TPU training repository  us-central1          Google-managed key  2022-02-02T21:36:53  2022-02-03T00:15:12
trainings                               DOCKER                                  us-central1       

### Configure authentication to your private repo

Before you push or pull container images, configure Docker to use the `gcloud` command-line tool to authenticate requests to `Artifact Registry` for your region.

In [33]:
! gcloud auth configure-docker {REGION}-docker.pkg.dev --quiet


{
  "credHelpers": {
    "gcr.io": "gcloud",
    "us.gcr.io": "gcloud",
    "eu.gcr.io": "gcloud",
    "asia.gcr.io": "gcloud",
    "staging-k8s.gcr.io": "gcloud",
    "marketplace.gcr.io": "gcloud",
    "us-central1-docker.pkg.dev": "gcloud"
  }
}
Adding credentials for: us-central1-docker.pkg.dev
gcloud credential helpers already registered correctly.


## Build the custom training container

Next, you build your custom training (Docker) image.

In [34]:
TRAIN_IMAGE = (
    f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{PRIVATE_REPO}/train_gpu_xgb:latest"
)

#! gcloud builds submit -t $TRAIN_IMAGE custom/.

! docker build custom -t $TRAIN_IMAGE
#! docker tag tensorflow/serving $DEPLOY_IMAGE
! docker push $TRAIN_IMAGE

Sending build context to Docker daemon  10.75kB
Step 1/4 : FROM rapidsai/rapidsai-nightly:22.04-cuda11.2-base-ubuntu20.04-py3.9
 ---> 6f5057ed56a0
Step 2/4 : RUN pip install google.cloud[storage]   && pip install gcsfs   && pip install xgboost --upgrade
 ---> Using cache
 ---> d0505852fede
Step 3/4 : COPY trainer trainer/
 ---> d1181e973754
Step 4/4 : ENTRYPOINT ["python", "trainer/task.py"]
 ---> Running in 2ed6510a8ae3
Removing intermediate container 2ed6510a8ae3
 ---> 4e748cd92992
Successfully built 4e748cd92992
Successfully tagged us-central1-docker.pkg.dev/andy-1234-221921/my-docker-repo/train_gpu_xgb:latest
The push refers to repository [us-central1-docker.pkg.dev/andy-1234-221921/my-docker-repo/train_gpu_xgb]

[1B94499082: Preparing 
[1B7412b417: Preparing 
[1Bd20a3982: Preparing 
[1B177f7dba: Preparing 
[1Be2396578: Preparing 
[1B749f56f5: Preparing 
[1Ba57855d6: Preparing 
[1Beca41527: Preparing 
[1B32e4a10b: Preparing 
[1Bc89280f3: Preparing 
[1Bd1e3350d: Preparing

## Construct the serving container using Flask.

In this tutorial, the model will be served using a custom serving container. You construct the HTTP server for the health and prediction routes using Flask.

*TODO: Unhardcode PROJECT_ID*

In [45]:
%%writefile prediction/app.py

import os
import logging
import pandas as pd
import xgboost as xgb
from flask import Flask, request, Response, jsonify
from google.cloud import storage

#client = storage.Client(project=os.environ['PROJECT_ID'])
client = storage.Client(project='andy-1234-221921')

# Model Download from gcs

fname = "model.json"

with open(fname, "wb") as model:
    client.download_blob_to_file(
        f"{os.environ['AIP_STORAGE_URI']}/{fname}", model
    )

# Loading model
print("Loading model from: {}".format(fname))
model = xgb.Booster(model_file=fname)

# Creation of the Flask app
app = Flask(__name__)

# Flask route for Liveness checks
@app.route(os.environ['AIP_HEALTH_ROUTE'])
def isalive():
    status_code = Response(status=200)
    return status_code

# Flask route for predictions
@app.route(os.environ['AIP_PREDICT_ROUTE'],methods=['GET','POST'])
def prediction():
    _features = ['Id','Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways',
                          'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm','Horizontal_Distance_To_Fire_Points', 'Wilderness_Area1', 'Wilderness_Area2', 'Wilderness_Area3', 
                          'Wilderness_Area4', 'Soil_Type1', 'Soil_Type2', 'Soil_Type3', 'Soil_Type4', 'Soil_Type5', 'Soil_Type6', 'Soil_Type7', 'Soil_Type8', 'Soil_Type9',
                          'Soil_Type10','Soil_Type11','Soil_Type12','Soil_Type13','Soil_Type14','Soil_Type15','Soil_Type16','Soil_Type17','Soil_Type18','Soil_Type19', 
                          'Soil_Type20', 'Soil_Type21', 'Soil_Type22', 'Soil_Type23', 'Soil_Type24', 'Soil_Type25', 'Soil_Type26', 'Soil_Type27', 'Soil_Type28', 'Soil_Type29',
                          'Soil_Type30', 'Soil_Type31', 'Soil_Type32', 'Soil_Type33', 'Soil_Type34', 'Soil_Type35', 'Soil_Type36', 'Soil_Type37', 'Soil_Type38', 'Soil_Type39', 'Soil_Type40']
    data = request.get_json(silent=True, force=True)
    dmf = xgb.DMatrix(pd.DataFrame(data["instances"], columns=_features))
    response = pd.DataFrame(model.predict(dmf))
    logging.info(f"Response: {response}")
    return jsonify({"Cover Type": str(response.idxmax(axis=1)[0])})

if __name__ == "__main__":
    app.run(debug=True, host='0.0.0.0', port=8080)

Overwriting prediction/app.py


### Construct the package requirements for your custom serving container

Next, you construct the requirements file for your custom serving container.

In [46]:
%%writefile prediction/requirements.txt

google-cloud-storage
numpy
pandas
flask
xgboost

Overwriting prediction/requirements.txt


### Construct custom prediction container

Next, you construct a custom (Docker) container for serving predictions from your deployed XGBoost model.


In [47]:
%%writefile prediction/Dockerfile

FROM python:3.7-buster

RUN mkdir my-model

COPY app.py ./app.py
COPY requirements.txt ./requirements.txt
RUN pip install -r requirements.txt 

# Flask Env Variable
ENV FLASK_APP=app

# Expose port 8080
EXPOSE 8080

CMD flask run --host=0.0.0.0 --port=8080

Overwriting prediction/Dockerfile


## Build the custom serving container

Next, you build your custom prediction (Docker) image.

In [48]:
DEPLOY_IMAGE = (
    f"{REGION}-docker.pkg.dev/{PROJECT_ID}/{PRIVATE_REPO}/predict_gpu_xgb:latest"
)

! docker build prediction -t $DEPLOY_IMAGE
! docker push $DEPLOY_IMAGE

Sending build context to Docker daemon   7.68kB
Step 1/8 : FROM python:3.7-buster
 ---> 1902e2432d77
Step 2/8 : RUN mkdir my-model
 ---> Using cache
 ---> ac56a1c65a6c
Step 3/8 : COPY app.py ./app.py
 ---> 2f9cb0c81571
Step 4/8 : COPY requirements.txt ./requirements.txt
 ---> 114316b4750e
Step 5/8 : RUN pip install -r requirements.txt
 ---> Running in 573bb5282aa7
Collecting google-cloud-storage
  Downloading google_cloud_storage-2.5.0-py2.py3-none-any.whl (106 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 107.0/107.0 KB 4.2 MB/s eta 0:00:00
Collecting numpy
  Downloading numpy-1.21.6-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (15.7 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 15.7/15.7 MB 64.1 MB/s eta 0:00:00
Collecting pandas
  Downloading pandas-1.3.5-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 11.3/11.3 MB 74.7 MB/s eta 0:00:00
Collecting flask
  Downloading Flask-2.2.1-py3-none-any.whl (1

#### Store training script on your Cloud Storage bucket

Next, you package the training folder into a compressed tar ball, and then store it in your Cloud Storage bucket.

In [49]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_URI/trainer_covertype.tar.gz

custom/
custom/PKG-INFO
custom/Dockerfile
custom/README.md
custom/trainer/
custom/trainer/task.py
custom/trainer/__init__.py
Copying file://custom.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  2.5 KiB/  2.5 KiB]                                                
Operation completed over 1 objects/2.5 KiB.                                      


### Create and run custom training job


To train a custom model, you perform two steps: 1) create a custom training job, and 2) run the job.

#### Create custom training job

A custom training job is created with the `CustomTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the custom training job.
- `container_uri`: The training container image.

- `command`: The command (e.g., interpreter) and script to invokee within the container.

- `model_serving_container_image_uri`: The corresponding serving container to be used with the model when it is deployed.

*Note:* The interpreter and script to invoke is overridable within the container (i.e., ENTRYPOINT).

In [50]:
DISPLAY_NAME = "covertype_" + TIMESTAMP

job = aiplatform.CustomContainerTrainingJob(
    display_name="DISPLAY_NAME",
    container_uri=TRAIN_IMAGE,
    command=["python3", "trainer/task.py"],
    model_serving_container_image_uri=DEPLOY_IMAGE,
)

#### Run the custom container training job

Next, you run the custom job to start the training job by invoking the method `run()`, with the following parameters:

- `model_display_name`: When the training job is completed, the model artifacts will be automatically uploaded as a `Vertex AI Model` resource, with the specified display name.
- `args`: The command line arguments to pass to the training script:
    - `dataset-source`: The Cloud Storage location to the CSV dataset file.
    - `model-name`: The file name for the model artifacts.
    - `num-gpu-per-worker`: The number of GPUs per VM instance (worker).
    - `threads-per-worker`: The number of training process threads per VM instances.
- `replica_count`: The number of VM instances.
- `machine_type`: The machine type per VM instance.
- `accelerator_type`: The type of GPU, if any.
- `accelerator_count`: The number of GPUs per VM instance, if any.
- `base_output_dir`: The Cloud Storage location to save the model artifacts to.

In [51]:
DATASET_FILE = "gs://vtx-datasets-public/cover_type_4Mrows.csv"

MODEL_DIR = f"{BUCKET_URI}"

NGPU = 4

CMDARGS = [
    "--dataset-source",
    DATASET_FILE,
    "--model-name",
    "model.json",
    "--num-gpu-per-worker",
    str(NGPU),
    "--threads-per-worker",
    "4",
]

model = job.run(
    model_display_name="covertype_" + TIMESTAMP,
    args=CMDARGS,
    replica_count=1,
    machine_type="n1-standard-4",
    accelerator_type="NVIDIA_TESLA_T4",
    accelerator_count=NGPU,
    base_output_dir=MODEL_DIR,
)

Training Output directory:
gs://andy-1234-221921aip-20220808183359 
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/5005831525049040896?project=759209241365
CustomContainerTrainingJob projects/759209241365/locations/us-central1/trainingPipelines/5005831525049040896 current state:
PipelineState.PIPELINE_STATE_RUNNING
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/3811682732447105024?project=759209241365
CustomContainerTrainingJob projects/759209241365/locations/us-central1/trainingPipelines/5005831525049040896 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContainerTrainingJob projects/759209241365/locations/us-central1/trainingPipelines/5005831525049040896 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContainerTrainingJob projects/759209241365/locations/us-central1/trainingPipelines/5005831525049040896 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomContaine

### Delete a custom training job

After a training job is completed, you can delete the training job with the method `delete()`.  Prior to completion, a training job can be canceled with the method `cancel()`.

In [52]:
job.delete()

Deleting CustomContainerTrainingJob : projects/759209241365/locations/us-central1/trainingPipelines/5005831525049040896
Delete CustomContainerTrainingJob  backing LRO: projects/759209241365/locations/us-central1/operations/2495575079300104192
CustomContainerTrainingJob deleted. . Resource name: projects/759209241365/locations/us-central1/trainingPipelines/5005831525049040896


#### Display location of saved model artifacts

Next, you display the contents of the Cloud Storage location where your training script saved the trained model artifacts.

In [58]:
! gsutil ls {MODEL_DIR}/model

gs://andy-1234-221921aip-20220808183359/model/xgboost/all_results.json
gs://andy-1234-221921aip-20220808183359/model/xgboost/model.json


### Deploy the model

Next, you deploy your XGBoost model + serving container to a `Vertex AI Endpoint` resource using the `deploy()` method.

In [54]:
endpoint = model.deploy(machine_type="n1-standard-4")

Creating Endpoint
Create Endpoint backing LRO: projects/759209241365/locations/us-central1/endpoints/3938177971809943552/operations/3392917305053675520
Endpoint created. Resource name: projects/759209241365/locations/us-central1/endpoints/3938177971809943552
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/759209241365/locations/us-central1/endpoints/3938177971809943552')
Deploying model to Endpoint : projects/759209241365/locations/us-central1/endpoints/3938177971809943552
Deploy Endpoint model backing LRO: projects/759209241365/locations/us-central1/endpoints/3938177971809943552/operations/6600606139648311296


FailedPrecondition: 400 Model server terminated: model server container terminated: exit_code: 	 1
reason: "Error"
started_at {
  seconds: 1659999294
}
finished_at {
  seconds: 1659999295
}
. Model server logs can be found at https://console.cloud.google.com/logs/viewer?project=759209241365&resource=aiplatform.googleapis.com%252FEndpoint&advancedFilter=resource.type%3D%22aiplatform.googleapis.com%2FEndpoint%22%0Aresource.labels.endpoint_id%3D%223938177971809943552%22%0Aresource.labels.location%3D%22us-central1%22.

#### Prepare prediction request

Next, you prepare a prediction request. For demonstration purposes, you use the first row (example) from the dataset.

In [56]:
output = ! gsutil cat {DATASET_FILE} | head -n2

print(output[1])

import json

instance = json.loads(output[1])
print(instance)

0,3189,40,8,30,13,3270,206,234,193,4873,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1


JSONDecodeError: Extra data: line 1 column 2 (char 1)

### Make the prediction

BLAH

In [None]:
prediction = endpoint.predict(instances=[instance])
print(prediction)

#### Undeploy and delete the `Endpoint` resource

BLAH

In [59]:
endpoint.undeploy_all(force=True)
endpoint.delete()

NameError: name 'endpoint' is not defined

#### Delete the `Model` resource

You can delete a model resource with the method delete().

In [60]:
model.delete()

Deleting Model : projects/759209241365/locations/us-central1/models/3452901917821239296
Delete Model  backing LRO: projects/759209241365/locations/us-central1/operations/6569080942256717824
Model deleted. . Resource name: projects/759209241365/locations/us-central1/models/3452901917821239296


## Cleanup

In [61]:
delete_bucket = True
if delete_bucket or os.getenv("IS_TESTING"):
    ! gsutil rm -r {BUCKET_URI}

! rm -rf custom prediction custom.tar.gz

# TODO: delete repo, delete images

Removing gs://andy-1234-221921aip-20220808183359/trainer_covertype.tar.gz#1659997980723690...
Removing gs://andy-1234-221921aip-20220808183359/model/model/xgboost/all_results.json#1659996013500812...
Removing gs://andy-1234-221921aip-20220808183359/model/model/xgboost/model.json#1659996013612869...
Removing gs://andy-1234-221921aip-20220808183359/model/xgboost/all_results.json#1659998221499926...
/ [4 objects]                                                                   
==> NOTE: You are performing a sequence of gsutil operations that may
run significantly faster if you instead use gsutil -m rm ... Please
see the -m section under "gsutil help options" for further information
about when gsutil -m can be advantageous.

Removing gs://andy-1234-221921aip-20220808183359/model/xgboost/model.json#1659998221591249...
/ [5 objects]                                                                   
Operation completed over 5 objects.                                              
Removing g