# Migrating Custom XGBoost Model with Pre-built Training Container

### Learning Objectives

1. Train a model.
2. Upload a model.
3. Make a batch and online predictions.
4. Deploy a model.

## Introduction

The dataset used for this tutorial is the [Iris dataset](https://www.tensorflow.org/datasets/catalog/iris) from [TensorFlow Datasets](https://www.tensorflow.org/datasets/catalog/overview). This dataset does not require any feature engineering. The version of the dataset you will use in this tutorial is stored in a public Cloud Storage bucket. The trained model predicts the type of Iris flower species from a class of three species: setosa, virginica, or versicolor.

Each learning objective will correspond to a __#TODO__ in the [student lab notebook](../labs/sdk_customm_xgboost.ipynb) -- try to complete that notebook first before reviewing this solution notebook.

## Installation

Install the latest version of Vertex SDK for Python.

In [1]:
# import necessary libraries
import os

# Google Cloud Notebook
if os.path.exists("/opt/deeplearning/metadata/env_version"):
    USER_FLAG = "--user"
else:
    USER_FLAG = ""

! pip3 install --upgrade google-cloud-aiplatform $USER_FLAG



Install the latest GA version of *google-cloud-storage* library as well.

In [2]:
! pip3 install -U google-cloud-storage $USER_FLAG



In [3]:
if os.getenv("IS_TESTING"):
    ! pip3 install --upgrade tensorflow $USER_FLAG

### Restart the kernel

Once you've installed the additional packages, you need to restart the notebook kernel so it can find the packages.

In [4]:
import os

if not os.getenv("IS_TESTING"):
    # Automatically restart kernel after installs
    import IPython

    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

### Set up your Google Cloud project

In [1]:
PROJECT_ID = "<your-project>"  # replace with your project ID

In [2]:
if PROJECT_ID == "" or PROJECT_ID is None or PROJECT_ID == "[your-project-id]":
    # Get your GCP project id from gcloud
    shell_output = ! gcloud config list --format 'value(core.project)' 2>/dev/null
    PROJECT_ID = shell_output[0]
    print("Project ID:", PROJECT_ID)

In [3]:
! gcloud config set project $PROJECT_ID

Updated property [core/project].


#### Region

You can also change the `REGION` variable, which is used for operations
throughout the rest of this notebook.  Below are regions supported for Vertex AI. We recommend that you choose the region closest to you.

- Americas: `us-central1`
- Europe: `europe-west4`
- Asia Pacific: `asia-east1`

You may not use a multi-regional bucket for training with Vertex AI. Not all regions provide support for all Vertex AI services.

Learn more about [Vertex AI regions](https://cloud.google.com/vertex-ai/docs/general/locations)

In [4]:
REGION = "us-central1"

#### Timestamp

If you are in a live tutorial session, you might be using a shared test account or project. To avoid name collisions between users on resources created, you create a timestamp for each instance session, and append the timestamp onto the name of resources you create in this tutorial.

In [5]:
from datetime import datetime

TIMESTAMP = datetime.now().strftime("%Y%m%d%H%M%S")

### Create a Cloud Storage bucket

When you initialize the Vertex SDK for Python, you specify a Cloud Storage staging bucket. The staging bucket is where all the data associated with your dataset and model resources are retained across sessions.

Set the name of your Cloud Storage bucket below. Bucket names must be globally unique across all Google Cloud projects, including those outside of your organization.

In [6]:
BUCKET_NAME = "gs://<your-bucket>" # replace bucket name

In [7]:
if BUCKET_NAME == "" or BUCKET_NAME is None or BUCKET_NAME == "gs://[your-bucket-name]":
    BUCKET_NAME = "gs://" + PROJECT_ID + "aip-" + TIMESTAMP

**Only if your bucket doesn't already exist**: Run the following cell to create your Cloud Storage bucket.

In [8]:
! gsutil mb -l $REGION $BUCKET_NAME

Creating gs://qwiklabs-gcp-02-c816472a2d85/...


Finally, validate access to your Cloud Storage bucket by examining its contents:

In [9]:
! gsutil ls -al $BUCKET_NAME

### Set up variables

Next, set up some variables used in this notebook.
### Import libraries and define constants

In [10]:
import google.cloud.aiplatform as aip

## Initialize Vertex SDK for Python

Initialize the Vertex SDK for Python for your project and corresponding bucket.

In [11]:
aip.init(project=PROJECT_ID, staging_bucket=BUCKET_NAME)

#### Set pre-built containers

Set the pre-built Docker container image for training and prediction.


For the latest list, see [Pre-built containers for training](https://cloud.google.com/ai-platform-unified/docs/training/pre-built-containers).


For the latest list, see [Pre-built containers for prediction](https://cloud.google.com/ai-platform-unified/docs/predictions/pre-built-containers).

In [12]:
TRAIN_VERSION = "xgboost-cpu.1-1"
DEPLOY_VERSION = "xgboost-cpu.1-1"

TRAIN_IMAGE = "gcr.io/cloud-aiplatform/training/{}:latest".format(TRAIN_VERSION)
DEPLOY_IMAGE = "gcr.io/cloud-aiplatform/prediction/{}:latest".format(DEPLOY_VERSION)

#### Set machine type

Next, set the machine type to use for training and prediction.

- Set the variables `TRAIN_COMPUTE` and `DEPLOY_COMPUTE` to configure  the compute resources for the VMs you will use for for training and prediction.
 - `machine type`
     - `n1-standard`: 3.75GB of memory per vCPU.
     - `n1-highmem`: 6.5GB of memory per vCPU
     - `n1-highcpu`: 0.9 GB of memory per vCPU
 - `vCPUs`: number of \[2, 4, 8, 16, 32, 64, 96 \]

*Note: The following is not supported for training:*

 - `standard`: 2 vCPUs
 - `highcpu`: 2, 4 and 8 vCPUs

*Note: You may also use n2 and e2 machine types for training and deployment, but they do not support GPUs*.

In [14]:
import os

if os.getenv("IS_TESTING_TRAIN_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_TRAIN_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
TRAIN_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Train machine type", TRAIN_COMPUTE)

if os.getenv("IS_TESTING_DEPLOY_MACHINE"):
    MACHINE_TYPE = os.getenv("IS_TESTING_DEPLOY_MACHINE")
else:
    MACHINE_TYPE = "n1-standard"

VCPU = "4"
DEPLOY_COMPUTE = MACHINE_TYPE + "-" + VCPU
print("Deploy machine type", DEPLOY_COMPUTE)

Train machine type n1-standard-4
Deploy machine type n1-standard-4


### Examine the training package

#### Package layout

Before you start the training, you will look at how a Python package is assembled for a custom training job. When unarchived, the package contains the following directory/file layout.

- PKG-INFO
- README.md
- setup.cfg
- setup.py
- trainer
  - \_\_init\_\_.py
  - task.py

The files `setup.cfg` and `setup.py` are the instructions for installing the package into the operating environment of the Docker image.

The file `trainer/task.py` is the Python script for executing the custom training job. *Note*, when we referred to it in the worker pool specification, we replace the directory slash with a dot (`trainer.task`) and dropped the file suffix (`.py`).

#### Package Assembly

In the following cells, you will assemble the training package.

In [15]:
# Make folder for Python training script
! rm -rf custom
! mkdir custom

# Add package information
! touch custom/README.md

setup_cfg = "[egg_info]\n\ntag_build =\n\ntag_date = 0"
! echo "$setup_cfg" > custom/setup.cfg

setup_py = "import setuptools\n\nsetuptools.setup(\n\n    install_requires=[\n\n        'tensorflow_datasets==1.3.0',\n\n    ],\n\n    packages=setuptools.find_packages())"
! echo "$setup_py" > custom/setup.py

pkg_info = "Metadata-Version: 1.0\n\nName: Iris tabular classification\n\nVersion: 0.0.0\n\nSummary: Demostration training script\n\nHome-page: www.google.com\n\nAuthor: Google\n\nAuthor-email: aferlitsch@google.com\n\nLicense: Public\n\nDescription: Demo\n\nPlatform: Vertex"
! echo "$pkg_info" > custom/PKG-INFO

# Make the training subfolder
! mkdir custom/trainer
! touch custom/trainer/__init__.py

In [16]:
%%writefile custom/trainer/task.py
# Single Instance Training for Iris

import datetime
import os
import subprocess
import sys
import pandas as pd
import xgboost as xgb

import argparse

parser = argparse.ArgumentParser()
parser.add_argument('--model-dir', dest='model_dir',
                    default=os.getenv('AIP_MODEL_DIR'), type=str, help='Model dir.')
args = parser.parse_args()

# Download data
iris_data_filename = 'iris_data.csv'
iris_target_filename = 'iris_target.csv'
data_dir = 'gs://cloud-samples-data/ai-platform/iris'

# gsutil outputs everything to stderr so we need to divert it to stdout.
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_data_filename),
                       iris_data_filename], stderr=sys.stdout)
subprocess.check_call(['gsutil', 'cp', os.path.join(data_dir,
                                                    iris_target_filename),
                       iris_target_filename], stderr=sys.stdout)


# Load data into pandas, then use `.values` to get NumPy arrays
iris_data = pd.read_csv(iris_data_filename).values
iris_target = pd.read_csv(iris_target_filename).values

# Convert one-column 2D array into 1D array for use with XGBoost
iris_target = iris_target.reshape((iris_target.size,))


# Load data into DMatrix object
dtrain = xgb.DMatrix(iris_data, label=iris_target)


# Train XGBoost model
bst = xgb.train({}, dtrain, 20)

# Export the classifier to a file
model_filename = 'model.bst'
bst.save_model(model_filename)

# Upload the saved model file to Cloud Storage
gcs_model_path = os.path.join(args.model_dir, model_filename)
subprocess.check_call(['gsutil', 'cp', model_filename, gcs_model_path],
    stderr=sys.stdout)

Writing custom/trainer/task.py


#### Store training script on your Cloud Storage bucket

Next, you package the training folder into a compressed tar ball, and then store it in your Cloud Storage bucket.

In [17]:
! rm -f custom.tar custom.tar.gz
! tar cvf custom.tar custom
! gzip custom.tar
! gsutil cp custom.tar.gz $BUCKET_NAME/trainer_iris.tar.gz

custom/
custom/trainer/
custom/trainer/__init__.py
custom/trainer/task.py
custom/setup.py
custom/README.md
custom/PKG-INFO
custom/setup.cfg
Copying file://custom.tar.gz [Content-Type=application/x-tar]...
/ [1 files][  1.2 KiB/  1.2 KiB]                                                
Operation completed over 1 objects/1.2 KiB.                                      


## Train a model ([training.create-python-pre-built-container](https://cloud.google.com/vertex-ai/docs/training/create-python-pre-built-container))

### Create and run custom training job


To train a custom model, you perform two steps: 1) create a custom training job, and 2) run the job.

#### Create custom training job

A custom training job is created with the `CustomTrainingJob` class, with the following parameters:

- `display_name`: The human readable name for the custom training job.
- `container_uri`: The training container image.
- `requirements`: Package requirements for the training container image (e.g., pandas).
- `script_path`: The relative path to the training script.

In [18]:
# TODO
# constructs a Custom Training Job using a Python script
job = aip.CustomTrainingJob(
    display_name="iris_" + TIMESTAMP,
    script_path="custom/trainer/task.py",
    container_uri=TRAIN_IMAGE,
    requirements=["gcsfs==0.7.1", "tensorflow-datasets==4.4"],
)

print(job)

<google.cloud.aiplatform.training_jobs.CustomTrainingJob object at 0x7fcf426eb350>


#### Run the custom training job

Next, you run the custom job to start the training job by invoking the method `run`, with the following parameters:

- `replica_count`: The number of compute instances for training (replica_count = 1 is single node training).
- `machine_type`: The machine type for the compute instances.
- `base_output_dir`: The Cloud Storage location to write the model artifacts to.
- `sync`: Whether to block until completion of the job.

In [19]:
MODEL_DIR = "{}/{}".format(BUCKET_NAME, TIMESTAMP)


job.run(
    replica_count=1, machine_type=TRAIN_COMPUTE, base_output_dir=MODEL_DIR, sync=True
)

MODEL_DIR = MODEL_DIR + "/model"
model_path_to_deploy = MODEL_DIR

Training script copied to:
gs://qwiklabs-gcp-02-c816472a2d85/aiplatform-2022-05-25-12:52:36.960-aiplatform_custom_trainer_script-0.1.tar.gz.
Training Output directory:
gs://qwiklabs-gcp-02-c816472a2d85/20220525124911 
View Training:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/6485750261657632768?project=781054085410
CustomTrainingJob projects/781054085410/locations/us-central1/trainingPipelines/6485750261657632768 current state:
PipelineState.PIPELINE_STATE_PENDING
CustomTrainingJob projects/781054085410/locations/us-central1/trainingPipelines/6485750261657632768 current state:
PipelineState.PIPELINE_STATE_PENDING
View backing custom job:
https://console.cloud.google.com/ai/platform/locations/us-central1/training/7486651089584914432?project=781054085410
CustomTrainingJob projects/781054085410/locations/us-central1/trainingPipelines/6485750261657632768 current state:
PipelineState.PIPELINE_STATE_RUNNING
CustomTrainingJob projects/781054085410/locations/us

 **The custom training job will take some time to complete**.

## Upload the model ([general.import-model](https://cloud.google.com/vertex-ai/docs/general/import-model))

Next, upload your model to a `Model` resource using `Model.upload()` method, with the following parameters:

- `display_name`: The human readable name for the `Model` resource.
- `artifact_uri`: The Cloud Storage location of the trained model artifacts.
- `serving_container_image_uri`: The serving container image.
- `sync`: Whether to execute the upload asynchronously or synchronously.

If the `upload()` method is run asynchronously, you can subsequently block until completion with the `wait()` method.

In [20]:
# TODO
model = aip.Model.upload(
    display_name="iris_" + TIMESTAMP,
    artifact_uri=MODEL_DIR,
    serving_container_image_uri=DEPLOY_IMAGE,
    sync=False,
)

model.wait()

Creating Model
Create Model backing LRO: projects/781054085410/locations/us-central1/models/6501425449179021312/operations/1923258661199675392
Model created. Resource name: projects/781054085410/locations/us-central1/models/6501425449179021312
To use this Model in another session:
model = aiplatform.Model('projects/781054085410/locations/us-central1/models/6501425449179021312')


## Make batch predictions ([predictions.batch-prediction](https://cloud.google.com/vertex-ai/docs/predictions/batch-predictions))

### Make test items

You will use synthetic data as a test data items. Don't be concerned that we are using synthetic data -- we just want to demonstrate how to make a prediction.

In [21]:
INSTANCES = [[1.4, 1.3, 5.1, 2.8], [1.5, 1.2, 4.7, 2.4]]

### Make the batch input file

Now make a batch input file, which you will store in your local Cloud Storage bucket.  Each instance in the prediction request is a list of the form:

                        [ [ content_1], [content_2] ]

- `content`: The feature values of the test item as a list.

In [22]:
import tensorflow as tf

gcs_input_uri = BUCKET_NAME + "/" + "test.jsonl"
with tf.io.gfile.GFile(gcs_input_uri, "w") as f:
    for i in INSTANCES:
        f.write(str(i) + "\n")

! gsutil cat $gcs_input_uri

[1.4, 1.3, 5.1, 2.8]
[1.5, 1.2, 4.7, 2.4]


### Make the batch prediction request

Now that your Model resource is trained, you can make a batch prediction by invoking the `batch_predict()` method, with the following parameters:

- `job_display_name`: The human readable name for the batch prediction job.
- `gcs_source`: A list of one or more batch request input files.
- `gcs_destination_prefix`: The Cloud Storage location for storing the batch prediction resuls.
- `instances_format`: The format for the input instances, either 'csv' or 'jsonl'. Defaults to 'jsonl'.
- `predictions_format`: The format for the output predictions, either 'csv' or 'jsonl'. Defaults to 'jsonl'.
- `machine_type`: The type of machine to use for training.
- `sync`: If set to True, the call will block while waiting for the asynchronous batch job to complete.

In [23]:
MIN_NODES = 1
MAX_NODES = 1

# TODO
batch_predict_job = model.batch_predict(
    job_display_name="iris_" + TIMESTAMP,
    gcs_source=gcs_input_uri,
    gcs_destination_prefix=BUCKET_NAME,
    instances_format="jsonl",
    predictions_format="jsonl",
    model_parameters=None,
    machine_type=DEPLOY_COMPUTE,
    starting_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
    sync=False,
)

print(batch_predict_job)

Creating BatchPredictionJob
<google.cloud.aiplatform.jobs.BatchPredictionJob object at 0x7fcf033fb350> is waiting for upstream dependencies to complete.
BatchPredictionJob created. Resource name: projects/781054085410/locations/us-central1/batchPredictionJobs/1684343511857496064
To use this BatchPredictionJob in another session:
bpj = aiplatform.BatchPredictionJob('projects/781054085410/locations/us-central1/batchPredictionJobs/1684343511857496064')
View Batch Prediction Job:
https://console.cloud.google.com/ai/platform/locations/us-central1/batch-predictions/1684343511857496064?project=781054085410
BatchPredictionJob projects/781054085410/locations/us-central1/batchPredictionJobs/1684343511857496064 current state:
JobState.JOB_STATE_PENDING
BatchPredictionJob projects/781054085410/locations/us-central1/batchPredictionJobs/1684343511857496064 current state:
JobState.JOB_STATE_RUNNING


 **Batch prediction request will take 25-30 mins to complete**.

### Get the predictions

Next, get the results from the completed batch prediction job.

The results are written to the Cloud Storage output bucket you specified in the batch prediction request. You call the method iter_outputs() to get a list of each Cloud Storage file generated with the results. Each file contains one or more prediction requests in a JSON format:

- `instance`: The prediction request.
- `prediction`: The prediction response.

In [28]:
import json

bp_iter_outputs = batch_predict_job.iter_outputs()

prediction_results = list()
for blob in bp_iter_outputs:
    if blob.name.split("/")[-1].startswith("prediction"):
        prediction_results.append(blob.name)

tags = list()
for prediction_result in prediction_results:
    gfile_name = f"gs://{bp_iter_outputs.bucket.name}/{prediction_result}"
    with tf.io.gfile.GFile(name=gfile_name, mode="r") as gfile:
        for line in gfile.readlines():
            line = json.loads(line)
            print(line)
            break

{'instance': [1.5, 1.2, 4.7, 2.4], 'prediction': 1.9618644714355469}
{'instance': [1.4, 1.3, 5.1, 2.8], 'prediction': 2.0451931953430176}


## Make online predictions ([predictions.deploy-model-api](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api))

## Deploy the model

Next, deploy your model for online prediction. To deploy the model, you invoke the `deploy` method, with the following parameters:

- `deployed_model_display_name`: A human readable name for the deployed model.
- `traffic_split`: Percent of traffic at the endpoint that goes to this model, which is specified as a dictionary of one or more key/value pairs.
If only one model, then specify as { "0": 100 }, where "0" refers to this model being uploaded and 100 means 100% of the traffic.
If there are existing models on the endpoint, for which the traffic will be split, then use model_id to specify as { "0": percent, model_id: percent, ... }, where model_id is the model id of an existing model to the deployed endpoint. The percents must add up to 100.
- `machine_type`: The type of machine to use for training.
- `min_replica_count`: The number of compute instances to initially provision.
- `max_replica_count`: The maximum number of compute instances to scale to. In this tutorial, only one instance is provisioned.

In [29]:
DEPLOYED_NAME = "iris-" + TIMESTAMP

TRAFFIC_SPLIT = {"0": 100}

MIN_NODES = 1
MAX_NODES = 1

# TODO
endpoint = model.deploy(
    deployed_model_display_name=DEPLOYED_NAME,
    traffic_split=TRAFFIC_SPLIT,
    machine_type=DEPLOY_COMPUTE,
    min_replica_count=MIN_NODES,
    max_replica_count=MAX_NODES,
)

Creating Endpoint
Create Endpoint backing LRO: projects/781054085410/locations/us-central1/endpoints/257575991969316864/operations/7331659601206575104
Endpoint created. Resource name: projects/781054085410/locations/us-central1/endpoints/257575991969316864
To use this Endpoint in another session:
endpoint = aiplatform.Endpoint('projects/781054085410/locations/us-central1/endpoints/257575991969316864')
Deploying model to Endpoint : projects/781054085410/locations/us-central1/endpoints/257575991969316864
Deploy Endpoint model backing LRO: projects/781054085410/locations/us-central1/endpoints/257575991969316864/operations/7588786992431759360
Endpoint model deployed. Resource name: projects/781054085410/locations/us-central1/endpoints/257575991969316864


 **Model deployment will take some time to complete**.

### Make test item ([predictions.online-prediction-automl](https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-automl))

You will use synthetic data as a test data item. Don't be concerned that we are using synthetic data -- we just want to demonstrate how to make a prediction.

In [30]:
INSTANCE = [1.4, 1.3, 5.1, 2.8]

### Make the prediction

Now that your `Model` resource is deployed to an `Endpoint` resource, you can do online predictions by sending prediction requests to the `Endpoint` resource.

#### Request

The format of each instance is:

    [feature_list]

Since the predict() method can take multiple items (instances), send your single test item as a list of one test item.

#### Response

The response from the predict() call is a Python dictionary with the following entries:

- `ids`: The internal assigned unique identifiers for each prediction request.
- `predictions`: The predicted confidence, between 0 and 1, per class label.
- `deployed_model_id`: The Vertex AI identifier for the deployed `Model` resource which did the predictions.

In [31]:
instances_list = [INSTANCE]

prediction = endpoint.predict(instances_list)
print(prediction)

Prediction(predictions=[2.045193195343018], deployed_model_id='1444239309409353728', explanations=None)


## Undeploy the model

When you are done doing predictions, you undeploy the model from the `Endpoint` resouce. This deprovisions all compute resources and ends billing for the deployed model.

In [32]:
endpoint.undeploy_all()

Undeploying Endpoint model: projects/781054085410/locations/us-central1/endpoints/257575991969316864
Undeploy Endpoint model backing LRO: projects/781054085410/locations/us-central1/endpoints/257575991969316864/operations/1112610728272986112
Endpoint model undeployed. Resource name: projects/781054085410/locations/us-central1/endpoints/257575991969316864


<google.cloud.aiplatform.models.Endpoint object at 0x7fcf030e12d0> 
resource name: projects/781054085410/locations/us-central1/endpoints/257575991969316864