# Serving NVIDIA HugeCTR model using NVIDIA Triton and Vertex AI Prediction

This notebook demonstrates how to serve NVIDIA HugeCTR deep learning models using NVIDIA Triton Inference Server and Vertex AI Prediction.
The notebook compiles prescriptive guidance for the following tasks:

1. Creating Triton ensemble models that combine NVTabular preprocessing workflows and HugeCTR models
2. Building a Vertex Prediction custom serving container image for serving the ensembles with Triton Inference server. 
2. Registering and deploying the ensemble models with Vertex Prediction Models and Endpoints.
5. Getting online predictions from the deployed ensembles.

To fully benefit from the content covered in this notebook, you should have a solid understanding of key Vertex AI Prediction concepts like models, endpoints, and model deployments. We strongly recommend reviewing [Vertex AI Prediction documentation](https://cloud.google.com/vertex-ai/docs/predictions/getting-predictions) before proceeding.

### Triton Inference Server Overview

[Triton Inference Server](https://github.com/triton-inference-server/server) provides an inferencing solution optimized for both CPUs and GPUs. Triton can run multiple models from the same or different frameworks concurrently on a single GPU or CPU. In a multi-GPU server, it automatically creates an instance of each model on each GPU to increase utilization without extra coding. It supports real-time inferencing, batch inferencing to maximize GPU/CPU utilization, and streaming inference with built-in support for audio streaming input. It also supports model ensembles for use cases that require multiple models to perform end-to-end inference.

The following figure shows the Triton Inference Server high-level architecture.

<img src="./images/triton-architecture.png" alt="Triton Architecture" style="width:70%"/>


- The model repository is a file-system based repository of the models that Triton will make available for inferencing. 
- Inference requests arrive at the server via either HTTP/REST or gRPC and are then routed to the appropriate per-model scheduler. 
- Triton implements multiple scheduling and batching algorithms that can be configured on a model-by-model basis.
- The backend performs inferencing using the inputs provided in the batched requests to produce the requested outputs.


Triton server provides readiness and liveness health endpoints, as well as utilization, throughput, and latency metrics, which enable the integration of Triton into deployment environments, such as Vertex AI Prediction.

Refer to [Triton Inference Server Architecture](https://github.com/triton-inference-server/server/blob/main/docs/architecture.md) for more detailed information.

### Triton Inference Server on Vertex AI Prediction



In this section, we describe the deployment of Triton Inference Server on Vertex AI Prediction. Although, the focus of this notebook is on demonstrating how to serve an ensemble of an NVTabular preprocessing workflow and a HugeCTR model, the outlined design patterns are applicable to a wider set of serving scenarios.  The following figure shows a deployment architecture.

<img src="./images/triton-in-vertex.png" alt="Triton on Vertex AI Prediction" style="width:70%"/>


Triton Inference Server runs inside a container based on a custom serving image. The custom container image is built on top of [NVIDIA Merlin Inference image](https://ngc.nvidia.com/catalog/containers/nvidia:merlin:merlin-inference) and adds packages and configurations to align with Vertex AI [requirements for custom serving container images](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements). 

An ensemble to be served by Triton is registered with Vertex AI Prediction as a `Model`. The `Model`'s metadata reference a location of the ensemble artifacts in Google Cloud Storage and the custom serving container and its configurations. 

After the model is deployed to a Vertex AI Prediction endpoint, the entrypoint script of the custom container copies the ensemble's artifacts from the GCS location to a local file system in the container. It then starts Triton, referencing a local copy of the ensemble as Triton's model repository. 

Triton loads the models comprising the ensemble and exposes inference, health, and model management REST endpoints using [standard inference protocols](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2). The Triton's inference endpoint - `/v2/models/{ENSEMBLE_NAME}/infer` is mapped to Vertex AI Prediction predict route and exposed to external clients through Vertex Prediction endpoint. The Triton's health endpoint - `/v2/health/ready` - is mapped to Vertex AI Prediction health route and used by Vertex AI Prediction for [health checks](https://cloud.google.com/vertex-ai/docs/predictions/custom-container-requirements#health).

To invoke the ensemble through the Vertex AI Prediction endpoint you need to format your request using a [standard Inference Request JSON Object](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference) or a [Inference Request JSON Object with a binary extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) and submit a request to Vertex AI Prediction [REST rawPredict endpoint](https://cloud.google.com/vertex-ai/docs/reference/rest/v1beta1/projects.locations.endpoints/rawPredict). You need to use the `rawPredict` rather than `predict` endpoint because inference request formats used by Triton are not compatible with the Vertex AI Prediction [standard input format](https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-custom-models#formatting-prediction-input).


### Notebook flow

This notebook assumes that you have access to both: a trained HugeCTR model and a fitted NVTabular workflow that converts raw inputs into the intputs required by the model. These artifacts are created by the [01-dataset-preprocessing.ipynb](01-dataset-preprocessing.ipynb) and [02-model-training-hugectr.ipynb](02-model-training-hugectr.ipynb) notebooks.

As you walk through the notebook you will execute the following tasks:

- Configure the notebook environment settings, including GCP project, compute region, and the GCS locations of a HugeCTR trained model and an NVTabular fitted workflow.
- Create an ensemble model consisting of the fitted model for input preprocessing and the HugeCTR model for generating predictions
- Build a custom Vertex serving container based on NVIDIA NGC Merlin Inference container
- Register the ensemble as a Vertex Prediction model
- Create a Vertex Prediction endpoint
- Deploy the model endpoint
- Invoke the deployed ensemble model


## Setup

In this section of the notebook you configure your environment settings, including a GCP project, a Vertex AI compute region, and a Vertex AI staging GCS bucket. 
You also set the locations of a fitted NVTaubular workflow, a trained HugeCTR model, and a set of constants that are used to create names and display names of Vertex AI Prediction resources.

Make sure to update the below cells with the values reflecting your environment.

In [None]:
import json
import os
import shutil
import time

from pathlib import Path
from src.serving import export
from src import feature_utils

from google.cloud import aiplatform as vertex_ai

Set the below constants to your project id, a compute region for Vertex AI and a GCS bucket that will be used for Vertex AI staging and storing exported model artifacts.

In [None]:
PROJECT_ID = 'jk-mlops-dev' # Change to your project.
REGION = 'us-central1'  # Change to your region.
STAGING_BUCKET = 'jk-merlin-dev' # Change to your bucket.

`LOCAL_WORKSPACE` is used for staging artifacts that need to be processed on a local file system. `MODEL_ARTIFACTS_REPOSITORY` is a root GCS location where the exported ensemble model artifacts will be stored. If you run this notebook on Vertex Workbench you don't need to change these values.

In [None]:
LOCAL_WORKSPACE = '/home/jupyter/staging'
MODEL_ARTIFACTS_REPOSITORY = f'gs://{STAGING_BUCKET}/models'

The following set of constants will be used to create names and display names of Vertex Prediction resources like models, endpoints, and model deployments. The HugeCTR model trained in the previous notebooks is a *DeepFM* deep learning ranking model so the default model name is set to `deepfm`.

In [None]:
MODEL_NAME = 'deepfm'
MODEL_VERSION = 'v01'
MODEL_DISPLAY_NAME = f'criteo-hugectr-{MODEL_NAME}-{MODEL_VERSION}'
MODEL_DESCRIPTION = 'HugeCTR DeepFM model'
ENDPOINT_DISPLAY_NAME = f'hugectr-{MODEL_NAME}-{MODEL_VERSION}'

The following constants set the name and the location of the Dockerfile for the custom serving container you will build in the following section of the notebook. You don't need to change these values.

In [None]:
IMAGE_NAME = 'triton-deploy-hugectr'
IMAGE_URI = f"gcr.io/{PROJECT_ID}/{IMAGE_NAME}"
DOCKERFILE = 'src/Dockerfile.triton'

And finally, the `WORKFLOW_MODEL_PATH` and the `HUGECTR_MODEL_PATH` should be updated to point to GCS locations of your NVTabular fitted workflow and the trained HugeCTR model generated by the [01-dataset-preprocessing.ipynb](01-dataset-preprocessing.ipynb) and [02-model-training-hugectr.ipynb](02-model-training-hugectr.ipynb) notebooks.

In [None]:
WORKFLOW_MODEL_PATH = "gs://criteo-datasets/criteo_processed_parquet/workflow" # Change to GCS path of the nvt workflow.
HUGECTR_MODEL_PATH = "gs://merlin-models/hugectr_deepfm_21.09" # Change to GCS path of the hugectr trained model.

### Initialize Vertex AI SDK

In [None]:
vertex_ai.init(
    project=PROJECT_ID,
    location=REGION,
    staging_bucket=STAGING_BUCKET
)

## 1. Exporting Triton ensemble model

A Triton ensemble model represents a pipeline of one or more models and the connection of input and output tensors between these models. Ensemble models are intended to encapsulate inference pipelines that involves multiple steps, each performed by a different model. For example, a common  "data preprocessing -> inference -> data postprocessing" pattern. Using ensemble models for this purpose can avoid the overhead of transferring intermediate tensors between client and serving endpoints and minimize the number of requests that must be sent to Triton. 

In our case, an inference pipeline comprises two steps: input preprocessing using a fitted NVTabular workflow and generating predictions using a HugeCTR ranking model.

An ensemble model is not an actual serialized model. There are no addtional model artifacts created when an ensemble is defined. It is a configuration that specifies which actual models comprise the ensemble, the execution flow when processing an inference request and the flow of data between inputs and outputs of the component models. This configuration is defined using the same [protocol buffer](https://developers.google.com/protocol-buffers) based configuration format as used for serving other model types in Triton. Refer to [Trition Inference Server Model Configuration guide](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md) for detailed information about configuring models and model ensembles.

You can create an ensemble model manually by arranging the component models into the prescribed folder structure and editing the required configuration files. For ensemble models that utilize the "NVTabular workflow -> Inference Model" processing pattern you can utilize a set of utility functions provided by the `nvtabular.inference.triton` module. Specifically to create a "NVTabular workflow -> HugeCTR model" ensemble, as utilized in this notebook, you can use the `nvtabular.inference.triton.export_hugectr_ensemble` function.

We have encapsulated the ensemble export logic in the `src.serving.export_ensemble` function. In addition to calling `nvtabular.inference.triton.export_hugectr_ensemble`, the function also creates a JSON configuration file required by Triton when serving HugeCTR models. This file - `ps.json` - specifies the locations of different components comprising a saved HugeCTR model and is used by Triton HugeCTR backend to correctly load the saved model and prepare it for serving. 

Recall that the entrypoint script in the custom serving container copies the ensemble's models artifacts from a source GCS location as prepared by Vertex AI Prediction into the serving container's local file systems. The `ps.json` file needs to use the paths that correctly point to saved model artifacts in the container's file system. Also some of the paths embedded in the configs generated by `nvtabular.inference.triton.export_hugectr_ensemble` use absolute paths and need to be properly set. The `src.serving.export_ensemble` function handles all of that. You can specify the target root folder in the containers local file system using the `model_repository_path` parameter and all the paths will be adjusted accordingly.



### Copy a HugeCTR saved model and a fitted NVTabular workflow to a local staging folder

The `nvtabular.inference.triton.export_hugectr_ensemble` does not support GCS. As such you need to copy NVTabular workflow and HugeCTR model artifacts to a local file system.

In [None]:
if os.path.isdir(LOCAL_WORKSPACE):
    shutil.rmtree(LOCAL_WORKSPACE)
os.makedirs(LOCAL_WORKSPACE)

!gsutil -m cp -r {WORKFLOW_MODEL_PATH} {LOCAL_WORKSPACE}
!gsutil -m cp -r {HUGECTR_MODEL_PATH} {LOCAL_WORKSPACE}

### Export the ensemble model

The `src.export.export_ensemble` utility function takes a number of arguments that are required to set up a proper flow of tensors between inputs and outputs of the NVTabular workflow and the HugeCTR model.

- `model_name` - The model name that will be used as a prefix for the generated ensemble artifacts.
- `workflow_path` - The local path to the NVTabular workflow
- `saved_model_path` - The local path to the saved HugeCTR model
- `output_path` - The local path to the location where an ensemble will be exported
- `model_repository_path` - The path to use as a root  in `ps.json` and other config files
- `max_batch` - The maximum size of a serving batch that will be supported by the ensemble 


The following settings should match the settings of the NVTabular workflow

- `categorical_columns` - The list of names of categorical input features to the NVTabular workflow
- `continuous_columns` - The list of names of continuous input features to the NVTabular workflow


The following settings should match the respective settings in the HugeCTR model

- `num_outputs` - The number of outputs from the HugeCTR model
- `embedding_vector_size` - The size of an embedding vector used by the HugeCTR model
- `num_slots` - The number of slots used for sparse features of the HugeCTR model
- `max_nnz` - This value controls how sparse features are coded in the embedding arrays 


As noted before, in this notebook we assume that you generated the NVTabular workflow and the HugeCTR model using the the [01-dataset-preprocessing.ipynb](01-dataset-preprocessing.ipynb) and [02-model-training-hugectr.ipynb](02-model-training-hugectr.ipynb) notebooks. The workflow captures the preprocessing logic for the Criteo dataset and the HugeCTR model is an implementation of [the DeepFM CTR model](https://arxiv.org/abs/1703.04247). 

In [None]:
NUM_SLOTS = 26
MAX_NNZ = 2
EMBEDDING_VECTOR_SIZE = 11
MAX_BATCH_SIZE = 64

continuous_columns = feature_utils.continuous_columns()
categorical_columns = feature_utils.categorical_columns()
label_columns = feature_utils.label_columns()
num_outputs = len(label_columns)

local_workflow_path = Path(LOCAL_WORKSPACE) / Path(WORKFLOW_MODEL_PATH).parts[-1]
local_saved_model_path = Path(LOCAL_WORKSPACE) / Path(HUGECTR_MODEL_PATH).parts[-1]
local_ensemble_path = Path(LOCAL_WORKSPACE) / f'triton-ensemble-{time.strftime("%Y%m%d%H%M%S")}'
model_repository_path = '/models'

In [None]:
export.export_ensemble(
    model_name=MODEL_NAME,
    workflow_path=local_workflow_path,
    saved_model_path=local_saved_model_path,
    output_path=local_ensemble_path,
    categorical_columns=categorical_columns,
    continuous_columns=continuous_columns,
    label_columns=label_columns,
    num_slots=NUM_SLOTS,
    max_nnz=MAX_NNZ,
    num_outputs=num_outputs,
    embedding_vector_size=EMBEDDING_VECTOR_SIZE,
    max_batch_size=MAX_BATCH_SIZE,
    model_repository_path=model_repository_path
    )

The previous cell created the following local folder structure

In [None]:
! ls -la {local_ensemble_path}

The `deepfm` folder contains artifacts and configurations for the HugeCTR model. The `deepfm_ens` folder contains a configuration for the ensemble model. And the `deepfm_nvt` contains artifacts and configurations for the NVTabular preprocessing workflow. The `ps.json` file contains information required by the Triton's HugeCTR backend.

Notice that the file paths in `ps.json`  use the value from `model_repository_path`. 


In [None]:
! cat {local_ensemble_path}/ps.json

### Upload the ensemble to GCS

In the later steps you will register the exported ensemble model as a Vertex AI Prediction model resource. Before doing that we need to move the ensemble to GCS.

In [None]:
gcs_ensemble_path = '{}/{}'.format(MODEL_ARTIFACTS_REPOSITORY, Path(local_ensemble_path).parts[-1])

!gsutil -m cp -r {local_ensemble_path}/* {gcs_ensemble_path}/

## 2. Building a custom serving container 

The custom serving container is derived from the NVIDIA NGC Merlin inference container. It adds Google Cloud SDK and an entrypoint script that executes the tasks described in detail in the overview.

In [None]:
! cat {DOCKERFILE}

As described in detail in the overview, the entry point script copies the ensemble artifacts to the serving container's local file system and starts Triton.

In [None]:
! cat src/serving/entrypoint.sh

You use [Cloud Build](https://cloud.google.com/build) to build the serving container and push it to your projects [Container Registry](https://cloud.google.com/container-registry#:~:text=Container%20Registry%20is%20a%20single,pipelines%20to%20get%20fast%20feedback.).

In [None]:
! cp {DOCKERFILE} src/Dockerfile
! gcloud builds submit --timeout "2h" --tag {IMAGE_URI} src --machine-type=e2-highcpu-8

## 3. Uploading the model and its metadata to Vertex Models.

In the following cell you will register (upload) the ensemble model as a Vertex AI Prediction `Model` resource. 

Refer to [Use a custom container for prediction guide](https://cloud.google.com/vertex-ai/docs/predictions/use-custom-container) for detailed information about creating Vertex AI Prediction `Model` resources.

Notice that the value of  `model_repository_path`that was used when exporting the ensemble is passed as a command line parameter to the serving container. The entrypoint script in the container will copy the ensemble artifacts to this location when the container starts. This ensures that the locations of the artifacts in the container's local file system and the paths in the `ps.json` and other configuration files used by Triton match.

In [None]:
health_route = "/v2/health/ready"
predict_route = f"/v2/models/{MODEL_NAME}_ens/infer"
serving_container_ports = [8000]
serving_container_args = [model_repository_path]


model = vertex_ai.Model.upload(
    display_name=MODEL_DISPLAY_NAME,
    description=MODEL_DESCRIPTION,
    serving_container_image_uri=IMAGE_URI,
    serving_container_predict_route=predict_route,
    serving_container_health_route=health_route,
    serving_container_ports=serving_container_ports,
    artifact_uri=gcs_ensemble_path,
    serving_container_args=serving_container_args,
    sync=True
)

model.resource_name

## 4. Deploying the model to Vertex AI Prediction.

Deploying a Vertex AI Prediction `Model` is a two step process. First you create an endpoint that will expose an external interface to clients consuming the model. After the endpoint is ready you can deploy multiple versions of a model to the endpoint.

Refer to [Deploy a model using the Vertex AI API guide](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api) for more information about the APIs used in the following cells.

### Create the Vertex Endpoint

Before deploying the ensemble model you need to create a Vertex AI Prediction endpoint. 

In [None]:
endpoint = vertex_ai.Endpoint.create(
    display_name=ENDPOINT_DISPLAY_NAME
)

### Deploy the model to Vertex Prediction endpoint

After the endpoint is ready, you can deploy your ensemble model to the endpoint. You will run the ensemble on a GPU node equipped with the NVIDIA Tesla T4 GPUs. 

Refer to [Deploy a model using the Vertex AI API guide](https://cloud.google.com/vertex-ai/docs/predictions/deploy-model-api) for more information.

In [None]:
traffic_percentage = 100
machine_type = "n1-standard-8"
accelerator_type="NVIDIA_TESLA_T4"
accelerator_count = 1
min_replica_count = 1
max_replica_count = 3

In [None]:
model.deploy(
    endpoint=endpoint,
    deployed_model_display_name=MODEL_DISPLAY_NAME,
    machine_type=machine_type,
    min_replica_count=min_replica_count,
    max_replica_count=max_replica_count,
    traffic_percentage=traffic_percentage,
    accelerator_type=accelerator_type,
    accelerator_count=accelerator_count,
    sync=True,
)

## 5. Invoking the model

To invoke the ensemble through Vertex AI Prediction endpoint you need to format your request using a [standard Inference Request JSON Object](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference) or a [Inference Request JSON Object with a binary extension](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_binary_data.md) and submit a request to Vertex AI Prediction [REST rawPredict endpoint](https://cloud.google.com/vertex-ai/docs/reference/rest/v1beta1/projects.locations.endpoints/rawPredict). You need to use the `rawPredict` rather than `predict` endpoint because inference request formats used by Triton are not compatible with the Vertex AI Prediction [standard input format](https://cloud.google.com/vertex-ai/docs/predictions/online-predictions-custom-models#formatting-prediction-input).

The below cell shows a sample request body formatted as a [standard Inference Request JSON Object](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference). The request encapsulates a batch of three records from the Criteo dataset.


In [None]:
payload = {
    'id': '1',
    'inputs': [
        {'name': 'I1','shape': [3, 1], 'datatype': 'INT32', 'data': [5, 32, 0]},
        {'name': 'I2', 'shape': [3, 1], 'datatype': 'INT32', 'data': [110, 3, 233]},
        {'name': 'I3', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 5, 1]},
        {'name': 'I4', 'shape': [3, 1], 'datatype': 'INT32', 'data': [16, 0, 146]},
        {'name': 'I5', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 1, 1]},
        {'name': 'I6', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1, 0, 0]},
        {'name': 'I7', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 0, 0]},
        {'name': 'I8', 'shape': [3, 1], 'datatype': 'INT32', 'data': [14, 61, 99]},
        {'name': 'I9', 'shape': [3, 1], 'datatype': 'INT32', 'data': [7, 5, 7]},
        {'name': 'I10', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1, 0, 0]},
        {'name': 'I11', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 1, 1]},
        {'name': 'I12', 'shape': [3, 1], 'datatype': 'INT32', 'data': [306, 3157, 3101]},
        {'name': 'I13', 'shape': [3, 1], 'datatype': 'INT32', 'data': [0, 5, 1]},
        {'name': 'C1', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1651969401, -436994675, 1651969401]},
        {'name': 'C2', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-501260968, -1599406170, -1382530557]},
        {'name': 'C3', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1343601617, 1873417685, 1656669709]},
        {'name': 'C4', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1805877297, -628476895, 946620910]},
        {'name': 'C5', 'shape': [3, 1], 'datatype': 'INT32', 'data': [951068488, 1020698403, -413858227]},
        {'name': 'C6', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1875733963, 1875733963, 1875733963]},
        {'name': 'C7', 'shape': [3, 1], 'datatype': 'INT32', 'data': [897624609, -1424560767, -1242174622]},
        {'name': 'C8', 'shape': [3, 1], 'datatype': 'INT32', 'data': [679512323, 1128426537, -772617077]},
        {'name': 'C9', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1189011366, 502653268, 776897055]},
        {'name': 'C10', 'shape': [3, 1], 'datatype': 'INT32', 'data': [771915201, 2112471209, 771915201]},
        {'name': 'C11', 'shape': [3, 1], 'datatype': 'INT32', 'data': [209470001, 1716706404, 209470001]},
        {'name': 'C12', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1785193185, -1712632281, 309420420]},
        {'name': 'C13', 'shape': [3, 1], 'datatype': 'INT32', 'data': [12976055, 12976055, 12976055]},
        {'name': 'C14', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1102125769, -1102125769, -1102125769]},
        {'name': 'C15', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1978960692, -205783399, -150008565]},
        {'name': 'C16', 'shape': [3, 1], 'datatype': 'INT32', 'data': [1289502458, 1289502458, 1289502458]},
        {'name': 'C17', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-771205462, -771205462, -771205462]},
        {'name': 'C18', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1206449222, -1578429167, 1653545869]},
        {'name': 'C19', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1793932789, -1793932789, -1793932789]},
        {'name': 'C20', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-1014091992, -20981661, -1014091992]},
        {'name': 'C21', 'shape': [3, 1], 'datatype': 'INT32', 'data': [351689309, -1556988767, 351689309]},
        {'name': 'C22', 'shape': [3, 1], 'datatype': 'INT32', 'data': [632402057, -924717482, 632402057]},
        {'name': 'C23', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-675152885, 391309800, -675152885]},
        {'name': 'C24', 'shape': [3, 1], 'datatype': 'INT32', 'data': [2091868316, 1966410890, 883538181]},
        {'name': 'C25', 'shape': [3, 1], 'datatype': 'INT32', 'data': [809724924, -1726799382, -10139646]},
        {'name': 'C26', 'shape': [3, 1], 'datatype': 'INT32', 'data': [-317696227, -1218975401, -317696227]}]
}

with open('criteo_payload.json', 'w') as f:
    json.dump(payload, f)

You can invoke the Vertex AI Prediction `rawPredict` endpoint using any HTTP tool or library, including `curl`.

In [None]:
uri = f'https://{REGION}-aiplatform.googleapis.com/v1/projects/{PROJECT_ID}/locations/{REGION}/endpoints/{endpoint.name}:rawPredict'

! curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json"  \
{uri} \
-d @criteo_payload.json