# **redis-feast-gcp**: 03 - Triton + Vertex AI Prediction Inference Example

In this notebook, we will test the deployed Triton model on the Vertex AI Prediction endpoint.

**This notebook assumes that you've already set up your Feature Store, model repo in GCP, and deployed your model in Vertex AI with NVIDIA Triton**

![architecture](img/redis-feast-gcp-architecture.png)

## Unpacking the Triton Ensemble

Before we test the inference endpoint to forecast Covid vaccinations for the state of Virginia, we will unpack the Triton Ensemble used to create the DAG of operations.

### What is an Ensemble???
An [Ensemble model](https://github.com/triton-inference-server/server/blob/main/docs/user_guide/architecture.md#ensemble-models) represents a pipeline of one or many operations (models) and connects inputs to outputs of each stage. These are useful for inference workflows that involve several stages like data preprocessing, postprocessing, and other transformations or business logic.


![ensemble](./img/RedisFeastTriton.png)

**Checkout the structure of the Triton Model repository below.**

In [8]:
!tree ./setup/models

[01;34m./setup/models[0m
├── [01;34mensemble[0m
│   ├── [01;34m1[0m
│   └── [00mconfig.pbtxt[0m
├── [01;34mfetch-vaccine-features[0m
│   ├── [01;34m1[0m
│   │   └── [00mmodel.py[0m
│   └── [00mconfig.pbtxt[0m
└── [01;34mpredict-vaccine-counts[0m
    ├── [01;34m1[0m
    │   └── [00mxgboost.json[0m
    └── [00mconfig.pbtxt[0m

6 directories, 5 files


There's a model for:
- `fetch-vaccine-features` - Fetch vaccine count features from Redis at low-latency.
- `predict-vaccine-counts` - Use XGBoost model (with [Triton FIL](https://developer.nvidia.com/blog/real-time-serving-for-xgboost-scikit-learn-randomforest-lightgbm-and-more/) backend) to forecast the counts.

The `ensemble` model wraps the other two, creating the pipeline. Each model here has a `config.pbtxt`. Let's look at the ensemble model config below:

In [9]:
!cat ./setup/models/ensemble/config.pbtxt

name: "ensemble"
platform: "ensemble"
max_batch_size: 256
input [
  {
    name: "state"
    data_type: TYPE_STRING
    dims: 1
  }
]
output [
  {
    name: "prediction"
    data_type: TYPE_FP32
    dims: 1
  }
]
ensemble_scheduling {
  step [
    {
      model_name: "fetch-vaccine-features"
      model_version: -1
      input_map {
        key: "state"
        value: "state"
      }
      output_map {
        key: "feature_values"
        value: "feature_values"
      }
    },
    {
      model_name: "predict-vaccine-counts"
      model_version: -1
      input_map {
        key: "feature_values"
        value: "feature_values"
      }
      output_map {
        key: "prediction"
        value: "prediction"
      }
    }
  ]
}

## Create Inference Instances

Before we can test the Vertex AI Prediction endpoint, we need to construct a JSON body that represents an inference request. See the example below:

In [None]:
import json

# Create inference instance
payload = {
    "id": "1",
    "inputs": [
        {
            "name": "state",        ## Triton model input name
            "shape": [1, 1],        ## Triton model input shape
            "datatype": "BYTES",    ## Triton model input datatype
            "data": [["Virginia"]]  ## Triton model input data
        }
    ]
}

# Save to file
with open("instances.json", "w") as f:
    json.dump(payload, f)

## Test Endpoint

You can test the Vertex AI Prediction `rawPredict` endpoint using any HTTP tool or library, including `curl`.

In [None]:
# Log in to GCloud using the CLI and your service account
!gcloud auth activate-service-account $SERVICE_ACCOUNT_EMAIL \
    --key-file=$GOOGLE_APPLICATION_CREDENTIALS \
    --project=$PROJECT_ID

In [None]:
!echo $(gcloud ai endpoints list \
  --region=$GCP_REGION \
  --filter=display_name=vaccine-predictor-endpoint \
  --format="value(name)")

In [None]:
%%bash

# Fetch Token
TOKEN=$(gcloud auth print-access-token)

# Fetch the Endpoint ID
ENDPOINT_ID=$(gcloud ai endpoints list \
  --region=$GCP_REGION \
  --filter=display_name=vaccine-predictor-endpoint \
  --format="value(name)")

# POST to the endpoint to get a response from the Triton ensemble model
curl \
  -X POST \
  -H "Authorization: Bearer ${TOKEN}" \
  -H "Content-Type: application/json" \
  https://${GCP_REGION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/us-east1/endpoints/${ENDPOINT_ID}:rawPredict \
  -d "@instances.json"


# Summary

We have just built an end-to-end ML system using Feast, Redis Enterprise, and NVIDIA Triton -- all in GCP.

This system generates realtime predictions using up to date feature values and allows us to manage "point-in-time correct" datasets from our offline datasource for training and model exploration.

Next steps that you can take after completing this tutorial include:

- Pull this repo and collaboration with your team.
- Use this tutorial to bootstrap a model for your use case by editing features / model.
- Incorporate the code in this tutorial into your company's batch pipelines by creating stages that perform feature creation and materialization.

**Redis and Triton are a perfect match: bringing the data layer (optimized for fast data access) close to the computing infrastructure (optimized for fast data processing).**