Copyright 2021 NVIDIA Corporation. All Rights Reserved.

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

<img src="http://developer.download.nvidia.com/compute/machine-learning/frameworks/nvidia_logo.png" style="width: 90px; float: center;">

# Triton server with FIL backend in Vertex AI

## Overview

This notebook shows the procedure to deploy an ensemble of [XGBoost model](https://xgboost.readthedocs.io/en/latest/) in Triton Inference Server with Forest Inference Library (FIL) backend. The FIL backend allows forest models trained by several popular machine learning frameworks (including XGBoost, LightGBM, Scikit-Learn, and cuML) to be deployed in a Triton inference server using the RAPIDS Forest Inference LIbrary for fast GPU-based inference. Using this backend, forest models can be deployed seamlessly alongside deep learning models for fast, unified inference pipelines.

For the ensemble, [Business Logic Scripting(BLS)](https://github.com/triton-inference-server/python_backend/blob/main/README.md#business-logic-scripting) in Triton python custom backend is used.

This notebooks runs in Triton NGC docker container: nvcr.io/nvidia/tritonserver:22.01-py3

### Contents
* [Train XGBoost model on dummy data](http://localhost:7001/notebooks/simple_xgboost_example.ipynb#Train-XGBoost-model)
* [Export, load and deploy XGBoost model in Triton Inference Server](http://localhost:8888/notebooks/simple_xgboost_example.ipynb#Export,-load-and-deploy-XGBoost-model-in-Triton-Inference-Server)
* [Determine throughput and latency using Perf Analyzer](http://localhost:7001/notebooks/simple_xgboost_example.ipynb#Determine-throughput-and-latency-with-Perf-Analyzer)
* [Find best configuration using Model Analyzer](http://localhost:7001/notebooks/simple_xgboost_example.ipynb#Find-best-configuration-using-Model-Analyzer)
* [Deploy model with best configuration](http://localhost:7001/notebooks/simple_xgboost_example.ipynb#Deploy-model-with-best-configuration)
* [Triton Client](http://localhost:7001/notebooks/simple_xgboost_example.ipynb#Triton-Client)
* [Conclusion](http://localhost:7001/notebooks/simple_xgboost_example.ipynb#Conclusion)

## Requirements

* Nvidia GPU (Pascal+ Recommended GPUs: T4, V100 or A100)
* [Latest NVIDIA driver](https://docs.nvidia.com/datacenter/tesla/tesla-installation-notes/index.html)
* [Docker](https://docs.docker.com/get-docker/)
* [The NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker)

## Setup

To begin, check that the NVIDIA driver has been installed correctly. The `nvidia-smi` command should run and output information about the GPUs on your system:"

In [None]:
!nvidia-smi

## Install XGBoost and Sklearn

We'd need to install XGBoost and SKlearn using the following pip3 commands inside the container as follows:

In [None]:
# Install sklearn and XGBoost
!pip3 install -U scikit-learn xgboost cupy-cuda115 google-cloud-aiplatform

## Train XGBoost model

If you have a pre-trained xgboost model, save it as `xgboost.model` and skip this step. We'll train a XGBoost model on random data in this section 

In [1]:
# Import required libraries
import numpy
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import os
import signal
import subprocess

In [None]:
# Generate dummy data to perform binary classification
seed = 7
features = 9 # number of sample features
samples = 10000 # number of samples
X = numpy.random.rand(samples, features).astype('float32')
Y = numpy.random.randint(2, size=samples)

test_size = 0.33
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=test_size, random_state=seed)

In [None]:
model = XGBClassifier()
model.fit(X_train, y_train)

In [None]:
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Test Accuracy: {:.2f}".format(accuracy * 100.0))

## Export, load and deploy XGBoost model in Triton Inference Server

For deploying the trained XGBoost model in Triton Inference Server, follow the steps below:

**1. Create a model repository and save xgboost model checkpoint:**

We'll need to create a model repository that looks as follows, and create 11 of them for the ensemble model:

```
model_repository/
`-- fil_0
    |-- 1
    |   `-- xgboost.model
    `-- config.pbtxt
```

In [None]:
# Create directory to save the model
!for i in {0..10}; do mkdir -p ./model_repository/fil_"$i"/1; done

# Save your xgboost model as xgboost.model
# For more information on saving xgboost model check https://xgboost.readthedocs.io/en/latest/python/python_intro.html#training
# Model can also be dumped to json format
for i in range(0,11):
    model.save_model('./model_repository/fil_{}/1/xgboost.model'.format(i))

**Note:**
The FIL backend's testing infrastructure includes a script for generating example models, putting them in the correct directory layout, and generating an associated config file. This can be helpful both for providing a template for your own models and for testing your Triton deployment. Please check this [link](https://github.com/triton-inference-server/fil_backend/blob/main/Example_Models.md) for the sample script.

**2. Create and save config.pbtxt**

To deploy the model in Triton Inference Server, we need to create and save a protobuf config file called config.pbtxt under `model_repository/fil_0/` directory that contains information about the model and the deployment. Sample config file is available here: [link](https://github.com/triton-inference-server/fil_backend#configuration)

Essentially, the following parameters need to be updated as per your configuration

```
name: "fil_0"                              # Name of the model directory (fil in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 9 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]                          # Output 2 for binary classification model
  }
]
instance_group [{ kind: KIND_GPU }]
parameters [
  {
    key: "model_type"
    value: { string_value: "xgboost" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]
```

Triton server looks for this configuration file before deploying XGBoost model for inference. It'll setup the server parameters as per the configuration passed within config.pbtxt. Store the above config at `/model_repository/fil_0/` directory as config.pbtxt as follows:

For more information on sample configs, please refer this [link](https://github.com/triton-inference-server/server/blob/main/docs/model_configuration.md)

In [None]:
%%bash
# Writing config to file
for i in {0..10}
do cat > ./model_repository/fil_$i/config.pbtxt <<EOL 
name: "fil_$i"                              # Name of the model directory (fil in our case)
backend: "fil"                           # Triton FIL backend for deploying forest models
max_batch_size: 8192
input [
 {
    name: "input__0"
    data_type: TYPE_FP32
    dims: [ 9 ]                          # Input feature dimensions, in our sample case it's 9
  }
]
output [
 {
    name: "output__0"
    data_type: TYPE_FP32
    dims: [ 1 ]                          # Output 2 for binary classification model
  }
]
instance_group [{ kind: KIND_GPU }]
parameters [
  {
    key: "model_type"
    value: { string_value: "xgboost" }
  },
  {
    key: "predict_proba"
    value: { string_value: "false" }
  },
  {
    key: "output_class"
    value: { string_value: "true" }
  },
  {
    key: "threshold"
    value: { string_value: "0.5" }
  },
  {
    key: "algo"
    value: { string_value: "ALGO_AUTO" }
  },
  {
    key: "storage_type"
    value: { string_value: "AUTO" }
  },
  {
    key: "blocks_per_sm"
    value: { string_value: "0" }
  }
]

EOL
done

The model repository should look like this, and has 11 folders in total, the 11 xgboost models will be ensembled by bls_async model:

```
model_repository/
`-- fil_0
    |-- 1
    |   `-- xgboost.model
    `-- config.pbtxt
```

more about triton BLS ensemble model can be learned [here](https://github.com/triton-inference-server/python_backend/blob/main/examples/bls/README.md#asynchronous-bls-requests)

**3. Deploy the model in Triton Inference Server**

Finally, we can deploy the xgboost model in Triton Inference Server using the following command:

In [24]:
# Run the Triton Inference Server in a Subprocess from Jupyter notebook

triton_process = subprocess.Popen(["tritonserver", "--model-repository=./model_repository"], stdout=subprocess.PIPE, preexec_fn=os.setsid) 

I0216 20:28:28.824433 9336 metrics.cc:298] Collecting metrics for GPU 0: Tesla T4
I0216 20:28:29.245799 9336 libtorch.cc:1227] TRITONBACKEND_Initialize: pytorch
I0216 20:28:29.245821 9336 libtorch.cc:1237] Triton TRITONBACKEND API version: 1.7
I0216 20:28:29.245842 9336 libtorch.cc:1243] 'pytorch' TRITONBACKEND API version: 1.7
2022-02-16 20:28:29.410012: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
2022-02-16 20:28:29.447749: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
I0216 20:28:29.447820 9336 tensorflow.cc:2176] TRITONBACKEND_Initialize: tensorflow
I0216 20:28:29.447839 9336 tensorflow.cc:2186] Triton TRITONBACKEND API version: 1.7
I0216 20:28:29.447844 9336 tensorflow.cc:2192] 'tensorflow' TRITONBACKEND API version: 1.7
I0216 20:28:29.447848 9336 tensorflow.cc:2216] backend configuration:
{}
I0216 20:28:29.453320 9336 onnxruntime.cc:223

The above command should load the model and print the log `successfully loaded 'fil' version 1`. Triton server listens on the following endpoints:

```
Port 8000    -> HTTP Service
Port 8001    -> GRPC Service
Port 8002    -> Metrics
```

We can test the status of the server connection by running the curl command: `curl -v <IP of machine>:8000/v2/health/ready` which should return `HTTP/1.1 200 OK`

**NOTE:-** In our case the IP of machine on which Triton Server and this notebook are currently running is `localhost`

In [4]:
!curl -v localhost:8000/v2/health/ready

*   Trying 127.0.0.1:8000...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: localhost:8000
> User-Agent: curl/7.68.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host localhost left intact


In [5]:
# Install nvidia-pyindex
!pip3 install nvidia-pyindex

Collecting nvidia-pyindex
  Downloading nvidia-pyindex-1.0.9.tar.gz (10 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: nvidia-pyindex
  Building wheel for nvidia-pyindex (setup.py) ... [?25ldone
[?25h  Created wheel for nvidia-pyindex: filename=nvidia_pyindex-1.0.9-py3-none-any.whl size=8416 sha256=0042995cf2b16f039b45625917ae738d72bf6c817a7bfb981be66c8658900498
  Stored in directory: /root/.cache/pip/wheels/e0/c2/fb/5cf4e1cfaf28007238362cb746fb38fc2dd76348331a748d54
Successfully built nvidia-pyindex
Installing collected packages: nvidia-pyindex
Successfully installed nvidia-pyindex-1.0.9
You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.[0m


In [6]:
# Install Triton client
!pip3 install tritonclient[http]

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Collecting tritonclient[http]
  Downloading tritonclient-2.18.0-py3-none-manylinux1_x86_64.whl (7.8 MB)
     |████████████████████████████████| 7.8 MB 6.4 MB/s            
[?25hCollecting python-rapidjson>=0.9.1
  Downloading python_rapidjson-1.5-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.5 MB)
     |████████████████████████████████| 1.5 MB 32.0 MB/s            
Collecting geventhttpclient>=1.4.4
  Downloading geventhttpclient-1.5.3-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (77 kB)
     |████████████████████████████████| 77 kB 32.2 MB/s            
[?25hCollecting gevent>=0.13
  Downloading gevent-21.12.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.5 MB)
     |████████████████████████████████| 6.5 MB 30.2 MB/s            
Collecting brotli
  Downloading Brotli-1.0.9-cp38-cp38-manylinux1_x

## Triton Client

After model profiling is done and the final model is selected as per required configuration and deployed in Triton, we can now test the inference by sending real inference request from Triton Client and checking the accuracy of responses. For more information on installation steps, please check [Triton Client Github](https://github.com/triton-inference-server/client)  

In [None]:
# Check client library can be imported
import numpy
import tritonclient.http as triton_http

In [16]:
features = 9 # number of sample features
samples = 8192 # number of samples
X = numpy.random.rand(samples, features).astype('float32')
with open('dtest.npy', 'wb') as f:
    numpy.save(f, X)

In [29]:
%%time
from tritonclient.utils import *
import sys
import tritonclient.grpc as triton_grpc

import numpy as np
with open('dtest.npy', 'rb') as f:
    dtest = np.load(f)

model_name = "bls_async"

with triton_grpc.InferenceServerClient("localhost:8001") as client:
    for n in range(1,100):
        triton_input = triton_grpc.InferInput('input__0',
                                              (dtest.shape[0], dtest.shape[1]),
                                              'FP32'
                                             )

        triton_input.set_data_from_numpy(dtest)

        triton_output = triton_grpc.InferRequestedOutput('output__0')

        response = client.infer(model_name,
                                inputs=[triton_input],
                                request_id=str(1),
                                outputs=[triton_output]
                               )

        result = response.get_response()

    output0_data = response.as_numpy("output__0")

CPU times: user 78.5 ms, sys: 41.2 ms, total: 120 ms
Wall time: 1.32 s


User can follow [Triton Vertex notebook](https://github.com/NVIDIA/nvidia-gcp-samples/blob/master/vertex-ai-samples/prediction/triton_inference.ipynb) to deploy above model to GCP Vertex AI Prediction.

In [23]:
# Stopping Triton Server before proceeding further
os.killpg(os.getpgid(triton_process.pid), signal.SIGTERM)  # Send the signal to all the process groups

I0216 20:28:19.448369 6686 server.cc:249] Waiting for in-flight requests to complete.
I0216 20:28:19.448418 6686 model_repository_manager.cc:1026] unloading: fil_9:1
I0216 20:28:19.448620 6686 model_repository_manager.cc:1026] unloading: fil_7:1
I0216 20:28:19.448806 6686 model_repository_manager.cc:1026] unloading: fil_6:1
I0216 20:28:19.448979 6686 instance_finalize.hpp:36] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0216 20:28:19.449061 6686 model_repository_manager.cc:1026] unloading: fil_5:1
I0216 20:28:19.449156 6686 instance_finalize.hpp:36] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0216 20:28:19.449304 6686 model_repository_manager.cc:1026] unloading: fil_4:1
I0216 20:28:19.449408 6686 instance_finalize.hpp:36] TRITONBACKEND_ModelInstanceFinalize: delete instance state
I0216 20:28:19.449465 6686 model_finalize.hpp:36] TRITONBACKEND_ModelFinalize: delete model state
I0216 20:28:19.449474 6686 model_finalize.hpp:36] TRITONBACKEND_ModelFinalize: d

# Conclusion

Triton FIL backend can be used for deploying tree based models trained in frameworks like LightGBM, Scikit-Learn, and cuML for fast GPU-based inference. Essentially, tree based models can now be deployed with other deep learning based models in Triton Inference Server seamlessly. Moreover, Model Analyzer utility tool can be used to profile the models and get the best deployment configuration that satisfy the deployment constraints. The trained model can then be deployed using the best configuration in Triton and Triton Client can be used for sending inference requests. 