# High-performance serving with Triton Inference Server (Preview)
Learn how to use [NVIDIA Triton Inference Server](https://aka.ms/nvidia-triton-docs) in Azure Machine Learning with [Managed online endpoints](concept-endpoints.md#managed-online-endpoints).

Triton is multi-framework, open-source software that is optimized for inference. It supports popular machine learning frameworks like TensorFlow, ONNX Runtime, PyTorch, NVIDIA TensorRT, and more. It can be used for your CPU or GPU workloads.

In this article, you will learn how to deploy Triton and a model to a managed online endpoint using the AML Python SDK v2 (Preview). This feature is currently in private preview. This preview version is provided without a service-level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities. For more information, see [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/en-us/support/legal/preview-supplemental-terms/).

## Prerequisites

* To use Azure Machine Learning, you must have an Azure subscription. If you don't have an Azure subscription, create a free account before you begin. Try the [free or paid version of Azure Machine Learning](https://azure.microsoft.com/free/).

* Install and configure the [Python SDK v2](sdk/setup.sh).

* You must have an Azure resource group, and you (or the service principal you use) must have Contributor access to it.

* You must have an Azure Machine Learning workspace. 

* To deploy locally, you must install Docker Engine on your local computer. We highly recommend this option, so it's easier to debug issues.

# 1. Connect to Azure Machine Learning Workspace

The [workspace](https://docs.microsoft.com/en-us/azure/machine-learning/concept-workspace) is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. In this section we will connect to the workspace in which the job will be run.

## 1.1. Import the required libraries

In [None]:
# import required libraries
from azure.ml import MLClient
from azure.ml.entities import ManagedOnlineEndpoint, ManagedOnlineDeployment
from azure.identity import InteractiveBrowserCredential

## 1.2. Configure workspace details and get a handle to the workspace

To connect to a workspace, we need identifier parameters - a subscription, resource group and workspace name. We will use these details in the `MLClient` from `azure.ml` to get a handle to the required Azure Machine Learning workspace. We use the default [interactive authentication](https://docs.microsoft.com/python/api/azure-identity/azure.identity.interactivebrowsercredential?view=azure-python) for this tutorial. More advanced connection methods can be found [here](https://docs.microsoft.com/python/api/azure-identity/azure.identity?view=azure-python).

In [None]:
# enter details of your AML workspace
subscription_id = '<SUBSCRIPTION_ID>'
resource_group = '<RESOURCE_GROUP>'
workspace = '<AML_WORKSPACE_NAME>'

In [None]:
# get a handle to the workspace
ml_client = MLClient(
    InteractiveBrowserCredential(), subscription_id, resource_group, workspace
)

## 2. Deploy the online endpoint and online deployment

For Triton no-code-deployment, [testing via local endpoints](../managed/online-endpoints-simple-deployment.ipynb) is currently not supported. 

### 2.1 Set a base path variable
To avoid typing in a path for multiple commands, set a `base_path` variable. This variable points to the directory where the model and associated YAML configuration files are located:

In [51]:
base_path = "sdk/endpoints/triton/single_model"

### 2.2 Assign a unique endpoint name
Endpoint names must be unique within an Azure region. Naming rules are defined under [managed online endpoint limits](https://docs.microsoft.com/azure/machine-learning/how-to-manage-quotas#azure-machine-learning-managed-online-endpoints-preview).

In [None]:
import datetime

endpoint_name = 'single-endpt-' + datetime.datetime.now().strftime('%m%d%H%M%f')

### 2.3 Install required dependencies

In [None]:
!pip install numpy
!pip install tritonclient[http]
!pip install pillow
!pip install gevent

### 2.4 Create a YAML configuration file for the endpoint
Online endpoints require the following configuration details. See the [CLI (v2) Online Endpoint YAML Schema](https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-endpoint-online) for more details.

`endpoint_name`: The name of the endpoint. It must be unique in the Azure region. Naming rules are defined under [managed online endpoint limits](https://docs.microsoft.com/azure/machine-learning/how-to-manage-quotas#azure-machine-learning-managed-online-endpoints-preview).

`auth_mode` : Use `key` for key-based authentication. Use `aml_token` for Azure Machine Learning token-based authentication. A `key` does not expire, but `aml_token` does expire. 

The YAML file we will be using is located at `sdk/endpoints/triton/single_model/create-managed-endpoint.yaml` and has the following structure:
```YAML
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-endpoint
auth_mode: aml_token
```

#### 2.5 Create a new endpoint using the YAML configuration

The AML Python SDK v2 allows the configuration of online endpoints and other entities through YAML files via the `load` method as below or by passing arguments to the constructor. See [Deploy and score a machine learning model by using an online endpoint](sdk/endpoints/online/online-endpoints-simple-deployment.ipynb) for more details. 

Before deploying the endpoint to Azure via the `ManagedOnlineEndpoint` object's `begin_create_or_update` method, we first set the `name` attribute to the unique value generated above. 

In [None]:
import os.path

yaml_path = os.path.join(base_path, 'create-managed-endpoint.yaml')
endpoint = ManagedOnlineEndpoint.load(yaml_path)
endpoint.name = endpoint_name
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint)

### 2.6 Create a YAML configuration file for the deployment

The following example configures a deployment named **blue** to the endpoint created in the previous step. The one used in the following commands is located at `sdk/endpoints/triton/single_model/create-managed-deployment.yaml`.

For Triton no-code deployment (NCD) to work, setting `type` to `triton_model` is required, `type: triton_model`. For more information, see [CLI (v2) model YAML schema](https://docs.microsoft.com/en-us/azure/machine-learning/reference-yaml-model).

This deployment uses a Standard_NC6s_v3 VM. You may need to request a quota increase for your subscription before you can use this VM. For more information, see [NCv3-series](https://docs.microsoft.com/en-us/azure/virtual-machines/ncv3-series).


```YAML
  $schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
  name: blue
  endpoint_name: my-endpoint
  model:
    name: sample-densenet-onnx-model
    version: 1
    path: ./models
    type: triton_model
  instance_count: 1
  instance_type: Standard_NC6s_v3
```

### 2.7 Create the online deployment 

As with the online endpoint, after instantiating the `ManagedOnlineDeployment` class with the YAML file we update the name attribute.

In [None]:
yaml_path = os.path.join(base_path, 'create-managed-deployment.yaml')
deployment = ManagedOnlineDeployment.load(yaml_path)
deployment.endpoint_name = endpoint_name
deployment = ml_client.begin_create_or_update(deployment)

## 3. Invoke your endpoint

The file `/sdk/endpoints/online/triton/single_model/triton_densenet_scoring.py` is used for scoring. The image passed to the endpoint needs pre-processing to meet the size, type, and format requirements, and post-processing to show the predicted label. The `triton_densenet_scoring.py` uses the `tritonclient.http` library to communicate with the Triton inference server.

### 3.1 Get the endpoint scoring uri

In [None]:
scoring_uri = endpoint.scoring_uri

### 3.2 Get an authentication token

In [None]:
auth_token = ml_client.online_endpoints.list_keys(endpoint_name).access_token

### 3.3 Score data with the endpoint

The script `triton_densenet_scoring.py` submits the image of a peacock to the endpoint.

In [None]:
from single_model import triton_densenet_scoring

triton_densenet_scoring.score(scoring_uri, auth_token)

## 4. Delete your endpoint and model

In [None]:
model_name = deployment.model.name
model_version = deployment.model.version
ml_client.online_endpoints.begin_delete(endpoint_name)

In [None]:
ml_client.models.archive(model_name, model_version)