# Deploy to Triton Inference Server on AKS

description: (preview) deploy a bi-directional attention flow (bidaf) Q&A model to V100s on AKS via Triton

Please note that this Public Preview release is subject to the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

In [1]:
!pip install nvidia-pyindex
!pip install --upgrade tritonclient
!pip install azureml_core-1.19.0a1-py3-none-any.whl


Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already up-to-date: tritonclient in /home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages (2.4.0)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [2]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception (azureml-core 1.19.0a1 (/home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages), Requirement.parse('azureml-core~=1.15.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception (azureml-core 1.19.0a1 (/home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages), Requirement.parse('azureml-core~=1.15.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.PipelineRun = azureml.pipeline.core.run:PipelineRun._from_dto with exception (azureml-core 1.19.0a1 (/home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages), Requirement.parse('azureml-core~=1.15.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.ReusedStepRun = azureml.pipeline.core.run:St

Workspace.create(name='Inference-PM-AML-Workspace', subscription_id='92c76a2f-0e1c-4216-b65e-abf7a3f34c1e', resource_group='Inference-PM')

## Download model

It's important that your model have this directory structure for Triton Inference Server to be able to load it. [Read more about the directory structure that Triton expects](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html).

In [3]:
from src.model_utils import download_triton_models, delete_triton_models
from pathlib import Path

prefix = Path(".")
download_triton_models(prefix)

successfully downloaded model: densenet_onnx
successfully downloaded model: bidaf-9


## Register model

In [4]:
from azureml.core.model import Model

model_path = prefix.joinpath("models", "triton")

model = Model.register(
    model_path=model_path,
    model_name="bidaf-9-tutorial",
    tags={"area": "Natural language processing", "type": "Question-answering"},
    description="Question answering from ONNX model zoo",
    workspace=ws,
    model_framework=Model.Framework.MULTI,
)

model

Registering model bidaf-9-tutorial


Model(workspace=Workspace.create(name='Inference-PM-AML-Workspace', subscription_id='92c76a2f-0e1c-4216-b65e-abf7a3f34c1e', resource_group='Inference-PM'), name=bidaf-9-tutorial, id=bidaf-9-tutorial:4, version=4, tags={'area': 'Natural language processing', 'type': 'Question-answering'}, properties={})

## Deploy webservice

Deploy to a pre-created [AksCompute](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.aks.akscompute?view=azure-ml-py#provisioning-configuration-agent-count-none--vm-size-none--ssl-cname-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--location-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--service-cidr-none--dns-service-ip-none--docker-bridge-cidr-none--cluster-purpose-none--load-balancer-type-none-) named `aks-gpu-deploy`. For other options, see [our documentation](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-and-where?tabs=azcli).


In [9]:
from azureml.core.webservice import AksWebservice
from azureml.core.model import InferenceConfig
from random import randint

service_name = "triton-bidaf-9" + str(randint(10000, 99999))

config = AksWebservice.deploy_configuration(
    compute_target_name="aks-gpu-deploy",
    gpu_cores=1,
    cpu_cores=1,
    memory_gb=4,
    auth_enabled=True,
)

service = Model.deploy(
    workspace=ws,
    name=service_name,
    models=[model],
    deployment_config=config,
    overwrite=True,
)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running......
Succeeded
AKS service creation operation finished, operation "Succeeded"


## Test the webservice

In [34]:
!pip install --upgrade nltk geventhttpclient python-rapidjson

  and should_run_async(code)
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already up-to-date: nltk in /home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages (3.5)
Requirement already up-to-date: geventhttpclient in /home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages (1.4.4)
Requirement already up-to-date: python-rapidjson in /home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages (0.9.4)


In [47]:
service_key = service.get_keys()[0]
scoring_uri = service.scoring_uri

  and should_run_async(code)


In [51]:
!curl -v $scoring_uri/v2/health/ready -H 'Authorization: Bearer '"$service_key"''

*   Trying 52.166.71.20...
* TCP_NODELAY set
  and should_run_async(code)
* Connected to 52.166.71.20 (52.166.71.20) port 80 (#0)














* Connection #0 to host 52.166.71.20 left intact


In [11]:
import json

# Using a modified version of tritonhttpclient for Preview, PR is out for review
# https://github.com/triton-inference-server/server/pull/2047
import tritonclient.http as tritonhttpclient
from tritonclientutils import triton_to_np_dtype

from src.bidaf_utils import preprocess, postprocess

key = service.get_keys()[0]
headers = {}
headers["Authorization"] = f"Bearer {key}"

triton_client = tritonhttpclient.InferenceServerClient(service.scoring_uri[7:])

context = "A quick brown fox jumped over the lazy dog."
query = "Which animal was lower?"

model_name = "bidaf-9"

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, headers=headers
)

input_meta = model_metadata["inputs"]
output_meta = model_metadata["outputs"]

# We use the np.object data type for string data
np_dtype = triton_to_np_dtype(input_meta[0]["datatype"])
cw, cc = preprocess(context, np_dtype)
qw, qc = preprocess(query, np_dtype)

input_mapping = {
    "query_word": qw,
    "query_char": qc,
    "context_word": cw,
    "context_char": cc,
}

inputs = []
outputs = []

# Populate the inputs array
for in_meta in input_meta:
    input_name = in_meta["name"]
    data = input_mapping[input_name]

    input = tritonhttpclient.InferInput(
        input_name, data.shape, in_meta["datatype"]
    )

    input.set_data_from_numpy(data, binary_data=False)
    inputs.append(input)

# Populate the outputs array
for out_meta in output_meta:
    output_name = out_meta["name"]
    output = tritonhttpclient.InferRequestedOutput(
        output_name, binary_data=False
    )
    outputs.append(output)

# Run inference
res = triton_client.infer(
    model_name,
    inputs,
    request_id="0",
    outputs=outputs,
    model_version="1",
    headers=headers,
)

result = postprocess(context_words=cw, answer=res)

result

[nltk_data] Downloading package punkt to /home/gopalv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /home/gopalv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
start is 7, end is 8


[b'lazy', b'dog']

## Delete the webservice and the downloaded model

In [12]:
service.delete()
delete_triton_models(prefix)

  and should_run_async(code)
successfully deleted model: densenet_onnx
successfully deleted model: bidaf-9


# Next steps

Try reading [our documentation](https://aka.ms/triton-aml-docs) to use Triton with your own models or check out the other notebooks in this folder for ways to do pre- and post-processing on the server. 