# Deploy to Triton Inference Server on AKS

description: (preview) deploy a bi-directional attention flow (bidaf) Q&A model to V100s on AKS via Triton

Please note that this Public Preview release is subject to the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

In [2]:
!pip install nvidia-pyindex
!pip install --upgrade tritonclient
!curl -L https://aka.ms/azureml-core-1.19.a1 --output azureml_core-1.19.0a1-py3-none-any.whl
!pip install azureml_core-1.19.0a1-py3-none-any.whl

Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com
Requirement already up-to-date: tritonclient in /home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages (2.6.0)
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100 2074k  100 2074k    0     0   673k      0  0:00:03  0:00:03 --:--:-- 1304k
Looking in indexes: https://pypi.org/simple, https://pypi.ngc.nvidia.com


In [11]:
print(service.get_logs())


== Triton Inference Server ==

NVIDIA Release 20.06 (build 13333626)

Copyright (c) 2018-2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

2021-01-16 00:21:23.572842: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
I0116 00:21:23.611734 1 metrics.cc:184] found 1 GPUs supporting NVML metrics
I0116 00:21:23.617109 1 metrics.cc:193]   GPU 0: Tesla V100-PCIE-16GB
I0116 00:21:23.617751 1 server.cc:120] Initializing Triton Inference Server
error: creating server: Internal - failed to stat file 

In [4]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')

## Download model

It's important that your model have this directory structure for Triton Inference Server to be able to load it. [Read more about the directory structure that Triton expects](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html).

In [6]:
from src.model_utils import download_triton_models, delete_triton_models
from pathlib import Path

prefix = Path(".")
download_triton_models(prefix)

successfully downloaded model: densenet_onnx
successfully downloaded model: bidaf-9


## Register model

In [8]:
from azureml.core.model import Model

model_path = prefix.joinpath("models")

model = Model.register(
    model_path=model_path,
    model_name="bidaf-9-tutorial",
    tags={"area": "Natural language processing", "type": "Question-answering"},
    description="Question answering from ONNX model zoo",
    workspace=ws,
    model_framework=Model.Framework.MULTI,
)

model

Registering model bidaf-9-tutorial


Model(workspace=Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples'), name=bidaf-9-tutorial, id=bidaf-9-tutorial:805, version=805, tags={'area': 'Natural language processing', 'type': 'Question-answering'}, properties={})

## Deploy webservice

Deploy to a pre-created [AksCompute](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.aks.akscompute?view=azure-ml-py#provisioning-configuration-agent-count-none--vm-size-none--ssl-cname-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--location-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--service-cidr-none--dns-service-ip-none--docker-bridge-cidr-none--cluster-purpose-none--load-balancer-type-none-) named `aks-gpu-deploy`. For other options, see [our documentation](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-and-where?tabs=azcli).


In [10]:
from azureml.core.webservice import AksWebservice
from azureml.core.model import InferenceConfig
from random import randint

service_name = "triton-bidaf-9" + str(randint(10000, 99999))

config = AksWebservice.deploy_configuration(
    compute_target_name="aks-gpu-deploy",
    gpu_cores=1,
    cpu_cores=1,
    memory_gb=4,
    auth_enabled=True,
)

service = Model.deploy(
    workspace=ws,
    name=service_name,
    models=[model],
    deployment_config=config,
    overwrite=True,
)

service.wait_for_deployment(show_output=True)

Tips: You can try get_logs(): https://aka.ms/debugimage#dockerlog or local deployment: https://aka.ms/debugimage#debug-locally to debug if deployment takes longer than 10 minutes.
Running...............................................................................
Failed
Service deployment polling reached non-successful terminal state, current service state: Transitioning
Operation ID: 7a9d54cf-35c3-494d-a552-4fb8c9a73ba0
More information can be found using '.get_logs()'
Error:
{
  "code": "KubernetesDeploymentFailed",
  "statusCode": 400,
  "message": "Kubernetes Deployment failed",
  "details": [
    {
      "code": "CrashLoopBackOff",
      "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance: triton-bidaf-974043. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \nYou can interactively debug your scoring file locall

WebserviceException: WebserviceException:
	Message: Service deployment polling reached non-successful terminal state, current service state: Transitioning
Operation ID: 7a9d54cf-35c3-494d-a552-4fb8c9a73ba0
More information can be found using '.get_logs()'
Error:
{
  "code": "KubernetesDeploymentFailed",
  "statusCode": 400,
  "message": "Kubernetes Deployment failed",
  "details": [
    {
      "code": "CrashLoopBackOff",
      "message": "Your container application crashed. This may be caused by errors in your scoring file's init() function.\nPlease check the logs for your container instance: triton-bidaf-974043. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \nYou can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\nYou can also try to run image nvcr.io/nvidia/tritonserver:20.06-py3 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information."
    },
    {
      "code": "DeploymentFailed",
      "message": "Your container endpoint is not available. Please follow the steps to debug:\n1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\n2. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\n3. View the diagnostic events to check status of container, it may help you to debug the issue. [{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Pulled\",\"Message\":\"Successfully pulled image \\\"mcr.microsoft.com/azureml/dependency-unpacker:20201117\\\"\",\"LastTimestamp\":\"2021-01-16T00:13:06Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Created\",\"Message\":\"Created container amlappinit\",\"LastTimestamp\":\"2021-01-16T00:13:08Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Started\",\"Message\":\"Started container amlappinit\",\"LastTimestamp\":\"2021-01-16T00:13:08Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Pulling\",\"Message\":\"Pulling image \\\"nvcr.io/nvidia/tritonserver:20.06-py3\\\"\",\"LastTimestamp\":\"2021-01-16T00:13:11Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Pulled\",\"Message\":\"Successfully pulled image \\\"nvcr.io/nvidia/tritonserver:20.06-py3\\\"\",\"LastTimestamp\":\"2021-01-16T00:15:23Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Warning\",\"Reason\":\"Unhealthy\",\"Message\":\"Readiness probe failed: Get http://10.244.65.8:8000/v2/health/ready: dial tcp 10.244.65.8:8000: connect: connection refused\",\"LastTimestamp\":\"2021-01-16T00:15:29Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Created\",\"Message\":\"Created container triton-bidaf-974043\",\"LastTimestamp\":\"2021-01-16T00:15:30Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Started\",\"Message\":\"Started container triton-bidaf-974043\",\"LastTimestamp\":\"2021-01-16T00:15:30Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Normal\",\"Reason\":\"Pulled\",\"Message\":\"Container image \\\"nvcr.io/nvidia/tritonserver:20.06-py3\\\" already present on machine\",\"LastTimestamp\":\"2021-01-16T00:15:30Z\"},{\"InvolvedObject\":\"triton-bidaf-974043-5954b7f879-spg4c\",\"InvolvedKind\":\"Pod\",\"Type\":\"Warning\",\"Reason\":\"BackOff\",\"Message\":\"Back-off restarting failed container\",\"LastTimestamp\":\"2021-01-16T00:15:35Z\"}]"
    }
  ]
}
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Service deployment polling reached non-successful terminal state, current service state: Transitioning\nOperation ID: 7a9d54cf-35c3-494d-a552-4fb8c9a73ba0\nMore information can be found using '.get_logs()'\nError:\n{\n  \"code\": \"KubernetesDeploymentFailed\",\n  \"statusCode\": 400,\n  \"message\": \"Kubernetes Deployment failed\",\n  \"details\": [\n    {\n      \"code\": \"CrashLoopBackOff\",\n      \"message\": \"Your container application crashed. This may be caused by errors in your scoring file's init() function.\\nPlease check the logs for your container instance: triton-bidaf-974043. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. \\nYou can interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\nYou can also try to run image nvcr.io/nvidia/tritonserver:20.06-py3 locally. Please refer to https://aka.ms/debugimage#service-launch-fails for more information.\"\n    },\n    {\n      \"code\": \"DeploymentFailed\",\n      \"message\": \"Your container endpoint is not available. Please follow the steps to debug:\\n1. From the AML SDK, you can run print(service.get_logs()) if you have service object to fetch the logs. Please refer to https://aka.ms/debugimage#dockerlog for more information.\\n2. You can also interactively debug your scoring file locally. Please refer to https://docs.microsoft.com/azure/machine-learning/how-to-debug-visual-studio-code#debug-and-troubleshoot-deployments for more information.\\n3. View the diagnostic events to check status of container, it may help you to debug the issue. [{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Pulled\\\",\\\"Message\\\":\\\"Successfully pulled image \\\\\\\"mcr.microsoft.com/azureml/dependency-unpacker:20201117\\\\\\\"\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:13:06Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Created\\\",\\\"Message\\\":\\\"Created container amlappinit\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:13:08Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Started\\\",\\\"Message\\\":\\\"Started container amlappinit\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:13:08Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Pulling\\\",\\\"Message\\\":\\\"Pulling image \\\\\\\"nvcr.io/nvidia/tritonserver:20.06-py3\\\\\\\"\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:13:11Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Pulled\\\",\\\"Message\\\":\\\"Successfully pulled image \\\\\\\"nvcr.io/nvidia/tritonserver:20.06-py3\\\\\\\"\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:15:23Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Warning\\\",\\\"Reason\\\":\\\"Unhealthy\\\",\\\"Message\\\":\\\"Readiness probe failed: Get http://10.244.65.8:8000/v2/health/ready: dial tcp 10.244.65.8:8000: connect: connection refused\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:15:29Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Created\\\",\\\"Message\\\":\\\"Created container triton-bidaf-974043\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:15:30Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Started\\\",\\\"Message\\\":\\\"Started container triton-bidaf-974043\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:15:30Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Normal\\\",\\\"Reason\\\":\\\"Pulled\\\",\\\"Message\\\":\\\"Container image \\\\\\\"nvcr.io/nvidia/tritonserver:20.06-py3\\\\\\\" already present on machine\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:15:30Z\\\"},{\\\"InvolvedObject\\\":\\\"triton-bidaf-974043-5954b7f879-spg4c\\\",\\\"InvolvedKind\\\":\\\"Pod\\\",\\\"Type\\\":\\\"Warning\\\",\\\"Reason\\\":\\\"BackOff\\\",\\\"Message\\\":\\\"Back-off restarting failed container\\\",\\\"LastTimestamp\\\":\\\"2021-01-16T00:15:35Z\\\"}]\"\n    }\n  ]\n}"
    }
}

## Test the webservice

In [None]:
!pip install --upgrade nltk geventhttpclient python-rapidjson

In [None]:
service_key = service.get_keys()[0]
scoring_uri = service.scoring_uri

In [None]:
!curl -v $scoring_uri/v2/health/ready -H 'Authorization: Bearer '"$service_key"''

In [None]:
import json

# Using a modified version of tritonhttpclient for Preview, PR is out for review
# https://github.com/triton-inference-server/server/pull/2047
import tritonclient.http as tritonhttpclient
from tritonclientutils import triton_to_np_dtype

from src.bidaf_utils import preprocess, postprocess

key = service.get_keys()[0]
headers = {}
headers["Authorization"] = f"Bearer {key}"

triton_client = tritonhttpclient.InferenceServerClient(service.scoring_uri[7:])

context = "A quick brown fox jumped over the lazy dog."
query = "Which animal was lower?"

model_name = "bidaf-9"

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, headers=headers
)

input_meta = model_metadata["inputs"]
output_meta = model_metadata["outputs"]

# We use the np.object data type for string data
np_dtype = triton_to_np_dtype(input_meta[0]["datatype"])
cw, cc = preprocess(context, np_dtype)
qw, qc = preprocess(query, np_dtype)

input_mapping = {
    "query_word": qw,
    "query_char": qc,
    "context_word": cw,
    "context_char": cc,
}

inputs = []
outputs = []

# Populate the inputs array
for in_meta in input_meta:
    input_name = in_meta["name"]
    data = input_mapping[input_name]

    input = tritonhttpclient.InferInput(input_name, data.shape, in_meta["datatype"])

    input.set_data_from_numpy(data, binary_data=False)
    inputs.append(input)

# Populate the outputs array
for out_meta in output_meta:
    output_name = out_meta["name"]
    output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
    outputs.append(output)

# Run inference
res = triton_client.infer(
    model_name,
    inputs,
    request_id="0",
    outputs=outputs,
    model_version="1",
    headers=headers,
)

result = postprocess(context_words=cw, answer=res)

result

## Delete the webservice and the downloaded model

In [None]:
service.delete()
delete_triton_models(prefix)

# Next steps

Try reading [our documentation](https://aka.ms/triton-aml-docs) to use Triton with your own models or check out the other notebooks in this folder for ways to do pre- and post-processing on the server. 