# Deploy to Triton Inference Server locally

description: (preview) deploy a bi-directional attention flow (bidaf) Q&A model locally with Triton

Please note that this Public Preview release is subject to the [Supplemental Terms of Use for Microsoft Azure Previews](https://azure.microsoft.com/support/legal/preview-supplemental-terms/).

In [1]:
!pip install nvidia-pyindex
!pip install --upgrade tritonclient

/bin/bash: pip: command not found
/bin/bash: pip: command not found


In [2]:
from azureml.core import Workspace

ws = Workspace.from_config()
ws

Failure while loading azureml_run_type_providers. Failed to load entrypoint hyperdrive = azureml.train.hyperdrive:HyperDriveRun._from_run_dto with exception (azureml-core 1.27.0 (/home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages), Requirement.parse('azureml-core~=1.20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint automl = azureml.train.automl.run:AutoMLRun._from_run_dto with exception (azureml-core 1.27.0 (/home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages), Requirement.parse('azureml-core~=1.20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.PipelineRun = azureml.pipeline.core.run:PipelineRun._from_dto with exception (azureml-core 1.27.0 (/home/gopalv/miniconda3/envs/azureml/lib/python3.7/site-packages), Requirement.parse('azureml-core~=1.20.0')).
Failure while loading azureml_run_type_providers. Failed to load entrypoint azureml.ReusedStepRun = azureml.pipeline.core.run:StepRun.

Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples')

## Download model

It's important that your model have this directory structure for Triton Inference Server to be able to load it. [Read more about the directory structure that Triton expects](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/model_repository.html).

In [3]:
from src.model_utils import download_triton_models, delete_triton_models
from pathlib import Path

prefix = Path(".")
download_triton_models(prefix)

successfully downloaded model: densenet_onnx
successfully downloaded model: bidaf-9


## Register model

In [4]:
from azureml.core.model import Model

model_path = prefix.joinpath("models")

model = Model.register(
    model_path=model_path,
    model_name="bidaf-9-tutorial",
    tags={"area": "Natural language processing", "type": "Question-answering"},
    description="Question answering from ONNX model zoo",
    workspace=ws,
    model_framework=Model.Framework.MULTI,
)

model

Registering model bidaf-9-tutorial


Model(workspace=Workspace.create(name='default', subscription_id='6560575d-fa06-4e7d-95fb-f962e74efd7a', resource_group='azureml-examples'), name=bidaf-9-tutorial, id=bidaf-9-tutorial:2212, version=2212, tags={'area': 'Natural language processing', 'type': 'Question-answering'}, properties={})

## Deploy webservice

Deploy to a pre-created [AksCompute](https://docs.microsoft.com/python/api/azureml-core/azureml.core.compute.aks.akscompute?view=azure-ml-py#provisioning-configuration-agent-count-none--vm-size-none--ssl-cname-none--ssl-cert-pem-file-none--ssl-key-pem-file-none--location-none--vnet-resourcegroup-name-none--vnet-name-none--subnet-name-none--service-cidr-none--dns-service-ip-none--docker-bridge-cidr-none--cluster-purpose-none--load-balancer-type-none-) named `aks-gpu-deploy`. For other options, see [our documentation](https://docs.microsoft.com/azure/machine-learning/how-to-deploy-and-where?tabs=azcli).


In [8]:
from azureml.core.webservice import LocalWebservice
from azureml.core.model import InferenceConfig
from random import randint

service_name = "triton-bidaf-9" + str(randint(10000, 99999))

config = LocalWebservice.deploy_configuration(port=6789)

service = Model.deploy(
    workspace=ws,
    name=service_name,
    models=[model],
    deployment_config=config,
    overwrite=True,
)

service.wait_for_deployment(show_output=True)

Downloading model bidaf-9-tutorial:2212 to /tmp/azureml__me4myf2/bidaf-9-tutorial/2212
Generating Docker build context.
Package creation Succeeded
Logging into Docker registry 
Building Docker image from Dockerfile...
Step 1/6 : FROM nvcr.io/nvidia/tritonserver:20.06-py3
 ---> 171a7fd4d078
Step 2/6 : ENV AZUREML_MODEL_DIR=azureml-models/bidaf-9-tutorial/2212
 ---> Running in 34ad55c96afb
 ---> edd10d37090f
Step 3/6 : COPY azureml-app /var/azureml-app
 ---> 46b91b39800c
Step 4/6 : RUN mkdir -p '/var/azureml-app' && echo eyJhY2NvdW50Q29udGV4dCI6eyJzdWJzY3JpcHRpb25JZCI6IjY1NjA1NzVkLWZhMDYtNGU3ZC05NWZiLWY5NjJlNzRlZmQ3YSIsInJlc291cmNlR3JvdXBOYW1lIjoiYXp1cmVtbC1leGFtcGxlcyIsImFjY291bnROYW1lIjoiZGVmYXVsdCIsIndvcmtzcGFjZUlkIjoiMGUxNDk3NjQtMzcyMC00NjEwLWIwZjMtM2UzZjk3NDU0NGFjIn0sIm1vZGVscyI6eyJiaWRhZi05LXR1dG9yaWFsIjp7InZlcnNpb24iOjIyMTIsImlkIjoiYmlkYWYtOS10dXRvcmlhbDoyMjEyIiwiaW50ZXJuYWxJZCI6ImRkMTJjNTUyZGZkNDRlZTRhYjhlYTJlZjA1ZTc4MjdkIn19LCJtb2RlbHNJbmZvIjp7ImJpZGFmLTktdHV0b3JpYWwiOnsiMjIxMiI

In [9]:
print(service.get_logs())


== Triton Inference Server ==

NVIDIA Release 20.06 (build 13333626)

Copyright (c) 2018-2020, NVIDIA CORPORATION.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION.  All rights reserved.
NVIDIA modifications are covered by the license terms that apply to the underlying
project or file.

   Use 'nvidia-docker run' to start this container; see
   https://github.com/NVIDIA/nvidia-docker/wiki/nvidia-docker .

NOTE: The SHMEM allocation limit is set to the default of 64MB.  This may be
   insufficient for the inference server.  NVIDIA recommends the use of the following flags:
   nvidia-docker run --shm-size=1g --ulimit memlock=-1 --ulimit stack=67108864 ...

2021-05-13 20:33:26.045755: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.11.0
I0513 20:33:26.995323 1 server.cc:120] Initializing Triton Inference Server
E0513 20:33:26.997570 1 pinned_memory_manager.cc:192] failed to allocate pinned s

## Test the webservice

In [11]:
!pip install --upgrade nltk geventhttpclient python-rapidjson

/bin/bash: pip: command not found


In [13]:
scoring_uri = service.scoring_uri

In [14]:
!curl -v $scoring_uri/v2/health/ready

*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 6789 (#0)









* Connection #0 to host localhost left intact


In [15]:
import json

import tritonclient.http as tritonhttpclient
from tritonclientutils import triton_to_np_dtype

from src.bidaf_utils import preprocess, postprocess

headers = {}

triton_client = tritonhttpclient.InferenceServerClient(service.scoring_uri[7:])

context = "A quick brown fox jumped over the lazy dog."
query = "Which animal was lower?"

model_name = "bidaf-9"

model_metadata = triton_client.get_model_metadata(
    model_name=model_name, headers=headers
)

input_meta = model_metadata["inputs"]
output_meta = model_metadata["outputs"]

# We use the np.object data type for string data
np_dtype = triton_to_np_dtype(input_meta[0]["datatype"])
cw, cc = preprocess(context, np_dtype)
qw, qc = preprocess(query, np_dtype)

input_mapping = {
    "query_word": qw,
    "query_char": qc,
    "context_word": cw,
    "context_char": cc,
}

inputs = []
outputs = []

# Populate the inputs array
for in_meta in input_meta:
    input_name = in_meta["name"]
    data = input_mapping[input_name]

    input = tritonhttpclient.InferInput(input_name, data.shape, in_meta["datatype"])

    input.set_data_from_numpy(data, binary_data=False)
    inputs.append(input)

# Populate the outputs array
for out_meta in output_meta:
    output_name = out_meta["name"]
    output = tritonhttpclient.InferRequestedOutput(output_name, binary_data=False)
    outputs.append(output)

# Run inference
res = triton_client.infer(
    model_name,
    inputs,
    request_id="0",
    outputs=outputs,
    model_version="1",
    headers=headers,
)

result = postprocess(context_words=cw, answer=res)

result

[nltk_data] Downloading package punkt to /home/gopalv/nltk_data...
start is 7, end is 8
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt to /home/gopalv/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


[b'lazy', b'dog']

## Delete the webservice and the downloaded model

In [None]:
service.delete()
delete_triton_models(prefix)

# Next steps

Try reading [our documentation](https://aka.ms/triton-aml-docs) to use Triton with your own models or check out the other notebooks in this folder for ways to do pre- and post-processing on the server. 