# Triton Inference Server Auto-Complete-Config

To simplify Triton Inference Server config.pbtxt you can utilize the Auto-Complete-Config feature to infer input/output shapes. In this case for the config.pbtxt we just include the platform and backend, add other parameters optionally if you would like. For this sample we'll take a Transformers Onnx model.

## Setting

For this sample we'll use a SageMaker Classic Notebook Instance, conda_py3 kernel and g5.4xlarge instance family.

## Setup

In [None]:
#!pip install transformers torch onnx

## Local Inference & Onnx Conversion

In [None]:
from transformers import AutoTokenizer, AutoModel
import torch

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output.last_hidden_state
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
query = "How many people live in London?"
encoded_input = tokenizer(query, padding=True, truncation=True, return_tensors='pt')
#print(encoded_input)
# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input, return_dict=True)
    #print(model_output)
# Perform pooling
embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
#embeddings.numpy()

In [None]:
from pathlib import Path
import transformers
from transformers.onnx import FeaturesManager
from transformers import AutoConfig, AutoTokenizer, AutoModelForSequenceClassification
from transformers import AutoTokenizer, AutoModel
import torch

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")
model = AutoModel.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")

# load config
model_kind, model_onnx_config = FeaturesManager.check_supported_model_or_raise(model)
onnx_config = model_onnx_config(model.config)

# export
onnx_inputs, onnx_outputs = transformers.onnx.export(
        preprocessor=tokenizer,
        model=model,
        config=onnx_config,
        opset=13,
        output=Path("model.onnx")
)

## Triton Inference Server Setup

Note we need to setup our model artifacts in a structure that Triton Inf Server expects:
```
- triton-serve-onnx
    - sentence
        - 1
            - model.onnx
        - config.pbtxt (adjusted for auto config)
```

In [None]:
%%sh
mkdir triton-serve-onnx
cd triton-serve-onnx
mkdir sentence
cd sentence
touch config.pbtxt
mkdir 1

In [None]:
%%writefile triton-serve-onnx/sentence/config.pbtxt
name: "sentence"
platform: "onnxruntime_onnx"

instance_group {
  count: 1
  kind: KIND_GPU
}
dynamic_batching {
}

In [None]:
!mv model.onnx triton-serve-onnx/sentence/1/

## Sample Inference
We can prepare the payload using the transformers tokenizer with the input formatted as needed for the model. We can then simply use the Python requests library or Triton Client for inference. 

Prior to inference ensure to start the container with the following command (adjust path and container as needed):

```
docker run --gpus=all --shm-size=4G --rm -p8000:8000 -p8001:8001 -p8002:8002 -v/home/ec2-user/SageMaker/triton-serve-onnx:/model_repository nvcr.io/nvidia/tritonserver:23.12-py3 tritonserver --model-repository=/model_repository --exit-on-error=false --log-verbose=1 --strict-model-config=false
```

Note that we include the flag for strict model config being false as well.

In [None]:
# prepare client payload
from transformers import AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/msmarco-bert-base-dot-v5")

def tokenize_text(text):
    tokenized_text = tokenizer(text, padding=True, truncation=True, return_tensors='pt')
    payload = {}
    payload["inputs"] = []
    payload["inputs"].append(
        {
            "name": "input_ids",
            "shape": tokenized_text.input_ids.shape,
            "datatype": "INT64",
            "data": tokenized_text.input_ids.tolist(),
        }
    )
    payload["inputs"].append(
        {
            "name": "token_type_ids",
            "shape": tokenized_text.token_type_ids.shape,
            "datatype": "INT64",
            "data": tokenized_text.token_type_ids.tolist(),
        }
    )
    payload["inputs"].append(
        {
            "name": "attention_mask",
            "shape": tokenized_text.attention_mask.shape,
            "datatype": "INT64",
            "data": tokenized_text.attention_mask.tolist(),
        }
    )
    
    return payload
sampPayload = tokenize_text(["This is a test"])
sampPayload

In [None]:
import requests
import json

# Specify the model name and version
model_name = "sentence" #specified in config.pbtxt
model_version = "1"

# Set the inference URL based on the Triton server's address
url = f"http://localhost:8000/v2/models/{model_name}/versions/{model_version}/infer"

In [None]:
# sample invoke onnx model
response = requests.post(url, data=json.dumps(sampPayload))
response.raise_for_status()

# output result
inference_result = response.json()
print(inference_result['outputs'])