# Triton Inference Server Ensemble Invoke

In this notebook we'll explore how to use Triton Ensemble mode to stitch together multiple models for inference, in this case we will take a sample embeddings model and show how we can use the tokenizer (python backend) and embeddings model (onnx backend) for inference in ensemble mode.

## Prerequisites
- Ensure you have ran onnx-exporter.ipynb to create the model.onnx, the file is also in the repository in the model repository structure.
- Ensure you have created the custom Docker image with transformers installed at runtime. Steps are in the README.md

## Client Installation

In [None]:
#!pip install nvidia-pyindex
#!pip install tritonclient[http]

## Start Container

Use the following command to startup the Docker container in the CLI, ensure you have built this image following the steps in the README.md

```
docker run --gpus=all --shm-size=4G --rm -p8000:8000 -p8001:8001 -p8002:8002 -v$(pwd)/hf_pipeline:/model_repository custom-triton:latest tritonserver --model-repository=/model_repository --exit-on-error=false --log-verbose=1
```

## Sample Inference

In [2]:
import tritonclient.http as http_client
triton_client = http_client.InferenceServerClient(url="localhost:8000", verbose=True)

In [8]:
import numpy as np

# Create inputs to send to Triton
model_name = "ensemble"
text_inputs = ["This is the test string"]

# Text is passed to Trtion as BYTES
inputs = []
inputs.append(http_client.InferInput("INPUT0", [1], "BYTES"))
input0_real = np.array(text_inputs, dtype=np.object_)
inputs[0].set_data_from_numpy(input0_real)

outputs = []
outputs.append(http_client.InferRequestedOutput("last_hidden_state"))

results = triton_client.infer(model_name=model_name, inputs=inputs, outputs=outputs)
results

POST /v2/models/ensemble/infer, headers {'Inference-Header-Content-Length': 173}
b'{"inputs":[{"name":"INPUT0","shape":[1],"datatype":"BYTES","parameters":{"binary_data_size":27}}],"outputs":[{"name":"last_hidden_state","parameters":{"binary_data":true}}]}\x17\x00\x00\x00This is the test string'
<HTTPSocketPoolResponse status=200 headers={'content-type': 'application/octet-stream', 'inference-header-content-length': '237', 'content-length': '21741'}>
bytearray(b'{"model_name":"ensemble","model_version":"1","parameters":{"sequence_id":0,"sequence_start":false,"sequence_end":false},"outputs":[{"name":"last_hidden_state","datatype":"FP32","shape":[1,7,768],"parameters":{"binary_data_size":21504}}]}')


<tritonclient.http._infer_result.InferResult at 0x7fdd6b8d77c0>

In [9]:
results.as_numpy('last_hidden_state')

array([[[ 0.36872923, -0.21283835,  0.5032521 , ...,  0.20257561,
         -0.14473006,  0.16335659],
        [ 0.16864908, -0.29112825,  0.43506908, ...,  0.14554416,
          0.04653071,  0.18665251],
        [ 0.02562865, -0.32240435,  0.40868905, ...,  0.10257501,
          0.07761161,  0.35088107],
        ...,
        [ 0.41528454, -0.3950774 ,  0.28445122, ...,  0.26427785,
          0.18659928, -0.6684136 ],
        [ 1.3407743 , -0.2955229 ,  0.2634831 , ...,  0.33415437,
          0.00846357, -0.1535036 ],
        [ 0.10819956,  0.10353471,  0.18187995, ...,  0.37455615,
          0.08028258, -0.06970064]]], dtype=float32)