[Fastpitch/Pytorch] Multi GPU inferencing in a single triton server issue

Related to **Fastpitch/Pytorch(s)** 
*(https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch)*

**Describe the bug**

We are using [FastPitch](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch) to generate the Melspectrogram for Thai Language. We have an inference server that occupies two T4 GPUs. Our approach is to use both GPUs in one triton server as follows, 

![image](https://user-images.githubusercontent.com/15320876/204241762-027e893a-54dd-4c3f-ae8e-0fa3be06e6d9.png)

When we run only one GPU per triton server, TTS model could generate Mel-Spectrogram without an issue. But once we use two GPUs per triton server we had to face the following issue. 

**RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!**

![image](https://user-images.githubusercontent.com/15320876/204224083-d9a99df4-1c93-4cad-a3a9-195c5fba4d4c.png)

**To Reproduce**
Steps to reproduce the behavior:
1. Follow the [Fastpitch triton example](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/FastPitch/triton)  to generate a triton model and the config.
2. Copy the model folder into specific location (example ~/models)
3. Edit the triton model configuration config.pbxt (two instance groups)
`instance_group[
  {
    count: 2
    kind: KIND_GPU
    gpus: [0]
  },
  {
    count: 2
    kind: KIND_GPU
    gpus: [1]
  }
]`
    
5. Pull the [triton server 21.05](https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/rel_21-05.html#rel_21-05) and run the Triton server as follows with CUDA_VISIBLE_DEVICES=0,1 environment. 
 ` docker run -it --rm --runtime=nvidia --env CUDA_VISIBLE_DEVICES=0,1 --gpus=all -v ~/models:/models -p 8000:8000 -p 8001:8001 -p 8002:8002 nvcr.io/nvidia/tritonserver:21.05-py3 tritonserver --model-repository=/models --strict-model-config=false`

6. Modify and run a simple client from [Client libraries](https://github.com/triton-inference-server/client) (HTTP, GRPC) to infer the model.
 
**Expected behavior**

Runtime Error will show up in the triton client-side log as follows,

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

Code that relates the [issue](https://github.com/NVIDIA/DeepLearningExamples/blob/master/PyTorch/SpeechSynthesis/FastPitch/fastpitch/transformer_jit.py)

![image](https://user-images.githubusercontent.com/15320876/204232714-f7dddaf6-5ccb-4ed7-a147-3051d0354f7a.png)

**Environment**
* Container version: nvcr.io/nvidia/tritonserver:21.05-py3 (Triton server image)
* GPUs in the system: 2x Tesla T4-16GB
* CUDA driver version:  Driver Version: 515.65.01


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Fastpitch/Pytorch] Multi GPU inferencing in a single triton server issue #1229

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Fastpitch/Pytorch] Multi GPU inferencing in a single triton server issue #1229

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions