Skip to content

[Feature] triton backend optimization #1309

@zhyncs

Description

@zhyncs

Motivation

As described in #1280, there are some performance issues with the Triton backend currently implemented in the repo, and its throughput is not as good as that of the API server. In some companies, Triton Server is closely integrated with the model control platform. Therefore, it is necessary to implement an efficient Triton backend. Considering that @lvhan028 previously replied in the issue that there is currently no extra time available for this optimization.

@ispobock and I plan to re-implement an efficient triton backend, the details are as follows:

  1. stream infer
    We will use the Triton Python backend and also employ the decoupled mode, which means starting only one instance, similar to this: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py
    This is how we did it internally before open sourcing vllm-backend, and both throughput and latency meet expectations. NVIDIA Triton support vllm-project/vllm#541 (comment)
    It requires client support for stream infer.

  2. normal infer
    We will implement it using the method of BLS + stream infer. It will start multiple BLS instances and one model instance. It requires Triton Server >= 23.04.
    It has been discussed here before NVIDIA Triton support vllm-project/vllm#541 (comment)
    It does not require client support for stream inference, normal synchronous inference is sufficient.

After completing these tasks, we will provide examples for the Python client and Java client, as these two clients are more commonly used in the enterprise. At the same time, performance test data will be provided to ensure that it is comparable to the performance of the API server. Cheers. Stay tuned.

@lvhan028 @lzhangzz @AllentDan @grimoire @irexyc Do you have any suggestions? Thanks.

Related resources

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions