[Feature] triton backend optimization

### Motivation

As described in https://github.com/InternLM/lmdeploy/issues/1280, there are some performance issues with the Triton backend currently implemented in the repo, and its throughput is not as good as that of the API server. In some companies, Triton Server is closely integrated with the model control platform. Therefore, it is necessary to implement an efficient Triton backend. Considering that @lvhan028  previously replied in the issue that there is currently no extra time available for this optimization.

@ispobock and I plan to **re-implement** an efficient triton backend, the details are as follows:

1. stream infer
We will use the Triton Python backend and also employ the `decoupled` mode, which means starting only one instance, similar to this: https://github.com/triton-inference-server/vllm_backend/blob/main/src/model.py
This is how we did it internally before open sourcing vllm-backend, and both throughput and latency meet expectations. https://github.com/vllm-project/vllm/issues/541#issuecomment-1670033944
It requires client support for stream infer.

2. normal infer
We will implement it using the method of BLS + stream infer. It will start multiple BLS instances and one model instance. It requires Triton Server >= 23.04.
It has been discussed here before https://github.com/vllm-project/vllm/issues/541#issuecomment-1669980643
It does not require client support for stream inference, normal synchronous inference is sufficient.

After completing these tasks, we will provide examples for the Python client and Java client, as these two clients are more commonly used in the enterprise. At the same time, performance test data will be provided to ensure that it is comparable to the performance of the API server. Cheers. Stay tuned.

@lvhan028 @lzhangzz @AllentDan @grimoire @irexyc Do you have any suggestions? Thanks.


### Related resources

_No response_

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] triton backend optimization #1309

Motivation

Related resources

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature] triton backend optimization #1309

Description

Motivation

Related resources

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions