-
Notifications
You must be signed in to change notification settings - Fork 891
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
when fastertransformer support continuous batching and PagedAttention ? #696
Comments
I both use FT+Tritonserver, TGI and vLLM, the vllm iterative-token-level batching throughtoutput is obviously large than request-level batching |
The FastServe paper is discuss this promblem.[FasterServe](Fast Distributed Inference Serving for Large Language Models) |
Following |
1 similar comment
Following |
have any body tested vllm through output compare with fastertransformer? |
Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind
|
看了下您给的文档,persistent batch这样记住多轮对话的kv确实能够有效提升对话过程的推理速度。但是感觉跟continuous batching还不太一样,我理解continuous batching是指当对一个batch 请求进行推理时,新来了一个请求,这个请求无需等待该batch所有请求完成,而是当该batch有完成了足够的请求后,直接和该batch中未完成的请求一起进行推理。 |
Request in the queue will join the batch as long as there are free batch slots in the persistent batch |
FasterTransformer development has transitioned to TensorRT-LLM. Continuous batching (inflight-batching) and PagedAttention are supported in TensorRT-LLM. Please take a try. |
From this article, I learned that continuous batching and PagedAttention greatly improve the inference performance of large models. I would like to know if fastertransformer has plans to support these two features.
The text was updated successfully, but these errors were encountered: