when fastertransformer support continuous batching and PagedAttention ? #696

ppppppppig · 2023-06-30T10:44:38Z

From this article, I learned that continuous batching and PagedAttention greatly improve the inference performance of large models. I would like to know if fastertransformer has plans to support these two features.

hudengjunai · 2023-07-07T06:16:52Z

I both use FT+Tritonserver, TGI and vLLM, the vllm iterative-token-level batching throughtoutput is obviously large than request-level batching

hudengjunai · 2023-07-07T06:17:49Z

The FastServe paper is discuss this promblem.[FasterServe](Fast Distributed Inference Serving for Large Language Models)

sfc-gh-jhilgart · 2023-07-13T03:10:20Z

Following

gttiankai · 2023-07-18T09:20:59Z

Following

lucasjinreal · 2023-07-21T06:20:45Z

have any body tested vllm through output compare with fastertransformer?

lvhan028 · 2023-07-25T04:17:51Z

Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind

It supports llama and llama-2
It modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process named as "persistent batch", which is like continuous batching
This document presents the architecture in more detail.

ppppppppig · 2023-07-25T09:46:58Z

Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind

It supports llama and llama-2

It modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process named as "persistent batch", which is like continuous batching
This document presents the architecture in more detail.

看了下您给的文档，persistent batch这样记住多轮对话的kv确实能够有效提升对话过程的推理速度。但是感觉跟continuous batching还不太一样，我理解continuous batching是指当对一个batch 请求进行推理时，新来了一个请求，这个请求无需等待该batch所有请求完成，而是当该batch有完成了足够的请求后，直接和该batch中未完成的请求一起进行推理。

lvhan028 · 2023-07-28T00:03:39Z

Based on FasterTransformer, we have implemented an efficient inference engine - TurboMind

It supports llama and llama-2

It modeled the inference of a conversational LLM as a persistently running batch whose lifetime spans the entire serving process named as "persistent batch", which is like continuous batching
This document presents the architecture in more detail.

看了下您给的文档，persistent batch这样记住多轮对话的kv确实能够有效提升对话过程的推理速度。但是感觉跟continuous batching还不太一样，我理解continuous batching是指当对一个batch 请求进行推理时，新来了一个请求，这个请求无需等待该batch所有请求完成，而是当该batch有完成了足够的请求后，直接和该batch中未完成的请求一起进行推理。

Request in the queue will join the batch as long as there are free batch slots in the persistent batch

byshiue · 2023-10-20T07:49:15Z

FasterTransformer development has transitioned to TensorRT-LLM. Continuous batching (inflight-batching) and PagedAttention are supported in TensorRT-LLM. Please take a try.

lvhan028 mentioned this issue Jul 25, 2023

[Doc] Add projects section in README which is developed based on FasterTransformer #731

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

when fastertransformer support continuous batching and PagedAttention ? #696

when fastertransformer support continuous batching and PagedAttention ? #696

ppppppppig commented Jun 30, 2023 •

edited

Loading

hudengjunai commented Jul 7, 2023

hudengjunai commented Jul 7, 2023

sfc-gh-jhilgart commented Jul 13, 2023

gttiankai commented Jul 18, 2023

lucasjinreal commented Jul 21, 2023

lvhan028 commented Jul 25, 2023

ppppppppig commented Jul 25, 2023

lvhan028 commented Jul 28, 2023

byshiue commented Oct 20, 2023

when fastertransformer support continuous batching and PagedAttention ? #696

when fastertransformer support continuous batching and PagedAttention ? #696

Comments

ppppppppig commented Jun 30, 2023 • edited Loading

hudengjunai commented Jul 7, 2023

hudengjunai commented Jul 7, 2023

sfc-gh-jhilgart commented Jul 13, 2023

gttiankai commented Jul 18, 2023

lucasjinreal commented Jul 21, 2023

lvhan028 commented Jul 25, 2023

ppppppppig commented Jul 25, 2023

lvhan028 commented Jul 28, 2023

byshiue commented Oct 20, 2023

ppppppppig commented Jun 30, 2023 •

edited

Loading