PagedAttention tracking issue #11

EricLBuehler · 2023-11-30T15:13:00Z

Tracking issue for PagedAttention

General overview

Use updated, refactored PagedAttention impl from vLLM
Implement attention biases
Implement _memory_efficient_attention or equiv.
- Implement xtransformers memory efficient transformers: https://arxiv.org/abs/2112.05682
Implement and use block cache (Scheduler, CacheEngine) in inference.
Finalize FFI linking to compiled CUDA kernels
Test PagedAttention in llama , mistral models

`Sampler`

Implement according to this

`ModulePipeline` refactoring

Need to redesign ModulePipeline design to integrate with the LLMEngine design. This means that a request output will be returned, and logprobs must be calculated.

Calculate logprobs
Generate best_of responses instead of the current n to check from (sorted by the logprobs)
Use Sampler
Do not return a ChatChoices, return something like a RawResponse that contains the logprobs, results
Streaming poses a problem. This will need to be handled by the LLMEngine.
Overall, convert ModulePipeline to a pure-sequence-generating platform

LLM Engine

Manages calling the model pipeline and linking this with the KV cache.

Tasks

Init cache by profiling model memory usage at maximum sequence length input. See: Add tracking of memory allocations to Device huggingface/candle#1412
Write the .generate function, the decoding entrypoint. The rough call chain looks like this, from vllm:
1. .generate: batch the input seqs, calling .add_request
- Write .add_request
- Write .generate
1. .run_engine: .step through each unfinished request, recording output
- Write a .has_unfinished_requests
- Write .run_engine
1. .step: 1) call Scheduler to manage the seqs to swap for this decoding phase with ._schedule, then 2) execute the model
- Write a simple ._schedule
- Write .step
1. .execute_model: Follow the cache ops from ._schedule, finally run the model pipeline with the cache.
- Looks like the cache passes to the model pipeline has shared ownership with the CacheEngine. Will need to use Arc+my interior immutability setup.
- Write .execute_model

Completed tasks

SequenceGroup:

Sequences generated from the same prompt.

Tasks

Add a new seq grp to the Scheduler when a req is recieved.

https://github.com/vllm-project/vllm/blob/60dc62dc9e53428912953276e0d12a034b353fb6/vllm/engine/llm_engine.py#L252
Add to this when a new seq is generated

https://github.com/vllm-project/vllm/blob/60dc62dc9e53428912953276e0d12a034b353fb6/vllm/engine/llm_engine.py#L429-L435

Cache:

CacheEngine manages the KV cache.

Tasks

Ensure initialization is called

BlockSpaceManager

Managed blocks and allocation

Deps

SequenceGroup

Tasks

BlockAllocator
PhysicalTokenBlock
BlockSpaceManager
- Allocation, in conjuction with a Scheduler impl

Scheduler

Scheduler schedules blocks to swap in/out, copy.

Deps

SequenceGroup
BlockSpaceManager

Tasks

SchedulerConfig
BlockSpaceManager (with some deps: 1) Block Allocator)
- Allocation via BlockSpaceManager

The text was updated successfully, but these errors were encountered:

EricLBuehler · 2023-12-03T10:40:03Z

_memory_efficient_attention is flash_attn when compiled with --features cuda, otherwise just normal scaled_dot_product_attention.

EricLBuehler · 2023-12-03T21:04:29Z

#13 added PagedAttention.

EricLBuehler · 2023-12-06T13:17:04Z

this commit adds most of the Scheduler functionality.

EricLBuehler · 2023-12-11T01:08:49Z

Closing this pending a rewrite of the scheduler and cache, see the scheduler branch. A new tracking issue will be reopened.

EricLBuehler mentioned this issue Nov 30, 2023

Batching and VLLM-style kv caching missing #3

Closed

2 tasks

EricLBuehler changed the title ~~support vllm style kv-caching using paged-attention~~ PagedAttention tracking issue Nov 30, 2023

EricLBuehler self-assigned this Nov 30, 2023

EricLBuehler added the tracking Tracking issue or PR for a new feature. label Nov 30, 2023

EricLBuehler closed this as completed Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PagedAttention tracking issue #11

PagedAttention tracking issue #11

EricLBuehler commented Nov 30, 2023 •

edited

Loading

EricLBuehler commented Dec 3, 2023

EricLBuehler commented Dec 3, 2023

EricLBuehler commented Dec 6, 2023 •

edited

Loading

EricLBuehler commented Dec 11, 2023

PagedAttention tracking issue #11

PagedAttention tracking issue #11

Comments

EricLBuehler commented Nov 30, 2023 • edited Loading

Tracking issue for PagedAttention

General overview

Sampler

ModulePipeline refactoring

LLM Engine

Tasks

Completed tasks

SequenceGroup:

Tasks

Cache:

Tasks

BlockSpaceManager

Deps

Tasks

Scheduler

Deps

Tasks

EricLBuehler commented Dec 3, 2023

EricLBuehler commented Dec 3, 2023

EricLBuehler commented Dec 6, 2023 • edited Loading

EricLBuehler commented Dec 11, 2023

EricLBuehler commented Nov 30, 2023 •

edited

Loading

`Sampler`

`ModulePipeline` refactoring

EricLBuehler commented Dec 6, 2023 •

edited

Loading