Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PagedAttention tracking issue #11

Closed
28 of 32 tasks
Tracked by #3
EricLBuehler opened this issue Nov 30, 2023 · 4 comments
Closed
28 of 32 tasks
Tracked by #3

PagedAttention tracking issue #11

EricLBuehler opened this issue Nov 30, 2023 · 4 comments
Assignees
Labels
tracking Tracking issue or PR for a new feature.

Comments

@EricLBuehler
Copy link
Owner

EricLBuehler commented Nov 30, 2023

Tracking issue for PagedAttention

General overview

  • Use updated, refactored PagedAttention impl from vLLM
  • Implement attention biases
  • Implement _memory_efficient_attention or equiv.
  • Implement and use block cache (Scheduler, CacheEngine) in inference.
  • Finalize FFI linking to compiled CUDA kernels
  • Test PagedAttention in llama , mistral models

Sampler

  • Implement according to this

ModulePipeline refactoring

Need to redesign ModulePipeline design to integrate with the LLMEngine design. This means that a request output will be returned, and logprobs must be calculated.

  • Calculate logprobs
  • Generate best_of responses instead of the current n to check from (sorted by the logprobs)
  • Use Sampler
  • Do not return a ChatChoices, return something like a RawResponse that contains the logprobs, results
  • Streaming poses a problem. This will need to be handled by the LLMEngine.
  • Overall, convert ModulePipeline to a pure-sequence-generating platform

LLM Engine

Manages calling the model pipeline and linking this with the KV cache.

Tasks

  • Init cache by profiling model memory usage at maximum sequence length input. See: Add tracking of memory allocations to Device huggingface/candle#1412
  • Write the .generate function, the decoding entrypoint. The rough call chain looks like this, from vllm:
    1. .generate: batch the input seqs, calling .add_request
    • Write .add_request
    • Write .generate
    1. .run_engine: .step through each unfinished request, recording output
    • Write a .has_unfinished_requests
    • Write .run_engine
    1. .step: 1) call Scheduler to manage the seqs to swap for this decoding phase with ._schedule, then 2) execute the model
    1. .execute_model: Follow the cache ops from ._schedule, finally run the model pipeline with the cache.
    • Looks like the cache passes to the model pipeline has shared ownership with the CacheEngine. Will need to use Arc+my interior immutability setup.
    • Write .execute_model

Completed tasks

SequenceGroup:

Sequences generated from the same prompt.

Tasks

Cache:

CacheEngine manages the KV cache.

Tasks

  • Ensure initialization is called

BlockSpaceManager

Managed blocks and allocation

Deps

  • SequenceGroup

Tasks

  • BlockAllocator
  • PhysicalTokenBlock
  • BlockSpaceManager
    • Allocation, in conjuction with a Scheduler impl

Scheduler

Scheduler schedules blocks to swap in/out, copy.

Deps

  • SequenceGroup
  • BlockSpaceManager

Tasks

  • SchedulerConfig
  • BlockSpaceManager (with some deps: 1) Block Allocator)
    • Allocation via BlockSpaceManager
@EricLBuehler EricLBuehler changed the title support vllm style kv-caching using paged-attention PagedAttention tracking issue Nov 30, 2023
@EricLBuehler EricLBuehler self-assigned this Nov 30, 2023
@EricLBuehler EricLBuehler added the tracking Tracking issue or PR for a new feature. label Nov 30, 2023
@EricLBuehler
Copy link
Owner Author

_memory_efficient_attention is flash_attn when compiled with --features cuda, otherwise just normal scaled_dot_product_attention.

@EricLBuehler
Copy link
Owner Author

#13 added PagedAttention.

@EricLBuehler
Copy link
Owner Author

EricLBuehler commented Dec 6, 2023

this commit adds most of the Scheduler functionality.

@EricLBuehler
Copy link
Owner Author

Closing this pending a rewrite of the scheduler and cache, see the scheduler branch. A new tracking issue will be reopened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
tracking Tracking issue or PR for a new feature.
Projects
None yet
Development

No branches or pull requests

1 participant