Skip to content

Conversation

@FanhaiLu1
Copy link
Collaborator

@FanhaiLu1 FanhaiLu1 commented Sep 6, 2024

Background:
PR #167 Added paged attention manager and kv cache manager. This PR supports end to end paged attention in JetStream.

Main Changes in this PR

  1. Added paged attention kernel in attention_kernel.py
  2. Supported paged attention insert, decode in Engine.py
  3. Refactored PageAttention manager
  4. Added unit test for kernel and manager

Next Steps

  1. Tuning accuracy: Current implement can generate human readable output, but accuracy is low, only first few token match to dense attention tokens
  2. Improve performance: The kernel itself are almost same performance as dense attention. By applying bf16 calculation can boost the performance
  3. Optimize the out jit compute
  4. Elegant way to collect resource
  5. Support quantization
  6. Lazy cache update

Current PR Output tokens example:

[304, 1284, 292, 367, 1476, 304, 1284, 714, 309, 310, 367, 29875, 18834, 29895, 29906, 29889, 29923, 310, 278, 367, 278, 1284, 4614, 17970, 2880, 310, 367, 579, 8024, 297, 25891, 29889, 306, 1311, 310, 1284, 12, 1284, 4634, 338, 304, 292, 6593, 310, 1196, 10379, 306, 29949, 6593, 1341, 263, 2834, 353, 29903, 12623, 29924, 278, 304, 350, 29889, 590, 1339, 29918, 2834, 6593, 310, 278, 310, 445, 338, 306, 508, 4658, 393, 338, 304, 437, 29973, 2023, 6593, 310, 306, 29915, 29879, 1914, 2834, 338, 263, 2462, 1383, 306, 4658, 592, 310, 10239, 306, 306, 273, 297, 306, 1774, 2834, 30010, 29871, 967, 6593, 310, 26093, 29880, 29889, 29914, 29968, 304, 4892, 29936, 29889, 338, 304, 306, 505, 2715, 29889, 372, 2191, 338, 278, 6593, 310, 29914, 2834, 306, 508, 29892, 322, 278, 29899, 306, 505, 29918, 278, 6593, 310, 306, 505, 6593, 29892, 306, 508, 367, 2834, 338, 304, 306, 723, 338, 16316, 310, 2834, 306, 505, 29889, 2794, 310, 306, 505, 304, 306, 306, 505, 263, 716, 29889, 29871, 29896, 29900, 29896, 29900, 30488, 29876, 29871, 29896, 29900, 306, 505, 263, 2462, 306, 505, 1063, 263, 29889, 29871, 29896, 29900, 29900, 526, 366, 508, 367, 29914, 29879, 2834, 338, 263, 29889, 29871, 29896, 29929, 29889, 29871, 29896, 29929, 29929, 29929, 29929, 29929, 29929, 29889, 306, 505, 263, 716, 15483, 322, 278, 1900, 310, 278, 6593, 310, 278, 6593, 310, 278, 6593, 310, 278, 1556, 310, 278, 1900, 29918, 278, 1900, 310, 278, 1556, 310, 278, 1900, 29899, 6707, 373, 278, 6593, 310, 12, 306, 626, 263, 306, 437, 29889, 29889, 306, 367, 592, 306, 626, 6593, 310, 304, 306, 4658, 306, 505, 263, 2462, 310, 278, 6593, 310, 278, 1900, 306, 505, 1063, 263, 716, 3088, 310, 278, 2446, 1629, 29899, 29900, 29889]

@FanhaiLu1 FanhaiLu1 requested review from qihqi and wang2yn84 September 6, 2024 18:12
@FanhaiLu1 FanhaiLu1 merged commit 33348d2 into AI-Hypercomputer:main Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants