Replies: 2 comments
-
|
Questions:
|
Beta Was this translation helpful? Give feedback.
0 replies
-
|
@ipiszy Hi, I'm interested in |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
This is a discussion of fp8 blockwise scaling in FAv3. Please see this doc for detailed info. Code has been upstreamed to ipiszy/fp8_scaling_recipe.
Motivation
The existing quantization recipe supported by FAv3 FP8 attention kernel is per-KV head scaling. This approach has good kernel performance, however there are two potential drawbacks: 1) for extreme long context length, per-KV head scaling granularity might be too coarse-grained which yields unsatisfying numerics; 2) for long decode or multi-turn cases where KVs are appended dynamically, it's painful to update per-KV head scales dynamically. Blockwise-scaling is useful when static per-KV head scaling is not enough to satisfy numeric requirements.
Design
Check doc for detailed info.
Benchmark Results
Overall, fp8 attention kernel achieves down to 62% BF16 attention kernel latency for prefill and down to 52% BF16 latency for decode, without perf degradation under short context lengths. Check doc for detailed info.
Current Status
Current code only supports fixed seqlen. More tests (and bug fixes) are needed to support var-seq-len / pagedKV, etc.
Beta Was this translation helpful? Give feedback.
All reactions