Enable Indexer cache for DS v3.2 decoding#3529
Conversation
d0609de to
024c567
Compare
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
024c567 to
d3f6edd
Compare
|
🤖 Hi @RissyRan, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
|
🤖 I'm sorry @RissyRan, but I was unable to process your request. Please see the logs for more details. |
b6120e0 to
1a3b8af
Compare
1a3b8af to
646ac9c
Compare
There was a problem hiding this comment.
Thanks for enabling decoding for DeepSeek Sparse Attention, and adding specialized encoding!
Really appreciate the comprehensive testing for pre-filling and generation (across unit tests, standalone decoding, and the API server). It is amazing that you reproduce reasonable MMLU-pro result for large scale with limited resources.
|
🤖 Hi @shuningjin, I've received your request, and I'm working on it now! You can track my progress in the logs for more details. |
There was a problem hiding this comment.
This pull request enables the Indexer cache for DeepSeek V3.2 decoding, which is a critical component for sparse attention bringup in MaxText. It also introduces necessary encoding/decoding logic for DeepSeek-V3.2's specialized output format, including thinking/reasoning blocks, and updates the API server to handle these new fields.
🔍 General Feedback
- Reasoning Content: The addition of
reasoning_contenttoChatMessageis a great improvement for supporting modern LLMs that use separate thinking blocks. - Cache Implementation: The Indexer KV cache implementation follows the existing MLA patterns well and correctly handles masking for uninitialized slots during decoding.
- Model Compatibility: The model name checks in the API server might be too restrictive, potentially missing the official Hugging Face model names. Consider using more robust string matching as suggested.
- Testing: The new unit tests effectively verify both prefill and autoregressive modes for the indexer, covering different sequence length scenarios.
646ac9c to
8c1cdb3
Compare
8c1cdb3 to
f4a4f21
Compare
Rohan-Bierneni
left a comment
There was a problem hiding this comment.
Thank you for making this change! As we have discussed in previous pr, there can be some other optimizations made to the cache, but for functionality reusing kvcache for indexer was a good choice.
Also thank you for adding tests and getting MMLU-Pro to work with api_server.
LGTM!
Description
This is a clean version of previous PR after rebase.
Enable Indexer cache for DS v3.2 decoding, to unblock the eval benchmark for DS v3.2 model with sparse attention bringup.
init_indexer_cache&update_indexer_cachefor indexer cacheencoding_dsv32.pyfrom HF files to enable specific encoding for V3.2Tests
max_prefill_predict_length=3072 max_target_length=4096: linkChecklist
Before submitting this PR, please make sure (put X in square brackets):
gemini-reviewlabel.