[None][fix] synchronize MLA cache reuse fallback metadata#14049
[None][fix] synchronize MLA cache reuse fallback metadata#14049DhineshPonnarasan wants to merge 6 commits into
Conversation
Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
📝 WalkthroughWalkthroughThe PR synchronizes KV cache reuse configuration with model engine runtime flags in MLA fallback branches. When unsupported GPU SM versions or KV quantization algorithms trigger cache reuse disablement, both the KV cache manager config and attention runtime metadata are now updated consistently. ChangesMLA Cache Reuse Synchronization
🎯 2 (Simple) | ⏱️ ~12 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip 💬 Introducing Slack Agent: The best way for teams to turn conversations into code.Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.
Built for teams:
One agent for your entire SDLC. Right inside Slack. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py (1)
92-134: ⚡ Quick winAdd one draft-model regression case to lock the two-engine sync contract.
Current tests exercise
speculative_config=None, so they won’t catch regressions where only the main engine’s runtime flag is updated. Please add a case with a draft model enabled and assert draft runtime cache-reuse sync too.As per coding guidelines: “Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix.”
Also applies to: 137-208, 211-241
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py` around lines 92 - 134, The tests only exercise speculative_config=None, so add a regression case that enables a draft model path and verifies cache-reuse synchronization for both engines; modify or overload the _make_llm_args helper (or add a new helper variant) to return a SimpleNamespace with speculative_config set to a non-None draft configuration (e.g., a SimpleNamespace indicating draft enabled) and then add an assertion in the new test that the draft engine's runtime cache-reuse flag/state is updated in sync with the main engine (referencing _make_llm_args, kv_cache_config.enable_block_reuse, and speculative_config) to lock the two-engine sync contract. Ensure analogous additions are made for the other ranges noted (lines 137-208 and 211-241) to cover both happy-path and draft-case regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In
`@tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py`:
- Around line 92-134: The tests only exercise speculative_config=None, so add a
regression case that enables a draft model path and verifies cache-reuse
synchronization for both engines; modify or overload the _make_llm_args helper
(or add a new helper variant) to return a SimpleNamespace with
speculative_config set to a non-None draft configuration (e.g., a
SimpleNamespace indicating draft enabled) and then add an assertion in the new
test that the draft engine's runtime cache-reuse flag/state is updated in sync
with the main engine (referencing _make_llm_args,
kv_cache_config.enable_block_reuse, and speculative_config) to lock the
two-engine sync contract. Ensure analogous additions are made for the other
ranges noted (lines 137-208 and 211-241) to cover both happy-path and draft-case
regressions.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 371dd3a2-f10d-410e-b406-5d4685af671b
📒 Files selected for processing (2)
tensorrt_llm/_torch/pyexecutor/py_executor_creator.pytests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py
- Add docstrings to all mock classes (_DummyCalibrator, _DummyResourceManager, _DummyPyExecutor, _DummyKvCacheCreator, _DummyModelEngine) - Add docstrings to helper functions (_make_llm_args, _run_create_py_executor) - Add detailed docstrings to all three test functions describing the invariant being verified - Improves docstring coverage to meet 80% threshold per CodeRabbit pre-merge checks Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
…hineshPonnarasan/TensorRT-LLM into fix/13939-mla-cache-reuse-sync Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
f0ccc16 to
19c04e7
Compare
|
@DhineshPonnarasan, |
Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
Hi @karljang, thanks for the heads-up. I reproduced the pre-commit failure locally and fixed it by applying ruff formatting to test_py_executor_creator_mla_cache_reuse_sync.py. I then re-ran both checks locally:
I pushed the fix in commit 15be36d to this PR branch. |
|
/bot run --disable-fail-fast |
|
PR_Github #50414 [ run ] triggered by Bot. Commit: |
|
PR_Github #50414 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #51087 [ run ] triggered by Bot. Commit: |
|
PR_Github #51087 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #51127 [ run ] triggered by Bot. Commit: |
|
PR_Github #51127 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #51159 [ run ] triggered by Bot. Commit: |
|
PR_Github #51159 [ run ] completed with state
|
Background / Motivation
Issue #13939 reports a runtime correctness bug in MLA fallback handling inside create_py_executor.
When MLA fallback disables KV cache block reuse, the configuration state was updated but attention runtime metadata could remain stale.
This can create split-brain behavior between scheduler/KV cache manager state and attention runtime feature state.
Summary
This PR fixes MLA cache reuse state synchronization in create_py_executor by ensuring that every post-engine MLA fallback path that sets enable_block_reuse to False also updates model_engine runtime metadata through the existing helper path.
The fix is intentionally minimal and follows existing repository patterns already used in other fallback branches.
The PR also adds focused regression tests covering unsupported SM fallback, unsupported KV quantization fallback, and supported MLA configuration to verify invariant preservation in both negative and positive paths.
Scope
This PR addresses a single concern only: MLA KV cache reuse synchronization correctness in Python executor initialization.
Code Changes
Invariant Protected
For post-engine MLA fallback transitions:
must always remain synchronized.
Functional / Performance Impact
Risk Assessment
Low risk:
Testing
Executed locally:
Test coverage added:
Environment note:
Related Issue
Fixes #13939
Summary by CodeRabbit