Skip to content

[None][fix] synchronize MLA cache reuse fallback metadata#14049

Open
DhineshPonnarasan wants to merge 6 commits into
NVIDIA:mainfrom
DhineshPonnarasan:fix/13939-mla-cache-reuse-sync
Open

[None][fix] synchronize MLA cache reuse fallback metadata#14049
DhineshPonnarasan wants to merge 6 commits into
NVIDIA:mainfrom
DhineshPonnarasan:fix/13939-mla-cache-reuse-sync

Conversation

@DhineshPonnarasan
Copy link
Copy Markdown

@DhineshPonnarasan DhineshPonnarasan commented May 12, 2026

Background / Motivation

Issue #13939 reports a runtime correctness bug in MLA fallback handling inside create_py_executor.
When MLA fallback disables KV cache block reuse, the configuration state was updated but attention runtime metadata could remain stale.
This can create split-brain behavior between scheduler/KV cache manager state and attention runtime feature state.

Summary

This PR fixes MLA cache reuse state synchronization in create_py_executor by ensuring that every post-engine MLA fallback path that sets enable_block_reuse to False also updates model_engine runtime metadata through the existing helper path.
The fix is intentionally minimal and follows existing repository patterns already used in other fallback branches.
The PR also adds focused regression tests covering unsupported SM fallback, unsupported KV quantization fallback, and supported MLA configuration to verify invariant preservation in both negative and positive paths.

Scope

This PR addresses a single concern only: MLA KV cache reuse synchronization correctness in Python executor initialization.

Code Changes

  1. Runtime fix in py_executor_creator.py
  • In MLA unsupported SM fallback branch, after disabling kv_cache_config.enable_block_reuse, call _set_model_engines_cache_reuse([model_engine, draft_model_engine], False).
  • In MLA unsupported KV quantization fallback branch, after disabling kv_cache_config.enable_block_reuse, call _set_model_engines_cache_reuse([model_engine, draft_model_engine], False).
  • Reuses existing helper logic and preserves draft None safety through helper iteration.
  1. Regression tests in test_py_executor_creator_mla_cache_reuse_sync.py
  • test_mla_unsupported_sm_fallback_syncs_cache_reuse
  • test_mla_unsupported_kv_quant_fallback_syncs_cache_reuse
  • test_mla_supported_configuration_preserves_cache_reuse

Invariant Protected

For post-engine MLA fallback transitions:

  • kv_cache_config.enable_block_reuse
  • model_engine.attn_runtime_features.cache_reuse

must always remain synchronized.

Functional / Performance Impact

  • Functional impact: fixes potential stale runtime metadata during MLA fallback transitions.
  • Performance impact: none expected outside correcting fallback state behavior.
  • API impact: no API signature or user-facing config schema changes.

Risk Assessment

Low risk:

  • Small, localized change in existing fallback logic.
  • Uses existing synchronization helper already used elsewhere in create_py_executor.
  • Includes targeted regression tests for both fallback and supported paths.

Testing

Executed locally:

  • python -m ruff check test_py_executor_creator_mla_cache_reuse_sync.py

Test coverage added:

  • Unsupported SM fallback -> both KV manager and runtime metadata are False
  • Unsupported KV quantization fallback -> both KV manager and runtime metadata are False
  • Supported MLA configuration -> both KV manager and runtime metadata remain True

Environment note:

  • Full pytest execution in this Windows environment is blocked by missing MPI runtime DLLs required by import chain.
  • The regression tests are included for CI validation in standard TensorRT-LLM test environment.

Related Issue

Fixes #13939

Summary by CodeRabbit

  • Bug Fixes
    • Fixed synchronization of cache reuse settings for Multi-head Latent Attention (MLA) configurations. When unsupported SM versions or KV cache quantization constraints are detected, cache reuse is now consistently disabled across all affected components.

Review Change Stack

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

📝 Walkthrough

Walkthrough

The PR synchronizes KV cache reuse configuration with model engine runtime flags in MLA fallback branches. When unsupported GPU SM versions or KV quantization algorithms trigger cache reuse disablement, both the KV cache manager config and attention runtime metadata are now updated consistently.

Changes

MLA Cache Reuse Synchronization

Layer / File(s) Summary
MLA fallback cache reuse synchronization
tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
Two MLA fallback branches (unsupported SM version and unsupported KV cache quantization) now call _set_model_engines_cache_reuse() to propagate enable_block_reuse = False to both main and draft model engines' runtime cache_reuse flags, aligning engine runtime state with KV cache configuration.
Cache reuse synchronization test module
tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py
New pytest module that validates MLA cache reuse synchronization across three scenarios: unsupported SM versions, unsupported KV quantization, and supported configurations. Defines lightweight dummy objects (calibrator, resource manager, executor, KV cache creator, model engine) and monkeypatches executor internals to verify both KV cache reuse and runtime cache reuse flags are disabled consistently in unsupported cases and enabled in supported cases.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[None][fix] synchronize MLA cache reuse fallback metadata' is clear and directly related to the main change: fixing synchronization between KV cache config and attention runtime metadata in MLA fallback handling.
Description check ✅ Passed The PR description is comprehensive and complete, covering background, summary, scope, code changes, invariants, impact, risk assessment, testing, and related issue #13939. All sections are well-populated.
Linked Issues check ✅ Passed The PR fully addresses the requirements from issue #13939: synchronizes KV cache reuse state by calling _set_model_engines_cache_reuse() in both MLA fallback branches (SM version and KV quantization), adds comprehensive regression tests covering both fallback and supported paths, and prevents split-brain state between config and runtime metadata.
Out of Scope Changes check ✅ Passed All changes are directly scoped to MLA cache reuse synchronization in create_py_executor and related regression tests. No unrelated modifications or scope creep detected; minimal localized fix with targeted test coverage.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py (1)

92-134: ⚡ Quick win

Add one draft-model regression case to lock the two-engine sync contract.

Current tests exercise speculative_config=None, so they won’t catch regressions where only the main engine’s runtime flag is updated. Please add a case with a draft model enabled and assert draft runtime cache-reuse sync too.

As per coding guidelines: “Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix.”

Also applies to: 137-208, 211-241

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py`
around lines 92 - 134, The tests only exercise speculative_config=None, so add a
regression case that enables a draft model path and verifies cache-reuse
synchronization for both engines; modify or overload the _make_llm_args helper
(or add a new helper variant) to return a SimpleNamespace with
speculative_config set to a non-None draft configuration (e.g., a
SimpleNamespace indicating draft enabled) and then add an assertion in the new
test that the draft engine's runtime cache-reuse flag/state is updated in sync
with the main engine (referencing _make_llm_args,
kv_cache_config.enable_block_reuse, and speculative_config) to lock the
two-engine sync contract. Ensure analogous additions are made for the other
ranges noted (lines 137-208 and 211-241) to cover both happy-path and draft-case
regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py`:
- Around line 92-134: The tests only exercise speculative_config=None, so add a
regression case that enables a draft model path and verifies cache-reuse
synchronization for both engines; modify or overload the _make_llm_args helper
(or add a new helper variant) to return a SimpleNamespace with
speculative_config set to a non-None draft configuration (e.g., a
SimpleNamespace indicating draft enabled) and then add an assertion in the new
test that the draft engine's runtime cache-reuse flag/state is updated in sync
with the main engine (referencing _make_llm_args,
kv_cache_config.enable_block_reuse, and speculative_config) to lock the
two-engine sync contract. Ensure analogous additions are made for the other
ranges noted (lines 137-208 and 211-241) to cover both happy-path and draft-case
regressions.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 371dd3a2-f10d-410e-b406-5d4685af671b

📥 Commits

Reviewing files that changed from the base of the PR and between 1118769 and 281de12.

📒 Files selected for processing (2)
  • tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
  • tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py

- Add docstrings to all mock classes (_DummyCalibrator, _DummyResourceManager, _DummyPyExecutor, _DummyKvCacheCreator, _DummyModelEngine)
- Add docstrings to helper functions (_make_llm_args, _run_create_py_executor)
- Add detailed docstrings to all three test functions describing the invariant being verified
- Improves docstring coverage to meet 80% threshold per CodeRabbit pre-merge checks

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
…hineshPonnarasan/TensorRT-LLM into fix/13939-mla-cache-reuse-sync

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
@DhineshPonnarasan DhineshPonnarasan force-pushed the fix/13939-mla-cache-reuse-sync branch from f0ccc16 to 19c04e7 Compare May 12, 2026 12:47
@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 12, 2026
@karljang
Copy link
Copy Markdown
Collaborator

@DhineshPonnarasan,
Thanks for your contributions!
Before we proceed, could you please address the issue with the pre-commit check failure?

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>
@DhineshPonnarasan
Copy link
Copy Markdown
Author

@DhineshPonnarasan, Thanks for your contributions! Before we proceed, could you please address the issue with the pre-commit check failure?

Hi @karljang, thanks for the heads-up.

I reproduced the pre-commit failure locally and fixed it by applying ruff formatting to test_py_executor_creator_mla_cache_reuse_sync.py.

I then re-ran both checks locally:

  • ruff format --check: pass
  • ruff check: pass

I pushed the fix in commit 15be36d to this PR branch.
Could you please re-run the pre-commit workflow when you get a chance?

Copy link
Copy Markdown
Collaborator

@lancelly lancelly left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

@lancelly
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50414 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #50414 [ run ] completed with state SUCCESS. Commit: 15be36d
/LLM/main/L0_MergeRequest_PR pipeline #39937 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51087 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51087 [ run ] completed with state SUCCESS. Commit: 15be36d
/LLM/main/L0_MergeRequest_PR pipeline #40527 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51127 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51127 [ run ] completed with state FAILURE. Commit: 15be36d

Link to invocation

@karljang
Copy link
Copy Markdown
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51159 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51159 [ run ] completed with state SUCCESS. Commit: 15be36d
/LLM/main/L0_MergeRequest_PR pipeline #40593 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: MLA fallback disables KV cache reuse in config but leaves attention runtime flag stale

5 participants