[None][fix] synchronize MLA cache reuse fallback metadata by DhineshPonnarasan · Pull Request #14049 · NVIDIA/TensorRT-LLM

DhineshPonnarasan · 2026-05-12T12:27:34Z

Background / Motivation

Issue #13939 reports a runtime correctness bug in MLA fallback handling inside create_py_executor.
When MLA fallback disables KV cache block reuse, the configuration state was updated but attention runtime metadata could remain stale.
This can create split-brain behavior between scheduler/KV cache manager state and attention runtime feature state.

Summary

This PR fixes MLA cache reuse state synchronization in create_py_executor by ensuring that every post-engine MLA fallback path that sets enable_block_reuse to False also updates model_engine runtime metadata through the existing helper path.
The fix is intentionally minimal and follows existing repository patterns already used in other fallback branches.
The PR also adds focused regression tests covering unsupported SM fallback, unsupported KV quantization fallback, and supported MLA configuration to verify invariant preservation in both negative and positive paths.

Scope

This PR addresses a single concern only: MLA KV cache reuse synchronization correctness in Python executor initialization.

Code Changes

Runtime fix in py_executor_creator.py

In MLA unsupported SM fallback branch, after disabling kv_cache_config.enable_block_reuse, call _set_model_engines_cache_reuse([model_engine, draft_model_engine], False).
In MLA unsupported KV quantization fallback branch, after disabling kv_cache_config.enable_block_reuse, call _set_model_engines_cache_reuse([model_engine, draft_model_engine], False).
Reuses existing helper logic and preserves draft None safety through helper iteration.

Regression tests in test_py_executor_creator_mla_cache_reuse_sync.py

test_mla_unsupported_sm_fallback_syncs_cache_reuse
test_mla_unsupported_kv_quant_fallback_syncs_cache_reuse
test_mla_supported_configuration_preserves_cache_reuse

Invariant Protected

For post-engine MLA fallback transitions:

kv_cache_config.enable_block_reuse
model_engine.attn_runtime_features.cache_reuse

must always remain synchronized.

Functional / Performance Impact

Functional impact: fixes potential stale runtime metadata during MLA fallback transitions.
Performance impact: none expected outside correcting fallback state behavior.
API impact: no API signature or user-facing config schema changes.

Risk Assessment

Low risk:

Small, localized change in existing fallback logic.
Uses existing synchronization helper already used elsewhere in create_py_executor.
Includes targeted regression tests for both fallback and supported paths.

Testing

Executed locally:

python -m ruff check test_py_executor_creator_mla_cache_reuse_sync.py

Test coverage added:

Unsupported SM fallback -> both KV manager and runtime metadata are False
Unsupported KV quantization fallback -> both KV manager and runtime metadata are False
Supported MLA configuration -> both KV manager and runtime metadata remain True

Environment note:

Full pytest execution in this Windows environment is blocked by missing MPI runtime DLLs required by import chain.
The regression tests are included for CI validation in standard TensorRT-LLM test environment.

Related Issue

Fixes #13939

Summary by CodeRabbit

Bug Fixes
- Fixed synchronization of cache reuse settings for Multi-head Latent Attention (MLA) configurations. When unsupported SM versions or KV cache quantization constraints are detected, cache reuse is now consistently disabled across all affected components.

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

coderabbitai · 2026-05-12T12:29:40Z

📝 Walkthrough

Walkthrough

The PR synchronizes KV cache reuse configuration with model engine runtime flags in MLA fallback branches. When unsupported GPU SM versions or KV quantization algorithms trigger cache reuse disablement, both the KV cache manager config and attention runtime metadata are now updated consistently.

Changes

MLA Cache Reuse Synchronization

Layer / File(s)	Summary
MLA fallback cache reuse synchronization `tensorrt_llm/_torch/pyexecutor/py_executor_creator.py`	Two MLA fallback branches (unsupported SM version and unsupported KV cache quantization) now call `_set_model_engines_cache_reuse()` to propagate `enable_block_reuse = False` to both main and draft model engines' runtime `cache_reuse` flags, aligning engine runtime state with KV cache configuration.
Cache reuse synchronization test module `tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py`	New pytest module that validates MLA cache reuse synchronization across three scenarios: unsupported SM versions, unsupported KV quantization, and supported configurations. Defines lightweight dummy objects (calibrator, resource manager, executor, KV cache creator, model engine) and monkeypatches executor internals to verify both KV cache reuse and runtime cache reuse flags are disabled consistently in unsupported cases and enabled in supported cases.

🎯 2 (Simple) | ⏱️ ~12 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[None][fix] synchronize MLA cache reuse fallback metadata' is clear and directly related to the main change: fixing synchronization between KV cache config and attention runtime metadata in MLA fallback handling.
Description check	✅ Passed	The PR description is comprehensive and complete, covering background, summary, scope, code changes, invariants, impact, risk assessment, testing, and related issue `#13939`. All sections are well-populated.
Linked Issues check	✅ Passed	The PR fully addresses the requirements from issue `#13939`: synchronizes KV cache reuse state by calling _set_model_engines_cache_reuse() in both MLA fallback branches (SM version and KV quantization), adds comprehensive regression tests covering both fallback and supported paths, and prevents split-brain state between config and runtime metadata.
Out of Scope Changes check	✅ Passed	All changes are directly scoped to MLA cache reuse synchronization in create_py_executor and related regression tests. No unrelated modifications or scope creep detected; minimal localized fix with targeted test coverage.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

🧹 Nitpick comments (1)

tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py (1)
92-134: ⚡ Quick win

Add one draft-model regression case to lock the two-engine sync contract.

Current tests exercise speculative_config=None, so they won’t catch regressions where only the main engine’s runtime flag is updated. Please add a case with a draft model enabled and assert draft runtime cache-reuse sync too.

As per coding guidelines: “Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix.”

Also applies to: 137-208, 211-241
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py`
around lines 92 - 134, The tests only exercise speculative_config=None, so add a
regression case that enables a draft model path and verifies cache-reuse
synchronization for both engines; modify or overload the _make_llm_args helper
(or add a new helper variant) to return a SimpleNamespace with
speculative_config set to a non-None draft configuration (e.g., a
SimpleNamespace indicating draft enabled) and then add an assertion in the new
test that the draft engine's runtime cache-reuse flag/state is updated in sync
with the main engine (referencing _make_llm_args,
kv_cache_config.enable_block_reuse, and speculative_config) to lock the
two-engine sync contract. Ensure analogous additions are made for the other
ranges noted (lines 137-208 and 211-241) to cover both happy-path and draft-case
regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py`:
- Around line 92-134: The tests only exercise speculative_config=None, so add a
regression case that enables a draft model path and verifies cache-reuse
synchronization for both engines; modify or overload the _make_llm_args helper
(or add a new helper variant) to return a SimpleNamespace with
speculative_config set to a non-None draft configuration (e.g., a
SimpleNamespace indicating draft enabled) and then add an assertion in the new
test that the draft engine's runtime cache-reuse flag/state is updated in sync
with the main engine (referencing _make_llm_args,
kv_cache_config.enable_block_reuse, and speculative_config) to lock the
two-engine sync contract. Ensure analogous additions are made for the other
ranges noted (lines 137-208 and 211-241) to cover both happy-path and draft-case
regressions.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 371dd3a2-f10d-410e-b406-5d4685af671b

📥 Commits

Reviewing files that changed from the base of the PR and between 1118769 and 281de12.

📒 Files selected for processing (2)

tensorrt_llm/_torch/pyexecutor/py_executor_creator.py
tests/unittest/_torch/executor/test_py_executor_creator_mla_cache_reuse_sync.py

- Add docstrings to all mock classes (_DummyCalibrator, _DummyResourceManager, _DummyPyExecutor, _DummyKvCacheCreator, _DummyModelEngine) - Add docstrings to helper functions (_make_llm_args, _run_create_py_executor) - Add detailed docstrings to all three test functions describing the invariant being verified - Improves docstring coverage to meet 80% threshold per CodeRabbit pre-merge checks Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

…hineshPonnarasan/TensorRT-LLM into fix/13939-mla-cache-reuse-sync Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

karljang · 2026-05-19T18:53:08Z

@DhineshPonnarasan,
Thanks for your contributions!
Before we proceed, could you please address the issue with the pre-commit check failure?

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

DhineshPonnarasan · 2026-05-19T19:19:06Z

@DhineshPonnarasan, Thanks for your contributions! Before we proceed, could you please address the issue with the pre-commit check failure?

Hi @karljang, thanks for the heads-up.

I reproduced the pre-commit failure locally and fixed it by applying ruff formatting to test_py_executor_creator_mla_cache_reuse_sync.py.

I then re-ran both checks locally:

ruff format --check: pass
ruff check: pass

I pushed the fix in commit 15be36d to this PR branch.
Could you please re-run the pre-commit workflow when you get a chance?

lancelly

LGTM~

lancelly · 2026-05-27T00:56:16Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-27T01:01:30Z

PR_Github #50414 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

tensorrt-cicd · 2026-05-27T03:01:33Z

PR_Github #50414 [ run ] completed with state SUCCESS. Commit: 15be36d
/LLM/main/L0_MergeRequest_PR pipeline #39937 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

karljang · 2026-05-29T18:04:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-29T18:10:41Z

PR_Github #51087 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

tensorrt-cicd · 2026-05-29T23:13:14Z

PR_Github #51087 [ run ] completed with state SUCCESS. Commit: 15be36d
/LLM/main/L0_MergeRequest_PR pipeline #40527 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

karljang · 2026-05-30T00:18:57Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-30T00:24:47Z

PR_Github #51127 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

tensorrt-cicd · 2026-05-30T00:29:43Z

PR_Github #51127 [ run ] completed with state FAILURE. Commit: 15be36d

Link to invocation

karljang · 2026-05-30T04:14:09Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-30T04:19:33Z

PR_Github #51159 [ run ] triggered by Bot. Commit: 15be36d Link to invocation

tensorrt-cicd · 2026-05-30T07:38:42Z

PR_Github #51159 [ run ] completed with state SUCCESS. Commit: 15be36d
/LLM/main/L0_MergeRequest_PR pipeline #40593 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

[None][fix] synchronize MLA cache reuse fallback metadata

373015f

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

DhineshPonnarasan requested a review from a team as a code owner May 12, 2026 12:27

DhineshPonnarasan requested a review from lancelly May 12, 2026 12:27

github-actions Bot assigned DhineshPonnarasan May 12, 2026

Merge branch 'main' into fix/13939-mla-cache-reuse-sync

281de12

coderabbitai Bot reviewed May 12, 2026

View reviewed changes

DhineshPonnarasan added 2 commits May 12, 2026 08:39

Merge branch 'fix/13939-mla-cache-reuse-sync' of https://github.com/D…

19c04e7

…hineshPonnarasan/TensorRT-LLM into fix/13939-mla-cache-reuse-sync Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

DhineshPonnarasan force-pushed the fix/13939-mla-cache-reuse-sync branch from f0ccc16 to 19c04e7 Compare May 12, 2026 12:47

svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label May 12, 2026

Merge branch 'main' into fix/13939-mla-cache-reuse-sync

a24b2cb

style: apply ruff-format to MLA cache reuse sync test

15be36d

Signed-off-by: Dhinesh Ponnarasan <dhineshponnarasan@gmail.com>

lancelly approved these changes May 27, 2026

View reviewed changes

Conversation

DhineshPonnarasan commented May 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background / Motivation

Summary

Scope

Code Changes

Invariant Protected

Functional / Performance Impact

Risk Assessment

Testing

Related Issue

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

karljang commented May 19, 2026

Uh oh!

DhineshPonnarasan commented May 19, 2026

Uh oh!

lancelly left a comment

Choose a reason for hiding this comment

Uh oh!

lancelly commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

tensorrt-cicd commented May 27, 2026

Uh oh!

karljang commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

karljang commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

karljang commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

DhineshPonnarasan commented May 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 12, 2026 •

edited

Loading