[#13320][test] Test coverage and repro for #13320 by eopXD · Pull Request #13553 · NVIDIA/TensorRT-LLM

eopXD · 2026-04-28T08:07:30Z

Description

Repro and coverage for #13320. The reporter has empirical evidence that, with a KvCacheConnectorScheduler attached, KVCacheManager::addToken advances GenerationRequest::mNumTokens correctly but WindowBlockManager::adjustBlocksIfNeeded never grows mCacheBlockIds at the tokens_per_block boundary. Decode KV writes consequently overwrite the prefill block in place, silently corrupting attention outputs across nearly every real generation.

A read of the C++ source does not show a branch on mKvCacheConnectorManager along the decode path:

cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:3195-3201 — KVCacheManager::addToken is addNewTokens(1) followed by mBlockManager.adjustBlocksIfNeeded(sequence) unconditionally.
cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:1944-1963 — WindowBlockManager::adjustBlocksIfNeeded checks (getNumTokens() - 1) % getTokensPerBlock() == 0 and calls allocateBlock → addBlockToBeam → sequence.addCacheBlock(...).
mKvCacheConnectorManager is referenced only from onboardAndAllocateBlocks (line 1725, prefill) and the constructor.

Yet the bug is reproducible in production. Existing unit-test coverage does not exercise this surface: cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp references a kvCacheConnectorManager argument exactly once, passing nullptr (line 7535).

This PR adds the missing coverage:

A MockKvCacheConnectorManager implementing the single getNumNewMatchedTokens hook from cpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h, so future regressions on the connector-attached decode path surface in the C++ suite without needing the Python connector stack.
Three TEST_Fs on KVCacheManagerTest:
1. KvCacheConnector_DecodeBlockBoundary_NoExternalMatches — connector attached, getNumNewMatchedTokens returns 0. Decode-time boundary allocation must still fire.
2. KvCacheConnector_DecodeBlockBoundary_WithExternalMatches — connector reports 4 externally matched tokens (the production Dynamo / KVBM path). Boundary allocation across multiple decode boundaries must still grow mCacheBlockIds.
3. KvCacheConnector_DecodeBlockBoundary_ParityWithBaseline — identical decode workload run with and without a connector; per-step block-count traces must be byte-for-byte identical.

These tests are the first step toward fixing #13320: if any of them fails on the canonical CI builders, the failure pinpoints the divergence inside the C++ block manager. If they pass, the bug lives above the C++ layer and the next step is to instrument the Python prepare_resources / KvCacheConnectorManager plumbing.

Test Coverage

cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp::KVCacheManagerTest.KvCacheConnector_DecodeBlockBoundary_NoExternalMatches
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp::KVCacheManagerTest.KvCacheConnector_DecodeBlockBoundary_WithExternalMatches
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp::KVCacheManagerTest.KvCacheConnector_DecodeBlockBoundary_ParityWithBaseline

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Summary by CodeRabbit

Tests
- Added unit tests for KV cache memory management during decoding operations to verify correct block allocation behavior.

coderabbitai · 2026-04-28T08:11:31Z

📝 Walkthrough

Walkthrough

Adds a comprehensive unit-test suite for KVCacheManager that uses a mocked KvCacheConnectorManager to verify decode-time block allocation behavior across token boundaries in three distinct scenarios involving connector state variations.

Changes

Cohort / File(s)	Summary
KVCacheManager Block Allocation Tests `cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp`	Adds 249 lines of test code: helper functions to construct `KVCacheManager` instances with mock connectors, a helper to drive decoding while recording block ID counts, and three `TEST_F` cases verifying block allocation behavior with zero externally matched tokens, non-zero matched-token counts during prefill, and parity between baseline and connector-attached execution.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description comprehensively explains the issue, root cause analysis, the testing approach, and covers all required template sections including Description, Test Coverage, and a completed PR Checklist.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Title check	✅ Passed	The title clearly references issue `#13320` and indicates this PR adds test coverage and a reproduction case, which aligns with the primary purpose of adding unit tests for the KVCacheManager decode-time block allocation behavior.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

eopXD · 2026-04-28T08:13:03Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-28T08:20:03Z

PR_Github #45901 [ run ] triggered by Bot. Commit: 93207d7 Link to invocation

tensorrt-cicd · 2026-04-28T13:46:43Z

PR_Github #45901 [ run ] completed with state FAILURE. Commit: 93207d7
/LLM/main/L0_MergeRequest_PR pipeline #36067 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

eopXD · 2026-04-30T07:53:18Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-04-30T08:01:50Z

PR_Github #46339 [ run ] triggered by Bot. Commit: edf0e08 Link to invocation

tensorrt-cicd · 2026-04-30T14:18:58Z

PR_Github #46339 [ run ] completed with state SUCCESS. Commit: edf0e08
/LLM/main/L0_MergeRequest_PR pipeline #36433 completed with status: 'SUCCESS'

CI Report

Link to invocation

Add C++ unit tests covering KvCacheConnector decode-time block allocation. Issue NVIDIA#13320 reports that with a kv_connector::KvCacheConnectorManager attached, KVCacheManager::addToken advances mNumTokens correctly but WindowBlockManager::adjustBlocksIfNeeded never grows mCacheBlockIds at the tokens_per_block boundary, silently corrupting decode KV writes. The C++ source (kvCacheManager.cpp:3195-3201, 1944-1963) does not branch on mKvCacheConnectorManager on the decode path, and the existing test suite passes nullptr for kvCacheConnectorManager everywhere. These tests close that gap by exercising: 1. NoExternalMatches: connector attached, getNumNewMatchedTokens returns 0. Decode-time boundary allocation must still fire. 2. WithExternalMatches: connector reports a non-zero match count (the production Dynamo / KVBM path). Block allocation across multiple boundaries must continue to grow mCacheBlockIds. 3. ParityWithBaseline: identical decode workload run with and without a connector; mCacheBlockIds growth must be byte-for-byte identical step-by-step. A MockKvCacheConnectorManager class in the test file implements the single virtual hook from cpp/include/tensorrt_llm/batch_manager/ kvCacheConnector.h. Co-Authored-By: Yueh-Ting Chen <yueh.ting.chen@gmail.com> Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>

eopXD · 2026-05-04T07:13:33Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-04T07:21:35Z

PR_Github #46625 [ run ] triggered by Bot. Commit: 2a49e41 Link to invocation

tensorrt-cicd · 2026-05-04T12:05:35Z

PR_Github #46625 [ run ] completed with state SUCCESS. Commit: 2a49e41
/LLM/main/L0_MergeRequest_PR pipeline #36670 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

eopXD · 2026-05-04T13:03:01Z

/bot skip --comment "Given that this MR has earned success in the previous CI run. The latest run passes the run for cpp tests and the failing errors are not related to this merge request. My judgement is that we have a low risk checking in the MR. So skipping and merging the MR here."

tensorrt-cicd · 2026-05-04T13:09:05Z

PR_Github #46640 [ skip ] triggered by Bot. Commit: 2a49e41 Link to invocation

tensorrt-cicd · 2026-05-04T13:18:55Z

PR_Github #46640 [ skip ] completed with state SUCCESS. Commit: 2a49e41
Skipping testing for commit 2a49e41

Link to invocation

… kv_cache_config When is_mla(config) and enable_flash_mla are both true in py_executor_creator, the MLA path rebinds the local tokens_per_block to 64 but leaves kv_cache_config.tokens_per_block at the user/default value (typically 32). Two consumers then read the same field at different times: KVCacheManager is built with the local 64, while a KvCacheConnectorScheduler subclass instantiated lower in the function via scheduler_cls(llm_args) reads llm_args.kv_cache_config.tokens_per_block and sees the stale 32. The 2x desync produces a frozen cache_block_ids view to the connector: decode in the engine never crosses the connector's perceived block boundary, decode KV writes overwrite the prefill block, and generation completes with plausible-looking but mathematically corrupted output. This was reproduced on GLM-5.1-FP8 (block-FP8 MLA) with the Dynamo KVBM connector. The fix mirrors the pattern already used by other overrides in the same is_mla(config) block (e.g. kv_cache_config.enable_block_reuse = False) and writes the effective value back onto the config object. Test coverage: - New tests/unittest/_torch/executor/test_py_executor_creator_flash_mla_tokens_per_block.py pins the propagation in source so a future refactor cannot quietly drop it. Complements the C++ regression suite added in NVIDIA#13553. Fixes NVIDIA#13320 Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>

eopXD requested a review from a team as a code owner April 28, 2026 08:07

github-actions Bot assigned eopXD Apr 28, 2026

eopXD force-pushed the fix/issue-13320-kv-connector-decode-block-allocation branch from 1ff3bf9 to 93207d7 Compare April 28, 2026 08:10

eopXD force-pushed the fix/issue-13320-kv-connector-decode-block-allocation branch from 93207d7 to edf0e08 Compare April 30, 2026 02:57

nvpohanh approved these changes May 4, 2026

View reviewed changes

eopXD force-pushed the fix/issue-13320-kv-connector-decode-block-allocation branch from edf0e08 to 2a49e41 Compare May 4, 2026 07:13

eopXD changed the title ~~[None][test] Test coverage and repro for #13320~~ [#13320][test] Test coverage and repro for #13320 May 4, 2026

eopXD enabled auto-merge (squash) May 4, 2026 13:03

eopXD merged commit abe5570 into NVIDIA:main May 4, 2026
9 checks passed

This was referenced May 5, 2026

[None][fix] Fix KVCacheManager constructor call in connector test helper #13749

Merged

[#13320][fix] Propagate FlashMLA tokens_per_block override onto kv_cache_config #13752

Merged

eopXD deleted the fix/issue-13320-kv-connector-decode-block-allocation branch May 7, 2026 06:06

Conversation

eopXD commented Apr 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

eopXD commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

tensorrt-cicd commented Apr 28, 2026

Uh oh!

eopXD commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

tensorrt-cicd commented Apr 30, 2026

Uh oh!

eopXD commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

eopXD commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

tensorrt-cicd commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eopXD commented Apr 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 28, 2026 •

edited

Loading

eopXD commented May 4, 2026 •

edited

Loading