[#13320][test] Test coverage and repro for #13320#13553
Conversation
1ff3bf9 to
93207d7
Compare
📝 WalkthroughWalkthroughAdds a comprehensive unit-test suite for Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
/bot run --disable-fail-fast |
|
PR_Github #45901 [ run ] triggered by Bot. Commit: |
|
PR_Github #45901 [ run ] completed with state
|
93207d7 to
edf0e08
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46339 [ run ] triggered by Bot. Commit: |
|
PR_Github #46339 [ run ] completed with state |
Add C++ unit tests covering KvCacheConnector decode-time block allocation. Issue NVIDIA#13320 reports that with a kv_connector::KvCacheConnectorManager attached, KVCacheManager::addToken advances mNumTokens correctly but WindowBlockManager::adjustBlocksIfNeeded never grows mCacheBlockIds at the tokens_per_block boundary, silently corrupting decode KV writes. The C++ source (kvCacheManager.cpp:3195-3201, 1944-1963) does not branch on mKvCacheConnectorManager on the decode path, and the existing test suite passes nullptr for kvCacheConnectorManager everywhere. These tests close that gap by exercising: 1. NoExternalMatches: connector attached, getNumNewMatchedTokens returns 0. Decode-time boundary allocation must still fire. 2. WithExternalMatches: connector reports a non-zero match count (the production Dynamo / KVBM path). Block allocation across multiple boundaries must continue to grow mCacheBlockIds. 3. ParityWithBaseline: identical decode workload run with and without a connector; mCacheBlockIds growth must be byte-for-byte identical step-by-step. A MockKvCacheConnectorManager class in the test file implements the single virtual hook from cpp/include/tensorrt_llm/batch_manager/ kvCacheConnector.h. Co-Authored-By: Yueh-Ting Chen <yueh.ting.chen@gmail.com> Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>
edf0e08 to
2a49e41
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46625 [ run ] triggered by Bot. Commit: |
|
PR_Github #46625 [ run ] completed with state
|
|
/bot skip --comment "Given that this MR has earned success in the previous CI run. The latest run passes the run for cpp tests and the failing errors are not related to this merge request. My judgement is that we have a low risk checking in the MR. So skipping and merging the MR here." |
|
PR_Github #46640 [ skip ] triggered by Bot. Commit: |
|
PR_Github #46640 [ skip ] completed with state |
… kv_cache_config When is_mla(config) and enable_flash_mla are both true in py_executor_creator, the MLA path rebinds the local tokens_per_block to 64 but leaves kv_cache_config.tokens_per_block at the user/default value (typically 32). Two consumers then read the same field at different times: KVCacheManager is built with the local 64, while a KvCacheConnectorScheduler subclass instantiated lower in the function via scheduler_cls(llm_args) reads llm_args.kv_cache_config.tokens_per_block and sees the stale 32. The 2x desync produces a frozen cache_block_ids view to the connector: decode in the engine never crosses the connector's perceived block boundary, decode KV writes overwrite the prefill block, and generation completes with plausible-looking but mathematically corrupted output. This was reproduced on GLM-5.1-FP8 (block-FP8 MLA) with the Dynamo KVBM connector. The fix mirrors the pattern already used by other overrides in the same is_mla(config) block (e.g. kv_cache_config.enable_block_reuse = False) and writes the effective value back onto the config object. Test coverage: - New tests/unittest/_torch/executor/test_py_executor_creator_flash_mla_tokens_per_block.py pins the propagation in source so a future refactor cannot quietly drop it. Complements the C++ regression suite added in NVIDIA#13553. Fixes NVIDIA#13320 Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>
Description
Repro and coverage for #13320. The reporter has empirical evidence that, with a
KvCacheConnectorSchedulerattached,KVCacheManager::addTokenadvancesGenerationRequest::mNumTokenscorrectly butWindowBlockManager::adjustBlocksIfNeedednever growsmCacheBlockIdsat thetokens_per_blockboundary. Decode KV writes consequently overwrite the prefill block in place, silently corrupting attention outputs across nearly every real generation.A read of the C++ source does not show a branch on
mKvCacheConnectorManageralong the decode path:cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:3195-3201—KVCacheManager::addTokenisaddNewTokens(1)followed bymBlockManager.adjustBlocksIfNeeded(sequence)unconditionally.cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:1944-1963—WindowBlockManager::adjustBlocksIfNeededchecks(getNumTokens() - 1) % getTokensPerBlock() == 0and callsallocateBlock→addBlockToBeam→sequence.addCacheBlock(...).mKvCacheConnectorManageris referenced only fromonboardAndAllocateBlocks(line 1725, prefill) and the constructor.Yet the bug is reproducible in production. Existing unit-test coverage does not exercise this surface:
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cppreferences akvCacheConnectorManagerargument exactly once, passingnullptr(line 7535).This PR adds the missing coverage:
MockKvCacheConnectorManagerimplementing the singlegetNumNewMatchedTokenshook fromcpp/include/tensorrt_llm/batch_manager/kvCacheConnector.h, so future regressions on the connector-attached decode path surface in the C++ suite without needing the Python connector stack.TEST_Fs onKVCacheManagerTest:KvCacheConnector_DecodeBlockBoundary_NoExternalMatches— connector attached,getNumNewMatchedTokensreturns 0. Decode-time boundary allocation must still fire.KvCacheConnector_DecodeBlockBoundary_WithExternalMatches— connector reports 4 externally matched tokens (the production Dynamo / KVBM path). Boundary allocation across multiple decode boundaries must still growmCacheBlockIds.KvCacheConnector_DecodeBlockBoundary_ParityWithBaseline— identical decode workload run with and without a connector; per-step block-count traces must be byte-for-byte identical.These tests are the first step toward fixing #13320: if any of them fails on the canonical CI builders, the failure pinpoints the divergence inside the C++ block manager. If they pass, the bug lives above the C++ layer and the next step is to instrument the Python
prepare_resources/KvCacheConnectorManagerplumbing.Test Coverage
cpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp::KVCacheManagerTest.KvCacheConnector_DecodeBlockBoundary_NoExternalMatchescpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp::KVCacheManagerTest.KvCacheConnector_DecodeBlockBoundary_WithExternalMatchescpp/tests/unit_tests/batch_manager/kvCacheManagerTest.cpp::KVCacheManagerTest.KvCacheConnector_DecodeBlockBoundary_ParityWithBaselinePR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.Summary by CodeRabbit