[None][perf] AutoDeploy: reduce C++ dispatch overhead in decode scheduling loop#13012
Conversation
📝 WalkthroughWalkthroughThese changes optimize position ID handling and KV cache computation in the auto-deployment system by deferring position ID staging and introducing caching for input positions alongside batched KV cache index lookups. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py (1)
837-855: Useisinstance(kv_cache_manager, MambaHybridCacheManager)instead ofhasattr()for mamba-specific guards.The
hasattr(kv_cache_manager, "mamba_cache_index")check at line 837 works today but is fragile. The attribute is unique toMambaHybridCacheManagerand not exposed on the baseKVCacheManagerclass. However, a direct type check withisinstance()is more explicit, safer if the attribute ever becomes a placeholder elsewhere, and consistent with how other call sites in the codebase guard mamba-specific access (e.g.,disaggregation/transceiver.py,kv_extractor.py,_util.py).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py` around lines 837 - 855, Replace the fragile hasattr guard with an explicit type check: instead of _use_mamba = hasattr(kv_cache_manager, "mamba_cache_index") use _use_mamba = isinstance(kv_cache_manager, MambaHybridCacheManager); update the import or reference so MambaHybridCacheManager is available in this module and keep the rest of the logic (uses of kv_cache_manager.mamba_cache_index and fallbacks) unchanged to match other call sites like disaggregation/transceiver.py and kv_extractor.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py`:
- Around line 837-855: Replace the fragile hasattr guard with an explicit type
check: instead of _use_mamba = hasattr(kv_cache_manager, "mamba_cache_index")
use _use_mamba = isinstance(kv_cache_manager, MambaHybridCacheManager); update
the import or reference so MambaHybridCacheManager is available in this module
and keep the rest of the logic (uses of kv_cache_manager.mamba_cache_index and
fallbacks) unchanged to match other call sites like
disaggregation/transceiver.py and kv_extractor.py.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 9bf5195f-cffa-4566-a54a-1f813fc17edb
📒 Files selected for processing (2)
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.pytensorrt_llm/_torch/auto_deploy/shim/ad_executor.py
💤 Files with no reviewable changes (1)
- tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
74c3bf7 to
e368317
Compare
Three low-overhead changes that reduce Python/C++ dispatch cost per decode step in the ADEngine scheduling loop: 1. get_last_tokens instead of get_token + get_num_tokens (ad_executor.py): Replace the two-call pattern get_token(0, get_num_tokens(0) - 1) with a single get_last_tokens(0) call. Saves one C++ round-trip per non-overlap decode request per step. Aligns with the pattern in model_engine.py. 2. Inline get_num_kv_blocks (ad_executor.py): Replace kv_cache_manager.get_num_kv_blocks(end_compute_i) with the equivalent pure-Python ceildiv expression, saving a function-call overhead per request per step. 3. Hoist mamba hasattr check (ad_executor.py): Move hasattr(kv_cache_manager, "mamba_cache_index") outside the per-request loop into _use_mamba to avoid repeated attribute lookup. 4. Remove redundant position_ids computation (attention_interface.py): position_ids were computed in SequenceInfo.nest_sequences() and then recomputed again downstream. Removing the duplicate saves one NumPy array construction and torch.from_numpy() call on every prefill step. Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
e368317 to
23a78a7
Compare
|
@coderabbitai review the changes |
|
✅ Actions performedReview triggered.
|
- Remove explanatory comment on get_last_tokens (self-evident) - Use isinstance(kv_cache_manager, MambaHybridCacheManager) instead of hasattr() for the mamba guard — more explicit and consistent with other call sites in ad_executor.py (line 295) and the codebase Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
|
Addressed CodeRabbit's |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #43144 [ run ] triggered by Bot. Commit: |
|
PR_Github #43144 [ run ] completed with state
|
…guard isinstance(kv_cache_manager, MambaHybridCacheManager) breaks unit tests that use _DummyHybridKVCacheManager, which cannot subclass the real class due to its complex constructor (~15 required params). hasattr() is the correct duck-typing approach here. Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
|
Tried switching to |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #43251 [ run ] triggered by Bot. Commit: |
|
PR_Github #43251 [ run ] completed with state |
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #43541 [ run ] triggered by Bot. Commit: |
|
PR_Github #43541 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #43817 [ run ] triggered by Bot. Commit: |
|
PR_Github #43817 [ run ] completed with state
|
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast |
|
PR_Github #46689 [ run ] triggered by Bot. Commit: |
Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
|
/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1, DGX_B200-8_GPUs-AutoDeploy-Post-Merge-1" --disable-fail-fast |
|
PR_Github #46706 [ run ] triggered by Bot. Commit: |
|
PR_Github #46706 [ run ] completed with state |
|
/bot skip --comment "AD pipeline passed" |
|
PR_Github #46837 [ skip ] triggered by Bot. Commit: |
|
PR_Github #46837 [ skip ] completed with state |
Resolve conflict in ad_executor.py: combine the dual-pool VSWA prelude (kv_group_windows/is_vswa + per-pool dicts) from sg/swa-take2 with the hot-loop locals (_tokens_per_block, _use_mamba) introduced by NVIDIA#13012, and apply NVIDIA#13012's inlined num_active_blocks_i formula in the non-VSWA branch so the perf win still applies on single-pool models. Signed-off-by: Yueh-Ting Chen <yueh.ting.chen@gmail.com>
…uling loop (NVIDIA#13012) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>
Summary
Four targeted changes to reduce Python/C++ API call overhead per decode step in `ADEngine._prepare_inputs()`.
`get_last_tokens` instead of `get_token` + `get_num_tokens`: replace the two-call pattern `get_token(0, get_num_tokens(0) - 1)` with `get_last_tokens(0)` — retrieves the last committed token in one C++ round-trip, saving one call per non-overlap decode request per step. Also aligns `ad_executor.py` with the pattern already used in `pyexecutor/model_engine.py`.
Inline `get_num_kv_blocks`: replace `kv_cache_manager.get_num_kv_blocks(end_compute_i)` with the equivalent pure-Python ceildiv expression, saving a Python function-call overhead per request per step.
Hoist mamba `hasattr` check: move `hasattr(kv_cache_manager, "mamba_cache_index")` outside the per-request loop into `_use_mamba` to avoid repeated attribute lookup.
Remove redundant `position_ids` computation in `SequenceInfo.nest_sequences()`: position_ids were computed here and then recomputed again downstream. Removing the duplicate saves one NumPy array construction and `torch.from_numpy()` call on every prefill step.
Perf
Benchmarked on Gemma4 26B MoE (single H100, ISL=1000/OSL=1000, no MLIR):
The ~60µs absolute saving is consistent at conc=1 where GPU step time is short enough that CPU scheduling is on the critical path. At higher concurrencies the GPU step dominates and scheduling overhead is fully pipelined.
Test plan
🤖 Generated with Claude Code