Skip to content

[https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs#14714

Open
govind-ramnarayan wants to merge 1 commit into
NVIDIA:mainfrom
nv-auto-deploy:gramnarayan/nvbug-6221483
Open

[https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs#14714
govind-ramnarayan wants to merge 1 commit into
NVIDIA:mainfrom
nv-auto-deploy:gramnarayan/nvbug-6221483

Conversation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator

@govind-ramnarayan govind-ramnarayan commented May 29, 2026

Summary

  • Make AutoDeploy D2H metadata mirroring blocking so reused pinned host buffers are safe under overlap scheduling.
  • Add active_args_override for metadata mutation helpers so callers can narrow host mirroring to the next consumer's graph inputs.
  • Use the draft model placeholders in the Eagle draft loop, avoiding target-only Mamba host metadata syncs during drafting.

Validation

  • CUDA_VISIBLE_DEVICES=0 PYTHONPATH=$PWD pytest -sv tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py
  • PYTHONPATH=$PWD LLM_MODELS_ROOT=/lustre/fs1/portfolios/coreai/projects/coreai_comparch_autodeploy/autodeploy_data/llm-models-fake CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 pytest -sv tests/integration/defs/accuracy/test_llm_api_autodeploy.py::TestNemotronSuperV3::test_mtp[fp8_ws8_80gb-trtllm]
  • A/B sanity: reversing this fix reproduces indexSelectSmallIndex / CUDA error: device-side assert triggered on the same full repro.

Summary by CodeRabbit

  • Refactor

    • Optimized device-to-host synchronization in KV-cache inference to selectively sync only required buffers, improving performance for graph-transformed model inference.
  • Tests

    • Added test coverage to verify selective synchronization behavior during inference operations.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

This PR enables selective host-mirror D2H synchronization in KV-cache inference by adding optional active_args_override parameters to SequenceInfo metadata update methods. Host copies now block, and a helper method computes which host buffers require sync after updates, filtered by consumer argument requirements. Eagle model integration extracts placeholder names and passes them as overrides during draft loops to skip redundant host copies. Tests verify out-of-scope host mirrors remain unsynced.

Changes

Host-mirror selective synchronization

Layer / File(s) Summary
Host buffer sync foundation
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
InputBuffer.copy_to_host() switches packed and truncatable tensor copies from non-blocking to blocking mode. SequenceInfo._active_host_update_args() helper computes which host args require D2H sync after metadata updates, optionally filtered by active_args_override.
Selective sync in offset_pos_and_cache_()
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Method signature updated to accept optional active_args_override parameter. Implementation uses _active_host_update_args() with override to scope D2H syncs to only host args needed by next consumer.
Selective sync in switch_to_generate_()
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
Method signature updated to accept optional active_args_override parameter. Implementation delegates to _active_host_update_args() to determine D2H sync scope based on override.
Eagle model draft-loop integration
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
EagleWrapper._submodule_placeholder_names() extracts GraphModule placeholder argument names. During draft loop, draft_arg_names computed once from model placeholders and passed as active_args_override to both metadata update calls, skipping redundant host copies.
Host-mirror scoping verification tests
tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py
Two new tests verify out-of-scope host mirrors remain unchanged: first confirms cu_seqlen_host unchanged during switch_to_generate_() with override limiting to input_ids; second confirms seq_len_with_cache_host not synced when offset_pos_and_cache_() scoped to input_ids.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • bmarimuthu-nv
  • tcherckez-nvidia
🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description check ✅ Passed The PR description includes a clear summary of changes, validation commands with full reproduction steps, and A/B sanity testing notes, but lacks explicit Test Coverage section.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The title directly addresses the main change: fixing AutoDeploy Eagle metadata host syncs, which aligns with the PR objectives of fixing device-to-host metadata mirroring and narrowing host syncs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py (2)

378-460: QA list update is not needed for this PR scope.

This change is unit-test-only (tests/unittest/...), so no tests/integration/test_lists/qa/* update is required.

As per coding guidelines "If the PR only touches unittest/ or narrow unit scope, say explicitly whether QA list updates are unnecessary or optional."

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py`
around lines 378 - 460, The PR only modifies unit tests under tests/unittest/...
(specifically test_switch_to_generate_inplace.py and functions like
test_out_of_scope_active_host_mirror_not_synced_by_switch_to_generate and
related tests), so explicitly state in the PR description that no QA list update
is necessary per the guideline for changes limited to unittest/ or narrow unit
scope; update the PR description (or add a short note to the PR checklist)
saying "QA list update not required for unit-test-only changes" so reviewers see
this decision without changing tests.

378-394: ⚡ Quick win

Add positive scoped-sync cases for active_args_override.

These additions validate the “excluded host arg stays stale” path, but they don’t explicitly validate the complementary “included host arg is synced” path when override contains <arg>_host. Please add one positive scoped test for each API (switch_to_generate_, offset_pos_and_cache_) to lock down both sides of the contract.

As per coding guidelines "Coverage expectations: Assess whether new/changed tests cover happy path, important edge cases, and failure modes relevant to the feature or fix."

Also applies to: 443-460

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py`
around lines 378 - 394, The test currently only checks that excluded host args
remain stale after calling switch_to_generate_; add a complementary positive
scoped-sync test that passes an active_args_override including the host
placeholder name (e.g., "input_ids_host") and asserts the host mirror is synced
to staging values (mirror in si._input_buffer.get_host_view("cu_seqlen") matches
expected), and likewise add a matching positive case for offset_pos_and_cache_
to verify that when the override contains the host placeholder the host mirror
is synchronized; locate and modify the test functions around
test_out_of_scope_active_host_mirror_not_synced_by_switch_to_generate and the
analogous block for lines ~443-460 to add these assertions calling
si.switch_to_generate_ and si.offset_pos_and_cache_ with active_args_override
that includes "<arg>_host" and assert the host view equals the expected staged
values.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In
`@tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py`:
- Around line 378-460: The PR only modifies unit tests under tests/unittest/...
(specifically test_switch_to_generate_inplace.py and functions like
test_out_of_scope_active_host_mirror_not_synced_by_switch_to_generate and
related tests), so explicitly state in the PR description that no QA list update
is necessary per the guideline for changes limited to unittest/ or narrow unit
scope; update the PR description (or add a short note to the PR checklist)
saying "QA list update not required for unit-test-only changes" so reviewers see
this decision without changing tests.
- Around line 378-394: The test currently only checks that excluded host args
remain stale after calling switch_to_generate_; add a complementary positive
scoped-sync test that passes an active_args_override including the host
placeholder name (e.g., "input_ids_host") and asserts the host mirror is synced
to staging values (mirror in si._input_buffer.get_host_view("cu_seqlen") matches
expected), and likewise add a matching positive case for offset_pos_and_cache_
to verify that when the override contains the host placeholder the host mirror
is synchronized; locate and modify the test functions around
test_out_of_scope_active_host_mirror_not_synced_by_switch_to_generate and the
analogous block for lines ~443-460 to add these assertions calling
si.switch_to_generate_ and si.offset_pos_and_cache_ with active_args_override
that includes "<arg>_host" and assert the host view equals the expected staged
values.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: d2e01141-fcbf-410b-9fdf-cfbcbef6a2f8

📥 Commits

Reviewing files that changed from the base of the PR and between f6ba936 and f74aa4d.

📒 Files selected for processing (3)
  • tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py
  • tensorrt_llm/_torch/auto_deploy/models/custom/modeling_eagle.py
  • tests/unittest/auto_deploy/singlegpu/custom_ops/test_switch_to_generate_inplace.py

@govind-ramnarayan govind-ramnarayan changed the title [NVBUG-6221483][fix] fix AutoDeploy MTP metadata host sync [NVBUG-6221483][fix] AutoDeploy: Fix MTP metadata host syncs May 29, 2026
@govind-ramnarayan govind-ramnarayan changed the title [NVBUG-6221483][fix] AutoDeploy: Fix MTP metadata host syncs [https://nvbugs/6117814][fix] AutoDeploy: Fix MTP metadata host syncs May 29, 2026
@govind-ramnarayan govind-ramnarayan marked this pull request as draft May 29, 2026 00:25
@govind-ramnarayan govind-ramnarayan changed the title [https://nvbugs/6117814][fix] AutoDeploy: Fix MTP metadata host syncs [https://nvbugs/6221483][fix] AutoDeploy: Fix MTP metadata host syncs May 29, 2026
@govind-ramnarayan govind-ramnarayan marked this pull request as ready for review May 29, 2026 00:42
@govind-ramnarayan govind-ramnarayan marked this pull request as draft May 29, 2026 00:42
@govind-ramnarayan govind-ramnarayan changed the title [https://nvbugs/6221483][fix] AutoDeploy: Fix MTP metadata host syncs [https://nvbugs/6221483][fix] AutoDeploy: Fix Eagle metadata host syncs May 29, 2026
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

Note: fold in an explicit error when trying to run flashinfer + Eagle + cudagraph (instead of silently changing to torch-simple)

@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/nvbug-6221483 branch 2 times, most recently from 8b58b36 to fde6af6 Compare May 29, 2026 19:16
Comment thread tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py Outdated
Comment thread tests/unittest/auto_deploy/singlegpu/shim/test_llm_config.py Outdated
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/nvbug-6221483 branch 2 times, most recently from 9c31278 to 7cff219 Compare May 29, 2026 19:56
Copy link
Copy Markdown
Collaborator Author

Review follow-up pushed in 7cff2196d0.

Updates:

  • Removed brittle error-message matching in the FlashInfer + speculative + CUDA graph config test.
  • Removed the redundant torch-simple assertion.
  • Added positive scoped-sync tests for both switch_to_generate_() and offset_pos_and_cache_().
  • Added validation that active_args_override is a subset of active graph args, with unit coverage for both mutation APIs.

QA list update not required: this PR adds focused unittest coverage only and does not add, remove, or rename integration tests/test-list entries.

Signed-off-by: Govind Ramnarayan <105831528+govind-ramnarayan@users.noreply.github.com>
@govind-ramnarayan govind-ramnarayan force-pushed the gramnarayan/nvbug-6221483 branch from 7cff219 to d8b9ad6 Compare May 29, 2026 19:59
@govind-ramnarayan govind-ramnarayan marked this pull request as ready for review May 29, 2026 20:00
@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51100 [ run ] triggered by Bot. Commit: d8b9ad6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51100 [ run ] completed with state FAILURE. Commit: d8b9ad6
/LLM/main/L0_MergeRequest_PR pipeline #40538 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@govind-ramnarayan
Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "A10-Build_Docs, A10-PackageSanityCheck-PY310-UB2204, A100X-PackageSanityCheck-PY312-UB2404, A30-AutoDeploy-1, H100_PCIe-AutoDeploy-1, DGX_B200-AutoDeploy-1, A100X-PyTorch-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-Post-Merge-1"

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51129 [ run ] triggered by Bot. Commit: d8b9ad6 Link to invocation

@tensorrt-cicd
Copy link
Copy Markdown
Collaborator

PR_Github #51129 [ run ] completed with state SUCCESS. Commit: d8b9ad6
/LLM/main/L0_MergeRequest_PR pipeline #40565 (Partly Tested) completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants