Skip to content

[https://nvbugs/5637012][fix] Fix helix unit tests#9369

Merged
brb-nv merged 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/fix-helix-tests
Nov 24, 2025
Merged

[https://nvbugs/5637012][fix] Fix helix unit tests#9369
brb-nv merged 1 commit intoNVIDIA:mainfrom
brb-nv:user/brb/fix-helix-tests

Conversation

@brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Nov 21, 2025

Description

Helix unit tests are broken. Fortunately, fixing them is just a matter of passing position_ids properly to mla_rope_generation.

Test Coverage

$ pytest tests/unittest/_torch/modules/test_mla_helix.py -s -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@brb-nv brb-nv requested a review from a team as a code owner November 21, 2025 19:57
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 21, 2025

📝 Walkthrough

Walkthrough

This PR adds runtime logging and threads position_ids and helix_position_offsets through MLA RoPE generation code paths across C++ kernels and Python modules to support context parallelism debugging and observability.

Changes

Cohort / File(s) Summary
C++ MLA logging
cpp/tensorrt_llm/kernels/mlaKernels.cu, cpp/tensorrt_llm/thop/attentionOp.cpp, cpp/tensorrt_llm/thop/dsv3RopeOp.cpp
Added runtime debug prints reporting whether mla_helix_position_offsets or helix_position_offsets_ptr is set during MLA RoPE generation. Minor comment punctuation adjustment. No control-flow changes.
Python MLA RoPE generation logging
tensorrt_llm/_torch/attention_backend/trtllm.py
Added debug print statements for helix_position_offsets in TrtllmAttentionWrapper.mla_rope_generation and TrtllmAttention.mla_rope_generation. No control-flow changes.
Python attention module refactoring
tensorrt_llm/_torch/modules/attention.py
Added debugging traces for position_ids in MLA attention forward paths. Expanded position_ids propagation through forward_generation_dsa and forward_absorption_generation. Introduced helix_position_offsets computation for context parallelism (when cp_size > 1) and threaded through RoPE/MLA pathways.

Sequence Diagram

sequenceDiagram
    participant AttentionModule as attention.py
    participant RopeBackend as trtllm.py<br/>(RoPE Backend)
    participant CppKernel as C++ MLA<br/>Kernel

    rect rgb(200, 220, 255)
    Note over AttentionModule: Generation Path
    AttentionModule->>AttentionModule: Compute helix_position_offsets<br/>(from position_ids if cp_size > 1)
    AttentionModule->>RopeBackend: mla_rope_generation(helix_position_offsets)
    RopeBackend->>RopeBackend: Debug log helix_position_offsets
    RopeBackend->>CppKernel: Invoke MLA kernel
    end

    rect rgb(220, 200, 255)
    Note over CppKernel: C++ Kernel Layer
    CppKernel->>CppKernel: Debug log if<br/>mla_helix_position_offsets is set
    CppKernel->>CppKernel: Execute RoPE generation
    end
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

  • Python attention module (tensorrt_llm/_torch/modules/attention.py): Requires verification of parameter threading logic and helix_position_offsets computation correctness for context parallelism; confirm position_ids propagation through generation paths is complete and backward-compatible.
  • C++ logging additions: Verify debug print statements don't impact performance and log format is consistent across files.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title relates to fixing helix unit tests and references a valid NVBugs ticket, but the raw_summary shows changes are debug logging additions across multiple files with no indication these changes fix helix unit tests. Clarify whether the debug logging additions actually fix helix unit tests, or update the title to better reflect the debug logging instrumentation work being performed.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description clearly explains the issue and solution: helix unit tests are broken and need position_ids passed to mla_rope_generation. Test coverage command provided.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
cpp/tensorrt_llm/kernels/mlaKernels.cu (1)

1045-1052: Use TLLM logging (or env‑gated debug) instead of raw printf in invokeMLARopeGeneration

The null check on params.helix_position_offsets is fine, but unconditional printf in this hot launcher will spam stdout and bypass existing TLLM_LOG_* controls. Consider replacing with a single debug log (e.g., TLLM_LOG_DEBUG) or guarding under a dedicated env/debug flag so production runs are not flooded by these messages.

cpp/tensorrt_llm/thop/attentionOp.cpp (1)

228-238: Avoid printf for MLA helix offsets; prefer TLLM logging macros

The logic of wiring mla_helix_position_offsets into mla_params.helix_position_offsets is correct, and leaving the pointer as its default nullptr when has_value() is false is fine. However, the unconditional printf calls in both branches will generate a lot of stdout noise during normal runs.

Please switch these to TLLM_LOG_DEBUG (or another existing logging macro) and/or guard them behind a debug flag so they don’t spam logs in production.

cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)

101-108: Consolidate and gate MLA helix‑offset logging in dsv3RopeOp

Wiring of helix_position_offsets_ptr into MlaRopeGenArgs and then into mla_params.helix_position_offsets is correct. However:

  • Both invokeMLARopeGenerationHelper and MLARopeGeneration unconditionally printf whether the pointer is set, so each generation call produces two stdout lines.
  • These logs bypass the existing TLLM_LOG_* infrastructure and will be very noisy in real workloads.

Consider:

  • Emitting a single TLLM_LOG_DEBUG (or similar) log at the top level instead of multiple printfs, and
  • Guarding it behind a debug/env flag so normal runs aren’t flooded.

Also applies to: 109-116, 171-180

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

1736-1737: Replace raw print of helix_position_offsets with structured, gated logging

mla_rope_generation now unconditionally does:

print("[TrtllmAttention::mla_rope_generation] helix_position_offsets", helix_position_offsets)

On real runs this can be extremely noisy and slow, especially if helix_position_offsets is a large tensor and this function is called per layer/step.

Since logger is already available, consider instead:

  • Using logger.debug and logging a summary (e.g., None vs .shape/.dtype), and
  • Gating it behind a debug flag (or only enabling in unit-test mode),

so that production inference/training isn’t flooded with console output.

tensorrt_llm/_torch/modules/attention.py (1)

1047-1048: Avoid unconditional print in MLA paths; use logger.debug and/or unit‑test gating

There are several new print calls:

  • MLA._attn_forward_gen: prints position_ids every generation call.
  • MLA.forward: prints position_ids at the start of each forward.
  • forward_absorption_generation: computes helix_position_offsets and passes it down; logging is elsewhere but this function is on the hot path.

Direct print of full position_ids (potentially large tensors) in these frequently‑invoked paths will:

  • Severely spam stdout in real runs (and when using multiple layers),
  • Add noticeable overhead, and
  • Bypass the existing tensorrt_llm.logger infrastructure.

Since this module already imports logger, please either:

  • Remove these prints entirely, or
  • Guard them under a debug/test flag (e.g., if self.enable_unit_test:) and switch to logger.debug, logging only a summary such as None vs position_ids.shape rather than the full tensor.

Example pattern:

if self.enable_unit_test:
    logger.debug(
        "[MLA::attn_forward_gen] position_ids=%s",
        None if position_ids is None else tuple(position_ids.shape),
    )

This keeps the helpful diagnostics for unit tests while avoiding noise and overhead in production.

Also applies to: 1716-1719, 1732-1736, 1745-1757, 2088-2090

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b2abb8 and 2385790.

📒 Files selected for processing (5)
  • cpp/tensorrt_llm/kernels/mlaKernels.cu (1 hunks)
  • cpp/tensorrt_llm/thop/attentionOp.cpp (1 hunks)
  • cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (2 hunks)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (1 hunks)
  • tensorrt_llm/_torch/modules/attention.py (7 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.

Applied to files:

  • cpp/tensorrt_llm/thop/attentionOp.cpp
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
  • helix_position_offsets (109-110)
tensorrt_llm/_torch/modules/attention.py (3)
tensorrt_llm/_utils.py (1)
  • get_sm_version (740-742)
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
  • helix_position_offsets (109-110)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • cp_size (55-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/attention.py (1)

1043-1062: Helix position offsets propagation for MLA generation looks consistent with context path

The new wiring of position_ids/helix offsets through the MLA generation stack appears sound:

  • _attn_forward_gen passes helix_position_offsets=position_ids to the backend only when self.mapping.cp_size > 1, matching how context MLA uses position_ids for Helix.
  • forward_generation_dsa now forwards position_ids into forward_absorption_generation, which computes helix_position_offsets = position_ids if self.mapping.cp_size > 1 else None and threads it into self.mqa.mla_rope_generation.
  • The same position_ids tensor is already used for RoPE and is truncated to num_tokens earlier, so indexing in the CUDA side via flattened helix_position_offsets remains consistent with existing context MLA behavior.

No functional issues stand out here; this looks like the right way to expose Helix offsets for generation MLA while keeping non‑CP paths unchanged.

Also applies to: 1716-1718, 1732-1736, 1745-1757, 1377-1395

@brb-nv brb-nv force-pushed the user/brb/fix-helix-tests branch from f20714d to e68fcd1 Compare November 21, 2025 20:48
@brb-nv
Copy link
Collaborator Author

brb-nv commented Nov 21, 2025

/bot run --disable-fail-fast

@brb-nv brb-nv force-pushed the user/brb/fix-helix-tests branch from e68fcd1 to 571bdb6 Compare November 21, 2025 20:52
@brb-nv
Copy link
Collaborator Author

brb-nv commented Nov 21, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25395 [ run ] triggered by Bot. Commit: 571bdb6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25396 [ run ] triggered by Bot. Commit: 571bdb6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25395 [ run ] completed with state ABORTED. Commit: 571bdb6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25396 [ run ] completed with state SUCCESS. Commit: 571bdb6
/LLM/main/L0_MergeRequest_PR pipeline #19214 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@brb-nv brb-nv enabled auto-merge (squash) November 22, 2025 03:45
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
@brb-nv brb-nv force-pushed the user/brb/fix-helix-tests branch from 571bdb6 to 58ec218 Compare November 22, 2025 21:19
@brb-nv
Copy link
Collaborator Author

brb-nv commented Nov 22, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25433 [ run ] triggered by Bot. Commit: 58ec218

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25433 [ run ] completed with state SUCCESS. Commit: 58ec218
/LLM/main/L0_MergeRequest_PR pipeline #19247 completed with status: 'FAILURE'

@yuxianq
Copy link
Collaborator

yuxianq commented Nov 24, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25470 [ run ] triggered by Bot. Commit: 58ec218

@tensorrt-cicd
Copy link
Collaborator

PR_Github #25470 [ run ] completed with state SUCCESS. Commit: 58ec218
/LLM/main/L0_MergeRequest_PR pipeline #19284 completed with status: 'SUCCESS'
Pipeline passed with automatic retried tests. Check the rerun report for details.

@brb-nv brb-nv merged commit c045e35 into NVIDIA:main Nov 24, 2025
5 checks passed
codego7250 pushed a commit to codego7250/TensorRT-LLM that referenced this pull request Dec 11, 2025
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants