[https://nvbugs/5637012][fix] Fix helix unit tests#9369
Conversation
📝 WalkthroughWalkthroughThis PR adds runtime logging and threads Changes
Sequence DiagramsequenceDiagram
participant AttentionModule as attention.py
participant RopeBackend as trtllm.py<br/>(RoPE Backend)
participant CppKernel as C++ MLA<br/>Kernel
rect rgb(200, 220, 255)
Note over AttentionModule: Generation Path
AttentionModule->>AttentionModule: Compute helix_position_offsets<br/>(from position_ids if cp_size > 1)
AttentionModule->>RopeBackend: mla_rope_generation(helix_position_offsets)
RopeBackend->>RopeBackend: Debug log helix_position_offsets
RopeBackend->>CppKernel: Invoke MLA kernel
end
rect rgb(220, 200, 255)
Note over CppKernel: C++ Kernel Layer
CppKernel->>CppKernel: Debug log if<br/>mla_helix_position_offsets is set
CppKernel->>CppKernel: Execute RoPE generation
end
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (5)
cpp/tensorrt_llm/kernels/mlaKernels.cu (1)
1045-1052: Use TLLM logging (or env‑gated debug) instead of rawprintfininvokeMLARopeGenerationThe null check on
params.helix_position_offsetsis fine, but unconditionalprintfin this hot launcher will spam stdout and bypass existingTLLM_LOG_*controls. Consider replacing with a single debug log (e.g.,TLLM_LOG_DEBUG) or guarding under a dedicated env/debug flag so production runs are not flooded by these messages.cpp/tensorrt_llm/thop/attentionOp.cpp (1)
228-238: Avoidprintffor MLA helix offsets; prefer TLLM logging macrosThe logic of wiring
mla_helix_position_offsetsintomla_params.helix_position_offsetsis correct, and leaving the pointer as its defaultnullptrwhenhas_value()is false is fine. However, the unconditionalprintfcalls in both branches will generate a lot of stdout noise during normal runs.Please switch these to
TLLM_LOG_DEBUG(or another existing logging macro) and/or guard them behind a debug flag so they don’t spam logs in production.cpp/tensorrt_llm/thop/dsv3RopeOp.cpp (1)
101-108: Consolidate and gate MLA helix‑offset logging in dsv3RopeOpWiring of
helix_position_offsets_ptrintoMlaRopeGenArgsand then intomla_params.helix_position_offsetsis correct. However:
- Both
invokeMLARopeGenerationHelperandMLARopeGenerationunconditionallyprintfwhether the pointer is set, so each generation call produces two stdout lines.- These logs bypass the existing
TLLM_LOG_*infrastructure and will be very noisy in real workloads.Consider:
- Emitting a single
TLLM_LOG_DEBUG(or similar) log at the top level instead of multipleprintfs, and- Guarding it behind a debug/env flag so normal runs aren’t flooded.
Also applies to: 109-116, 171-180
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
1736-1737: Replace rawhelix_position_offsetswith structured, gated logging
mla_rope_generationnow unconditionally does:print("[TrtllmAttention::mla_rope_generation] helix_position_offsets", helix_position_offsets)On real runs this can be extremely noisy and slow, especially if
helix_position_offsetsis a large tensor and this function is called per layer/step.Since
loggeris already available, consider instead:
- Using
logger.debugand logging a summary (e.g.,Nonevs.shape/.dtype), and- Gating it behind a debug flag (or only enabling in unit-test mode),
so that production inference/training isn’t flooded with console output.
tensorrt_llm/_torch/modules/attention.py (1)
1047-1048: Avoid unconditionallogger.debugand/or unit‑test gatingThere are several new
MLA._attn_forward_gen: printsposition_idsevery generation call.MLA.forward: printsposition_idsat the start of each forward.forward_absorption_generation: computeshelix_position_offsetsand passes it down; logging is elsewhere but this function is on the hot path.Direct
position_ids(potentially large tensors) in these frequently‑invoked paths will:
- Severely spam stdout in real runs (and when using multiple layers),
- Add noticeable overhead, and
- Bypass the existing
tensorrt_llm.loggerinfrastructure.Since this module already imports
logger, please either:
- Remove these prints entirely, or
- Guard them under a debug/test flag (e.g.,
if self.enable_unit_test:) and switch tologger.debug, logging only a summary such asNonevsposition_ids.shaperather than the full tensor.Example pattern:
if self.enable_unit_test: logger.debug( "[MLA::attn_forward_gen] position_ids=%s", None if position_ids is None else tuple(position_ids.shape), )This keeps the helpful diagnostics for unit tests while avoiding noise and overhead in production.
Also applies to: 1716-1719, 1732-1736, 1745-1757, 2088-2090
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
cpp/tensorrt_llm/kernels/mlaKernels.cu(1 hunks)cpp/tensorrt_llm/thop/attentionOp.cpp(1 hunks)cpp/tensorrt_llm/thop/dsv3RopeOp.cpp(2 hunks)tensorrt_llm/_torch/attention_backend/trtllm.py(1 hunks)tensorrt_llm/_torch/modules/attention.py(7 hunks)
🧰 Additional context used
🧠 Learnings (7)
📚 Learning: 2025-09-23T15:01:00.070Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/kernels/nccl_device/config.cu:15-17
Timestamp: 2025-09-23T15:01:00.070Z
Learning: In TensorRT-LLM NCCL device kernels, the <sstream> header is not needed as an explicit include in config.cu because it's provided transitively through other headers. Local compilation testing confirms this works without the explicit include.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-21T02:39:12.009Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1475-1480
Timestamp: 2025-08-21T02:39:12.009Z
Learning: The min latency mode functionality in TensorRT-LLM MOE kernels (cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu) is deprecated and no longer being maintained/updated, as confirmed by djns99. Bug reports and optimization suggestions for the computeStridesTmaWarpSpecializedLowLatencyKernel and related min latency code paths should be deprioritized.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
cpp/tensorrt_llm/kernels/mlaKernels.cu
📚 Learning: 2025-08-14T21:04:50.248Z
Learnt from: thorjohnsen
Repo: NVIDIA/TensorRT-LLM PR: 6910
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-14T21:04:50.248Z
Learning: In KV cache onboarding logic during prefill in cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp, when calculating which blocks fall within the attention window, use getTokensPerBlock() to advance token indices rather than block->getUniqueTokens().size(), because the calculation needs to consider the post-prefill state where blocks will be filled to capacity, not their current token count.
Applied to files:
cpp/tensorrt_llm/thop/attentionOp.cpp
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
helix_position_offsets(109-110)
tensorrt_llm/_torch/modules/attention.py (3)
tensorrt_llm/_utils.py (1)
get_sm_version(740-742)cpp/tensorrt_llm/kernels/mlaKernels.h (1)
helix_position_offsets(109-110)tensorrt_llm/_torch/distributed/communicator.py (1)
cp_size(55-56)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (1)
tensorrt_llm/_torch/modules/attention.py (1)
1043-1062: Helix position offsets propagation for MLA generation looks consistent with context pathThe new wiring of
position_ids/helix offsets through the MLA generation stack appears sound:
_attn_forward_genpasseshelix_position_offsets=position_idsto the backend only whenself.mapping.cp_size > 1, matching how context MLA usesposition_idsfor Helix.forward_generation_dsanow forwardsposition_idsintoforward_absorption_generation, which computeshelix_position_offsets = position_ids if self.mapping.cp_size > 1 else Noneand threads it intoself.mqa.mla_rope_generation.- The same
position_idstensor is already used for RoPE and is truncated tonum_tokensearlier, so indexing in the CUDA side via flattenedhelix_position_offsetsremains consistent with existing context MLA behavior.No functional issues stand out here; this looks like the right way to expose Helix offsets for generation MLA while keeping non‑CP paths unchanged.
Also applies to: 1716-1718, 1732-1736, 1745-1757, 1377-1395
f20714d to
e68fcd1
Compare
|
/bot run --disable-fail-fast |
e68fcd1 to
571bdb6
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #25395 [ run ] triggered by Bot. Commit: |
|
PR_Github #25396 [ run ] triggered by Bot. Commit: |
|
PR_Github #25395 [ run ] completed with state |
|
PR_Github #25396 [ run ] completed with state |
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
571bdb6 to
58ec218
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #25433 [ run ] triggered by Bot. Commit: |
|
PR_Github #25433 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #25470 [ run ] triggered by Bot. Commit: |
|
PR_Github #25470 [ run ] completed with state |
Signed-off-by: Balaram Buddharaju <169953907+brb-nv@users.noreply.github.com>
Description
Helix unit tests are broken. Fortunately, fixing them is just a matter of passing
position_idsproperly tomla_rope_generation.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.