[None][feat] Resubmission of the routing refactor in trtllmgen#13328
[None][feat] Resubmission of the routing refactor in trtllmgen#13328Funatiq merged 10 commits intoNVIDIA:mainfrom
Conversation
|
/bot help |
GitHub Bot Help
Provide a user friendly way for developers to interact with a Jenkins server. Run See details below for each supported subcommand. Details
Launch build/test pipelines. All previously running jobs will be killed.
kill
Kill all running builds associated with pull request. skip
Skip testing for latest commit on pull request. reuse-pipeline
Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break. |
|
/bot run --disable-fail-fast |
|
PR_Github #44940 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughThis PR refactors the TensorRT-LLM MoE routing kernel infrastructure from macro-based dispatch to a unified policy-driven system. It consolidates multiple routing implementations (renormalize, DeepSeek) into a single custom routing framework with configurable preprocess/postprocess policies. Runtime configuration replaces compile-time template parameters for flexibility. Old dispatch macros and separate routing implementations are removed and replaced with new infrastructure (RoutingCustomPolicy, RoutingDevKernel, RoutingFromTopKIds). Python layers and tests are updated to support new routing methods and dtype handling. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 13
🧹 Nitpick comments (4)
tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py (1)
92-103: Reuse one dummy correction bias per tuning session.
callable_e_score_correction_biasgenerates a new random tensor on everyapply(), so autotuning can benchmark different tactics against different expert distributions. Caching one dummy bias here keeps tactic selection reproducible.One way to make the dummy input stable
# Get routing method routing_cls_kwargs = {} + dummy_e_score_correction_bias = None + + if routing_method_type in (RoutingMethodType.DeepSeekV3, + RoutingMethodType.MiniMax2): + dummy_e_score_correction_bias = torch.randn( + num_experts, dtype=torch.bfloat16, device=hidden_states.device) + if routing_method_type == RoutingMethodType.DeepSeekV3: routing_cls_kwargs.update({ 'n_group': n_group, @@ 'routed_scaling_factor': routed_scaling_factor, 'is_fused': False, # fuse_routing_kernel 'callable_e_score_correction_bias': - lambda: torch.randn( - num_experts, dtype=torch.bfloat16, device=hidden_states.device) + lambda: dummy_e_score_correction_bias }) if routing_method_type == RoutingMethodType.MiniMax2: routing_cls_kwargs.update({ 'callable_e_score_correction_bias': - lambda: torch.randn( - num_experts, dtype=torch.bfloat16, device=hidden_states.device), + lambda: dummy_e_score_correction_bias, 'num_experts': num_experts, })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py` around lines 92 - 103, The current lambda assigned to routing_cls_kwargs['callable_e_score_correction_bias'] creates a new random tensor on every apply(), breaking autotuning reproducibility; instead, allocate one dummy bias tensor once (e.g., dummy_e_score_correction_bias = torch.randn(num_experts, dtype=torch.bfloat16, device=hidden_states.device)) and set the callable to return that same tensor (e.g., lambda: dummy_e_score_correction_bias) so MiniMax2 uses a stable dummy correction bias across the tuning session; update the block handling RoutingMethodType.MiniMax2 to create and close over this cached tensor and keep the existing keys ('callable_e_score_correction_bias', 'num_experts') in routing_cls_kwargs.tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py (1)
1096-1100: Reuse_extract_routing_params()in the fake path too.This branch re-implements the same routing-method split that
run_moe()andforward_impl()just centralized, so the next routing-method addition has to update three places again.♻️ Suggested simplification
- is_deepseek_v3_routing = isinstance(self.routing_method, - DeepSeekV3MoeRoutingMethod) - is_minimax_routing = isinstance(self.routing_method, - MiniMaxM2MoeRoutingMethod) - top_k = self.routing_method.routing_impl.top_k if is_deepseek_v3_routing else self.routing_method.top_k - routing_bias = self.routing_method.e_score_correction_bias if ( - is_deepseek_v3_routing or is_minimax_routing) else None + routing_params = self._extract_routing_params() + top_k = routing_params.top_k + routing_bias = routing_params.routing_bias return fp4_block_scale_fake_output_without_finalize( x, self.num_experts,🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py` around lines 1096 - 1100, The fake/inactive path currently reimplements the routing-method split (calculating is_minimax_routing, top_k, routing_bias) instead of reusing the centralized helper—call the existing _extract_routing_params() helper from the fake path so it returns the same routing parameters used by run_moe() and forward_impl(); replace the duplicated logic that computes is_minimax_routing, top_k and routing_bias with a single call to _extract_routing_params() and use its returned values to drive the fake path behavior.cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu (1)
59-62:clusterSizeInBatchDimis currently a no-op.The constructor advertises a second tuning parameter, but the value is dropped on the floor here and never participates in workspace sizing or routing launch decisions. Either persist/use it or remove it until the implementation is ready.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu` around lines 59 - 62, The constructor Runner::Runner(int32_t tileTokensDim, int32_t clusterSizeInBatchDim) currently drops clusterSizeInBatchDim; persist it (e.g., add a member mClusterSizeInBatchDim and initialize it in the initializer list alongside mTileTokensDim) and then use mClusterSizeInBatchDim in workspace sizing and routing launch decisions (the code paths that compute workspace bytes or choose kernel launch dimensions), or if the tuning parameter is not yet supported remove clusterSizeInBatchDim from the signature and all callsites; ensure references are updated to the chosen approach.cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh (1)
586-609: Drop the commented-out sample policy block.Keeping a disabled implementation in a block comment here makes it easy for the example to drift away from the real dispatch path. Either remove it or gate it with
#if defined(...)if you want an opt-in sample.As per coding guidelines, "Do not use comments to disable code in C++; use
#if/#endifor avoid dead code entirely".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh` around lines 586 - 609, Remove the large commented-out sample policy block (the FirstKExpertSelect struct and its explicit PolicyTraits specialization referencing TierList/Tier) from RoutingCustomPolicy.cuh; either delete it entirely or wrap it with a clear compile-time guard like `#if` defined(SAMPLE_ROUTING_POLICY) / `#endif` so it is not present as dead code in comments—ensure references to FirstKExpertSelect and the PolicyTraits<T> specialization are handled accordingly and that no dangling commented symbols remain.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustom.cu`:
- Around line 954-980: The early-return path that calls runPostTopKPipeline when
precomputed topK is present must still validate sizes: move or duplicate the
bounds checks for data.mTopK (<= MaxSupportedTopExperts), data.mNumExperts (<=
MaxSupportedExperts) and the data.mNumExperts % 4 == 0 check to execute before
the early return that handles data.mPtrTopKIds / data.mPtrTopKPacked; keep the
existing TLLM_CHECK_WITH_INFO validating mPtrTopKWeights when mPtrTopKIds is
provided and then call runPostTopKPipeline only after these validations pass so
oversized precomputed inputs fail fast with the same checks as the
non-precomputed path.
- Around line 648-655: The PDL completion trigger is currently invoked before
Phase 5 writes permutation outputs, so when params.mUsePdl is true we must move
the cudaTriggerProgrammaticLaunchCompletion() call to after Phase 5 finishes
writing mPtrExpandedIdxToPermutedIdx and any other permutation outputs; update
the dyn-block path to place the trigger after the final global writes (same
location/order as the block kernel) so downstream kernels cannot consume
partially written routing results.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`:
- Around line 744-759: The dispatchRoutingPolicy function currently ignores
Data::mPostprocessType for many cases and silently remaps requested (preprocess,
postprocess) pairs; update dispatchRoutingPolicy (the function handling Data,
Fn, and enums RoutingPreprocessType/RoutingPostprocessType) to match on the full
pair instead of only preprocess: for each supported combination explicitly call
fn(...) with the exact (Preprocess, Postprocess) tuple (e.g.,
SigmoidBiasPreprocess + ScaledSumNormalizePostprocess, SigmoidPreprocess +
SumNormalizePostprocess, SoftmaxPreprocess + NoOpPostprocess, SoftmaxPreprocess
+ SumNormalizePostprocess, NoOpPreprocess + SoftmaxPostprocess), and add a final
else branch that fails fast (throw a std::runtime_error or assert/log + exit)
when an unsupported preprocess/postprocess pair is requested so callers cannot
be silently remapped.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDeepSeek.cu`:
- Around line 190-228: The routing bias is being narrowed early: biasVal is cast
to OutputT after loadScalar which loses fp32 precision (problematic when
mDtypeOutput is Bfloat16). Change the logic around biasVal/loadScalar so you
keep the bias in float precision for selection and comparison (use a float bias
variable from loadScalar(params.mPtrRoutingBias, params.mDtypeBias) and only
cast to OutputT when storing into outputs if needed). Update usages around
biasVal, scoreBias and any selection code that compares expert scores (e.g.,
where expertSelected, scoreIdx, smemScoreSigmoid are used) to use the float bias
variable so top-k decisions use full fp32 bias precision.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDevKernel.h`:
- Around line 117-167: The BFloat16 branches in both
LAUNCH_ROUTING_WITH_POLICIES and LAUNCH_ROUTING_WITH_EXPERT_SELECT incorrectly
accept any mDtypeInput; change the third branch condition in each macro from
"else if (data.mDtypeOutput == tg::Dtype::Bfloat16)" to "else if
(data.mDtypeOutput == tg::Dtype::Bfloat16 && data.mDtypeInput ==
tg::Dtype::Fp32)" so the bf16→bf16 kernel is only selected when input is fp32;
keep the final else that calls TLLM_LOG_ERROR("Unsupported dtypeOutput") so
unsupported input/output combinations are rejected.
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingFromTopKIds.cu`:
- Around line 42-136: Add a fail-fast input validation in runPostTopKPipeline:
after computing useStaticBlock, useDynBlock, useSingleCluster and before any
routingCustom::launch* calls, check for the unsupported "packed-only without
weights" combination (data.mPtrTopKPacked != nullptr && data.mPtrTopKWeights ==
nullptr) that can lead to garbage writes to
mPtrPermutedIdxSize/mPtrNumNonExitingCtas and histogram fallback corruption; if
that condition is true and the code will take a
non-static/non-dyn/non-single-cluster path (i.e., !(useStaticBlock ||
useDynBlock || useSingleCluster) or when useCoop is possible), call
TLLM_CHECK_WITH_INFO(false, "clear message...") to abort early. Ensure the check
references runPostTopKPipeline, data.mPtrTopKPacked, data.mPtrTopKWeights,
mPtrPermutedIdxSize/mPtrNumNonExitingCtas and the boolean flags so the
validation is placed before any routingCustom::launch* invocations.
In `@cpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cpp`:
- Around line 79-81: The code sets dtypeRoutingLogits by mapping any non-Float
routing_logits to btg::Dtype::Bfloat16 which silently accepts unsupported
dtypes; update the logic in cuteDslMoeUtilsOp.cpp where dtypeRoutingLogits is
computed (the routing_logits.has_value() branch) to explicitly accept only
at::ScalarType::Float -> btg::Dtype::Fp32 and at::ScalarType::BFloat16 (or the
exact BF16 enum used by your build) -> btg::Dtype::Bfloat16, and otherwise fail
fast (throw an exception or return an error) when routing_logits->scalar_type()
is any other type (e.g., at::ScalarType::Half) so the kernel won't read invalid
data as bf16.
In `@cpp/tests/unit_tests/kernels/routing/routingCustomTest.cpp`:
- Around line 1262-1279: The test currently only checks kernel launch for
routingCustom::Data (routingData) with mixed bias dtype; instead add numeric
assertions by computing the CPU reference outputs (using the same reference
helper used by other test paths) and compare the device kernel outputs (the
buffer pointed to by routingData.mPtrRoutingBias or the scores output buffer
produced by routingCustom::run) against that CPU reference with an appropriate
tolerance; ensure you exercise the mDtypeBias / loadScalar path by reading back
the device output into host memory via bufferCast and then assert elementwise
equality/near-equality to the CPU reference (use the same tolerance and
comparison helper used elsewhere in these tests) so the mixed-precision behavior
is validated, and keep calls to routingCustom::run(routingData,
this->mStream->get()) and this->mStream->synchronize() before reading back.
- Around line 145-156: The ScaledSumNormalize oracle currently divides by
sumSigmoid without using the test epsilon, so thread routingData.mSumEpsilon
into the ScaledSumNormalize test logic: update the validation in the
RoutingPostprocessType::ScaledSumNormalize branch to divide by (sumSigmoid +
routingData.mSumEpsilon) when computing expected scores (using symbols
sigmoidScores, expIdx, and param.routedScalingFactor), and modify setParams() to
populate routingData.mSumEpsilon with the intended non-zero test values so the
non-zero-epsilon behavior is actually exercised; apply the same change to the
other occurrence mentioned (lines 235-242).
In `@cpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cpp`:
- Around line 412-459: The test currently only ensures the kernel runs with
mDtypeBias = Fp32; update it to verify correctness by computing a CPU host
reference using the float32 bias (use the same inputs initialized via initData
and float32BiasHost), run moe::dev::routing::routingDeepSeek::run(routingData,
...), copy back the kernel outputs (top-k ids and weights buffers produced by
the test harness) and ASSERT/EXPECT that the device top-k ids/weights match the
host reference within tolerance; locate code around setCommonParams,
routingData, float32BiasHost/float32BiasDevice, routingDeepSeek::run and add the
host-reference computation and comparisons after this->mStream->synchronize() so
the test fails if dtype plumbing is wrong.
In `@cpp/tests/unit_tests/kernels/routing/routingTest.cpp`:
- Around line 301-317: The host-side reference in computePermutation() must not
index expertCountsHostPtr or expertScanCountsHostPtr with out-of-range expert
IDs when hasInvalidTopKInput is true; update computePermutation() (the host
oracle that reads expIdxHostPtr entries) to validate each expertIdx (require
expertIdx >= 0 && expertIdx < param.numExperts) before any access to
expertCountsHostPtr or expertScanCountsHostPtr and skip or set outputs for
invalid entries (e.g., produce -1) so the reference no longer walks past the
buffers for expertIdx >= param.numExperts.
In `@tests/unittest/_torch/modules/moe/test_moe_module.py`:
- Around line 1225-1229: Replace the TRTLLM-only gate with the suite's full
backend capability check: call the same helper used elsewhere
(backend_type.get_quick_skip_reason or backend_type.can_implement pattern)
passing quant_algo, moe_model_config and routing_method_cls (and the custom
n_group/topk_group settings) and if it returns a reason, pytest.skip(reason); do
not use should_skip_trtllm here so unsupported combos like TRTLLM+QuantAlgo.FP8
or custom DeepSeek group/topk configurations are correctly skipped.
---
Nitpick comments:
In
`@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuh`:
- Around line 586-609: Remove the large commented-out sample policy block (the
FirstKExpertSelect struct and its explicit PolicyTraits specialization
referencing TierList/Tier) from RoutingCustomPolicy.cuh; either delete it
entirely or wrap it with a clear compile-time guard like `#if`
defined(SAMPLE_ROUTING_POLICY) / `#endif` so it is not present as dead code in
comments—ensure references to FirstKExpertSelect and the PolicyTraits<T>
specialization are handled accordingly and that no dangling commented symbols
remain.
In `@cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cu`:
- Around line 59-62: The constructor Runner::Runner(int32_t tileTokensDim,
int32_t clusterSizeInBatchDim) currently drops clusterSizeInBatchDim; persist it
(e.g., add a member mClusterSizeInBatchDim and initialize it in the initializer
list alongside mTileTokensDim) and then use mClusterSizeInBatchDim in workspace
sizing and routing launch decisions (the code paths that compute workspace bytes
or choose kernel launch dimensions), or if the tuning parameter is not yet
supported remove clusterSizeInBatchDim from the signature and all callsites;
ensure references are updated to the chosen approach.
In `@tensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.py`:
- Around line 92-103: The current lambda assigned to
routing_cls_kwargs['callable_e_score_correction_bias'] creates a new random
tensor on every apply(), breaking autotuning reproducibility; instead, allocate
one dummy bias tensor once (e.g., dummy_e_score_correction_bias =
torch.randn(num_experts, dtype=torch.bfloat16, device=hidden_states.device)) and
set the callable to return that same tensor (e.g., lambda:
dummy_e_score_correction_bias) so MiniMax2 uses a stable dummy correction bias
across the tuning session; update the block handling RoutingMethodType.MiniMax2
to create and close over this cached tensor and keep the existing keys
('callable_e_score_correction_bias', 'num_experts') in routing_cls_kwargs.
In `@tensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.py`:
- Around line 1096-1100: The fake/inactive path currently reimplements the
routing-method split (calculating is_minimax_routing, top_k, routing_bias)
instead of reusing the centralized helper—call the existing
_extract_routing_params() helper from the fake path so it returns the same
routing parameters used by run_moe() and forward_impl(); replace the duplicated
logic that computes is_minimax_routing, top_k and routing_bias with a single
call to _extract_routing_params() and use its returned values to drive the fake
path behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: d0ba2c10-bd7b-4c9e-a975-6b32330215d0
📒 Files selected for processing (48)
cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/IntFastDiv.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustom.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingCustomPolicy.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDeepSeek.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingDevKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingFromTopKIds.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernel.hcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingKernelTopK.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routing/RoutingLlama4.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/RoutingDeepSeekCommon.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchClusterKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchHistogramKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchInitExpertCounts.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchMainKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchOffsetsKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/RoutingRenormalizeCommon.cuhcpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchClusterKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramScoresKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchInitExpertCounts.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchOffsetsKernel.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.cucpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/runner.hcpp/tensorrt_llm/thop/cuteDslMoeUtilsOp.cppcpp/tensorrt_llm/thop/fp4BlockScaleMoe.cppcpp/tensorrt_llm/thop/fp8BlockScaleMoe.cppcpp/tensorrt_llm/thop/fp8PerTensorScaleMoe.cppcpp/tensorrt_llm/thop/mxFp4BlockScaleMoe.cppcpp/tests/unit_tests/kernels/CMakeLists.txtcpp/tests/unit_tests/kernels/routing/routingCustomTest.cppcpp/tests/unit_tests/kernels/routing/routingDeepSeekTest.cppcpp/tests/unit_tests/kernels/routing/routingLlama4Test.cppcpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cppcpp/tests/unit_tests/kernels/routing/routingTest.cppcpp/tests/unit_tests/kernels/routing/routingTest.htensorrt_llm/_torch/custom_ops/trtllm_gen_custom_ops.pytensorrt_llm/_torch/models/modeling_deepseekv3.pytensorrt_llm/_torch/modules/fused_moe/__init__.pytensorrt_llm/_torch/modules/fused_moe/fused_moe_trtllm_gen.pytensorrt_llm/_torch/modules/fused_moe/routing.pytests/integration/defs/accuracy/test_llm_api_pytorch.pytests/unittest/_torch/modules/moe/moe_test_utils.pytests/unittest/_torch/modules/moe/test_moe_module.py
💤 Files with no reviewable changes (17)
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingDeepSeek.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchClusterKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/DevKernel.h
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchHistogramKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchOffsetsKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchOffsetsKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchInitExpertCounts.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/RoutingDeepSeekCommon.cuh
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchInitExpertCounts.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/RoutingRenormalize.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchClusterKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchMainKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/launchHistogramScoresKernel.cu
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingDeepSeek/launchCoopKernel.cu
- cpp/tests/unit_tests/kernels/routing/routingRenormalizeTest.cpp
- cpp/tensorrt_llm/kernels/trtllmGenKernels/blockScaleMoe/routingRenormalize/RoutingRenormalizeCommon.cuh
|
/bot kill |
|
PR_Github #44965 [ kill ] triggered by Bot. Commit: |
|
PR_Github #44940 [ run ] completed with state |
|
PR_Github #44965 [ kill ] completed with state |
b4ca38d to
5d4d8ce
Compare
|
/bot run --disable-fail-fast |
|
/bot kill |
5d4d8ce to
d6d1600
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46501 [ run ] triggered by Bot. Commit: |
|
PR_Github #46501 [ run ] completed with state
|
|
/bot run --disable-fail-fast/bot run --disable-fail-fast |
|
PR_Github #46545 Bot args parsing error: usage: /bot [-h] |
|
/bot run --disable-fail-fast |
|
PR_Github #46549 [ run ] triggered by Bot. Commit: |
|
PR_Github #46549 [ run ] completed with state
|
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Christina Zhang <83400082+ChristinaZ@users.noreply.github.com>
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
Signed-off-by: Yue Weng <25103990+yweng0828@users.noreply.github.com>
6c06f2e to
8d50c57
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #46666 [ run ] triggered by Bot. Commit: |
|
PR_Github #46666 [ run ] completed with state
|
|
/bot run --disable-fail-fast |
|
PR_Github #46725 [ run ] triggered by Bot. Commit: |
|
PR_Github #46725 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
New Features
Improvements
Description
This PR fixes the issues introduced by #12246.
Previously, I had skipped the C++ unit tests and there were failing cases.
In this PR, I’ve addressed those failures and fixed the related bugs.
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.