Enable NVFP4 grouped MLP GLU RHT amax path#3073
Conversation
Greptile SummaryThis PR enables the NVFP4 grouped MLP GLU RHT amax path by adding two new C++ entry points (
Confidence Score: 5/5The new quantize-with-amax paths correctly place allreduce before cast kernels in both the single-tensor and grouped variants, and empty-rank distributed scenarios are handled properly for all new code paths. All new C++ and Python code introduced in this PR correctly implements the amax-reduction-before-quantization ordering. The compute_amax=false branch in quantize_impl calls reduce_amaxes() before the cast kernels for both non-empty and empty inputs. The grouped path calls allreduce_nvfp4_amax_tensors explicitly before group_quantize_nvfp4_impl. Pre-existing behavior for the regular quantize path (empty-rank early exit without allreduce) is unchanged. No files require special attention; the most complex logic is in cast.cpp and quantizer.cpp, both of which correctly sequence allreduce before quantization in all new code paths. Important Files Changed
Reviews (8): Last reviewed commit: "Merge branch 'main' into nvfp4-grouped-m..." | Re-trigger Greptile |
vthumbe1503
left a comment
There was a problem hiding this comment.
Mostly LGTM. Left a few comments on code duplication and other minor issues.
Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>
Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>
for more information, see https://pre-commit.ci
Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>
Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>
Signed-off-by: Siddhartha Raman S <sraman@nvidia.com>
22b9b0e to
d3533df
Compare
|
/te-ci pytorch |
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: