-
Notifications
You must be signed in to change notification settings - Fork 513
Add bf16/fp32 token-per-expert to the MoE aux loss kernel #2162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…fusion Signed-off-by: tongliu <tongliu@nvidia.com>
5708f20
to
de887b4
Compare
for more information, see https://pre-commit.ci
/te-ci |
yaox12
approved these changes
Sep 9, 2025
vthumbe1503
added a commit
to vthumbe1503/TransformerEngine
that referenced
this pull request
Sep 19, 2025
Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Add bf16/fp32 token-per-expert to the MoE aux loss kernel (NVIDIA#2162) * add bf16/fp32 token-per-expert on the moe-loss-computation on router fusion Signed-off-by: tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: tongliu <tongliu@nvidia.com> Co-authored-by: tongliu <tongliu@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [JAX] Scale swizzling via JAX transpose op (NVIDIA#2163) * add swizzle in jax Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * added outer_impl Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * clean up FFI Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Extract cpp distributed tests into a separate project (NVIDIA#2165) * Extract cpp distributed tests into a separate project Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove obsolete exclusion Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Run L1_cpp_distributed tests if at least 4 GPUs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (NVIDIA#2129) * test - adds unit test for cp utilities and the utilites Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * assert line change Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com>
vthumbe1503
added a commit
that referenced
this pull request
Sep 30, 2025
* Test working as I think it should work Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> revert accidental change Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Restrict the number of cases for unfused quantization, some fp8->fp8 cases are handled by cublas Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> fix merge conflict Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> bug: missed a } in the code Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by: Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by: Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by: Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by: Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by: Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by: Ming Huang <mingh@nvidia.com> --------- Signed-off-by: Ming Huang <mingh@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] Decouple Recipe and ScalingMode (#1728) * Decouple recipe and scaling mode Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Expose global QuantizeConfig instance as a getter Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Format and lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Merge branch 'main' into dev/jberchtold/jax-scaling-mode-and-recipe-decoupling Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Rename UsageType to TensorSource Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Jan Bielak <jbielak@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by: oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> feat: Add support for multiple quantization modes in the UB communicators (#2043) Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by: Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: Robin Zhang <robinz@nvidia.com> --------- Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by: Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by: Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by: Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> mxfp8 unfused quant support, refined unit test, remove unecessary quantization code Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> missed a quant code removal Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> minor bug fix Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> minor code cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> minor cosmetics Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Address review comment Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> minor comment update Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Fix CI failures for UB overlap changes (#2149) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> minor bug: quantizer should not be none for unfused quantization Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) * Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> fix linting error Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * initial draft of changes to get GPT oss based swiglu integrated, gated kernels needs to be fixed Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * redundant implementation for the pytorch to te hook up, refactoring to be done later Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * all gated kernels modified, pytest working for oss swiglu Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * fix the merge conflict Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Add cuBLASMp-backed GEMM-like API to TE common (#1824) * Pick up cuBLASMp during build Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Change lib order to fix link error Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Context creation, incomplete... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Test fixure Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * A sanity AgGemm test, failing... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix axes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Take care of uneven distribution Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use MPI to get position of local matrices Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Refactor & fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-RS Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Gemm-AR, not working... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fixes Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Setting all-reduce epilogue for gemm-ar Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use supported shapes for GEMM-AR Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tolerance Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * First shot at fp8 Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use TensorHolder in tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More test configs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Support comm_sm_count Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Parametrize dtypes for A, B and D separately Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak scaling Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Amax ptr Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Flags parity with cublas_gemm, saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Cleanup Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Bias tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix bias test Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Aux, saving... Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * aux_ld Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * A fix Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Use test::Tensor Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Set scale inv Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove unsupported test configs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Replace libcal with NCCL Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add NVTX markers to API functions Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Tweak GemmAr tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More test config Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix merge fallout Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove MPI dependency, comment API, add algo parameter Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem dependency Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix nvshmem build Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Excluse CommGemm tests from L0_cppunittest Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Add cpp_distributed sh file for CI Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Adapt tp TensorAllocator Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip GemmAr test on unsupported HW Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Oversibscribe is needed on some clusters Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Fix incomplete libcal removal Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Move CI tests to L1 Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Rename context to include NVTE prefix Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove leftover code Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * NVTE_WITH_CUBLASMP off by default Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed NVTE_CHECK diag Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment API Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Include stdbool header for legacy C compilers Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove now unused argument Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Abstract away cuBLASMp algo behind our own enum Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * More detailed shape diag messages Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/include/transformer_engine/comm_gemm.h Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> * Add license Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Signed-off-by: Vladimir Cherepanov <56651474+mk-61@users.noreply.github.com> Co-authored-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119) * add noop to comp amax Signed-off-by: zhongboz <zhongboz@nvidia.com> * fix for fp8 blockwise recipe Signed-off-by: zhongboz <zhongboz@nvidia.com> * resolve comments Signed-off-by: zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [PyTorch] fix cross entropy vanishing gradients (#2139) * fix cross entropy Signed-off-by: Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Casper <casperbh.96@gmail.com> * fix comments Signed-off-by: Casper <casperbh.96@gmail.com> * fix: few more style issues Signed-off-by: Casper <casperbh.96@gmail.com> * fix: remove grad_output_stride (unnecessary) Signed-off-by: Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: only backward was broken Signed-off-by: Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Generalize cross entropy backward kernel to handle reduced and unreduced loss Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Casper <casperbh.96@gmail.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Fix bug when enabling --overlap-grad-reduce in mcore (#2142) * fix bugs when enabling --overlap-grad-reduce in mcore Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * format Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Fix CUDA version in setup.py (#2132) * Fix CUDA version in setup.py Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Re-enable building comm-gemm tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * WAR for nvidia-nvshmem package Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] NoScaleTensor wrapper for non-quantized data (#2136) * Custom call tests passing Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix test_layer.py Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix shardy issue with amax being shape 1,1,1 instead of shape (1,) Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Add higher-precision VJP tests to test_distributed_layernorm_mlp Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Cast non-quantized kernels to input dtype in VJPs Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Rename HighPrecisionTensor to NoScaleTensor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Use NoScaleTensor in pure JAX impls where it was missing Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix tests Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [JAX] Fix GroupedScaledTensor creation with keyword arg (#2154) Fix GroupedScaledTensor creation Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Fixing few issues with multi-process launching. (#2155) * Fixing few issues with multi-process launching. Signed-off-by: Ming Huang <mingh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Ming Huang <mingh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Update list of authorized CI users (#2152) Signed-off-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> a bit of cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * accidentally had removed some activations, minor bug in the templated function Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * parent de9ef2fe450daae0d4ea1b647a37219f72814f66 author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000 committer Varun Thumbe <vthumbe@nvidia.com> 1758262513 +0000 parent de9ef2fe450daae0d4ea1b647a37219f72814f66 author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000 committer Varun Thumbe <vthumbe@nvidia.com> 1758262476 +0000 parent de9ef2fe450daae0d4ea1b647a37219f72814f66 author Varun Thumbe <vthumbe@nvidia.com> 1757373536 +0000 committer Varun Thumbe <vthumbe@nvidia.com> 1758262304 +0000 merge conflict Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> FP8 AllGather in FP8 GroupedGEMM + Fix Stream Usage Issue. (#2086) * FP8 AllGather in FP8 GroupedGEMM 1. Support current scaling FP8 quantation with a given amax. 2. Support FP8 AG in fwd and BF16 RS in bwd. 3. The workflow is AR-max -> FP8 Quant -> FP8 AG -> FP8 GroupedGEMM. Signed-off-by: Ming Huang <mingh@nvidia.com> * Slightly refactor Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding documents of new args. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding unit-tests. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding license. Signed-off-by: Ming Huang <mingh@nvidia.com> * Move unit-tests to L1. Signed-off-by: Ming Huang <mingh@nvidia.com> * Move quantizaer store/reset into FP8 only. Signed-off-by: Ming Huang <mingh@nvidia.com> * Adding all layout support for Blackwell+ Signed-off-by: Ming Huang <mingh@nvidia.com> * Adopt the feedback from code-review. Signed-off-by: Ming Huang <mingh@nvidia.com> * Fixed the wrong stream used by d2d in groupedGEMM FFI. Signed-off-by: Ming Huang <mingh@nvidia.com> --------- Signed-off-by: Ming Huang <mingh@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Delay MeshResource validation until first usage (#2124) Delay MeshResource validation until first usage Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> [JAX] `dot_1_output` sharding constraint + use AXIS_IS_UNSHARDED (#2128) * add dot_1_output sharding constraint + use AXIS_IS_UNSHARDED Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> [JAX] Add amax input to DBiasQuantizePrimitive and FFI (#2118) * add amax input to DBiasQuantizePrimitive and FFI Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make sure amax is init with zero Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * fix sharding rule Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Further relax constraints to cuDNN 9.13 for disabling fused attn for kv caching (#2121) Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Temporarily remove comm_gemm tests (#2133) Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> [PyTorch] Disable determinism for sm100 (#2130) * disable determinism for sm100+ and cudnn<9.14 Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix remaining CI failures Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert some changes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert more changes Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm100 from determinism table Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by: Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch] ONNX export of FP8 Current Scaling (#2068) * Compute amax in normalization forward in current scaling in untuned kernels Signed-off-by: Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * code drop Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * apply tims suggestions Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by: Jan Bielak <jbielak@nvidia.com> Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: Jan Bielak <jbielak@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch][MOE] Tentative Fix For Replacing from_blob with empty for experts receiving zero tokens (#2134) use torch empty for empty shape instead of from_blob Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> build: pull cached wheels (#2127) * build: pull cached wheels Signed-off-by: oliver könig <okoenig@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update setup.py Signed-off-by: oliver könig <okoenig@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> [Common] Add checks to CUDA kernel launch and CUDA API calls (#2074) * add checks to cuda kernel launch and cuda API calls Signed-off-by: Xin Yao <xiny@nvidia.com> * Remove exceptions from destructors Signed-off-by: Tim Moon <tmoon@nvidia.com> * fix weired dispatch in ln/rmsnorm Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Xin Yao <xiny@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> [PyTorch] Support bf16+fp8 cudagraph (#2098) * support bf16+fp8 model Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by: Robin Zhang <robinz@nvidia.com> --------- Signed-off-by: Robin Zhang <robinz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Dropout with 8-bit RNG (#2014) * Add dropout kernel with 8-bit RNG Co-authored-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix license Signed-off-by: Tim Moon <tmoon@nvidia.com> * Avoid ambiguous types Signed-off-by: Tim Moon <tmoon@nvidia.com> * Do not enforce dropout prob is representable in 8 bits Signed-off-by: Tim Moon <tmoon@nvidia.com> * Expand error message Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix small statistical bug from using less-equal instead of less-than Refactor kernel implementations and add comments. Interpret masks as bytes rather than 16-bit uints. Signed-off-by: Tim Moon <tmoon@nvidia.com> * Fix linter warning Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary helper function in PyTorch extensions Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Create GPU reload buffers on main stream (#2131) * Create GPU relaod buffers on main stream Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed typo Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> * Fixed typo Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> --------- Signed-off-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Signed-off-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-ptyche01.ptyche.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Selvaraj Anandaraj <selvaraja@login-preos01.a51.clusters.nvidia.com> Co-authored-by: Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Fix CI failures for UB overlap changes (#2149) Signed-off-by: djns99 <40156487+djns99@users.noreply.github.com> [JAX] Fix failing fused attn tests for dropout=0.1 and bias for sm100 (#2135) * Fix failing tests for dropout=0.1 and bias for fused attn for blackwell Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the skip message Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Add check for sm100 Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support to get all devs in the process for jax Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Code clean up Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Make get_all_device_compute_capability more pythonic, thereby avoiding unnecessary type conversion Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> * Represent attn bias using enum instead of string Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by: Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [PyTorch][CUDA Graph] Fix FP8 Weight Quantization Cache under CUDA Graph (#2119) * add noop to comp amax Signed-off-by: zhongboz <zhongboz@nvidia.com> * fix for fp8 blockwise recipe Signed-off-by: zhongboz <zhongboz@nvidia.com> * resolve comments Signed-off-by: zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: zhongboz <zhongboz@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> [PyTorch] fix cross entropy vanishing gradients (#2139) * fix cross entropy Signed-off-by: Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Casper <casperbh.96@gmail.com> * fix comments Signed-off-by: Casper <casperbh.96@gmail.com> * fix: few more style issues Signed-off-by: Casper <casperbh.96@gmail.com> * fix: remove grad_output_stride (unnecessary) Signed-off-by: Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: only backward was broken Signed-off-by: Casper <casperbh.96@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Generalize cross entropy backward kernel to handle reduced and unreduced loss Signed-off-by: Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Casper <casperbh.96@gmail.com> Signed-off-by: Tim Moon <tmoon@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by: Tim Moon <tmoon@nvidia.com> Fix bug when enabling --overlap-grad-reduce in mcore (#2142) * fix bugs when enabling --overlap-grad-reduce in mcore Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CI Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * format Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: Hongbin Liu <hongbinl@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Fix CUDA version in setup.py (#2132) * Fix CUDA version in setup.py Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Re-enable building comm-gemm tests Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * WAR for nvidia-nvshmem package Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> [JAX] NoScaleTensor wrapper for non-quantized data (#2136) * Custom call tests passing Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix test_layer.py Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix comments Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Support using amax on HighPrecision tensor if it exists instead of recomputing for current scaling Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix shardy issue with amax being shape 1,1,1 instead of shape (1,) Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Add higher-precision VJP tests to test_distributed_layernorm_mlp Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Cast non-quantized kernels to input dtype in VJPs Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Rename HighPrecisionTensor to NoScaleTensor Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Use NoScaleTensor in pure JAX impls where it was missing Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> * Fix tests Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by: Jeremy Berchtold <jberchtold@nvidia.com> [JAX] Fix GroupedScaledTensor creation with keyword arg (#2154) Fix GroupedScaledTensor creation Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Fixing few issues with multi-process launching. (#2155) * Fixing few issues with multi-process launching. Signed-off-by: Ming Huang <mingh@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Ming Huang <mingh@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Phuong Nguyen <phuonguyen@nvidia.com> Update list of authorized CI users (#2152) Signed-off-by: Tim Moon <tmoon@nvidia.com> Fused RoPE with combined QKV input. (#2122) * Fused RoPE with combined QKV input. Initial commit for Dropout with 8-bit RNG Fix documentation Initial commit for Fused QKV RoPE WIP Initial tests passing Enable rotary percent and margin Enable CP2, start_positions, interleaved Cleanup test Revert "Fix documentation" This reverts commit 53df10044e7769982bd4af2ae2628e6b7717e715. Revert "Initial commit for Dropout with 8-bit RNG" This reverts commit 301505e24031cbcd679069e1c2cd4d00eedf2dca. Cleanup. Minor cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Optimize kernels Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Misc. Cleanup Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Optimize kernel performance Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * Move fused_qkv_rope test to test_fused_rope.py Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * apply shared memory optimization to separate fused rope kernels Signed-off-by: Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by: Xin Yao <xiny@nvidia.com> --------- Signed-off-by: Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by: Xin Yao <xiny@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Xin Yao <xiny@nvidia.com> Co-authored-by: Tim Moon <4406448+timmoon10@users.noreply.github.com> * accidentally removed the copyright Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * fix linting issue Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * minor issue in comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * Commit is for another PR Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * revert changes since this belongs to another PR Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert change back since belongs to another PR Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changes belong to another PR Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert changes here Signed-off-by: vthumbe1503 <vthumbe@nvidia.com> Add bf16/fp32 token-per-expert to the MoE aux loss kernel (#2162) * add bf16/fp32 token-per-expert on the moe-loss-computation on router fusion Signed-off-by: tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: tongliu <tongliu@nvidia.com> Co-authored-by: tongliu <tongliu@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> [JAX] Scale swizzling via JAX transpose op (#2163) * add swizzle in jax Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * added outer_impl Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> * clean up FFI Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by: Phuong Nguyen <phuonguyen@nvidia.com> Extract cpp distributed tests into a separate project (#2165) * Extract cpp distributed tests into a separate project Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove obsolete exclusion Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> * Run L1_cpp_distributed tests if at least 4 GPUs Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by: Vladimir Cherepanov <vcherepanov@nvidia.com> Adds context parallelism utilities: moving cp shards to diff ranks and pad sequence to divisibility factory (#2129) * test - adds unit test for cp utilities and the utilites Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * assert line change Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by: Jonathan Mitchell <jomitchell@login-eos02.eos.clusters.nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Sudhakar Singh <sudhakars@nvidia.com> * address review comments Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * cleanup Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix linting error Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci [PyTorch Debug] Fix issue with negative underflow% stat. (#2107) * fix underflows log issue Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * Address review comments, fix mxfp8 kernel bug: was not passing clamped swiglu parameter correctly Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: Varun Thumbe <vthumbe@nvidia.com> Lower precision gated-act to accelerate FP8 current-scaling. (#2153) * Applying the original precision as N…
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
Add the new datatype(1. bf16, 2. fp32) support on the token-per-expert on the moe-aux-loss-computation kernel.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: