Skip to content

Comments

Improved performance of mxfp8 cast kernels#1602

Closed
Oleg-Goncharov wants to merge 6 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_mxfp8_kernels_optimization
Closed

Improved performance of mxfp8 cast kernels#1602
Oleg-Goncharov wants to merge 6 commits intoNVIDIA:mainfrom
Oleg-Goncharov:pr_mxfp8_kernels_optimization

Conversation

@Oleg-Goncharov
Copy link
Collaborator

Description

Modified and tuned fused cast mxfp8 kernels for better performance.

Work is still in progress, and some further optimizations will be included in the next few days:

  1. Caching the computed activations in the first pass to avoid recomputing them in the second pass (along the other dim)
  2. Reducing the block size for CAST+DBIAS+DACT kernel specifically
  3. Swapping the order of DBIAS computation

The performance comparison data will also be provided.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Increased block size from 64x64 to 128x128
  • Modified the scheme of how tensor elements are processed. ROWWISE: 32 elements per thread (i.e. one scaling factor)
  • Added micro-optimizations (e.g. using MUL2 instructions; fusing MUL and CVT instructions directly in PTX)

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
@Oleg-Goncharov Oleg-Goncharov added the performance Performance issues label Mar 22, 2025
Oleg-Goncharov and others added 4 commits March 31, 2025 12:56
Signed-off-by: Oleg Goncharov <ogoncharov@prenyx0129.a51.clusters.nvidia.com>
Signed-off-by: Oleg Goncharov <ogoncharov@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2.2.0 performance Performance issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants