Skip to content

Optimize /ops/fuser.py by moving computation from forward to __init__#1870

Merged
timmoon10 merged 7 commits intoNVIDIA:mainfrom
janekb04:optimize-ops-fuser
Jun 13, 2025
Merged

Optimize /ops/fuser.py by moving computation from forward to __init__#1870
timmoon10 merged 7 commits intoNVIDIA:mainfrom
janekb04:optimize-ops-fuser

Conversation

@janekb04
Copy link
Copy Markdown
Collaborator

Description

This PR moves certain computations performed during the forward pass of te.Sequential in _OperationFuserAutogradFunction.forward and OperationFuser.__call__ to OperationFuser.__init__. Additionally, it caches is_non_tn_fp8_gemm_supported.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

  • Flatten list of parameters of basic_ops in OperationFuser.__init__ instead of in OperationFuser.__call__
  • Change interface of _OperationFuserAutogradFunction.forward to take fuser: OperationFuser instead of 7 separate parameters
  • Count parameters of basic_ops in OperationFuser.__init__ instead of in _OperationFuserAutogradFunction.forward
  • Cache is_non_tn_fp8_gemm_supported

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes (tested test_fusible_ops.py)

janekb04 and others added 4 commits June 11, 2025 20:50
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit 949abe97070721b1da5117903067608250f5fb61)
Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5)
…utation to __init__

Signed-off-by: Jan Bielak <jbielak@nvidia.com>
(cherry picked from commit fd808991993958b670726896254b82fcb967fa07)
Copy link
Copy Markdown
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good. Can you quantify how much speedup you observed from each optimization?

Comment thread transformer_engine/pytorch/ops/fuser.py Outdated
@janekb04
Copy link
Copy Markdown
Collaborator Author

Having used my benchmark script, the running time of a GPT encoder transformer layer (averaged over 10k runs) is:

Layer main pre-flatten 1+cache 1+2+pre-count
Fused TE Layer 1.66 ms 1.64 ms 1.65 ms 1.63 ms
Fused TE Layer (FP8) 2.31 ms 2.37 ms 2.17 ms 2.18 ms
Sequential TE Layer 1.92 ms 1.91 ms 1.92 ms 1.88 ms
Sequential TE Layer (FP8) 3.12 ms 3.09 ms 2.92 ms 2.87 ms
Builtin TE TransformerLayer 2.51 ms 2.49 ms 2.50 ms 2.52 ms
Builtin TE TransformerLayer (FP8) 3.14 ms 3.12 ms 2.96 ms 2.95 ms

Where:

  • opt. 1 is Flatten list of parameters of basic_ops in OperationFuser.__init__ instead of in OperationFuser.__call__
  • opt. 2 is apply opt. 1 + Cache is_non_tn_fp8_gemm_supported
  • opt. 3 is apply opt. 1 and opt. 2 and Change interface of _OperationFuserAutogradFunction.forward to take fuser: OperationFuser instead of 7 separate parameters and Count parameters of basic_ops in OperationFuser.__init__ instead of in _OperationFuserAutogradFunction.forward

It appears that the most significant change performance-wise is actually the one line change of caching is_non_tn_fp8_gemm_supported as that seems to speed up all the FP8 layers.

@timmoon10
Copy link
Copy Markdown
Collaborator

/te-ci pytorch

@timmoon10 timmoon10 merged commit 8d4bdbc into NVIDIA:main Jun 13, 2025
21 checks passed
@janekb04 janekb04 deleted the optimize-ops-fuser branch June 13, 2025 17:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants