Optimize `/ops/fuser.py` by moving computation from `forward` to `__init__` by janekb04 · Pull Request #1870 · NVIDIA/TransformerEngine

janekb04 · 2025-06-11T21:19:41Z

Description

This PR moves certain computations performed during the forward pass of te.Sequential in _OperationFuserAutogradFunction.forward and OperationFuser.__call__ to OperationFuser.__init__. Additionally, it caches is_non_tn_fp8_gemm_supported.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Flatten list of parameters of basic_ops in OperationFuser.__init__ instead of in OperationFuser.__call__
Change interface of _OperationFuserAutogradFunction.forward to take fuser: OperationFuser instead of 7 separate parameters
Count parameters of basic_ops in OperationFuser.__init__ instead of in _OperationFuserAutogradFunction.forward
Cache is_non_tn_fp8_gemm_supported

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes (tested test_fusible_ops.py)

Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 949abe97070721b1da5117903067608250f5fb61)

Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5)

…utation to __init__ Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd808991993958b670726896254b82fcb967fa07)

for more information, see https://pre-commit.ci

timmoon10

Overall looks good. Can you quantify how much speedup you observed from each optimization?

… fuser Signed-off-by: Jan Bielak <jbielak@nvidia.com>

for more information, see https://pre-commit.ci

janekb04 · 2025-06-11T23:56:55Z

Having used my benchmark script, the running time of a GPT encoder transformer layer (averaged over 10k runs) is:

Layer	`main`	pre-flatten	1+cache	1+2+pre-count
Fused TE Layer	1.66 ms	1.64 ms	1.65 ms	1.63 ms
Fused TE Layer (FP8)	2.31 ms	2.37 ms	2.17 ms	2.18 ms
Sequential TE Layer	1.92 ms	1.91 ms	1.92 ms	1.88 ms
Sequential TE Layer (FP8)	3.12 ms	3.09 ms	2.92 ms	2.87 ms
Builtin TE TransformerLayer	2.51 ms	2.49 ms	2.50 ms	2.52 ms
Builtin TE TransformerLayer (FP8)	3.14 ms	3.12 ms	2.96 ms	2.95 ms

Where:

opt. 1 is Flatten list of parameters of basic_ops in OperationFuser.__init__ instead of in OperationFuser.__call__
opt. 2 is apply opt. 1 + Cache is_non_tn_fp8_gemm_supported
opt. 3 is apply opt. 1 and opt. 2 and Change interface of _OperationFuserAutogradFunction.forward to take fuser: OperationFuser instead of 7 separate parameters and Count parameters of basic_ops in OperationFuser.__init__ instead of in _OperationFuserAutogradFunction.forward

It appears that the most significant change performance-wise is actually the one line change of caching is_non_tn_fp8_gemm_supported as that seems to speed up all the FP8 layers.

timmoon10 · 2025-06-12T00:15:20Z

/te-ci pytorch

janekb04 and others added 4 commits June 11, 2025 20:50

Flatten basic op params during fuser init

601709d

Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 949abe97070721b1da5117903067608250f5fb61)

Add caching for is_non_tn_fp8_gemm_supported

39f4a59

Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5)

Pass fuser to _OperationFuserAutogradFunction.forward and moving comp…

147acd7

…utation to __init__ Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd808991993958b670726896254b82fcb967fa07)

[pre-commit.ci] auto fixes from pre-commit.com hooks

03267a2

for more information, see https://pre-commit.ci

timmoon10 reviewed Jun 11, 2025

View reviewed changes

Comment thread transformer_engine/pytorch/ops/fuser.py Outdated

janekb04 and others added 2 commits June 11, 2025 22:54

Pass basic_op_kwargs and is_grad_enabled as parameters rather than in…

9b2ef2c

… fuser Signed-off-by: Jan Bielak <jbielak@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7f768e9

for more information, see https://pre-commit.ci

Merge branch 'main' into optimize-ops-fuser

acc989e

timmoon10 approved these changes Jun 12, 2025

View reviewed changes

timmoon10 merged commit 8d4bdbc into NVIDIA:main Jun 13, 2025
21 checks passed

janekb04 deleted the optimize-ops-fuser branch June 13, 2025 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize `/ops/fuser.py` by moving computation from `forward` to `init`#1870

Optimize `/ops/fuser.py` by moving computation from `forward` to `init`#1870
timmoon10 merged 7 commits intoNVIDIA:mainfrom
janekb04:optimize-ops-fuser

janekb04 commented Jun 11, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

Uh oh!

janekb04 commented Jun 11, 2025

Uh oh!

timmoon10 commented Jun 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

janekb04 commented Jun 11, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

janekb04 commented Jun 11, 2025

Uh oh!

timmoon10 commented Jun 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants