Optimize /ops/fuser.py by moving computation from forward to __init__#1870
Merged
timmoon10 merged 7 commits intoNVIDIA:mainfrom Jun 13, 2025
Merged
Optimize /ops/fuser.py by moving computation from forward to __init__#1870timmoon10 merged 7 commits intoNVIDIA:mainfrom
/ops/fuser.py by moving computation from forward to __init__#1870timmoon10 merged 7 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 949abe97070721b1da5117903067608250f5fb61)
Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd830ae24ffbd2d0727010b1a8a119ca72f61ce5)
…utation to __init__ Signed-off-by: Jan Bielak <jbielak@nvidia.com> (cherry picked from commit fd808991993958b670726896254b82fcb967fa07)
for more information, see https://pre-commit.ci
timmoon10
reviewed
Jun 11, 2025
Collaborator
timmoon10
left a comment
There was a problem hiding this comment.
Overall looks good. Can you quantify how much speedup you observed from each optimization?
… fuser Signed-off-by: Jan Bielak <jbielak@nvidia.com>
for more information, see https://pre-commit.ci
Collaborator
Author
|
Having used my benchmark script, the running time of a GPT encoder transformer layer (averaged over 10k runs) is:
Where:
It appears that the most significant change performance-wise is actually the one line change of caching |
Collaborator
|
/te-ci pytorch |
timmoon10
approved these changes
Jun 12, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR moves certain computations performed during the forward pass of
te.Sequentialin_OperationFuserAutogradFunction.forwardandOperationFuser.__call__toOperationFuser.__init__. Additionally, it cachesis_non_tn_fp8_gemm_supported.Fixes # (issue)
Type of change
Changes
basic_opsinOperationFuser.__init__instead of inOperationFuser.__call___OperationFuserAutogradFunction.forwardto takefuser: OperationFuserinstead of 7 separate parametersbasic_opsinOperationFuser.__init__instead of in_OperationFuserAutogradFunction.forwardis_non_tn_fp8_gemm_supportedChecklist:
test_fusible_ops.py)