[PyTorch Debug] Fix the issue with PP by pggPL · Pull Request #1894 · NVIDIA/TransformerEngine

pggPL · 2025-06-25T11:05:58Z

Description

Simple fix for PP.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

for more information, see https://pre-commit.ci

pggPL · 2025-06-25T11:10:18Z

/te-ci pytorch

lengerfulluse · 2025-06-25T23:35:55Z

# .internal = True is slightly faster, but results
# in errors when caching the weights.

@pggPL thanks for working on this. The PP > 1 verified worked when i set to False. But wondering whether you have some metrics on the throughput impact on this when you commented with slightly faster

pggPL · 2025-06-26T08:47:15Z

@lengerfulluse we use internal=False to mitigate CPU overhead with creating tensor object. It should not affect layers in iterations without logging and the overhead for logged layers should be much smaller than other overhead of logging.

* fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

pggPL and others added 4 commits June 25, 2025 10:58

fix

2183c89

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

fix

d8d1c93

Signed-off-by: Pawel Gadzinski <pgadzinski@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2103f94

for more information, see https://pre-commit.ci

Merge branch 'main' into pp_nvinspect

52df72c

ksivaman approved these changes Jun 26, 2025

View reviewed changes

pggPL merged commit 964c2ed into NVIDIA:main Jun 26, 2025
19 of 21 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch Debug] Fix the issue with PP#1894

[PyTorch Debug] Fix the issue with PP#1894
pggPL merged 4 commits intoNVIDIA:mainfrom
pggPL:pp_nvinspect

pggPL commented Jun 25, 2025

Uh oh!

pggPL commented Jun 25, 2025

Uh oh!

lengerfulluse commented Jun 25, 2025

Uh oh!

Uh oh!

pggPL commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

pggPL commented Jun 25, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

pggPL commented Jun 25, 2025

Uh oh!

lengerfulluse commented Jun 25, 2025

Uh oh!

Uh oh!

pggPL commented Jun 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants