Skip to content

[PyTorch] Fix CP implementation with FP8#1483

Merged
xrennvidia merged 19 commits into
NVIDIA:mainfrom
xrennvidia:xren/cp_v2.0_fix
Feb 20, 2025
Merged

[PyTorch] Fix CP implementation with FP8#1483
xrennvidia merged 19 commits into
NVIDIA:mainfrom
xrennvidia:xren/cp_v2.0_fix

Conversation

@xrennvidia
Copy link
Copy Markdown
Collaborator

@xrennvidia xrennvidia commented Feb 14, 2025

Description

Fix bugs of quantizations, dtypes, etc. of FP8+CP

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

xrennvidia and others added 11 commits February 10, 2025 15:13
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
@xrennvidia
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Signed-off-by: Xiaowei Ren <xren@nvidia.com>
@xrennvidia xrennvidia changed the title Fix CP implementation with FP8 [PyTorch] Fix CP implementation with FP8 Feb 14, 2025
@xrennvidia
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

@xrennvidia
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

@ptrendx ptrendx added the 2.1.0 label Feb 15, 2025
@xrennvidia xrennvidia requested a review from cyanguwa February 15, 2025 01:41
Copy link
Copy Markdown
Collaborator

@cyanguwa cyanguwa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some failures in the CI, but it doesn't look like they are related to CP?

@xrennvidia
Copy link
Copy Markdown
Collaborator Author

There are some failures in the CI, but it doesn't look like they are related to CP?

Yeah, I think so. All CP tests with H100 passed. Failed tests are for other functional units.

@xrennvidia
Copy link
Copy Markdown
Collaborator Author

/te-ci pytorch L1

@xrennvidia
Copy link
Copy Markdown
Collaborator Author

The failed tests are unrelated to CP, and CP tests on Blackwell GPUs passed locally, so merge the PR.

@xrennvidia xrennvidia merged commit 257345a into NVIDIA:main Feb 20, 2025
@xrennvidia xrennvidia deleted the xren/cp_v2.0_fix branch February 20, 2025 19:12
timmoon10 pushed a commit that referenced this pull request Feb 21, 2025
* commit some debug code

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add more debug info

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* debug code commit and typo fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* a typo fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* remove debug info

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* do not return lse

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* add amax_per_step for quantizers of CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* fix FP8 + CP

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* bug fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* dtype fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

* bug fix

Signed-off-by: Xiaowei Ren <xren@nvidia.com>

---------

Signed-off-by: Xiaowei Ren <xren@nvidia.com>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xiaowei Ren <xren@login-preos01.a51.clusters.nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants