Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fp8_group when using FSDP and tensor parallelism #656

Closed
cavdard opened this issue Feb 5, 2024 · 1 comment · Fixed by #852
Closed

fp8_group when using FSDP and tensor parallelism #656

cavdard opened this issue Feb 5, 2024 · 1 comment · Fixed by #852
Labels
documentation Improvements or additions to documentation

Comments

@cavdard
Copy link

cavdard commented Feb 5, 2024

Hi,

What is the correct fp8_group when using FSDP and tensor parallelism together?
Is it all gpus or between tensor parallel groups?

Thanks.

@timmoon10
Copy link
Collaborator

The process group for FP8 amax reductions (fp8_group) should be the combination of the data-parallel and tensor-parallel groups, which is the world group in your use-case. This is because the activation and dgrad tensors need to have consistent FP8 scaling factors when they are distributed. In principle we could get away with doing FP8 amax reductions over just the tensor-parallel group in order to reduce the amount of global communication, but that makes the FP8 casts less stable and complicates checkpointing.

I see that the example in the docs uses fp8_group=data_parallel_group instead of world_group. This is confusing and should be fixed, even if it is running on one GPU and is technically correct.

@ptrendx ptrendx added the documentation Improvements or additions to documentation label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants