`fp8_group` when using FSDP and tensor parallelism #656

cavdard · 2024-02-05T18:41:12Z

Hi,

What is the correct fp8_group when using FSDP and tensor parallelism together?
Is it all gpus or between tensor parallel groups?

Thanks.

The text was updated successfully, but these errors were encountered:

timmoon10 · 2024-02-29T21:34:32Z

The process group for FP8 amax reductions (fp8_group) should be the combination of the data-parallel and tensor-parallel groups, which is the world group in your use-case. This is because the activation and dgrad tensors need to have consistent FP8 scaling factors when they are distributed. In principle we could get away with doing FP8 amax reductions over just the tensor-parallel group in order to reduce the amount of global communication, but that makes the FP8 casts less stable and complicates checkpointing.

I see that the example in the docs uses fp8_group=data_parallel_group instead of world_group. This is confusing and should be fixed, even if it is running on one GPU and is technically correct.

ptrendx added the documentation Improvements or additions to documentation label May 16, 2024

timmoon10 mentioned this issue May 16, 2024

Use correct FP8 group in multi-GPU docs #852

Merged

11 tasks

ptrendx closed this as completed in #852 May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`fp8_group` when using FSDP and tensor parallelism #656

`fp8_group` when using FSDP and tensor parallelism #656

cavdard commented Feb 5, 2024

timmoon10 commented Feb 29, 2024

fp8_group when using FSDP and tensor parallelism #656

fp8_group when using FSDP and tensor parallelism #656

Comments

cavdard commented Feb 5, 2024

timmoon10 commented Feb 29, 2024

`fp8_group` when using FSDP and tensor parallelism #656

`fp8_group` when using FSDP and tensor parallelism #656