2D TP+FSDP with device mesh #126548
Labels
oncall: distributed
Add this issue/PR to distributed oncall triage queue
release notes: distributed (fsdp)
release notes category
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
馃摎 The doc issue
At https://pytorch.org/docs/stable/fsdp.html, there is an undocumented argument
device_mesh
. This is necessary for DTensor TP, as TP support was added fordevice_mesh
but not forprocess_group
. Attempting to useprocess_group
produces the following error:RuntimeError: Attempted to call resize_() on an invalid python storage.
Suggest a potential alternative/fix
Document
device_mesh
. Forprocess_group
, mention that it will not work with Tensor Parallel.cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @XilunWu @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k
The text was updated successfully, but these errors were encountered: