You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With newest version of Docker image (tested on 2024-04-22) during training of Nous-Hermes-13b with thunder.jit with default executors training with FSDP + zero2 I get an error:
File "/opt/pytorch/lightning-thunder/thunder/core/baseutils.py", line 103, in check
[rank2]: raise exception_type(s())
[rank2]: RuntimeError: Current sharding requires the first dimension of the parameter 'lm_head.weight' (32001) to be divisible by the world size (8)
To Reproduce
Start container before every test:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --pull always --INTERNAL_ADDRESS:5005/dl/pytorch/update-scripts:pjnl-latest
mruberry
changed the title
Nous-Hermes-13b from LitGPT with FSDP + zero2 gives error
Feature request: Support sharding parameters where first dimension is not divisible by 8
Apr 22, 2024
馃悰 Bug
With newest version of Docker image (tested on 2024-04-22) during training of
Nous-Hermes-13b
with thunder.jit with default executors training with FSDP + zero2 I get an error:To Reproduce
Thunder
2. Inside container run:
Inductor:
2. Inside container run:
This command works.
Expected behavior
We should be able to train Nous-Hermes-13b with FSDP + zero2 + Thunder.
Environment
As in the Docker image. The reproduction examples come from 8xA100 GPUs.
Output of nvidia-smi:
![image](https://private-user-images.githubusercontent.com/149149379/324549974-25f751d0-ffc6-4474-9905-bec654d64493.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjA2MjQyNjgsIm5iZiI6MTcyMDYyMzk2OCwicGF0aCI6Ii8xNDkxNDkzNzkvMzI0NTQ5OTc0LTI1Zjc1MWQwLWZmYzYtNDQ3NC05OTA1LWJlYzY1NGQ2NDQ5My5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjQwNzEwJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI0MDcxMFQxNTA2MDhaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1mMmIxNjNlMzlkOTg0NDNkNjYwZmViNWExNDlhZjlhODJiMWFjYWVlMDIxYmI1YzRmZGZlODA3ZGUwZTc5YjRkJlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCZhY3Rvcl9pZD0wJmtleV9pZD0wJnJlcG9faWQ9MCJ9.RiDV37n_tfWOBO1ZJxmITWgvjKFNyRiOKv0aOvDh0w4)
cc @carmocca @awaelchli @crcrpar
The text was updated successfully, but these errors were encountered: