Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Support sharding parameters where first dimension is not divisible by 8 #248

Closed
mpatel31415 opened this issue Apr 22, 2024 · 1 comment 路 Fixed by #415
Closed
Assignees
Labels
distributed enhancement New feature or request

Comments

@mpatel31415
Copy link
Contributor

mpatel31415 commented Apr 22, 2024

馃悰 Bug

With newest version of Docker image (tested on 2024-04-22) during training of Nous-Hermes-13b with thunder.jit with default executors training with FSDP + zero2 I get an error:

File "/opt/pytorch/lightning-thunder/thunder/core/baseutils.py", line 103, in check
[rank2]: raise exception_type(s())
[rank2]: RuntimeError: Current sharding requires the first dimension of the parameter 'lm_head.weight' (32001) to be divisible by the world size (8)

To Reproduce

  1. Start container before every test:
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864  --pull always --INTERNAL_ADDRESS:5005/dl/pytorch/update-scripts:pjnl-latest

Thunder
2. Inside container run:

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Nous-Hermes-13b --compile thunder --distributed_mode fsdp --shard_mode zero2

Inductor:
2. Inside container run:

torchrun --nproc-per-node=8 /opt/pytorch/lightning-thunder/thunder/benchmarks/benchmark_litgpt.py --model_name Nous-Hermes-13b --compile inductor --distributed_mode fsdp --shard_mode zero2

This command works.

Expected behavior

We should be able to train Nous-Hermes-13b with FSDP + zero2 + Thunder.

Environment

As in the Docker image. The reproduction examples come from 8xA100 GPUs.

Output of nvidia-smi:
image

cc @carmocca @awaelchli @crcrpar

@mpatel31415 mpatel31415 added the bug Something isn't working label Apr 22, 2024
@mruberry mruberry added triage review distributed enhancement New feature or request and removed triage review bug Something isn't working labels Apr 22, 2024
@mruberry mruberry changed the title Nous-Hermes-13b from LitGPT with FSDP + zero2 gives error Feature request: Support sharding parameters where first dimension is not divisible by 8 Apr 22, 2024
@mruberry
Copy link
Collaborator

triage review:

  • updated title

Thanks @mpatel31415!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed enhancement New feature or request
Projects
None yet
3 participants