Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用 DeepSpeed ZeRO-3 平均分配显存 运行一段时间后报 Signal 7 (SIGBUS) received #3747

Closed
1 task done
xudongyss opened this issue May 15, 2024 · 3 comments
Closed
1 task done
Labels
wontfix This will not be worked on

Comments

@xudongyss
Copy link

xudongyss commented May 15, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

bash examples/lora_multi_gpu/ds_zero3.sh

ds_zero3.sh:

#!/bin/bash

NPROC_PER_NODE=4

CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.run
--nproc_per_node $NPROC_PER_NODE
--nnodes 1
--standalone
src/train.py examples/lora_multi_gpu/llama3_lora_sft_ds.yaml

llama3_lora_sft_ds.yaml:

model_name_or_path: Meta-Llama-3-8B-Instruct

stage: pt
do_train: true
finetuning_type: lora
lora_target: all

ddp_timeout: 180000000
deepspeed: examples/deepspeed/ds_z3_config.json

dataset: wikipedia_zh_local
cutoff_len: 4096
val_size: 0.1
overwrite_cache: true
preprocessing_num_workers: 16

output_dir: saves/llama3-8b-instruct/lora/sft
logging_steps: 10
save_steps: 500
plot_loss: true
overwrite_output_dir: true

per_device_train_batch_size: 1
gradient_accumulation_steps: 8
learning_rate: 0.0001
num_train_epochs: 3.0
lr_scheduler_type: cosine
warmup_steps: 0.1
fp16: true

per_device_eval_batch_size: 1
evaluation_strategy: steps
eval_steps: 500

错误

WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043353 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043355 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 1043356 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -7) local_rank: 1 (pid: 1043354) of binary: /home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/bin/python
Traceback (most recent call last):
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/runpy.py", line 197, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 766, in
main()
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 346, in wrapper
return f(*args, **kwargs)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
run(args)
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
elastic_launch(
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/share/ug9tppkr/home/Cnsunrun/.conda/envs/test-003/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

src/train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-15_10:04:22
host : whshare-agent-26
rank : 1 (local_rank: 1)
exitcode : -7 (pid: 1043354)
error_file: <N/A>
traceback : Signal 7 (SIGBUS) received by PID 1043354

显卡

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.85.02 Driver Version: 510.85.02 CUDA Version: 11.6 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... On | 00000000:01:00.0 Off | 0 |
| N/A 29C P0 35W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA A100-PCI... On | 00000000:02:00.0 Off | 0 |
| N/A 29C P0 34W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 2 NVIDIA A100-PCI... On | 00000000:81:00.0 Off | 0 |
| N/A 30C P0 33W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 3 NVIDIA A100-PCI... On | 00000000:82:00.0 Off | 0 |
| N/A 30C P0 33W / 250W | 0MiB / 40960MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

Expected behavior

No response

System Info

  • transformers version: 4.40.2
  • Platform: Linux-4.19.90-24.4.v2101.ky10.aarch64-aarch64-with-glibc2.28
  • Python version: 3.9.19
  • Huggingface_hub version: 0.23.0
  • Safetensors version: 0.4.3
  • Accelerate version: 0.30.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 1.13.1+cu116 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?: yes
  • Using distributed or parallel set-up in script?: yes

Others

发生这个错误是什么原因呢?怎么解决?

@Parasolation
Copy link

是在保存checkpoint的时候报的这个错么

@xudongyss
Copy link
Author

是在保存checkpoint的时候报的这个错么
不是

@xiaotaozi121096
Copy link

你好,问题解决了吗

@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 21, 2024
@hiyouga hiyouga added wontfix This will not be worked on and removed pending This problem is yet to be addressed. labels May 29, 2024
@hiyouga hiyouga closed this as not planned Won't fix, can't repro, duplicate, stale May 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
None yet
Development

No branches or pull requests

4 participants