Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练时CUDA报错 #6

Open
Lily920 opened this issue May 6, 2024 · 5 comments
Open

训练时CUDA报错 #6

Lily920 opened this issue May 6, 2024 · 5 comments

Comments

@Lily920
Copy link

Lily920 commented May 6, 2024

! sh scripts/finetune_model_TaiyiXL_data_catwoman.sh
运行时会出现CUDA报错问题,指定了设备也没用是为什么
RuntimeError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@wxj630
Copy link
Collaborator

wxj630 commented May 7, 2024

./config/accelerate_config/default_config.yaml 这里指定了使用单机8卡训练,你看下是不是你没有这么多卡

@Lily920
Copy link
Author

Lily920 commented May 7, 2024

./config/accelerate_config/default_config.yaml 这里指定了使用单机8卡训练,你看下是不是你没有这么多卡

需要这么多张卡吗,没这么多资源怎么办

@wxj630
Copy link
Collaborator

wxj630 commented May 7, 2024

不一定非要8卡,你有几张卡就写几张

@Lily920
Copy link
Author

Lily920 commented May 7, 2024

不一定非要8卡,你有几张卡就写几张

Traceback (most recent call last):
File "/root/miniconda3/bin/accelerate", line 8, in
sys.exit(main())
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 47, in main
args.func(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 977, in launch_command
multi_gpu_launcher(args)
File "/root/miniconda3/lib/python3.10/site-packages/accelerate/commands/launch.py", line 646, in multi_gpu_launcher
distrib_run.run(args)
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
elastic_launch(
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

sdxl_train.py FAILED

Failures:
<NO_OTHER_FAILURES>

Root Cause (first observed failure):
[0]:
time : 2024-05-07_17:30:10
host : autodl-container-75ca4f8174-2d098751
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1778)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
那这个不知名错误是为什么,一开始还以为是卡不够

@wxj630
Copy link
Collaborator

wxj630 commented May 7, 2024

我训练的时候没有遇到你这个问题,也许可以参考Vision-CAIR/MiniGPT-4#237 (comment)
试着检查下以下设置:

  • 判断torch是否是cuda版本?
  • 减小batch_size?
  • export CUDA_LAUNCH_BLOCKING=1 看看代码具体在哪里报错的

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants