-
Notifications
You must be signed in to change notification settings - Fork 7.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
检测的训练,单卡GPU可以正常训练;多卡GPU报错 error: unrecognized arguments: --gpus [2,3,4,6] #3327
Comments
命令不对改为: 但是新的错误:
已杀死 |
补充:paddle的安装方式:pip install paddlepaddle-gpu==2.0.1.post110 -f https://paddlepaddle.org.cn/whl/mkl/stable.html |
You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle. 以下是 Ubuntu 18.04的 nccl2 阿里源安装方法 curl -fsSL https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/7fa2af80.pub | apt-key add -
echo "deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/ ./" > /etc/apt/sources.list.d/cuda.list
apt update
sudo apt install libnccl2=2.9.6-1+cuda11.0 libnccl-dev=2.9.6-1+cuda11.0 |
感谢,我们正在尝试用您的方法解决这个问题。 |
好像今天阿里镜像暂时关闭了,明天再试试。 |
如果是想打开镜像网站的话关了好好久了…… |
您好,在里面我找到了相应的安装文件,并执行了: root@mk-NF5468M5:/data/Hong/SW# sudo dpkg -i nccl-repo-ubuntu1804-2.8.3-ga-cuda11.2_1-1_amd64.deb 看提示,好像我的nccl已经成功安装,但是在conda环境下执行:python -c "import paddle; paddle.fluid.install_check.run_check()" 是否说明nccl已经成功了? .... 最后好像被kill掉了 |
具体错误信息如下: File "tools/train.py", line 102, in main 这个好像跟多GPU关系不大了? |
其实只需改命令里面的系统版本就好…不用本地安装的…… 而且看你装的包你本来就是 18.04... |
|
一直会被kill掉。全部的一些print出来的log如下,请大佬再帮忙看一下 @chenglong-do
|
通过设置 shared memory为 False。 worker nums 为0 ,清理 shm : rm -rf /dev/shm/* 等方式 |
@HongChow 你好, shared memory是在哪设置的?,你当时数据没有问题? |
训练脚本命令:
python -m paddle.distributed.launch tools/train.py --gpus '2,3,4,6' -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/
错误信息:
train.py: error: unrecognized arguments: --gpus [2,3,4,6]
INFO 2021-07-13 15:04:06,240 launch_utils.py:307] terminate all the procs
ERROR 2021-07-13 15:04:06,240 launch_utils.py:545] ABORT!!! Out of all 8 trainers, the trainer process with rank=[0, 1, 2, 3, 4, 5, 6, 7] was aborted. Please check its log.
INFO 2021-07-13 15:04:09,243 launch_utils.py:307] terminate all the procs
OS: UBUNTU
CUDA: NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2
显卡:一共有8块 0-7
GeForce RTX 3090
求助
The text was updated successfully, but these errors were encountered: