Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

检测的训练,单卡GPU可以正常训练;多卡GPU报错 error: unrecognized arguments: --gpus [2,3,4,6] #3327

Closed
HongChow opened this issue Jul 13, 2021 · 13 comments
Assignees

Comments

@HongChow
Copy link

HongChow commented Jul 13, 2021

训练脚本命令:
python -m paddle.distributed.launch tools/train.py --gpus '2,3,4,6' -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/

错误信息:
train.py: error: unrecognized arguments: --gpus [2,3,4,6]
INFO 2021-07-13 15:04:06,240 launch_utils.py:307] terminate all the procs
ERROR 2021-07-13 15:04:06,240 launch_utils.py:545] ABORT!!! Out of all 8 trainers, the trainer process with rank=[0, 1, 2, 3, 4, 5, 6, 7] was aborted. Please check its log.
INFO 2021-07-13 15:04:09,243 launch_utils.py:307] terminate all the procs

OS: UBUNTU

CUDA: NVIDIA-SMI 460.39 Driver Version: 460.39 CUDA Version: 11.2

显卡:一共有8块 0-7
GeForce RTX 3090

求助

@HongChow
Copy link
Author

命令不对改为:
python -m paddle.distributed.launch --gpus '2,3,4,6' tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/

但是新的错误:
W0713 15:35:20.323416 26672 dynamic_loader.cc:207] You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.
Traceback (most recent call last):
File "tools/train.py", line 125, in
main(config, device, logger, vdl_writer)
File "tools/train.py", line 47, in main
dist.init_parallel_env()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/distributed/parallel.py", line 184, in init_parallel_env
parallel_helper._init_parallel_ctx()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dygraph/parallel_helper.py", line 42, in _init_parallel_ctx
parallel_ctx__clz.init()
RuntimeError: (PreconditionNotMet) The third-party dynamic library (libnccl.so) that Paddle depends on is not configured correctly. (error code is libnccl.so: cannot open shared object file: No such file or directory)
Suggestions:

  1. Check if the third-party dynamic library (e.g. CUDA, CUDNN) is installed correctly and its version is matched with paddlepaddle you installed.
  2. Configure third-party dynamic library environment variables as follows:
  • Linux: set LD_LIBRARY_PATH by export LD_LIBRARY_PATH=...
  • Windows: set PATH by `set PATH=XXX; (at /paddle/paddle/fluid/platform/dynload/dynamic_loader.cc:234)

已杀死

@HongChow
Copy link
Author

补充:paddle的安装方式:pip install paddlepaddle-gpu==2.0.1.post110 -f https://paddlepaddle.org.cn/whl/mkl/stable.html
单卡可以正常跑

@chenglong-do
Copy link

You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.

以下是 Ubuntu 18.04的 nccl2 阿里源安装方法

curl -fsSL https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/7fa2af80.pub | apt-key add -
echo "deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/ ./" > /etc/apt/sources.list.d/cuda.list
apt update
sudo apt install libnccl2=2.9.6-1+cuda11.0 libnccl-dev=2.9.6-1+cuda11.0

@HongChow
Copy link
Author

You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.

以下是 Ubuntu 18.04的 nccl2 阿里源安装方法

curl -fsSL https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/7fa2af80.pub | apt-key add -
echo "deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/ ./" > /etc/apt/sources.list.d/cuda.list
apt update
sudo apt install libnccl2=2.9.6-1+cuda11.0 libnccl-dev=2.9.6-1+cuda11.0

感谢,我们正在尝试用您的方法解决这个问题。

@HongChow
Copy link
Author

apt update

You may need to install 'nccl2' from NVIDIA official website: https://developer.nvidia.com/nccl/nccl-downloadbefore install PaddlePaddle.

以下是 Ubuntu 18.04的 nccl2 阿里源安装方法

curl -fsSL https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/7fa2af80.pub | apt-key add -
echo "deb https://mirrors.aliyun.com/nvidia-cuda/ubuntu1804/x86_64/ ./" > /etc/apt/sources.list.d/cuda.list
apt update
sudo apt install libnccl2=2.9.6-1+cuda11.0 libnccl-dev=2.9.6-1+cuda11.0

好像今天阿里镜像暂时关闭了,明天再试试。

@chenglong-do
Copy link

如果是想打开镜像网站的话关了好好久了……
直接打开 https://mirrors.aliyun.com/nvidia-cuda/ 可以获取不同版本的

@HongChow
Copy link
Author

如果是想打开镜像网站的话关了好好久了……
直接打开 https://mirrors.aliyun.com/nvidia-cuda/ 可以获取不同版本的

您好,在里面我找到了相应的安装文件,并执行了:

root@mk-NF5468M5:/data/Hong/SW# sudo dpkg -i nccl-repo-ubuntu1804-2.8.3-ga-cuda11.2_1-1_amd64.deb
(正在读取数据库 ... 系统当前共安装有 228787 个文件和目录。)
正准备解包 nccl-repo-ubuntu1804-2.8.3-ga-cuda11.2_1-1_amd64.deb ...
正在将 nccl-repo-ubuntu1804-2.8.3-ga-cuda11.2 (1-1) 解包到 (1-1) 上 ...
正在设置 nccl-repo-ubuntu1804-2.8.3-ga-cuda11.2 (1-1) ...
root@mk-NF5468M5:/data/Hong/SW# sudo apt install libnccl2=2.8.3-1+cuda11.2 libnccl-dev=2.8.3-1+cuda11.2
正在读取软件包列表... 完成
正在分析软件包的依赖关系树
正在读取状态信息... 完成
libnccl-dev 已经是最新版 (2.8.3-1+cuda11.2)。
下列软件包是自动安装的并且现在不需要了:
fonts-lato fonts-texgyre javascript-common libjs-jquery libllvm9 libruby2.5 preview-latex-style rake ruby ruby-did-you-mean ruby-minitest ruby-net-telnet ruby-power-assert ruby-test-unit
ruby2.5 rubygems-integration tex-gyre texlive-fonts-recommended texlive-latex-base texlive-latex-extra texlive-latex-recommended texlive-pictures texlive-plain-generic tipa
使用'sudo apt autoremove'来卸载它(它们)。
下列软件包将被【降级】:
libnccl2
升级了 0 个软件包,新安装了 0 个软件包,降级了 1 个软件包,要卸载 0 个软件包,有 75 个软件包未被升级。
需要下载 0 B/40.6 MB 的归档。
解压缩后会消耗 0 B 的额外空间。
您希望继续执行吗? [Y/n] y
获取:1 file:/var/nccl-repo-2.8.3-ga-cuda11.2 libnccl2 2.8.3-1+cuda11.2 [40.6 MB]
dpkg: 警告: 即将把 libnccl2 从 2.8.4-1+cuda11.2 降级到 2.8.3-1+cuda11.2
(正在读取数据库 ... 系统当前共安装有 228787 个文件和目录。)
正准备解包 .../libnccl2_2.8.3-1+cuda11.2_amd64.deb ...
正在将 libnccl2 (2.8.3-1+cuda11.2) 解包到 (2.8.4-1+cuda11.2) 上 ...
正在设置 libnccl2 (2.8.3-1+cuda11.2) ...
正在处理用于 libc-bin (2.27-3ubuntu1.4) 的触发器 ...

看提示,好像我的nccl已经成功安装,但是在conda环境下执行:python -c "import paddle; paddle.fluid.install_check.run_check()"
时发现多GPU仍然是不对的:
....
W0714 19:52:13.318320 13784 parallel_executor.cc:596] Cannot enable P2P access from 7 to 6
W0714 19:52:38.857322 13784 fuse_all_reduce_op_pass.cc:79] Find all_reduce operators: 2. To make the speed faster, some all_reduce ops are fused during training, after fusion, the number of all_reduce ops is 1.
Your Paddle Fluid works well on MUTIPLE GPU or CPU.
Your Paddle Fluid is installed successfully! Let's start deep Learning with Paddle Fluid now

是否说明nccl已经成功了?
然后再次执行训练脚本:python -m paddle.distributed.launch --gpus '2,3,4,6' tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/
发现2,3,4,6的显卡Memory的确有增长,但是代码好像一直卡在这里很久:

....
[2021/07/14 20:12:06] root INFO: shuffle : True
[2021/07/14 20:12:06] root INFO: use_shared_memory : False
[2021/07/14 20:12:06] root INFO: train with paddle 2.0.1 and device CUDAPlace(3)
I0714 20:12:07.002872 26797 nccl_context.cc:189] init nccl context nranks: 4 local rank: 0 gpu id: 3 ring id: 0
W0714 20:12:07.738409 26797 device_context.cc:362] Please NOTE: device: 3, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.0
W0714 20:12:07.830386 26797 device_context.cc:372] device: 3, cuDNN Version: 8.0.
[2021/07/14 20:12:13] root INFO: Initialize indexs of datasets:['../ZM_DATA/Det/zll_erge_2596_new/Label.txt']
[2021/07/14 20:12:13] root INFO: Initialize indexs of datasets:['../ZM_DATA/Det/500/Label.txt']
[2021/07/14 20:12:14] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained']
[2021/07/14 20:12:14] root INFO: train dataloader has 11 iters
[2021/07/14 20:12:14] root INFO: valid dataloader has 500 iters
[2021/07/14 20:12:14] root INFO: During the training process, after the 0th iteration, an evaluation is run every 100 iterations
[2021/07/14 20:12:14] root INFO: Initialize indexs of datasets:['../ZM_DATA/Det/zll_erge_2596_new/Label.txt']

最后好像被kill掉了

@HongChow
Copy link
Author

具体错误信息如下:
2021-07-14 20:17:18,221 - ERROR - DataLoader reader thread raised an exception!
2021-07-14 20:17:18,272 - ERROR - DataLoader reader thread failed() to read data from workers' result queue.
Traceback (most recent call last):
File "tools/train.py", line 125, in
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 616, in _thread_loop
batch = self._get_data()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 710, in _get_data
six.reraise(*sys.exc_info())
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/six.py", line 719, in reraise
raise value
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 684, in _get_data
data = self._data_queue.get(timeout=self._timeout)
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/multiprocessing/queues.py", line 108, in get
res = self._recv_bytes()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/multiprocessing/connection.py", line 411, in _recv_bytes
return self._recv(size)
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/multiprocessing/connection.py", line 386, in _recv
buf.write(chunk)
MemoryError
main(config, device, logger, vdl_writer)

File "tools/train.py", line 102, in main
eval_class, pre_best_model_dict, logger, vdl_writer)
File "/data/Hong/OCR/PaddleOCR/tools/program.py", line 204, in train
for idx, batch in enumerate(train_dataloader):
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in next
data = self.reader.read_next_var_list()
SystemError: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed
!= true, but received killed_:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

这个好像跟多GPU关系不大了?

@chenglong-do
Copy link

chenglong-do commented Jul 15, 2021

其实只需改命令里面的系统版本就好…不用本地安装的…… 而且看你装的包你本来就是 18.04...
现在看起来是数据问题 train dataloader has 11 iters
这么少的话,是 batch_size 设太大了么…或者数据很少?

@HongChow
Copy link
Author

其实只需改命令里面的系统版本就好…不用本地安装的…… 而且看你装的包你本来就是 18.04...
现在看起来是数据问题 train dataloader has 11 iters
这么少的话,是 batch_size 设太大了么…或者数据很少?
是的,目前只用了2500多的数据做实现,目的是把多卡跑起来,batchsize一开始设置的比较大现在改小了点。

@HongChow
Copy link
Author

一直会被kill掉。全部的一些print出来的log如下,请大佬再帮忙看一下 @chenglong-do
(PaddleOCR) admin@mk-NF5468M5:/data/Hong/OCR/PaddleOCR$ python -m paddle.distributed.launch --gpus '1,2,3,4,5,6,7' tools/train.py -c configs/det/det_mv3_db.yml -o Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/
/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
----------- Configuration Arguments -----------
gpus: 1,2,3,4,5,6,7
heter_worker_num: None
heter_workers:
http_port: None
ips: 127.0.0.1
log_dir: log
nproc_per_node: None
server_num: None
servers:
training_script: tools/train.py
training_script_args: ['-c', 'configs/det/det_mv3_db.yml', '-o', 'Global.pretrain_weights=./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/']
worker_num: None
workers:

WARNING 2021-07-15 15:31:10,665 launch.py:316] Not found distinct arguments and compiled with cuda. Default use collective mode
launch train in GPU mode
INFO 2021-07-15 15:31:10,666 launch_utils.py:471] Local start 7 processes. First process distributed environment info (Only For Debug):
+=======================================================================================+
| Distributed Envs Value |
+---------------------------------------------------------------------------------------+
| PADDLE_TRAINER_ID 0 |
| PADDLE_CURRENT_ENDPOINT 127.0.0.1:51779 |
| PADDLE_TRAINERS_NUM 7 |
| PADDLE_TRAINER_ENDPOINTS ... 0.1:38061,127.0.0.1:41553,127.0.0.1:39445|
| FLAGS_selected_gpus 1 |
+=======================================================================================+

INFO 2021-07-15 15:31:10,666 launch_utils.py:475] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0
/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/layers/utils.py:26: DeprecationWarning: np.int is a deprecated alias for the builtin int. To silence this warning, use int by itself. Doing this will not modify any behavior and is safe. When replacing np.int, you may wish to use e.g. np.int64 or np.int32 to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
def convert_to_list(value, n, name, dtype=np.int):
/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:241: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
0, 1, 1, 0, 0, 1, 0, 0, 0], dtype=np.bool)
/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/skimage/morphology/_skeletonize.py:256: DeprecationWarning: np.bool is a deprecated alias for the builtin bool. To silence this warning, use bool by itself. Doing this will not modify any behavior and is safe. If you specifically wanted the numpy scalar type, use np.bool_ here.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=np.bool)
[2021/07/15 15:31:12] root INFO: Architecture :
[2021/07/15 15:31:12] root INFO: Backbone :
[2021/07/15 15:31:12] root INFO: model_name : large
[2021/07/15 15:31:12] root INFO: name : MobileNetV3
[2021/07/15 15:31:12] root INFO: scale : 1.0
[2021/07/15 15:31:12] root INFO: Head :
[2021/07/15 15:31:12] root INFO: k : 50
[2021/07/15 15:31:12] root INFO: name : DBHead
[2021/07/15 15:31:12] root INFO: Neck :
[2021/07/15 15:31:12] root INFO: name : DBFPN
[2021/07/15 15:31:12] root INFO: out_channels : 256
[2021/07/15 15:31:12] root INFO: Transform : None
[2021/07/15 15:31:12] root INFO: algorithm : DB
[2021/07/15 15:31:12] root INFO: model_type : det
[2021/07/15 15:31:12] root INFO: Eval :
[2021/07/15 15:31:12] root INFO: dataset :
[2021/07/15 15:31:12] root INFO: data_dir : ../ZM_DATA/Det/500/
[2021/07/15 15:31:12] root INFO: label_file_list : ['../ZM_DATA/Det/500/Label.txt']
[2021/07/15 15:31:12] root INFO: name : SimpleDataSet
[2021/07/15 15:31:12] root INFO: transforms :
[2021/07/15 15:31:12] root INFO: DecodeImage :
[2021/07/15 15:31:12] root INFO: channel_first : False
[2021/07/15 15:31:12] root INFO: img_mode : BGR
[2021/07/15 15:31:12] root INFO: DetLabelEncode : None
[2021/07/15 15:31:12] root INFO: DetResizeForTest :
[2021/07/15 15:31:12] root INFO: image_shape : [736, 1280]
[2021/07/15 15:31:12] root INFO: NormalizeImage :
[2021/07/15 15:31:12] root INFO: mean : [0.485, 0.456, 0.406]
[2021/07/15 15:31:12] root INFO: order : hwc
[2021/07/15 15:31:12] root INFO: scale : 1./255.
[2021/07/15 15:31:12] root INFO: std : [0.229, 0.224, 0.225]
[2021/07/15 15:31:12] root INFO: ToCHWImage : None
[2021/07/15 15:31:12] root INFO: KeepKeys :
[2021/07/15 15:31:12] root INFO: keep_keys : ['image', 'shape', 'polys', 'ignore_tags']
[2021/07/15 15:31:12] root INFO: loader :
[2021/07/15 15:31:12] root INFO: batch_size_per_card : 1
[2021/07/15 15:31:12] root INFO: drop_last : False
[2021/07/15 15:31:12] root INFO: num_workers : 8
[2021/07/15 15:31:12] root INFO: shuffle : False
[2021/07/15 15:31:12] root INFO: use_shared_memory : True
[2021/07/15 15:31:12] root INFO: Global :
[2021/07/15 15:31:12] root INFO: cal_metric_during_train : False
[2021/07/15 15:31:12] root INFO: checkpoints : None
[2021/07/15 15:31:12] root INFO: debug : False
[2021/07/15 15:31:12] root INFO: distributed : True
[2021/07/15 15:31:12] root INFO: epoch_num : 1200
[2021/07/15 15:31:12] root INFO: eval_batch_step : [0, 100]
[2021/07/15 15:31:12] root INFO: infer_img : doc/imgs_en/img_10.jpg
[2021/07/15 15:31:12] root INFO: log_smooth_window : 20
[2021/07/15 15:31:12] root INFO: pretrain_weights : ./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained/
[2021/07/15 15:31:12] root INFO: pretrained_model : ./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained
[2021/07/15 15:31:12] root INFO: print_batch_step : 10
[2021/07/15 15:31:12] root INFO: save_epoch_step : 1200
[2021/07/15 15:31:12] root INFO: save_inference_dir : None
[2021/07/15 15:31:12] root INFO: save_model_dir : ./output/db_mv3/
[2021/07/15 15:31:12] root INFO: save_res_path : ./output/det_db/predicts_db.txt
[2021/07/15 15:31:12] root INFO: use_gpu : True
[2021/07/15 15:31:12] root INFO: use_visualdl : False
[2021/07/15 15:31:12] root INFO: Loss :
[2021/07/15 15:31:12] root INFO: alpha : 5
[2021/07/15 15:31:12] root INFO: balance_loss : True
[2021/07/15 15:31:12] root INFO: beta : 10
[2021/07/15 15:31:12] root INFO: main_loss_type : DiceLoss
[2021/07/15 15:31:12] root INFO: name : DBLoss
[2021/07/15 15:31:12] root INFO: ohem_ratio : 3
[2021/07/15 15:31:12] root INFO: Metric :
[2021/07/15 15:31:12] root INFO: main_indicator : hmean
[2021/07/15 15:31:12] root INFO: name : DetMetric
[2021/07/15 15:31:12] root INFO: Optimizer :
[2021/07/15 15:31:12] root INFO: beta1 : 0.9
[2021/07/15 15:31:12] root INFO: beta2 : 0.999
[2021/07/15 15:31:12] root INFO: lr :
[2021/07/15 15:31:12] root INFO: learning_rate : 0.001
[2021/07/15 15:31:12] root INFO: name : Adam
[2021/07/15 15:31:12] root INFO: regularizer :
[2021/07/15 15:31:12] root INFO: factor : 0
[2021/07/15 15:31:12] root INFO: name : L2
[2021/07/15 15:31:12] root INFO: PostProcess :
[2021/07/15 15:31:12] root INFO: box_thresh : 0.01
[2021/07/15 15:31:12] root INFO: max_candidates : 200
[2021/07/15 15:31:12] root INFO: name : DBPostProcess
[2021/07/15 15:31:12] root INFO: thresh : 0.01
[2021/07/15 15:31:12] root INFO: unclip_ratio : 1.5
[2021/07/15 15:31:12] root INFO: Train :
[2021/07/15 15:31:12] root INFO: dataset :
[2021/07/15 15:31:12] root INFO: data_dir : ../ZM_DATA/Det/zll_erge_2596_new/
[2021/07/15 15:31:12] root INFO: label_file_list : ['../ZM_DATA/Det/zll_erge_2596_new/Label.txt']
[2021/07/15 15:31:12] root INFO: name : SimpleDataSet
[2021/07/15 15:31:12] root INFO: ratio_list : [1.0]
[2021/07/15 15:31:12] root INFO: transforms :
[2021/07/15 15:31:12] root INFO: DecodeImage :
[2021/07/15 15:31:12] root INFO: channel_first : False
[2021/07/15 15:31:12] root INFO: img_mode : BGR
[2021/07/15 15:31:12] root INFO: DetLabelEncode : None
[2021/07/15 15:31:12] root INFO: IaaAugment :
[2021/07/15 15:31:12] root INFO: augmenter_args :
[2021/07/15 15:31:12] root INFO: args :
[2021/07/15 15:31:12] root INFO: p : 0.5
[2021/07/15 15:31:12] root INFO: type : Fliplr
[2021/07/15 15:31:12] root INFO: args :
[2021/07/15 15:31:12] root INFO: rotate : [-10, 10]
[2021/07/15 15:31:12] root INFO: type : Affine
[2021/07/15 15:31:12] root INFO: args :
[2021/07/15 15:31:12] root INFO: size : [0.5, 3]
[2021/07/15 15:31:12] root INFO: type : Resize
[2021/07/15 15:31:12] root INFO: EastRandomCropData :
[2021/07/15 15:31:12] root INFO: keep_ratio : True
[2021/07/15 15:31:12] root INFO: max_tries : 50
[2021/07/15 15:31:12] root INFO: size : [640, 640]
[2021/07/15 15:31:12] root INFO: MakeBorderMap :
[2021/07/15 15:31:12] root INFO: shrink_ratio : 0.4
[2021/07/15 15:31:12] root INFO: thresh_max : 0.7
[2021/07/15 15:31:12] root INFO: thresh_min : 0.3
[2021/07/15 15:31:12] root INFO: MakeShrinkMap :
[2021/07/15 15:31:12] root INFO: min_text_size : 8
[2021/07/15 15:31:12] root INFO: shrink_ratio : 0.4
[2021/07/15 15:31:12] root INFO: NormalizeImage :
[2021/07/15 15:31:12] root INFO: mean : [0.485, 0.456, 0.406]
[2021/07/15 15:31:12] root INFO: order : hwc
[2021/07/15 15:31:12] root INFO: scale : 1./255.
[2021/07/15 15:31:12] root INFO: std : [0.229, 0.224, 0.225]
[2021/07/15 15:31:12] root INFO: ToCHWImage : None
[2021/07/15 15:31:12] root INFO: KeepKeys :
[2021/07/15 15:31:12] root INFO: keep_keys : ['image', 'threshold_map', 'threshold_mask', 'shrink_map', 'shrink_mask']
[2021/07/15 15:31:12] root INFO: loader :
[2021/07/15 15:31:12] root INFO: batch_size_per_card : 4
[2021/07/15 15:31:12] root INFO: drop_last : False
[2021/07/15 15:31:12] root INFO: num_workers : 8
[2021/07/15 15:31:12] root INFO: shuffle : False
[2021/07/15 15:31:12] root INFO: use_shared_memory : True
[2021/07/15 15:31:12] root INFO: train with paddle 2.0.1 and device CUDAPlace(1)
W0715 15:31:12.229997 54495 nccl_context.cc:142] Socket connect worker 127.0.0.1:49541 failed, try again after 3 seconds.
I0715 15:31:15.230306 54495 nccl_context.cc:189] init nccl context nranks: 7 local rank: 0 gpu id: 1 ring id: 0
W0715 15:31:16.297904 54495 device_context.cc:362] Please NOTE: device: 1, GPU Compute Capability: 8.6, Driver API Version: 11.2, Runtime API Version: 11.0
W0715 15:31:16.300493 54495 device_context.cc:372] device: 1, cuDNN Version: 8.0.
[2021/07/15 15:31:19] root INFO: Initialize indexs of datasets:['../ZM_DATA/Det/zll_erge_2596_new/Label.txt']
[2021/07/15 15:31:19] root INFO: Initialize indexs of datasets:['../ZM_DATA/Det/500/Label.txt']
[2021/07/15 15:31:19] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x1_0_ssld_pretrained']
[2021/07/15 15:31:19] root INFO: train dataloader has 93 iters
[2021/07/15 15:31:19] root INFO: valid dataloader has 500 iters
[2021/07/15 15:31:19] root INFO: During the training process, after the 0th iteration, an evaluation is run every 100 iterations
[2021/07/15 15:31:19] root INFO: Initialize indexs of datasets:['../ZM_DATA/Det/zll_erge_2596_new/Label.txt']
2021-07-15 15:31:25,538 - ERROR - DataLoader reader thread raised an exception!
Traceback (most recent call last):
File "tools/train.py", line 125, in
main(config, device, logger, vdl_writer)
File "tools/train.py", line 102, in main
eval_class, pre_best_model_dict, logger, vdl_writer)
File "/data/Hong/OCR/PaddleOCR/tools/program.py", line 204, in train
for idx, batch in enumerate(train_dataloader):
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 779, in next
data = self._reader.read_next_var_list()
Exception in thread Thread-1:
Traceback (most recent call last):
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 684, in _get_data
data = self._data_queue.get(timeout=self._timeout)
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/multiprocessing/queues.py", line 105, in get
raise Empty
_queue.Empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 616, in _thread_loop
batch = self.get_data()
File "/home/admin/anaconda3/envs/PaddleOCR/lib/python3.7/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 700, in get_data
"pids: {}".format(len(failed_workers), pids))
RuntimeError: DataLoader 8 workers exit unexpectedly, pids: 54735, 54747, 55556, 55790, 55796, 55812, 56261, 56553
SystemError
: (Fatal) Blocking queue is killed because the data reader raises an exception.
[Hint: Expected killed
!= true, but received killed
:1 == true:1.] (at /paddle/paddle/fluid/operators/reader/blocking_queue.h:158)

INFO 2021-07-15 15:31:34,778 launch_utils.py:307] terminate all the procs
ERROR 2021-07-15 15:31:34,778 launch_utils.py:545] ABORT!!! Out of all 7 trainers, the trainer process with rank=[0, 2, 3, 4, 5, 6] was aborted. Please check its log.
INFO 2021-07-15 15:31:37,781 launch_utils.py:307] terminate all the procs

@HongChow
Copy link
Author

通过设置 shared memory为 False。 worker nums 为0 ,清理 shm : rm -rf /dev/shm/* 等方式
目前可以正常跑多卡了

@cqray1990
Copy link

@HongChow 你好, shared memory是在哪设置的?,你当时数据没有问题?

@paddle-bot-old paddle-bot-old bot closed this as completed Nov 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants