Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault (core dumped) #57

Closed
chenyangMl opened this issue May 18, 2020 · 5 comments
Closed

Segmentation fault (core dumped) #57

chenyangMl opened this issue May 18, 2020 · 5 comments

Comments

@chenyangMl
Copy link

您好,follow detection.md文档进行ocr训练时,运行训练命令遇到了以下问题,特来请教。
通过FLAGS_selected_gpus 指定显卡 和 修改train.py里面的place = fluid.CUDAPlace(3) if use_gpu else fluid.CPUPlace()运行训练都会出现这个问题。
运行use_gpu=False也出现了下面的问题。
FLAGS_selected_gpus=3 && python3 tools/train.py -c configs/det/det_mv3_db.yml -o Optimizer.base_lr=0.0001
grep: warning: GREP_OPTIONS is deprecated; please use an alias or script
2020-05-18 09:29:29,797-INFO: {'Architecture': {'function': 'ppocr.modeling.architectures.det_model,DetModel'}, 'TestReader': {'reader_function': 'ppocr.data.det.dataset_traversal,EvalTestReader', 'single_img_path': None, 'img_set_dir': './train_data/icdar2015/text_localization/', 'label_file_path': './train_data/icdar2015/text_localization/test_icdar2015_label.txt', 'test_image_shape': [736, 1280], 'process_function': 'ppocr.data.det.db_process,DBProcessTest', 'do_eval': True}, 'Backbone': {'function': 'ppocr.modeling.backbones.det_mobilenet_v3,MobileNetV3', 'model_name': 'large', 'scale': 0.5}, 'Head': {'k': 50, 'function': 'ppocr.modeling.heads.det_db_head,DBHead', 'inner_channels': 96, 'model_name': 'large', 'out_channels': 2}, 'Optimizer': {'beta1': 0.9, 'function': 'ppocr.optimizer,AdamDecay', 'base_lr': 0.0001, 'beta2': 0.999}, 'EvalReader': {'test_image_shape': [736, 1280], 'reader_function': 'ppocr.data.det.dataset_traversal,EvalTestReader', 'img_set_dir': './train_data/icdar2015/text_localization/', 'process_function': 'ppocr.data.det.db_process,DBProcessTest', 'label_file_path': './train_data/icdar2015/text_localization/test_icdar2015_label.txt'}, 'Loss': {'function': 'ppocr.modeling.losses.det_db_loss,DBLoss', 'balance_loss': True, 'beta': 10, 'alpha': 5, 'ohem_ratio': 3, 'main_loss_type': 'DiceLoss'}, 'TrainReader': {'reader_function': 'ppocr.data.det.dataset_traversal,TrainReader', 'num_workers': 8, 'img_set_dir': './train_data/icdar2015/text_localization/', 'process_function': 'ppocr.data.det.db_process,DBProcessTrain', 'label_file_path': './train_data/icdar2015/text_localization/train_icdar2015_label.txt'}, 'PostProcess': {'unclip_ratio': 1.5, 'max_candidates': 1000, 'function': 'ppocr.postprocess.db_postprocess,DBPostProcess', 'thresh': 0.3, 'box_thresh': 0.7}, 'Global': {'save_epoch_step': 200, 'save_inference_dir': None, 'eval_batch_step': 5000, 'log_smooth_window': 20, 'algorithm': 'DB', 'epoch_num': 1200, 'use_gpu': True, 'train_batch_size_per_card': 16, 'image_shape': [3, 640, 640], 'save_model_dir': './output/det_db/', 'save_res_path': './output/det_db/predicts_db.txt', 'checkpoints': None, 'pretrain_weights': './pretrain_models/MobileNetV3_large_x0_5_pretrained/', 'test_batch_size_per_card': 16, 'reader_yml': './configs/det/det_db_icdar15_reader.yml', 'print_batch_step': 2}}
3 640 640
3 640 640
import ujson error: No module named 'ujson' use json
2020-05-18 09:29:33,067-INFO: places would be ommited when DataLoader is not iterable
W0518 09:29:33.928460 5607 device_context.cc:237] Please NOTE: device: 3, CUDA Capability: 75, Driver API Version: 10.2, Runtime API Version: 10.0
W0518 09:29:33.932370 5607 device_context.cc:245] device: 3, cuDNN Version: 7.5.
W0518 09:29:33.932396 5607 device_context.cc:271] WARNING: device: 3. The installed Paddle is compiled with CUDNN 7.6, but CUDNN version in your machine is 7.5, which may cause serious incompatible bug. Please recompile or reinstall Paddle with compatible CUDNN version.
2020-05-18 09:29:35,015-INFO: Loading parameters from ./pretrain_models/MobileNetV3_large_x0_5_pretrained/...
2020-05-18 09:29:35,015-WARNING: ./pretrain_models/MobileNetV3_large_x0_5_pretrained/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
2020-05-18 09:29:35,015-WARNING: ./pretrain_models/MobileNetV3_large_x0_5_pretrained/.pdparams not found, try to load model file saved with [ save_params, save_persistables, save_vars ]
2020-05-18 09:29:35,206-INFO: Finish initing model from ./pretrain_models/MobileNetV3_large_x0_5_pretrained/
I0518 09:29:35.251972 5607 parallel_executor.cc:440] The Program will be executed on CUDA using ParallelExecutor, 8 cards are used, so 8 programs are executed in parallel.
W0518 09:29:48.223284 5607 init.cc:209] Warning: PaddlePaddle catches a failure signal, it may not work properly
W0518 09:29:48.223331 5607 init.cc:211] You could check whether you killed PaddlePaddle thread/process accidentally or report the case to PaddlePaddle
W0518 09:29:48.223345 5607 init.cc:214] The detail failure signal is:

@LDOUBLEV
Copy link
Collaborator

place = fluid.CUDAPlace(3) -> 改成place = fluid.CUDAPlace(0) 目前切换不同GPU,建议用

export CUDA_VISIBLE_DEVICES=3
python3 tools/train.py -c 

@chenyangMl
Copy link
Author

CUDA_VISIBLE_DEVICES=3

多谢多谢,跑起来了。建议更新下文档,我看文档说,多卡任务请先使用 FLAGS_selected_gpus 环境变量设置可见的GPU设备,下个版本将会修正 CUDA_VISIBLE_DEVICES 环境变量无效的问题。

@LDOUBLEV
Copy link
Collaborator

CUDA_VISIBLE_DEVICES=3

多谢多谢,跑起来了。建议更新下文档,我看文档说,多卡任务请先使用 FLAGS_selected_gpus 环境变量设置可见的GPU设备,下个版本将会修正 CUDA_VISIBLE_DEVICES 环境变量无效的问题。

收到,感谢反馈

@chenyangMl
Copy link
Author

CUDA_VISIBLE_DEVICES=3

多谢多谢,跑起来了。建议更新下文档,我看文档说,多卡任务请先使用 FLAGS_selected_gpus 环境变量设置可见的GPU设备,下个版本将会修正 CUDA_VISIBLE_DEVICES 环境变量无效的问题。

收到,感谢反馈

您好,训练过程中出现了如下错误,请问该如何解决?
2020-05-18 12:07:49,601-INFO: epoch: 156, iter: 4996, 'total_loss': 2.304114, 'loss_threshold_maps': 0.762533, 'loss_shrink_maps': 1.291126, 'lr': 1e-04, 'loss_binary_maps': 0.249221, time: 1.161
2020-05-18 12:07:51,780-INFO: epoch: 156, iter: 4998, 'total_loss': 2.336653, 'loss_threshold_maps': 0.762533, 'loss_shrink_maps': 1.323457, 'lr': 1e-04, 'loss_binary_maps': 0.251576, time: 1.072
2020-05-18 12:07:53,733-INFO: epoch: 156, iter: 5000, 'total_loss': 2.266071, 'loss_threshold_maps': 0.755013, 'loss_shrink_maps': 1.292108, 'lr': 1e-04, 'loss_binary_maps': 0.249244, time: 1.006
Traceback (most recent call last):
File "tools/train.py", line 112, in
main()
File "tools/train.py", line 104, in main
program.train_eval_det_run(config, exe, train_info_dict, eval_info_dict)
File "/paddle/PaddleOCR/tools/program.py", line 256, in train_eval_det_run
metrics = eval_det_run(exe, config, eval_info_dict, "eval")
File "/paddle/PaddleOCR/tools/eval_utils/eval_det_utils.py", line 126, in eval_det_run
cal_det_res(exe, config, eval_info_dict)
File "/paddle/PaddleOCR/tools/eval_utils/eval_det_utils.py", line 48, in cal_det_res
for data in eval_info_dict'reader':
File "/paddle/PaddleOCR/ppocr/data/det/dataset_traversal.py", line 93, in batch_iter_reader
img = cv2.imread(img_path)
TypeError: bad argument type for built-in operation

@tink2123
Copy link
Collaborator

训练过程中eval读取数据报错,问题已修复,请更新到最新的代码~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants