Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练环节报错Segmentation fault #4927

Closed
TTMRonald opened this issue Dec 15, 2021 · 2 comments
Closed

训练环节报错Segmentation fault #4927

TTMRonald opened this issue Dec 15, 2021 · 2 comments
Assignees

Comments

@TTMRonald
Copy link

TTMRonald commented Dec 15, 2021

你好,我使用预训练模型进行预测正常,但是训练表格结构识别模型的时候,遇到报错Segmentation fault

训练使用命令:python3 tools/train.py -c configs/table/table_mv3.yml
GPU环境信息:单卡cuda=10.1 cudnn=7.6
Paddle版本信息:paddleocr=2.3.0.2 paddlepaddle-gpu=2.1.2.post101
table_mv3.yml修改内容:只修改max_len : 800num_workers: 0,其他保持默认配置
具体配置:

[2021/12/16 10:25:49] root INFO: Architecture : 
[2021/12/16 10:25:49] root INFO:     Backbone : 
[2021/12/16 10:25:49] root INFO:         disable_se : True
[2021/12/16 10:25:49] root INFO:         model_name : small
[2021/12/16 10:25:49] root INFO:         name : MobileNetV3
[2021/12/16 10:25:49] root INFO:         scale : 1.0
[2021/12/16 10:25:49] root INFO:     Head : 
[2021/12/16 10:25:49] root INFO:         hidden_size : 256
[2021/12/16 10:25:49] root INFO:         l2_decay : 1e-05
[2021/12/16 10:25:49] root INFO:         loc_type : 2
[2021/12/16 10:25:49] root INFO:         name : TableAttentionHead
[2021/12/16 10:25:49] root INFO:     algorithm : TableAttn
[2021/12/16 10:25:49] root INFO:     model_type : table
[2021/12/16 10:25:49] root INFO: Eval : 
[2021/12/16 10:25:49] root INFO:     dataset : 
[2021/12/16 10:25:49] root INFO:         data_dir : dataset/PubTabNet/images/val/
[2021/12/16 10:25:49] root INFO:         label_file_path : dataset/PubTabNet/annotations/PubTabNet_2.0.0_val.jsonl
[2021/12/16 10:25:49] root INFO:         name : PubTabDataSet
[2021/12/16 10:25:49] root INFO:         transforms : 
[2021/12/16 10:25:49] root INFO:             DecodeImage : 
[2021/12/16 10:25:49] root INFO:                 channel_first : False
[2021/12/16 10:25:49] root INFO:                 img_mode : BGR
[2021/12/16 10:25:49] root INFO:             ResizeTableImage : 
[2021/12/16 10:25:49] root INFO:                 max_len : 800
[2021/12/16 10:25:49] root INFO:             TableLabelEncode : None
[2021/12/16 10:25:49] root INFO:             NormalizeImage : 
[2021/12/16 10:25:49] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/12/16 10:25:49] root INFO:                 order : hwc
[2021/12/16 10:25:49] root INFO:                 scale : 1./255.
[2021/12/16 10:25:49] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/12/16 10:25:49] root INFO:             PaddingTableImage : None
[2021/12/16 10:25:49] root INFO:             ToCHWImage : None
[2021/12/16 10:25:49] root INFO:             KeepKeys : 
[2021/12/16 10:25:49] root INFO:                 keep_keys : ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask']
[2021/12/16 10:25:49] root INFO:     loader : 
[2021/12/16 10:25:49] root INFO:         batch_size_per_card : 8
[2021/12/16 10:25:49] root INFO:         drop_last : False
[2021/12/16 10:25:49] root INFO:         num_workers : 0
[2021/12/16 10:25:49] root INFO:         shuffle : False
[2021/12/16 10:25:49] root INFO: Global : 
[2021/12/16 10:25:49] root INFO:     cal_metric_during_train : True
[2021/12/16 10:25:49] root INFO:     character_dict_path : ppocr/utils/dict/table_structure_dict.txt
[2021/12/16 10:25:49] root INFO:     character_type : en
[2021/12/16 10:25:49] root INFO:     checkpoints : None
[2021/12/16 10:25:49] root INFO:     debug : False
[2021/12/16 10:25:49] root INFO:     distributed : False
[2021/12/16 10:25:49] root INFO:     epoch_num : 50
[2021/12/16 10:25:49] root INFO:     eval_batch_step : [0, 800]
[2021/12/16 10:25:49] root INFO:     infer_img : doc/imgs_words/ch/word_1.jpg
[2021/12/16 10:25:49] root INFO:     infer_mode : False
[2021/12/16 10:25:49] root INFO:     log_smooth_window : 20
[2021/12/16 10:25:49] root INFO:     max_cell_num : 500
[2021/12/16 10:25:49] root INFO:     max_elem_length : 500
[2021/12/16 10:25:49] root INFO:     max_text_length : 100
[2021/12/16 10:25:49] root INFO:     pretrained_model : None
[2021/12/16 10:25:49] root INFO:     print_batch_step : 5
[2021/12/16 10:25:49] root INFO:     process_cut_num : 0
[2021/12/16 10:25:49] root INFO:     process_total_num : 0
[2021/12/16 10:25:49] root INFO:     save_epoch_step : 5
[2021/12/16 10:25:49] root INFO:     save_inference_dir : None
[2021/12/16 10:25:49] root INFO:     save_model_dir : ./output/table_mv3_pubtabnet/
[2021/12/16 10:25:49] root INFO:     use_gpu : True
[2021/12/16 10:25:49] root INFO:     use_visualdl : False
[2021/12/16 10:25:49] root INFO: Loss : 
[2021/12/16 10:25:49] root INFO:     loc_weight : 10000.0
[2021/12/16 10:25:49] root INFO:     name : TableAttentionLoss
[2021/12/16 10:25:49] root INFO:     structure_weight : 100.0
[2021/12/16 10:25:49] root INFO: Metric : 
[2021/12/16 10:25:49] root INFO:     main_indicator : acc
[2021/12/16 10:25:49] root INFO:     name : TableMetric
[2021/12/16 10:25:49] root INFO: Optimizer : 
[2021/12/16 10:25:49] root INFO:     beta1 : 0.9
[2021/12/16 10:25:49] root INFO:     beta2 : 0.999
[2021/12/16 10:25:49] root INFO:     clip_norm : 5.0
[2021/12/16 10:25:49] root INFO:     lr : 
[2021/12/16 10:25:49] root INFO:         learning_rate : 0.001
[2021/12/16 10:25:49] root INFO:     name : Adam
[2021/12/16 10:25:49] root INFO:     regularizer : 
[2021/12/16 10:25:49] root INFO:         factor : 0.0
[2021/12/16 10:25:49] root INFO:         name : L2
[2021/12/16 10:25:49] root INFO: PostProcess : 
[2021/12/16 10:25:49] root INFO:     name : TableLabelDecode
[2021/12/16 10:25:49] root INFO: Train : 
[2021/12/16 10:25:49] root INFO:     dataset : 
[2021/12/16 10:25:49] root INFO:         data_dir : dataset/PubTabNet/images/train/
[2021/12/16 10:25:49] root INFO:         label_file_path : dataset/PubTabNet/annotations/PubTabNet_2.0.0_train.jsonl
[2021/12/16 10:25:49] root INFO:         name : PubTabDataSet
[2021/12/16 10:25:49] root INFO:         transforms : 
[2021/12/16 10:25:49] root INFO:             DecodeImage : 
[2021/12/16 10:25:49] root INFO:                 channel_first : False
[2021/12/16 10:25:49] root INFO:                 img_mode : BGR
[2021/12/16 10:25:49] root INFO:             ResizeTableImage : 
[2021/12/16 10:25:49] root INFO:                 max_len : 800
[2021/12/16 10:25:49] root INFO:             TableLabelEncode : None
[2021/12/16 10:25:49] root INFO:             NormalizeImage : 
[2021/12/16 10:25:49] root INFO:                 mean : [0.485, 0.456, 0.406]
[2021/12/16 10:25:49] root INFO:                 order : hwc
[2021/12/16 10:25:49] root INFO:                 scale : 1./255.
[2021/12/16 10:25:49] root INFO:                 std : [0.229, 0.224, 0.225]
[2021/12/16 10:25:49] root INFO:             PaddingTableImage : None
[2021/12/16 10:25:49] root INFO:             ToCHWImage : None
[2021/12/16 10:25:49] root INFO:             KeepKeys : 
[2021/12/16 10:25:49] root INFO:                 keep_keys : ['image', 'structure', 'bbox_list', 'sp_tokens', 'bbox_list_mask']
[2021/12/16 10:25:49] root INFO:     loader : 
[2021/12/16 10:25:49] root INFO:         batch_size_per_card : 4
[2021/12/16 10:25:49] root INFO:         drop_last : True
[2021/12/16 10:25:49] root INFO:         num_workers : 0
[2021/12/16 10:25:49] root INFO:         shuffle : True

报错日志:

[2021/12/15 11:25:48] root INFO: train with paddle 2.1.2 and device CUDAPlace(0)
[2021/12/15 11:25:48] root INFO: Initialize indexs of datasets:dataset/PubTabNet/annotations/PubTabNet_2.0.0_test.jsonl
[2021/12/15 11:25:48] root INFO: Initialize indexs of datasets:dataset/PubTabNet/annotations/PubTabNet_2.0.0_test.jsonl
W1215 11:25:48.538619 168238 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 7.0, Driver API Version: 11.2, Runtime API Version: 10.1
W1215 11:25:48.543943 168238 device_context.cc:422] device: 0, cuDNN Version: 7.6.
[2021/12/15 11:26:04] root INFO: train dataloader has 6 iters
[2021/12/15 11:26:04] root INFO: valid dataloader has 13 iters
[2021/12/15 11:26:04] root INFO: During the training process, after the 0th iteration, an evaluation is run every 5 iterations
[2021/12/15 11:26:04] root INFO: Initialize indexs of datasets:dataset/PubTabNet/annotations/PubTabNet_2.0.0_test.jsonl
[ERROR] 2021-12-15T04:46:31.567873Z, 168330, "Cannot create UVM block on server"
W1215 12:46:31.567991 168330 system_allocator.cc:205] cudaHostAlloc failed.
W1215 12:46:31.568040 168330 naive_best_fit_allocator.cc:519] cudaHostAlloc Cannot allocate 122880000 bytes in CUDAPinnedPlace


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0 std::thread::_Impl<std::_Bind_simple<ThreadPool::ThreadPool(unsigned long)::{lambda()#1} ()> >::_M_run()
1 std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*)
2 paddle::framework::SignalHandle(char const*, int)
3 paddle::platform::GetCurrentTraceBackString[abi:cxx11]()

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
[TimeInfo: *** Aborted at 1639543591 (unix time) try "date -d @1639543591" if you are using GNU date ***]
[SignalInfo: *** SIGSEGV (@0x0) received by PID 168238 (TID 0x7fc05d8c5700) from PID 0 ***]
@UncleLLD
Copy link

UncleLLD commented Dec 22, 2021

遇到类似的问题
使用下面的语句重新安装paddle环境好了

conda install paddlepaddle-gpu==2.2.1 cudatoolkit=10.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

@TTMRonald
Copy link
Author

遇到类似的问题 使用下面的语句重新安装paddle环境好了

conda install paddlepaddle-gpu==2.2.1 cudatoolkit=10.1 --channel https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/Paddle/

我尝试把paddlepaddle-gpu升级到你这个版本,还是存在同样的问题,另外通过top查看内存发现RES用了3.2G,但是VIRT会不断增加,直到100多t后,出现报错Cannot create UVM block on server,怀疑这个表格识别模型训练部分是不是存在内存泄漏的问题

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants