文字检测训练时候出现Initialize indexs of datasets:['./train_data/train_Label.txt']就直接结束程序了 #2184

Xu-xunshan · 2021-03-06T12:30:29Z

[2021/03/06 20:22:25] root INFO: shuffle : True
[2021/03/06 20:22:25] root INFO: use_shared_memory : False
[2021/03/06 20:22:25] root INFO: train with paddle 2.0.0 and device CUDAPlace(0)
[2021/03/06 20:22:25] root INFO: Initialize indexs of datasets:['./train_data/train_Label.txt']
[2021/03/06 20:22:25] root INFO: Initialize indexs of datasets:['./train_data/test_Label.txt']
W0306 20:22:25.196138 4376 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.2, Runtime API Version: 11.0
W0306 20:22:25.205113 4376 device_context.cc:372] device: 0, cuDNN Version: 8.0.
[2021/03/06 20:22:27] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x0_5_pretrained']
[2021/03/06 20:22:27] root INFO: train dataloader has 60 iters, valid dataloader has 100 iters
[2021/03/06 20:22:27] root INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
[2021/03/06 20:22:27] root INFO: Initialize indexs of datasets:['./train_data/train_Label.txt']
运行到这边会卡一下然后结束程序
我观察了一下显卡的内存利用率挺正常，但是CUDA的利用率显示为一瞬间的使用

小白求解

littletomatodonkey · 2021-03-07T08:35:53Z

可以把num worker改为0，然后把batch size改小一点，看下还会不会有问题

Xu-xunshan · 2021-03-07T14:38:01Z

batch size改小后成功开始训练，感谢

Xu-xunshan · 2021-03-08T05:01:38Z

我将batch size修改为1后成功运行，但是没多久又中断训练了，这是什么情况

Xu-xunshan · 2021-03-08T05:32:45Z

补充：多次测试后发现将batch size设置为2时候能跑到第一轮训练的iter290；设置为1时能跑到590

littletomatodonkey · 2021-03-08T15:28:05Z

eval的batch size和num worker也需要调整，另外，训练的时候，因为包含batch norm，不建议batch size小于16，否则效果可能会受到影响，num worker可以设置为0

Xu-xunshan · 2021-03-09T02:12:30Z

我的batch size设置为16就回到了最开始的问题直接程序结束运行，按说1060不至于这样啊，还有eval的batch size不是注释了必须是1吗

JescalLin · 2021-03-31T07:37:44Z

我也遇到一样的问题训练会突然中断

[2021/03/31 15:35:12] root INFO: train with paddle 2.0.1 and device CPUPlace
[2021/03/31 15:35:12] root INFO: Initialize indexs of datasets:['./train_data/train/Label.txt']
[2021/03/31 15:35:12] root INFO: Initialize indexs of datasets:['./train_data/test/Label.txt']
[2021/03/31 15:35:12] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x0_5_pretrained']
[2021/03/31 15:35:12] root INFO: train dataloader has 11 iters, valid dataloader has 3 iters
[2021/03/31 15:35:12] root INFO: During the training process, after the 3000th iteration, an evaluation is run every 2000 iterations
[2021/03/31 15:35:12] root INFO: Initialize indexs of datasets:['./train_data/train/Label.txt']
[2021/03/31 15:35:19] root INFO: epoch: [1/1200], iter: 2, lr: 0.000045, loss: 9.355393, loss_shrink_maps: 4.752926, loss_threshold_maps: 3.717568, loss_binary_maps: 0.955332, reader_cost: 0.04814 s, batch_cost: 3.45816 s, samples: 3, ips: 0.43376
[2021/03/31 15:35:23] root INFO: epoch: [1/1200], iter: 4, lr: 0.000091, loss: 9.207070, loss_shrink_maps: 4.692245, loss_threshold_maps: 3.555828, loss_binary_maps: 0.945580, reader_cost: 0.00000 s, batch_cost: 2.14012 s, samples: 2, ips: 0.46726
[2021/03/31 15:35:27] root INFO: epoch: [1/1200], iter: 6, lr: 0.000136, loss: 9.149884, loss_shrink_maps: 4.658204, loss_threshold_maps: 3.498812, loss_binary_maps: 0.935853, reader_cost: 0.00000 s, batch_cost: 2.14013 s, samples: 2, ips: 0.46726
[2021/03/31 15:35:32] root INFO: epoch: [1/1200], iter: 8, lr: 0.000182, loss: 8.971068, loss_shrink_maps: 4.629452, loss_threshold_maps: 3.473349, loss_binary_maps: 0.926134, reader_cost: 0.00000 s, batch_cost: 2.11335 s, samples: 2, ips: 0.47318
[2021/03/31 15:35:36] root INFO: epoch: [1/1200], iter: 10, lr: 0.000227, loss: 8.894274, loss_shrink_maps: 4.586464, loss_threshold_maps: 3.338689, loss_binary_maps: 0.918020, reader_cost: 0.00000 s, batch_cost: 2.10887 s, samples: 2, ips: 0.47419

就停了
(ocr-gpu) D:\PaddleOCR>

JescalLin · 2021-04-07T02:13:27Z

@Xu-xunshan
我的问题解决了
tools/program.py

if idx >= len(train_dataloader):
break
#把上面这句话修改成下面这个
if idx >= len(train_dataloader)-1:
break

littletomatodonkey · 2021-04-10T03:25:35Z

好的，之前应该也有其他用户遇到这样的问题，我们记录下，感谢反馈～

* fix RCNN dygraph to static

littletomatodonkey closed this as completed Apr 10, 2021

an1018 pushed a commit to an1018/PaddleOCR that referenced this issue Aug 17, 2022

fix RCNN dygraph to static (PaddlePaddle#2184)

01d57c6

* fix RCNN dygraph to static

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

文字检测训练时候出现Initialize indexs of datasets:['./train_data/train_Label.txt']就直接结束程序了 #2184

文字检测训练时候出现Initialize indexs of datasets:['./train_data/train_Label.txt']就直接结束程序了 #2184

Xu-xunshan commented Mar 6, 2021

littletomatodonkey commented Mar 7, 2021

Xu-xunshan commented Mar 7, 2021

Xu-xunshan commented Mar 8, 2021

Xu-xunshan commented Mar 8, 2021

littletomatodonkey commented Mar 8, 2021

Xu-xunshan commented Mar 9, 2021

JescalLin commented Mar 31, 2021

JescalLin commented Apr 7, 2021

littletomatodonkey commented Apr 10, 2021

文字检测训练时候出现Initialize indexs of datasets:['./train_data/train_Label.txt']就直接结束程序了 #2184

文字检测训练时候出现Initialize indexs of datasets:['./train_data/train_Label.txt']就直接结束程序了 #2184

Comments

Xu-xunshan commented Mar 6, 2021

littletomatodonkey commented Mar 7, 2021

Xu-xunshan commented Mar 7, 2021

Xu-xunshan commented Mar 8, 2021

Xu-xunshan commented Mar 8, 2021

littletomatodonkey commented Mar 8, 2021

Xu-xunshan commented Mar 9, 2021

JescalLin commented Mar 31, 2021

JescalLin commented Apr 7, 2021

littletomatodonkey commented Apr 10, 2021