Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

文字检测训练时候出现Initialize indexs of datasets:['./train_data/train_Label.txt']就直接结束程序了 #2184

Closed
Xu-xunshan opened this issue Mar 6, 2021 · 9 comments

Comments

@Xu-xunshan
Copy link

[2021/03/06 20:22:25] root INFO: shuffle : True
[2021/03/06 20:22:25] root INFO: use_shared_memory : False
[2021/03/06 20:22:25] root INFO: train with paddle 2.0.0 and device CUDAPlace(0)
[2021/03/06 20:22:25] root INFO: Initialize indexs of datasets:['./train_data/train_Label.txt']
[2021/03/06 20:22:25] root INFO: Initialize indexs of datasets:['./train_data/test_Label.txt']
W0306 20:22:25.196138 4376 device_context.cc:362] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 11.2, Runtime API Version: 11.0
W0306 20:22:25.205113 4376 device_context.cc:372] device: 0, cuDNN Version: 8.0.
[2021/03/06 20:22:27] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x0_5_pretrained']
[2021/03/06 20:22:27] root INFO: train dataloader has 60 iters, valid dataloader has 100 iters
[2021/03/06 20:22:27] root INFO: During the training process, after the 0th iteration, an evaluation is run every 2000 iterations
[2021/03/06 20:22:27] root INFO: Initialize indexs of datasets:['./train_data/train_Label.txt']
运行到这边会卡一下然后结束程序
我观察了一下显卡的内存利用率挺正常,但是CUDA的利用率显示为一瞬间的使用
捕获
小白求解

@littletomatodonkey
Copy link
Collaborator

可以把num worker改为0,然后把batch size改小一点,看下还会不会有问题

@Xu-xunshan
Copy link
Author

batch size改小后成功开始训练,感谢

@Xu-xunshan
Copy link
Author

我将batch size修改为1后成功运行,但是没多久又中断训练了,这是什么情况

@Xu-xunshan
Copy link
Author

补充:多次测试后发现将batch size设置为2时候能跑到第一轮训练的iter290;设置为1时能跑到590

@littletomatodonkey
Copy link
Collaborator

eval的batch size和num worker也需要调整,另外,训练的时候,因为包含batch norm,不建议batch size小于16,否则效果可能会受到影响,num worker可以设置为0

@Xu-xunshan
Copy link
Author

我的batch size设置为16就回到了最开始的问题直接程序结束运行,按说1060不至于这样啊,还有eval的batch size不是注释了必须是1吗

@JescalLin
Copy link

我也遇到一样的问题训练会突然中断

[2021/03/31 15:35:12] root INFO: train with paddle 2.0.1 and device CPUPlace
[2021/03/31 15:35:12] root INFO: Initialize indexs of datasets:['./train_data/train/Label.txt']
[2021/03/31 15:35:12] root INFO: Initialize indexs of datasets:['./train_data/test/Label.txt']
[2021/03/31 15:35:12] root INFO: load pretrained model from ['./pretrain_models/MobileNetV3_large_x0_5_pretrained']
[2021/03/31 15:35:12] root INFO: train dataloader has 11 iters, valid dataloader has 3 iters
[2021/03/31 15:35:12] root INFO: During the training process, after the 3000th iteration, an evaluation is run every 2000 iterations
[2021/03/31 15:35:12] root INFO: Initialize indexs of datasets:['./train_data/train/Label.txt']
[2021/03/31 15:35:19] root INFO: epoch: [1/1200], iter: 2, lr: 0.000045, loss: 9.355393, loss_shrink_maps: 4.752926, loss_threshold_maps: 3.717568, loss_binary_maps: 0.955332, reader_cost: 0.04814 s, batch_cost: 3.45816 s, samples: 3, ips: 0.43376
[2021/03/31 15:35:23] root INFO: epoch: [1/1200], iter: 4, lr: 0.000091, loss: 9.207070, loss_shrink_maps: 4.692245, loss_threshold_maps: 3.555828, loss_binary_maps: 0.945580, reader_cost: 0.00000 s, batch_cost: 2.14012 s, samples: 2, ips: 0.46726
[2021/03/31 15:35:27] root INFO: epoch: [1/1200], iter: 6, lr: 0.000136, loss: 9.149884, loss_shrink_maps: 4.658204, loss_threshold_maps: 3.498812, loss_binary_maps: 0.935853, reader_cost: 0.00000 s, batch_cost: 2.14013 s, samples: 2, ips: 0.46726
[2021/03/31 15:35:32] root INFO: epoch: [1/1200], iter: 8, lr: 0.000182, loss: 8.971068, loss_shrink_maps: 4.629452, loss_threshold_maps: 3.473349, loss_binary_maps: 0.926134, reader_cost: 0.00000 s, batch_cost: 2.11335 s, samples: 2, ips: 0.47318
[2021/03/31 15:35:36] root INFO: epoch: [1/1200], iter: 10, lr: 0.000227, loss: 8.894274, loss_shrink_maps: 4.586464, loss_threshold_maps: 3.338689, loss_binary_maps: 0.918020, reader_cost: 0.00000 s, batch_cost: 2.10887 s, samples: 2, ips: 0.47419

就停了
(ocr-gpu) D:\PaddleOCR>

@JescalLin
Copy link

@Xu-xunshan
我的问题解决了
tools/program.py

if idx >= len(train_dataloader):
break
#把上面这句话修改成下面这个
if idx >= len(train_dataloader)-1:
break

@littletomatodonkey
Copy link
Collaborator

好的, 之前应该也有其他用户遇到这样的问题,我们记录下,感谢反馈~

an1018 pushed a commit to an1018/PaddleOCR that referenced this issue Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants