Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练时validation阶段卡死 #7

Open
magiczixiao opened this issue Aug 3, 2021 · 3 comments
Open

训练时validation阶段卡死 #7

magiczixiao opened this issue Aug 3, 2021 · 3 comments

Comments

@magiczixiao
Copy link

作者您好!
我使用CASIA数据集自行混合生成了训练和验证数据集, 可以进行训练, 但在每个epoch后的验证阶段会不定期卡死.
验证时调用的的函数为trainer\trainer.py: validation(self, epoch), 请问您是否有解决方案?
谢谢!

@JusperLee
Copy link
Owner

您好你有什么报错信息么,给我参考一下

@magiczixiao
Copy link
Author

您好, 谢谢您的回复! 没有报错信息, 现象就是CPU利用率直接降低至0. 切掉进程后也没有返回异常信息.
我进行了一些实验, 发现将num_worker降低至16以下可以降低该问题出现的概率. 环境为Intel(R) Xeon(R) Gold 5218R CPU, 可能是加载数据时的调度问题?

@JusperLee
Copy link
Owner

有可能是因为设置的num_workers超过cpu的线程数目,造成了进程堵塞

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants