-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix uie dataloader memory overflow #3381
fix uie dataloader memory overflow #3381
Conversation
model_zoo/uie/utils.py
Outdated
@@ -229,8 +229,7 @@ def reader(data_path, max_seq_len=512): | |||
cur_result_list = [] | |||
|
|||
for result in result_list: | |||
if result['start'] + 1 <= max_content_len < result[ | |||
'end']: | |||
if result['start'] + 1 <= max_content_len < result['end'] and result['end'] - result['start'] <= max_content_len : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
对于抽取目标超过最大长度限制result['end'] - result['start']
的情况如果能给个warning的信息可能更好一些
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
最新的commit已添加相关warning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
PR types
Bug fixes
PR changes
Models
Description
分析了一下bug出现的原因:
当输入过长时,在数据加载时reader会对输入的实例进行分割,返回多个实例, 正常情况max_content_len是不会变化的,它的作用是分割content为两部分, 即cur_content = content[:max_content_len] res_content = content[max_content_len:] 。
但是,当待抽取对象result的start id小于max_content_len且end id小于max_content_len,那么待抽取result就将位于cur_content和res_content两个地方,所以为了让整个result位于同一个地方,就会有一个特殊处理,也即进入if result['start'] + 1 <= max_content_len < result[ 'end']语句, 并设置max_content_len = result['start'],从而让待抽取对象位于新实例的开始部分。
但是问题是,如果待抽取对象result过长,超过了max_content_len,即使执行了上一步骤之后,更新后的start id和end id仍然会位于两个不同的地方,就会再次进入if result['start'] + 1 <= max_content_len < result[ 'end']语句,从而无限循环,得到无限多的空实例,最终内存溢出。
所以结论就是,要考虑到待抽取对象result会出现超过max_content_len的情况,在这种情况下,我们是无法得到一个可以包含待抽取对象result的实例的,即对重制max_content_len对条件增加限制,也即if result['start'] + 1 <= max_content_len < result['end'] and result['end']-result['start'] <= max_content_len 。
经过bad case测试,增加测试后不再出现内存溢出的问题。