Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix uie dataloader memory overflow #3381

Merged
merged 5 commits into from
Oct 13, 2022

Conversation

westfish
Copy link
Contributor

PR types

Bug fixes

PR changes

Models

Description

分析了一下bug出现的原因:
当输入过长时,在数据加载时reader会对输入的实例进行分割,返回多个实例, 正常情况max_content_len是不会变化的,它的作用是分割content为两部分, 即cur_content = content[:max_content_len] res_content = content[max_content_len:] 。
但是,当待抽取对象result的start id小于max_content_len且end id小于max_content_len,那么待抽取result就将位于cur_content和res_content两个地方,所以为了让整个result位于同一个地方,就会有一个特殊处理,也即进入if result['start'] + 1 <= max_content_len < result[ 'end']语句, 并设置max_content_len = result['start'],从而让待抽取对象位于新实例的开始部分。
但是问题是,如果待抽取对象result过长,超过了max_content_len,即使执行了上一步骤之后,更新后的start id和end id仍然会位于两个不同的地方,就会再次进入if result['start'] + 1 <= max_content_len < result[ 'end']语句,从而无限循环,得到无限多的空实例,最终内存溢出。
所以结论就是,要考虑到待抽取对象result会出现超过max_content_len的情况,在这种情况下,我们是无法得到一个可以包含待抽取对象result的实例的,即对重制max_content_len对条件增加限制,也即if result['start'] + 1 <= max_content_len < result['end'] and result['end']-result['start'] <= max_content_len 。
经过bad case测试,增加测试后不再出现内存溢出的问题。

@@ -229,8 +229,7 @@ def reader(data_path, max_seq_len=512):
cur_result_list = []

for result in result_list:
if result['start'] + 1 <= max_content_len < result[
'end']:
if result['start'] + 1 <= max_content_len < result['end'] and result['end'] - result['start'] <= max_content_len :
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对于抽取目标超过最大长度限制result['end'] - result['start']的情况如果能给个warning的信息可能更好一些

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

最新的commit已添加相关warning

Copy link
Contributor

@linjieccc linjieccc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@westfish westfish changed the title fix dataloader memory overflow fix uie dataloader memory overflow Oct 12, 2022
@westfish westfish merged commit c65dbb4 into PaddlePaddle:develop Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants