Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

在处理中英文混合文本时,每个word对应的start_pos, end_pos的处理有错误? #13

Open
erichuazhou opened this issue May 15, 2020 · 1 comment
Labels
bug Something isn't working

Comments

@erichuazhou
Copy link

在_get_words_start_end_pos函数中,是根据len(w)的叠加来递增start_pos和end_pos的。但是,当w是英文时,bert是采用wordPieces算法分词的,而不是按字母来分词的。这会导致 _get_words_start_end_pos中的w的长度(e.g. len('15*9mm')=6) 与 convert_examples_to_features 中w的长度(e.g. len([ 15, *, 9, #mm ])=4)不一致。
因此,在数据处理阶段的start_pos和end_pos就是有问题的。
也有可能是我分析错了。请帮忙看一看。
@LiangsLi

@LiangsLi
Copy link
Contributor

你好 @erichuazhou 我确认了一下, 这的确是一个BUG. 现在的输入实现写的过分复杂和丑陋了, 后面会将tokenizer传入, 确保获得正确的 词语边界 和 最大句长. 不过我最近在忙毕业项目, 而且这部分需要改的地方很多, 不确定何时能改完.
感谢你帮忙看代码~

@LiangsLi LiangsLi added the bug Something isn't working label May 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants