We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
在_get_words_start_end_pos函数中,是根据len(w)的叠加来递增start_pos和end_pos的。但是,当w是英文时,bert是采用wordPieces算法分词的,而不是按字母来分词的。这会导致 _get_words_start_end_pos中的w的长度(e.g. len('15*9mm')=6) 与 convert_examples_to_features 中w的长度(e.g. len([ 15, *, 9, #mm ])=4)不一致。 因此,在数据处理阶段的start_pos和end_pos就是有问题的。 也有可能是我分析错了。请帮忙看一看。 @LiangsLi
The text was updated successfully, but these errors were encountered:
你好 @erichuazhou 我确认了一下, 这的确是一个BUG. 现在的输入实现写的过分复杂和丑陋了, 后面会将tokenizer传入, 确保获得正确的 词语边界 和 最大句长. 不过我最近在忙毕业项目, 而且这部分需要改的地方很多, 不确定何时能改完. 感谢你帮忙看代码~
Sorry, something went wrong.
No branches or pull requests
在_get_words_start_end_pos函数中,是根据len(w)的叠加来递增start_pos和end_pos的。但是,当w是英文时,bert是采用wordPieces算法分词的,而不是按字母来分词的。这会导致 _get_words_start_end_pos中的w的长度(e.g. len('15*9mm')=6) 与 convert_examples_to_features 中w的长度(e.g. len([ 15, *, 9, #mm ])=4)不一致。
因此,在数据处理阶段的start_pos和end_pos就是有问题的。
也有可能是我分析错了。请帮忙看一看。
@LiangsLi
The text was updated successfully, but these errors were encountered: