在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？ #13

erichuazhou · 2020-05-15T02:03:03Z

在_get_words_start_end_pos函数中，是根据len(w)的叠加来递增start_pos和end_pos的。但是，当w是英文时，bert是采用wordPieces算法分词的，而不是按字母来分词的。这会导致 _get_words_start_end_pos中的w的长度(e.g. len('15*9mm')=6) 与 convert_examples_to_features 中w的长度(e.g. len([ 15, *, 9, #mm ])=4)不一致。
因此，在数据处理阶段的start_pos和end_pos就是有问题的。
也有可能是我分析错了。请帮忙看一看。
@LiangsLi

LiangsLi · 2020-05-15T02:51:55Z

你好 @erichuazhou 我确认了一下, 这的确是一个BUG. 现在的输入实现写的过分复杂和丑陋了, 后面会将tokenizer传入, 确保获得正确的词语边界和最大句长. 不过我最近在忙毕业项目, 而且这部分需要改的地方很多, 不确定何时能改完.
感谢你帮忙看代码~

LiangsLi added the bug Something isn't working label May 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？ #13

在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？ #13

erichuazhou commented May 15, 2020

LiangsLi commented May 15, 2020

在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？ #13

在处理中英文混合文本时，每个word对应的start_pos, end_pos的处理有错误？ #13

Comments

erichuazhou commented May 15, 2020

LiangsLi commented May 15, 2020