IK分词器在处理中文时产生了错误的偏移量 #1022

DemosHume · 2023-09-26T06:16:25Z

我在使用IK分词器处理中文文本时遇到了一个问题。我有一个字段recommend_tags，它的值是"贝尔法斯特号"。当我尝试将这个记录插入我的索引时，我收到了一个错误，说偏移量必须是非负的，而且endOffset必须大于等于startOffset，而且偏移量不能倒退。错误信息如下

('1 document(s) failed to index.', [{'index': {'_index': 'image_test_6_8_0', '_type': 'sql_record', '_id': 'WcgY0IoB7W0KhcCALXYf', 'status': 400, 'error': {'type': 'illegal_argument_exception', 'reason': "startOffset must be non-negative, and endOffset must be >= startOffset, and offsets must not go backwards startOffset=2,endOffset=3,lastStartOffset=3 for field 'recommend_tags'"}, 'data': {'recommend_tags': '贝尔法斯特号'}}}])
当我使用分词API手动分析我的文本时，我发现问题可能出在"法"和"斯"这两个词元上。"法"的startOffset为2，endOffset为3，然后下一个词元"斯"的startOffset也是3，这违反了偏移量不能倒退的规则。这是分词结果：

{'tokens': [{'token': '贝尔法斯特', 'start_offset': 0, 'end_offset': 5, 'type': 'CN_WORD', 'position': 0}, {'token': '贝尔法', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '贝尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 3}, {'token': '法', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 4}, {'token': '斯', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 5}, {'token': '特号', 'start_offset': 4, 'end_offset': 6, 'type': 'CN_WORD', 'position': 6}]}

ik分词器版本信息如下
description=IK Analyzer for Elasticsearch version=6.8.0

索引字段信息如下：

"recommend_tags": { "type": "text", "analyzer": "ik_max_word" }

The text was updated successfully, but these errors were encountered:

DemosHume · 2023-09-26T06:29:58Z

萨尔瓦多共和国
这个词也会出问题
{'tokens': [{'token': '萨尔瓦多', 'start_offset': 0, 'end_offset': 4, 'type': 'CN_WORD', 'position': 0}, {'token': '萨尔瓦', 'start_offset': 0, 'end_offset': 3, 'type': 'CN_WORD', 'position': 1}, {'token': '萨尔', 'start_offset': 0, 'end_offset': 2, 'type': 'CN_WORD', 'position': 2}, {'token': '瓦', 'start_offset': 2, 'end_offset': 3, 'type': 'CN_CHAR', 'position': 3}, {'token': '多', 'start_offset': 3, 'end_offset': 4, 'type': 'CN_CHAR', 'position': 4}]}

lizongbo · 2023-10-21T06:23:15Z

基于ES 8.10.2验证是正常的

AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["贝尔法斯特"]}======
AnalyzeResponse: {"tokens":[{"end_offset":5,"position":0,"start_offset":0,"token":"贝尔法斯特","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"贝尔法","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"贝尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"法","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"斯","type":"CN_CHAR"},{"end_offset":5,"position":5,"start_offset":4,"token":"特","type":"CN_CHAR"}]}

AnalyzeRequest: POST /_analyze {"analyzer":"ik_max_word","text":["萨尔瓦多"]}======AnalyzeResponse: {"tokens":[{"end_offset":4,"position":0,"start_offset":0,"token":"萨尔瓦多","type":"CN_WORD"},{"end_offset":3,"position":1,"start_offset":0,"token":"萨尔瓦","type":"CN_WORD"},{"end_offset":2,"position":2,"start_offset":0,"token":"萨尔","type":"CN_WORD"},{"end_offset":3,"position":3,"start_offset":2,"token":"瓦","type":"CN_CHAR"},{"end_offset":4,"position":4,"start_offset":3,"token":"多","type":"CN_CHAR"}]}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IK分词器在处理中文时产生了错误的偏移量 #1022

IK分词器在处理中文时产生了错误的偏移量 #1022

DemosHume commented Sep 26, 2023 •

edited

DemosHume commented Sep 26, 2023 •

edited

lizongbo commented Oct 21, 2023

IK分词器在处理中文时产生了错误的偏移量 #1022

IK分词器在处理中文时产生了错误的偏移量 #1022

Comments

DemosHume commented Sep 26, 2023 • edited

DemosHume commented Sep 26, 2023 • edited

lizongbo commented Oct 21, 2023

DemosHume commented Sep 26, 2023 •

edited

DemosHume commented Sep 26, 2023 •

edited