Skip to content

Fixes bugs from word delimiter graph filter#3237

Closed
mboynes wants to merge 1 commit into10up:developfrom
mboynes:hotfix/dont-preserve-original-with-word-delimiter-graph
Closed

Fixes bugs from word delimiter graph filter#3237
mboynes wants to merge 1 commit into10up:developfrom
mboynes:hotfix/dont-preserve-original-with-word-delimiter-graph

Conversation

@mboynes
Copy link
Copy Markdown

@mboynes mboynes commented Dec 28, 2022

Description of the Change

This removes the preserve_original option for the word_delimiter_graph token filter. As noted in the ES docs:

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

As it was set, this was producing multi-position tokens, which could lead to unexpected results and, potentially, indexing errors. Where I observed this specifically was using a search_as_you_type field. This uses shingle token filters to create 2- and 3-gram sub-fields. Combined with word_delimiter_graph.preserve_original = true, if the field text is a word like "WordPress", and the analyzed token count is <= the gram size, the tokens can end up with a negative token position and indexing fails.

For what it's worth, I tried using flatten_graph as the docs suggest as a workaround, and that didn't work 🤷.

How to test the Change

To replicate the error in isolation:

PUT sayt-demo
{
  "settings": {
    "analysis": {
      "filter": {
        "word_delimiter": {
          "type": "word_delimiter_graph",
          "preserve_original": true
        }
      },
      "analyzer": {
        "default": {
          "tokenizer": "standard",
          "filter": [
            "word_delimiter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "test": {
        "type": "search_as_you_type"
      }
    }
  }
}

PUT sayt-demo/_doc/1?refresh
{
  "test": "WordPress"
}

RESPONSE:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
  },
  "status" : 400
}

GET /sayt-demo/_analyze
{
  "field": "test._2gram",
  "text": "WordPress"
}

RESPONSE:
{
  "tokens" : [
    {
      "token" : "Word Press",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "shingle",
      "position" : -1
    }
  ]
}

Observe that when removing "preserve_original": true, the document indexes as expected and the position is correctly calculated as 0, not -1.

Changelog Entry

Fixed - Potential indexing error due to word_delimiter_graph.preserve_original: true, which Elastic notes should not be set for analysis during indexing.

Credits

Props @mboynes, @dlh01

Checklist:

  • I agree to follow this project's Code of Conduct.
  • I have updated the documentation accordingly.
  • I have added tests to cover my change.
  • All new and existing tests pass.

@felipeelia
Copy link
Copy Markdown
Member

Thanks for the PR @mboynes. We are aware of this problem already and it will be fixed via #2911, applying the synonyms only during search time.

@felipeelia felipeelia closed this Jan 3, 2023
@mboynes
Copy link
Copy Markdown
Author

mboynes commented Jan 3, 2023

@felipeelia the conflict isn't only with synonyms, synonyms are just another example, so that PR will not solve the root issue. I included a lot of detail in the PR description of the issue that explains why this setting needs to change, please review that and let me know if you have any questions.

@felipeelia
Copy link
Copy Markdown
Member

@mboynes yeah, I got it. The problem with simply removing the preserve_original value is that a token for WordPress is never created (during search or index time.) In this specific case it doesn't matter much but for SKUs, for example, having the original string does make a difference.

I know synonyms are just an example but that will be the only case where this will be a problem for out-of-the-box users. That said, we have two options:

  1. All mapping files would need a new analysis.analyzer.default_search, so different filters with different configurations are applied during index and search time OR
  2. Developers creating new fields could do that manually according to their needs

Right now we are tending to nr. 1 but that can obviously change.

@mboynes
Copy link
Copy Markdown
Author

mboynes commented Jan 3, 2023

The problem with simply removing the preserve_original value is that a token for WordPress is never created (during search or index time.) In this specific case it doesn't matter much but for SKUs, for example, having the original string does make a difference.

Since we're talking about analyzed fields, that wouldn't be much of a problem. The field will be analyzed the same at search time as at indexed time, so a SKU (or some other string that contains a mixture of letters and numbers) that happened to exist in an analyzed field would still match practically 100% of the time. There are exceptions to this statement, but there always are exceptions with ES analyzers 😅.

Regardless, the status quo is invalid and, in my opinion, should not be an option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants