Fixes bugs from word delimiter graph filter by mboynes · Pull Request #3237 · 10up/ElasticPress

mboynes · 2022-12-28T18:10:18Z

Description of the Change

This removes the preserve_original option for the word_delimiter_graph token filter. As noted in the ES docs:

Setting this parameter to true produces multi-position tokens, which are not supported by indexing.

If this parameter is true, avoid using this filter in an index analyzer or use the flatten_graph filter after this filter to make the token stream suitable for indexing.

As it was set, this was producing multi-position tokens, which could lead to unexpected results and, potentially, indexing errors. Where I observed this specifically was using a search_as_you_type field. This uses shingle token filters to create 2- and 3-gram sub-fields. Combined with word_delimiter_graph.preserve_original = true, if the field text is a word like "WordPress", and the analyzed token count is <= the gram size, the tokens can end up with a negative token position and indexing fails.

For what it's worth, I tried using flatten_graph as the docs suggest as a workaround, and that didn't work 🤷.

How to test the Change

To replicate the error in isolation:

PUT sayt-demo
{
  "settings": {
    "analysis": {
      "filter": {
        "word_delimiter": {
          "type": "word_delimiter_graph",
          "preserve_original": true
        }
      },
      "analyzer": {
        "default": {
          "tokenizer": "standard",
          "filter": [
            "word_delimiter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "test": {
        "type": "search_as_you_type"
      }
    }
  }
}

PUT sayt-demo/_doc/1?refresh
{
  "test": "WordPress"
}

RESPONSE:
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "first position increment must be > 0 (got 0) for field 'test._2gram'"
  },
  "status" : 400
}

GET /sayt-demo/_analyze
{
  "field": "test._2gram",
  "text": "WordPress"
}

RESPONSE:
{
  "tokens" : [
    {
      "token" : "Word Press",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "shingle",
      "position" : -1
    }
  ]
}

Observe that when removing "preserve_original": true, the document indexes as expected and the position is correctly calculated as 0, not -1.

Changelog Entry

Fixed - Potential indexing error due to word_delimiter_graph.preserve_original: true, which Elastic notes should not be set for analysis during indexing.

Credits

Props @mboynes, @dlh01

Checklist:

I agree to follow this project's Code of Conduct.
I have updated the documentation accordingly.
I have added tests to cover my change.
All new and existing tests pass.

felipeelia · 2023-01-03T12:15:51Z

Thanks for the PR @mboynes. We are aware of this problem already and it will be fixed via #2911, applying the synonyms only during search time.

mboynes · 2023-01-03T12:51:20Z

@felipeelia the conflict isn't only with synonyms, synonyms are just another example, so that PR will not solve the root issue. I included a lot of detail in the PR description of the issue that explains why this setting needs to change, please review that and let me know if you have any questions.

felipeelia · 2023-01-03T13:40:35Z

@mboynes yeah, I got it. The problem with simply removing the preserve_original value is that a token for WordPress is never created (during search or index time.) In this specific case it doesn't matter much but for SKUs, for example, having the original string does make a difference.

I know synonyms are just an example but that will be the only case where this will be a problem for out-of-the-box users. That said, we have two options:

All mapping files would need a new analysis.analyzer.default_search, so different filters with different configurations are applied during index and search time OR
Developers creating new fields could do that manually according to their needs

Right now we are tending to nr. 1 but that can obviously change.

mboynes · 2023-01-03T14:13:28Z

The problem with simply removing the preserve_original value is that a token for WordPress is never created (during search or index time.) In this specific case it doesn't matter much but for SKUs, for example, having the original string does make a difference.

Since we're talking about analyzed fields, that wouldn't be much of a problem. The field will be analyzed the same at search time as at indexed time, so a SKU (or some other string that contains a mixture of letters and numbers) that happened to exist in an analyzed field would still match practically 100% of the time. There are exceptions to this statement, but there always are exceptions with ES analyzers 😅.

Regardless, the status quo is invalid and, in my opinion, should not be an option.

Fixes bugs from word delimiter graph filter

eab2d45

felipeelia closed this Jan 3, 2023

felipeelia mentioned this pull request Jan 3, 2023

Create default_search analyzer while setting up synonyms #2877

Closed

mboynes mentioned this pull request Jan 4, 2023

Fixes bugs from word delimiter graph filter Automattic/ElasticPress#184

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes bugs from word delimiter graph filter#3237

Fixes bugs from word delimiter graph filter#3237
mboynes wants to merge 1 commit into10up:developfrom
mboynes:hotfix/dont-preserve-original-with-word-delimiter-graph

mboynes commented Dec 28, 2022

Uh oh!

felipeelia commented Jan 3, 2023

Uh oh!

mboynes commented Jan 3, 2023

Uh oh!

felipeelia commented Jan 3, 2023

Uh oh!

mboynes commented Jan 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mboynes commented Dec 28, 2022

Description of the Change

How to test the Change

Changelog Entry

Credits

Checklist:

Uh oh!

felipeelia commented Jan 3, 2023

Uh oh!

mboynes commented Jan 3, 2023

Uh oh!

felipeelia commented Jan 3, 2023

Uh oh!

mboynes commented Jan 3, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants