Multiple tokens at the same position not working correctly with match query if AND operator is used #3881

lukas-vlcek · 2013-10-10T19:51:49Z

If multiple tokens are output at the same position then match queries are not working correctly if AND operator is used.

First I noticed this issue when using Hunspell token filter (something similar has been reported in LUCENE-5057 but it is not really a Lucene issue). With Hunspell it is possible to get multiple output tokens from a single input token, all at the same position. However, client query usually contains only one of those tokens or token that can output different set of tokens. When using match query and AND operator the document is not matching (while it should be).

I also think that this can impact other linguistics packages (like Basis`s RBL?)

Similar situation can be simulated using synonym filter. Imagine that we are using query time synonyms.

Let's say we index simple document:

{ text : "Quick brown fox" }

and we define query time synonym "quick, fast". Now let's see what we can do with this in the following recreation script (using ES 0.90.5), output commented below:

#!/bin/sh

echo "Elasticsearch version"
curl localhost:9200; echo; echo;

echo "Delete index"; curl -X DELETE 'localhost:9200/i'; echo; echo;

echo "Create index with analysis and mappings"; curl -X PUT 'localhost:9200/i' -d '{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "index" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : ["lowercase"]
        },
        "search" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : ["lowercase","synonym"]
        }
      },
      "filter" : {
        "synonym" : {
          "type" : "synonym",
          "synonyms" : [
            "fast, quick"
          ]
  }}},
  "mappings" : {
    "t" : {
      "properties" : {
        "text" : {
          "type" : "string",
          "index_analyzer" : "index",
          "search_analyzer" : "search"
}}}}}}'; echo; echo;

# Wait for all the index shards to be allocated
curl -s -X GET 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=5s' > /dev/null

echo "Test synonyms for 'fast': should output two tokens"; curl -X POST 'localhost:9200/i/_analyze?analyzer=search&format=text&text=fast'; echo; echo;

echo "Index data: 'Quick brown fox'"; curl -X POST 'localhost:9200/i/t' -d '{
  "text" : "Quick brown fox"
}'; echo; echo;

echo "Refresh Lucene reader"; curl -X POST 'localhost:9200/i/_refresh'; echo; echo;

echo "Testing search";
echo ===========================
echo "1) query_string: quick";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"quick","default_field":"text"}}}'; echo; echo;

echo "2) query_string: fast - is search_analyzer used?";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"fast","default_field":"text"}}}'; echo; echo;

echo "2.5) query_string: fast - forcing search_analyzer";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"fast","default_field":"text","analyzer":"search"}}}'; echo; echo;

echo "3) query_string: fast - forcing search_analyzer, forcing AND operator";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"fast","default_field":"text","analyzer":"search","default_operator":"AND"}}}'; echo; echo;

echo "4) match query: quick";
curl -X GET 'localhost:9200/_search' -d '{"query":{"match":{"text":{"query":"quick","analyzer":"search"}}}}'; echo; echo;

echo "5) match query: fast";
curl -X GET 'localhost:9200/_search' -d '{"query":{"match":{"text":{"query":"fast","analyzer":"search"}}}}'; echo; echo;

echo "6) match query: fast - forcing AND operator";
curl -X GET 'localhost:9200/_search' -d '{"query":{"match":{"text":{"query":"fast","analyzer":"search","operator":"AND"}}}}'; echo; echo;

Output of queries:

1) query_string: quick
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.15342641,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.15342641, "_source" : {
  "text" : "Quick brown fox"
}}]}}

2) query_string: fast - is search_analyzer used?
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

2.5) query_string: fast - forcing search_analyzer
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

3) query_string: fast - forcing search_analyzer, forcing AND operator
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

4) match query: quick
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

5) match query: fast
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

6) match query: fast - forcing AND operator
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

My comments on results:

(note that comment no.2 may contain question regarding other non related issue)

query_string for query "quick" works as expected.
query_string for query "fast" does not seem to work. According to the documentation I was expecting that search_analyzer defined in string type mapping would be used. But anyway, this should not be the topic of this issue... 😄

2.5) query_string for query "fast" works (if I explicitly force search analyzer) so we can say query time synonym works fine.

The same situation as in 2.5) except we are forcing AND operator. It should work and it is working.
Now, let's use match query and query for "quick". It works fine.
Again, match query but query for "fast". It works, so far so good.
The same as in 5) except we are forcing AND operator. It should work (I hope) but it is not.

If I could speculate about why this is happening:

a) MatchQueryParser does something like:

... if ("and".equalsIgnoreCase(op)) {
    matchQuery.setOccur(BooleanClause.Occur.MUST);
} ...

b) and MatchQuery does not take account on the position of tokens. It simply stacks all incoming tokens into BooleanQuery. It contains patterns similar to the following excerpt:

BooleanQuery q = new BooleanQuery(positionCount == 1);
for (int i = 0; i < numTokens; i++) {
    boolean hasNext = buffer.incrementToken();
    assert hasNext == true;
    final Query currentQuery = newTermQuery(mapper, new Term(field, termToByteRef(termAtt)));
    q.add(currentQuery, occur);
}

The position of tokens is not taken into account which would explain why this is not working as expected in combination with AND operator in situations described above.
I think if incoming tokens share the same position it should generate Boolean subquery with OR operator (?).

The text was updated successfully, but these errors were encountered:

s1monw · 2013-10-11T13:00:36Z

I think if incoming tokens share the same position it should generate Boolean subquery with OR operator (?).

I agree!

lukas-vlcek · 2013-10-11T13:43:13Z

btw @s1monw I do not want to hijack this issue but what do you think about my comment no.2 (to me it seems that the search analyzer is not used while it should be, no?) Is it worth opening a new issue or I am misunderstanding something here?

s1monw · 2013-10-11T15:02:58Z

I updated the PR with a test for your issue no. 2 but I can't reproduce it though. Works just fine and uses the right filter or do I miss something?

lukas-vlcek · 2013-10-11T15:19:13Z

If my recreation script returns one hit for the second query to you then this means it has been probably fixed already (or hard to say ... ). Just ignore it...
Thanks!

s1monw · 2013-10-11T15:24:43Z

I will try to recreate it via REST maybe there is some problem there. I don't think I will get to it today so I will update it later!

SynonymFilters produces token streams with stacked tokens such that conjunction queries need to be parsed in a special way such that the stacked tokens are added as an innner disjuncition. Closes #3881

SynonymFilters produces token streams with stacked tokens such that conjunction queries need to be parsed in a special way such that the stacked tokens are added as an innner disjuncition. Closes elastic#3881

ghost assigned s1monw Oct 11, 2013

s1monw mentioned this issue Oct 11, 2013

Add match query support for stacked tokens #3897

Merged

lukas-vlcek mentioned this issue Oct 14, 2013

search_analyzer does not seem to kick in as expected #3903

Closed

s1monw closed this as completed in 7a7370e Oct 14, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiple tokens at the same position not working correctly with match query if AND operator is used #3881

Multiple tokens at the same position not working correctly with match query if AND operator is used #3881

lukas-vlcek commented Oct 10, 2013

s1monw commented Oct 11, 2013

lukas-vlcek commented Oct 11, 2013

s1monw commented Oct 11, 2013

lukas-vlcek commented Oct 11, 2013

s1monw commented Oct 11, 2013

Multiple tokens at the same position not working correctly with match query if AND operator is used #3881

Multiple tokens at the same position not working correctly with match query if AND operator is used #3881

Comments

lukas-vlcek commented Oct 10, 2013

s1monw commented Oct 11, 2013

lukas-vlcek commented Oct 11, 2013

s1monw commented Oct 11, 2013

lukas-vlcek commented Oct 11, 2013

s1monw commented Oct 11, 2013