Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple tokens at the same position not working correctly with match query if AND operator is used #3881

Closed
lukas-vlcek opened this issue Oct 10, 2013 · 5 comments · Fixed by #3897

Comments

@lukas-vlcek
Copy link
Contributor

If multiple tokens are output at the same position then match queries are not working correctly if AND operator is used.

First I noticed this issue when using Hunspell token filter (something similar has been reported in LUCENE-5057 but it is not really a Lucene issue). With Hunspell it is possible to get multiple output tokens from a single input token, all at the same position. However, client query usually contains only one of those tokens or token that can output different set of tokens. When using match query and AND operator the document is not matching (while it should be).

I also think that this can impact other linguistics packages (like Basis`s RBL?)

Similar situation can be simulated using synonym filter. Imagine that we are using query time synonyms.

Let's say we index simple document:

{ text : "Quick brown fox" }

and we define query time synonym "quick, fast". Now let's see what we can do with this in the following recreation script (using ES 0.90.5), output commented below:

#!/bin/sh

echo "Elasticsearch version"
curl localhost:9200; echo; echo;

echo "Delete index"; curl -X DELETE 'localhost:9200/i'; echo; echo;

echo "Create index with analysis and mappings"; curl -X PUT 'localhost:9200/i' -d '{
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "index" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : ["lowercase"]
        },
        "search" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : ["lowercase","synonym"]
        }
      },
      "filter" : {
        "synonym" : {
          "type" : "synonym",
          "synonyms" : [
            "fast, quick"
          ]
  }}},
  "mappings" : {
    "t" : {
      "properties" : {
        "text" : {
          "type" : "string",
          "index_analyzer" : "index",
          "search_analyzer" : "search"
}}}}}}'; echo; echo;

# Wait for all the index shards to be allocated
curl -s -X GET 'http://localhost:9200/_cluster/health?wait_for_status=yellow&timeout=5s' > /dev/null

echo "Test synonyms for 'fast': should output two tokens"; curl -X POST 'localhost:9200/i/_analyze?analyzer=search&format=text&text=fast'; echo; echo;

echo "Index data: 'Quick brown fox'"; curl -X POST 'localhost:9200/i/t' -d '{
  "text" : "Quick brown fox"
}'; echo; echo;

echo "Refresh Lucene reader"; curl -X POST 'localhost:9200/i/_refresh'; echo; echo;

echo "Testing search";
echo ===========================
echo "1) query_string: quick";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"quick","default_field":"text"}}}'; echo; echo;

echo "2) query_string: fast - is search_analyzer used?";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"fast","default_field":"text"}}}'; echo; echo;

echo "2.5) query_string: fast - forcing search_analyzer";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"fast","default_field":"text","analyzer":"search"}}}'; echo; echo;

echo "3) query_string: fast - forcing search_analyzer, forcing AND operator";
curl -X GET 'localhost:9200/_search' -d '{"query":{"query_string":{"query":"fast","default_field":"text","analyzer":"search","default_operator":"AND"}}}'; echo; echo;

echo "4) match query: quick";
curl -X GET 'localhost:9200/_search' -d '{"query":{"match":{"text":{"query":"quick","analyzer":"search"}}}}'; echo; echo;

echo "5) match query: fast";
curl -X GET 'localhost:9200/_search' -d '{"query":{"match":{"text":{"query":"fast","analyzer":"search"}}}}'; echo; echo;

echo "6) match query: fast - forcing AND operator";
curl -X GET 'localhost:9200/_search' -d '{"query":{"match":{"text":{"query":"fast","analyzer":"search","operator":"AND"}}}}'; echo; echo;

Output of queries:

1) query_string: quick
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.15342641,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.15342641, "_source" : {
  "text" : "Quick brown fox"
}}]}}

2) query_string: fast - is search_analyzer used?
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

2.5) query_string: fast - forcing search_analyzer
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

3) query_string: fast - forcing search_analyzer, forcing AND operator
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

4) match query: quick
{"took":2,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

5) match query: fast
{"took":3,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":1,"max_score":0.04500804,"hits":[{"_index":"i","_type":"t","_id":"0N2FX_vxR5qsMTYczFPl1w","_score":0.04500804, "_source" : {
  "text" : "Quick brown fox"
}}]}}

6) match query: fast - forcing AND operator
{"took":4,"timed_out":false,"_shards":{"total":5,"successful":5,"failed":0},"hits":{"total":0,"max_score":null,"hits":[]}}

My comments on results:

(note that comment no.2 may contain question regarding other non related issue)

  1. query_string for query "quick" works as expected.

  2. query_string for query "fast" does not seem to work. According to the documentation I was expecting that search_analyzer defined in string type mapping would be used. But anyway, this should not be the topic of this issue... 😄

2.5) query_string for query "fast" works (if I explicitly force search analyzer) so we can say query time synonym works fine.

  1. The same situation as in 2.5) except we are forcing AND operator. It should work and it is working.

  2. Now, let's use match query and query for "quick". It works fine.

  3. Again, match query but query for "fast". It works, so far so good.

  4. The same as in 5) except we are forcing AND operator. It should work (I hope) but it is not.

If I could speculate about why this is happening:

a) MatchQueryParser does something like:

... if ("and".equalsIgnoreCase(op)) {
    matchQuery.setOccur(BooleanClause.Occur.MUST);
} ...

b) and MatchQuery does not take account on the position of tokens. It simply stacks all incoming tokens into BooleanQuery. It contains patterns similar to the following excerpt:

BooleanQuery q = new BooleanQuery(positionCount == 1);
for (int i = 0; i < numTokens; i++) {
    boolean hasNext = buffer.incrementToken();
    assert hasNext == true;
    final Query currentQuery = newTermQuery(mapper, new Term(field, termToByteRef(termAtt)));
    q.add(currentQuery, occur);
}

The position of tokens is not taken into account which would explain why this is not working as expected in combination with AND operator in situations described above.
I think if incoming tokens share the same position it should generate Boolean subquery with OR operator (?).

@s1monw
Copy link
Contributor

s1monw commented Oct 11, 2013

I think if incoming tokens share the same position it should generate Boolean subquery with OR operator (?).

I agree!

@ghost ghost assigned s1monw Oct 11, 2013
@lukas-vlcek
Copy link
Contributor Author

btw @s1monw I do not want to hijack this issue but what do you think about my comment no.2 (to me it seems that the search analyzer is not used while it should be, no?) Is it worth opening a new issue or I am misunderstanding something here?

@s1monw
Copy link
Contributor

s1monw commented Oct 11, 2013

I updated the PR with a test for your issue no. 2 but I can't reproduce it though. Works just fine and uses the right filter or do I miss something?

@lukas-vlcek
Copy link
Contributor Author

If my recreation script returns one hit for the second query to you then this means it has been probably fixed already (or hard to say ... ). Just ignore it...
Thanks!

@s1monw
Copy link
Contributor

s1monw commented Oct 11, 2013

I will try to recreate it via REST maybe there is some problem there. I don't think I will get to it today so I will update it later!

s1monw added a commit that referenced this issue Oct 14, 2013
SynonymFilters produces token streams with stacked tokens such that
conjunction queries need to be parsed in a special way such that the
stacked tokens are added as an innner disjuncition.

Closes #3881
@s1monw s1monw closed this as completed in 7a7370e Oct 14, 2013
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
SynonymFilters produces token streams with stacked tokens such that
conjunction queries need to be parsed in a special way such that the
stacked tokens are added as an innner disjuncition.

Closes elastic#3881
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants