New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fuzzy query ranks misspellings over exact for repeated "close" tokens #22745
Comments
Fuzziness is behaving correctly here. The idea is that words will be spelled correctly most of the time, with a few misspellings, so with fuzziness we give a slight edge to terms that appear more frequently, as this is more likely to be the correct spelling. If your correct spelling is the only occurrence in the document collection, then it's going to rank more poorly |
Let me rephrase so I am sure I understand : This is effectively overriding an exact match with a fuzzy match which runs counter to what both #9105 ( match/multi_match +fuzzy replace FLT ) and #5883 (don't rank fuzzy above exact match) were saying... Is there a way to configure this behaviour ? My exact use case is searching for cars and this bug breaks the search for porsche cars ranking the |
The more I think about this the worse it sounds : So please consider this a feature request to be able to tweak the relevance model. |
@jeantil it doesn't work like that. What it should do is say:
So this is working correctly. However, I think there is a different bug: the more fuzzy terms that match, the higher the score. With explain, the score for the first two matching documents includes this:
While the last matching doc has just this:
That's just wrong. Instead, we should be taking the max score for each of these fuzzy "synonyms". @jimczi what do you think? |
Oh I'm stupid - Not a bug. Closing |
Right the query matches the fuzzy terms twice. It's more obvious with the
The boost is computed based on the distance with the exact string. |
While I understand the current logic, I still argue that it effectively ends up ranking an exact match below a fuzzy match because it matches the same input token twice again different document tokens even when one of the document token has an exact match with a differnt input token. here abb in the input is an exact match for abb in the document, yet aab is fuzzy matched with both aae and abb ? @clintongormley Is there any way to tweak the ranking algorithm to avoid scoring an input token against an output token which already has an exact match (or anything which would alway result in full exact matches to actually be ranked above partial exact match+fuzzy match really) ? @jimczi the validate result seems very weird as it doesn't account for the exact match abb<->abb does this stem from using a specific rewrite |
@jeantil the validate query uses |
yet I did provide a very real world example of why this issue is important to me. We have a search for cars from all manufacturers. You probably heard of the Porshe 911 in the real world. Except that's a fairly old line of cards so porsche created multiple versions of the Porsche 911 :
According to our domain experts, people "in the know" talk and search these cars using This is not the only brand causing issues. BMW classifies cars as Serie 1, Serie 2, Serie 3, etc ... then goes to decline the series in different categories (E30) Berline, (E36) Berline, (E46) cabrio, etc And people use that to refer to their cars and therefore to search for it, I'll let you guess what happened to my test cases when I switched to 1.7.x with the now retired Fuzzy Like This to 5.x with a fuzzy multimatch. At the same time, in the same index, we have brands who enjoy much longer names in which typos are easily found. I took the time to create an easily reproductible test case but that was driven by a very real world issue. by the way our current query is already heavily tweaked:
|
Sorry my wording was not clear enough. You should not try to apply fuzziness to a three letter word. If people search for |
Thanks for taking the time. I fail to understand how I can "remove the problematic words" since they convey 100% of the information I am trying to retrieve. I have created a topic on the forum to continue this disussion: Thanks again for your help. |
Elasticsearch version: 5.1.2
Plugins installed: [] (no plugins installed)
JVM version :OpenJDK 64-Bit Server VM/1.8.0_111/25.111-b14
OS version: MacOS sierra 10.12.2 (16C67)
Description of the problem including expected versus actual behavior:
Despite #9103, match/multimatch ranks misspellings over exact matches in some fairly precise conditions.
Steps to reproduce:
I expect
I get
Looking at the explain. The norms factor is always the same: 0.89722675 (I'll call it $norms)
it seems that for
aab abb
the score0.9479706
Whereas for
aae abb
I get a score of0.9419285
asThe text was updated successfully, but these errors were encountered: