-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some partial search results in description not boosted by relevancy #1335
Comments
Additional test cases from Alan Elbaum:
|
Set min gram size to 2, preserveOriginal=true for solr edge ngram filter (#1335)
@blms The search results are still not boosted by relevance -- as you can see, result 4 should've been before results 2 and 3. |
@amelbensalim You're right! @rlskoeser any idea off the bat about what might be causing this? Looks like it might be |
maybe we can boost the phrase matching more strongly? in the PUL solr config they have phrase slop matching boosted crazy amounts I forget, do we have edge ngrams set to 1 or 2? is this behavior specific to one-character searches or does it also happen with two? |
@rlskoeser good call! We set it to 2, but I'm not sure how to test it with 1 vs 2. I'll look into the phrase match boosting. |
Improve boosts for description exact matches (#1335)
@blms So searching Harun b. Y* gives me the correct results and the whole name is highlighted. But searching Abu l-Mun* also gives me the correct results sorted by relevancy but only highlights the "Abu." |
@kseniaryzhova Unfortunately that's a limitation with the dash—the highlighting works correctly if you type We do some processing where we join all dashes into single words for the exact matches; if you type If the only concern here is phrase highlighting, is it acceptable as is, or would it be worth the tradeoff of treating all hyphenated phrases as single unbroken words? I'd be curious to hear @rlskoeser's thoughts on this as well. |
@blms yes it's good to close, we'll just add help text to the website! Thanks! |
Phrase matching needs to be boosted higher as, for example, |
Update on that: solr edismax phrase matching does not seem to like the asterisk. However, |
@blms after the first 4 results, results are no longer boosted by relevancy and I'm only getting partial matches for one or the other word (There are at least 8 instances of phrases including אלמרכב אלצ in close proximity) |
@kseniaryzhova Can you link the four that do not show up so I can investigate? |
@blms מרכב אלצלטאן does not show up at all even though it appears four times https://test-geniza.cdh.princeton.edu/en/documents/?q=%D7%9E%D7%A8%D7%9B%D7%91+%D7%90%D7%9C%D7%A6%D7%9C%D7%98%D7%90%D7%9F&docdate_0=&docdate_1=&sort=relevance |
I'm not sure if it should consider |
@blms it's a partial search though so I'm not sure why it wouldn't recognize the super close match? Maybe this is an issue of knowing it's Hebrew or JA |
@kseniaryzhova FWIW, |
@blms they do bring up the correct search results, I just think this is very confusing for users, when to add in the definite article and when not to add it. The word remains the same (with some implications for grammar, one is "a small ship" and the other is "the ship of the sultan" so there is a grammatical difference), so partial search results should capture it, right? Would this be something that could be helped with tagging this as Hebrew/JA? I mean, when I search in English, I don't expect grammar to always be followed, just word order. Although with English word order is grammar, so that may not even be a helpful comparison. |
It's hard to say, though it seems the issue here is that you're including additional characters than what is present in the result. @rlskoeser I see in the indexing code that we are only applying edge ngram filtering on index, but not on query. Could this be related? And what was the reasoning behind that?
@kseniaryzhova I would say almost certainly yes, because this seems like something that would not be possible to know without some knowledge of the language. |
@kseniaryzhova To give a little explanation about how the partial matching works right now:
If the system knew your search term were Hebrew, it could try to separate |
@kseniaryzhova Ok to close this one since the results—when they appear at all—are correctly sorted by relevance? And then the remaining issue is handled by #1582? |
The goal was for search terms that are partial matches of the contents to return results without the user having to add wildcards. I think in most cases applying the edge ngram filtering on the query would have very confusing results. It does sound like the best solution for this particular use case would be to move towards language-specific indexing. |
@blms per discussion above, closing this! Thank you! |
testing notes (QA - round two)
In the QA site document search:
אלמרכב AND אלצ
should return the results, in correct order for relevance, that Ksenia was trying to achieve with*אלמרכב אלצ
testing notes (QA)
In the QA site document search, try the following kinds of searches:
Harun b. Y
should be writtenHarun b. Y*
."Ḥārūn b. Yaʿīsh"
, should work as expected (that search should bring up PGPID 32249)"Moshe b. Levi"
Note:
"Naṣr b. Sālim"
still will not work due to #1475. There might be other records where the space is a unicode character. If something fails and you're not sure, I can check.Note:
shelfmark:"T-S 12.30"
is also still not putting the most relevant result first, but not sure why; seems like a relevance score issue, and we're having issues with the shelfmark scoped search anyway. Created a new issue for that as well: #1476Describe the bug
Searching for Harun b. Y (partial name) should yield at least 5 relevant results with the 5 relevant results appearing first. Currently I don't event get matches for Harun in the first page of the results in the public site search:
To reproduce
Steps to reproduce the behavior:
Expected behavior
I want partial word search in descriptions sorted by relevancy to bring up the most relevant matches first. For this particular case the five Harun b. Y names should appear first, followed by all the Harun's and then everything else.
Additional context
The Abu l-Mun partial search works fine still (which was the previous test case).
dev notes
The text was updated successfully, but these errors were encountered: