Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some partial search results in description not boosted by relevancy #1335

Closed
4 of 5 tasks
kseniaryzhova opened this issue Feb 28, 2023 · 22 comments
Closed
4 of 5 tasks
Assignees
Labels
🐛 bug Something isn't working performant Tasks for or taken on by Performant
Milestone

Comments

@kseniaryzhova
Copy link

kseniaryzhova commented Feb 28, 2023

testing notes (QA - round two)

In the QA site document search:

  • Searching אלמרכב AND אלצ should return the results, in correct order for relevance, that Ksenia was trying to achieve with *אלמרכב אלצ

testing notes (QA)

In the QA site document search, try the following kinds of searches:

  • For partial names where you only have one letter of a longer word, use an asterisk. For example, the partial search Harun b. Y should be written Harun b. Y*.
    • This search should bring up the correct results, above "Yevr" shelfmarks.
  • Searches with single character words in them, such as "Ḥārūn b. Yaʿīsh", should work as expected (that search should bring up PGPID 32249)
    • The "b." should be highlighted in results, including in searches like "Moshe b. Levi"

Note: "Naṣr b. Sālim" still will not work due to #1475. There might be other records where the space is a unicode character. If something fails and you're not sure, I can check.

Note: shelfmark:"T-S 12.30" is also still not putting the most relevant result first, but not sure why; seems like a relevance score issue, and we're having issues with the shelfmark scoped search anyway. Created a new issue for that as well: #1476


Describe the bug
Searching for Harun b. Y (partial name) should yield at least 5 relevant results with the 5 relevant results appearing first. Currently I don't event get matches for Harun in the first page of the results in the public site search:

To reproduce
Steps to reproduce the behavior:

  1. Go to public site search.
  2. Search for Harun b. Y and sort by Relevance.
  3. See results: https://geniza.princeton.edu/en/documents/?q=Harun+b.+Y&docdate_0=&docdate_1=&sort=relevance

Expected behavior
I want partial word search in descriptions sorted by relevancy to bring up the most relevant matches first. For this particular case the five Harun b. Y names should appear first, followed by all the Harun's and then everything else.

Additional context
The Abu l-Mun partial search works fine still (which was the previous test case).

dev notes

  • maybe worth trying edge ngram indexing min 1 to see if it resolves this. I think it should match without them, as we are keeping the un sliced words as well, but seems like a potentially simple solution that may help other search problems too
@kseniaryzhova kseniaryzhova added 🐛 bug Something isn't working performant Tasks for or taken on by Performant labels Feb 28, 2023
@blms blms changed the title Partial search results in description not boosted by relevancy Some partial search results in description not boosted by relevancy Feb 28, 2023
@blms blms self-assigned this Mar 2, 2023
blms added a commit that referenced this issue Mar 2, 2023
blms added a commit that referenced this issue Mar 9, 2023
@blms
Copy link
Contributor

blms commented Jul 31, 2023

Additional test cases from Alan Elbaum:

  • "Naṣr b. Sālim" should bring up PGPID 17583
    • Some with "b." do seem to work, such as "Moshe b. Levi", though highlighting is broken
  • T-S shelfmarks, for example shelfmark:"T-S 12.30" should bring up PGPID 3122

blms added a commit that referenced this issue Oct 18, 2023
Set min gram size to 2, preserveOriginal=true for solr edge ngram filter (#1335)
@blms blms added the 🗜️ awaiting testing Implemented and ready to be tested label Oct 18, 2023
@amelbensalim
Copy link

@blms The search results are still not boosted by relevance -- as you can see, result 4 should've been before results 2 and 3.
image

image

@blms
Copy link
Contributor

blms commented Oct 18, 2023

@amelbensalim You're right!

@rlskoeser any idea off the bat about what might be causing this? Looks like it might be Yevr. in the shelfmark that's matching Y* and throwing it off, but I would think the presence of Hārūn in the 4th result would boost the description relevance enough. Maybe we need to tweak the boosts for multiple matched words appearing in the description? I have a feeling it's going to be more solr admin deep dives for me 😆

@blms blms added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Oct 18, 2023
@rlskoeser
Copy link
Contributor

maybe we can boost the phrase matching more strongly? in the PUL solr config they have phrase slop matching boosted crazy amounts

I forget, do we have edge ngrams set to 1 or 2? is this behavior specific to one-character searches or does it also happen with two?

@blms
Copy link
Contributor

blms commented Oct 18, 2023

@rlskoeser good call! We set it to 2, but I'm not sure how to test it with 1 vs 2. I'll look into the phrase match boosting.

blms added a commit that referenced this issue Feb 12, 2024
Improve boosts for description exact matches (#1335)
@blms blms removed the ⚠️ tested needs attention Has been through acceptance testing and needs additional work label Feb 12, 2024
@blms blms added the 🗜️ awaiting testing Implemented and ready to be tested label Feb 23, 2024
@kseniaryzhova
Copy link
Author

@blms So searching Harun b. Y* gives me the correct results and the whole name is highlighted. But searching Abu l-Mun* also gives me the correct results sorted by relevancy but only highlights the "Abu."
image

@kseniaryzhova kseniaryzhova added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Feb 26, 2024
@blms
Copy link
Contributor

blms commented Feb 26, 2024

@kseniaryzhova Unfortunately that's a limitation with the dash—the highlighting works correctly if you type Abu l Mun*. The index is treating the dash as a separator, so it doesn't think the whole phrase matched when a dash is joining the search query terms.

We do some processing where we join all dashes into single words for the exact matches; if you type "Abu l-Muna" with double quotes you can see that working correctly. One option would be to always join dashes like this, even on non-exact searches. My only concern with this is that it could break for the case for which it was originally designed: English! In English of course we have a lot of times that we use dashes and they aren't part of a single word—they're often used to join multiple words.

If the only concern here is phrase highlighting, is it acceptable as is, or would it be worth the tradeoff of treating all hyphenated phrases as single unbroken words?

I'd be curious to hear @rlskoeser's thoughts on this as well.

@blms blms added the ❓ question Further information is requested label Feb 26, 2024
@kseniaryzhova
Copy link
Author

@blms yes it's good to close, we'll just add help text to the website! Thanks!

@blms blms removed ❓ question Further information is requested ⚠️ tested needs attention Has been through acceptance testing and needs additional work labels Mar 11, 2024
@kseniaryzhova kseniaryzhova reopened this Mar 25, 2024
@blms
Copy link
Contributor

blms commented Mar 25, 2024

Phrase matching needs to be boosted higher as, for example, *אלמרכב אלצ results in matches without the full phrase appearing above matches with the full phrase.

@blms
Copy link
Contributor

blms commented Mar 25, 2024

Update on that: solr edismax phrase matching does not seem to like the asterisk. However, אלמרכב אלצ does appear to bring up the results in the correct order. (אלמרכב AND אלצ after next update)

@blms blms added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 19, 2024
@kseniaryzhova
Copy link
Author

@blms after the first 4 results, results are no longer boosted by relevancy and I'm only getting partial matches for one or the other word (There are at least 8 instances of phrases including אלמרכב אלצ in close proximity)

@kseniaryzhova kseniaryzhova added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Apr 22, 2024
@blms
Copy link
Contributor

blms commented Apr 22, 2024

@kseniaryzhova Can you link the four that do not show up so I can investigate?

@kseniaryzhova
Copy link
Author

@blms
Copy link
Contributor

blms commented Apr 22, 2024

I'm not sure if it should consider מרכב to match a search for אלמרכב, or how we would tell it to do so. Is it something that would be improved if the system knew this were Hebrew or Judaeo-Arabic?

@kseniaryzhova
Copy link
Author

@blms it's a partial search though so I'm not sure why it wouldn't recognize the super close match? Maybe this is an issue of knowing it's Hebrew or JA

@blms
Copy link
Contributor

blms commented Apr 22, 2024

@kseniaryzhova FWIW, מרכב AND אלצ seems to bring up the correct results, I think. Would that be an acceptable solution?

@kseniaryzhova
Copy link
Author

kseniaryzhova commented Apr 22, 2024

@blms they do bring up the correct search results, I just think this is very confusing for users, when to add in the definite article and when not to add it. The word remains the same (with some implications for grammar, one is "a small ship" and the other is "the ship of the sultan" so there is a grammatical difference), so partial search results should capture it, right? Would this be something that could be helped with tagging this as Hebrew/JA?

I mean, when I search in English, I don't expect grammar to always be followed, just word order. Although with English word order is grammar, so that may not even be a helpful comparison.

@blms
Copy link
Contributor

blms commented Apr 22, 2024

@blms it's a partial search though so I'm not sure why it wouldn't recognize the super close match? Maybe this is an issue of knowing it's Hebrew or JA

It's hard to say, though it seems the issue here is that you're including additional characters than what is present in the result.

@rlskoeser I see in the indexing code that we are only applying edge ngram filtering on index, but not on query. Could this be related? And what was the reasoning behind that?

@blms they do bring up the correct search results, I just think this is very confusing for users, when to add in the definite article and when not to add it. The word remains the same (with some implications for grammar, one is "a small ship" and the other is "the ship of the sultan" so there is a grammatical difference), so partial search results should capture it, right? Would this be something that could be helped with tagging this as Hebrew/JA?

@kseniaryzhova I would say almost certainly yes, because this seems like something that would not be possible to know without some knowledge of the language.

@blms
Copy link
Contributor

blms commented Apr 22, 2024

@kseniaryzhova To give a little explanation about how the partial matching works right now:

  • the word מרכב appears in the transcription to be indexed
  • in indexing, solr tokenizes it as מר, מרכ, מרכב
  • you search אלמרכב and solr tries to match it against one of those tokens exactly: no match
  • you search מרכב and it matches the first of three tokens exactly

If the system knew your search term were Hebrew, it could try to separate מרכב as a stem for the word, and consider that as the actual token to match. But since we don't have that capability right now, it's only capable of matching the word itself as entered, or smaller.

@blms
Copy link
Contributor

blms commented Apr 23, 2024

@kseniaryzhova Ok to close this one since the results—when they appear at all—are correctly sorted by relevance? And then the remaining issue is handled by #1582?

@rlskoeser
Copy link
Contributor

@rlskoeser I see in the indexing code that we are only applying edge ngram filtering on index, but not on query. Could this be related? And what was the reasoning behind that?

The goal was for search terms that are partial matches of the contents to return results without the user having to add wildcards. I think in most cases applying the edge ngram filtering on the query would have very confusing results.

It does sound like the best solution for this particular use case would be to move towards language-specific indexing.

@kseniaryzhova
Copy link
Author

kseniaryzhova commented Apr 24, 2024

@blms per discussion above, closing this! Thank you!

@blms blms removed the ⚠️ tested needs attention Has been through acceptance testing and needs additional work label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working performant Tasks for or taken on by Performant
Projects
None yet
Development

No branches or pull requests

5 participants