Some partial search results in description not boosted by relevancy #1335

kseniaryzhova · 2023-02-28T20:16:34Z

testing notes (QA - round two)

Searching אלמרכב AND אלצ should return the results, in correct order for relevance, that Ksenia was trying to achieve with *אלמרכב אלצ

testing notes (QA)

In the QA site document search, try the following kinds of searches:

For partial names where you only have one letter of a longer word, use an asterisk. For example, the partial search Harun b. Y should be written Harun b. Y*.
- This search should bring up the correct results, above "Yevr" shelfmarks.
Searches with single character words in them, such as "Ḥārūn b. Yaʿīsh", should work as expected (that search should bring up PGPID 32249)
- The "b." should be highlighted in results, including in searches like "Moshe b. Levi"

Note: "Naṣr b. Sālim" still will not work due to #1475. There might be other records where the space is a unicode character. If something fails and you're not sure, I can check.

Note: shelfmark:"T-S 12.30" is also still not putting the most relevant result first, but not sure why; seems like a relevance score issue, and we're having issues with the shelfmark scoped search anyway. Created a new issue for that as well: #1476

Describe the bug
Searching for Harun b. Y (partial name) should yield at least 5 relevant results with the 5 relevant results appearing first. Currently I don't event get matches for Harun in the first page of the results in the public site search:

To reproduce
Steps to reproduce the behavior:

Go to public site search.
Search for Harun b. Y and sort by Relevance.
See results: https://geniza.princeton.edu/en/documents/?q=Harun+b.+Y&docdate_0=&docdate_1=&sort=relevance

Expected behavior
I want partial word search in descriptions sorted by relevancy to bring up the most relevant matches first. For this particular case the five Harun b. Y names should appear first, followed by all the Harun's and then everything else.

Additional context
The Abu l-Mun partial search works fine still (which was the previous test case).

dev notes

maybe worth trying edge ngram indexing min 1 to see if it resolves this. I think it should match without them, as we are keeping the un sliced words as well, but seems like a potentially simple solution that may help other search problems too

The text was updated successfully, but these errors were encountered:

ref #1335

blms · 2023-07-31T14:00:31Z

Additional test cases from Alan Elbaum:

"Naṣr b. Sālim" should bring up PGPID 17583
- Some with "b." do seem to work, such as "Moshe b. Levi", though highlighting is broken
T-S shelfmarks, for example shelfmark:"T-S 12.30" should bring up PGPID 3122

Set min gram size to 2, preserveOriginal=true for solr edge ngram filter (#1335)

amelbensalim · 2023-10-18T19:27:37Z

@blms The search results are still not boosted by relevance -- as you can see, result 4 should've been before results 2 and 3.

blms · 2023-10-18T19:34:00Z

@amelbensalim You're right!

@rlskoeser any idea off the bat about what might be causing this? Looks like it might be Yevr. in the shelfmark that's matching Y* and throwing it off, but I would think the presence of Hārūn in the 4th result would boost the description relevance enough. Maybe we need to tweak the boosts for multiple matched words appearing in the description? I have a feeling it's going to be more solr admin deep dives for me 😆

rlskoeser · 2023-10-18T19:38:53Z

maybe we can boost the phrase matching more strongly? in the PUL solr config they have phrase slop matching boosted crazy amounts

I forget, do we have edge ngrams set to 1 or 2? is this behavior specific to one-character searches or does it also happen with two?

blms · 2023-10-18T19:41:48Z

@rlskoeser good call! We set it to 2, but I'm not sure how to test it with 1 vs 2. I'll look into the phrase match boosting.

Improve boosts for description exact matches (#1335)

kseniaryzhova · 2024-02-26T02:02:36Z

@blms So searching Harun b. Y* gives me the correct results and the whole name is highlighted. But searching Abu l-Mun* also gives me the correct results sorted by relevancy but only highlights the "Abu."

blms · 2024-02-26T16:54:48Z

@kseniaryzhova Unfortunately that's a limitation with the dash—the highlighting works correctly if you type Abu l Mun*. The index is treating the dash as a separator, so it doesn't think the whole phrase matched when a dash is joining the search query terms.

We do some processing where we join all dashes into single words for the exact matches; if you type "Abu l-Muna" with double quotes you can see that working correctly. One option would be to always join dashes like this, even on non-exact searches. My only concern with this is that it could break for the case for which it was originally designed: English! In English of course we have a lot of times that we use dashes and they aren't part of a single word—they're often used to join multiple words.

If the only concern here is phrase highlighting, is it acceptable as is, or would it be worth the tradeoff of treating all hyphenated phrases as single unbroken words?

I'd be curious to hear @rlskoeser's thoughts on this as well.

kseniaryzhova · 2024-03-11T16:50:10Z

@blms yes it's good to close, we'll just add help text to the website! Thanks!

blms · 2024-03-25T16:57:19Z

Phrase matching needs to be boosted higher as, for example, *אלמרכב אלצ results in matches without the full phrase appearing above matches with the full phrase.

blms · 2024-03-25T20:41:34Z

Update on that: solr edismax phrase matching does not seem to like the asterisk. However, אלמרכב אלצ does appear to bring up the results in the correct order. (אלמרכב AND אלצ after next update)

kseniaryzhova · 2024-04-22T20:08:32Z

@blms after the first 4 results, results are no longer boosted by relevancy and I'm only getting partial matches for one or the other word (There are at least 8 instances of phrases including אלמרכב אלצ in close proximity)

blms · 2024-04-22T20:10:58Z

@kseniaryzhova Can you link the four that do not show up so I can investigate?

kseniaryzhova · 2024-04-22T20:12:37Z

@blms מרכב אלצלטאן does not show up at all even though it appears four times https://test-geniza.cdh.princeton.edu/en/documents/?q=%D7%9E%D7%A8%D7%9B%D7%91+%D7%90%D7%9C%D7%A6%D7%9C%D7%98%D7%90%D7%9F&docdate_0=&docdate_1=&sort=relevance

blms · 2024-04-22T20:14:40Z

I'm not sure if it should consider מרכב to match a search for אלמרכב, or how we would tell it to do so. Is it something that would be improved if the system knew this were Hebrew or Judaeo-Arabic?

kseniaryzhova · 2024-04-22T20:15:58Z

@blms it's a partial search though so I'm not sure why it wouldn't recognize the super close match? Maybe this is an issue of knowing it's Hebrew or JA

blms · 2024-04-22T20:16:00Z

@kseniaryzhova FWIW, מרכב AND אלצ seems to bring up the correct results, I think. Would that be an acceptable solution?

kseniaryzhova · 2024-04-22T20:24:18Z

@blms they do bring up the correct search results, I just think this is very confusing for users, when to add in the definite article and when not to add it. The word remains the same (with some implications for grammar, one is "a small ship" and the other is "the ship of the sultan" so there is a grammatical difference), so partial search results should capture it, right? Would this be something that could be helped with tagging this as Hebrew/JA?

I mean, when I search in English, I don't expect grammar to always be followed, just word order. Although with English word order is grammar, so that may not even be a helpful comparison.

blms · 2024-04-22T20:24:47Z

@blms it's a partial search though so I'm not sure why it wouldn't recognize the super close match? Maybe this is an issue of knowing it's Hebrew or JA

It's hard to say, though it seems the issue here is that you're including additional characters than what is present in the result.

@rlskoeser I see in the indexing code that we are only applying edge ngram filtering on index, but not on query. Could this be related? And what was the reasoning behind that?

@blms they do bring up the correct search results, I just think this is very confusing for users, when to add in the definite article and when not to add it. The word remains the same (with some implications for grammar, one is "a small ship" and the other is "the ship of the sultan" so there is a grammatical difference), so partial search results should capture it, right? Would this be something that could be helped with tagging this as Hebrew/JA?

@kseniaryzhova I would say almost certainly yes, because this seems like something that would not be possible to know without some knowledge of the language.

blms · 2024-04-22T20:35:53Z

@kseniaryzhova To give a little explanation about how the partial matching works right now:

the word מרכב appears in the transcription to be indexed
in indexing, solr tokenizes it as מר, מרכ, מרכב
you search אלמרכב and solr tries to match it against one of those tokens exactly: no match
you search מרכב and it matches the first of three tokens exactly

If the system knew your search term were Hebrew, it could try to separate מרכב as a stem for the word, and consider that as the actual token to match. But since we don't have that capability right now, it's only capable of matching the word itself as entered, or smaller.

blms · 2024-04-23T19:55:53Z

@kseniaryzhova Ok to close this one since the results—when they appear at all—are correctly sorted by relevance? And then the remaining issue is handled by #1582?

rlskoeser · 2024-04-23T20:15:59Z

@rlskoeser I see in the indexing code that we are only applying edge ngram filtering on index, but not on query. Could this be related? And what was the reasoning behind that?

The goal was for search terms that are partial matches of the contents to return results without the user having to add wildcards. I think in most cases applying the edge ngram filtering on the query would have very confusing results.

It does sound like the best solution for this particular use case would be to move towards language-specific indexing.

kseniaryzhova · 2024-04-24T01:24:59Z

@blms per discussion above, closing this! Thank you!

kseniaryzhova added 🐛 bug Something isn't working performant Tasks for or taken on by Performant labels Feb 28, 2023

blms changed the title ~~Partial search results in description not boosted by relevancy~~ Some partial search results in description not boosted by relevancy Feb 28, 2023

blms mentioned this issue Mar 1, 2023

As a public site user, I want to be able to search descriptions for words/phrases in quotations, so that I can find exact matches for my search terms. #1255

Closed

2 tasks

blms self-assigned this Mar 2, 2023

blms added a commit that referenced this issue Mar 2, 2023

Set min gram size to 1 for solr edge ngram filter

69349d6

ref #1335

blms mentioned this issue Mar 2, 2023

Set min gram size to 2, preserveOriginal=true for solr edge ngram filter (#1335) #1339

Merged

blms added a commit that referenced this issue Mar 9, 2023

Revert minGramSize to 2 and add preserveOriginal

0c1d354

ref #1335

blms added a commit that referenced this issue Oct 13, 2023

Add unit test for ngram highlighting (#1335)

32cef31

blms added a commit that referenced this issue Oct 18, 2023

Merge pull request #1339 from Princeton-CDH/bugfix/1335-ngram

e8b51fe

Set min gram size to 2, preserveOriginal=true for solr edge ngram filter (#1335)

blms added the 🗜️ awaiting testing Implemented and ready to be tested label Oct 18, 2023

blms added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Oct 18, 2023

blms added a commit that referenced this issue Oct 25, 2023

Improve boosts for partial searches (#1335)

461368b

blms mentioned this issue Oct 26, 2023

Improve boosts for description exact matches (#1335) #1491

Merged

blms added this to the Performant Q1 2024 high priority bugs milestone Jan 5, 2024

richmanrachel modified the milestones: Performant Q1 2024 high priority bugs, Search Feb 6, 2024

blms added a commit that referenced this issue Feb 8, 2024

Boost desc exact matches above shelfmark partial (#1335)

7506481

blms added a commit that referenced this issue Feb 12, 2024

Merge pull request #1491 from Princeton-CDH/bugfix/1335-phrase-match

dae2f98

Improve boosts for description exact matches (#1335)

blms removed the ⚠️ tested needs attention Has been through acceptance testing and needs additional work label Feb 12, 2024

blms added the 🗜️ awaiting testing Implemented and ready to be tested label Feb 23, 2024

kseniaryzhova added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Feb 26, 2024

blms added the ❓ question Further information is requested label Feb 26, 2024

kseniaryzhova closed this as completed Mar 11, 2024

blms removed ❓ question Further information is requested ⚠️ tested needs attention Has been through acceptance testing and needs additional work labels Mar 11, 2024

blms mentioned this issue Mar 11, 2024

Update "how to search" page with information about Arabic/JA exact matches #1547

Open

kseniaryzhova reopened this Mar 25, 2024

blms added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 19, 2024

kseniaryzhova added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Apr 22, 2024

blms mentioned this issue Apr 23, 2024

When searching in Hebrew, search results are excluded when the keyword searched is longer than the word that appears in transcriptions #1582

Open

kseniaryzhova closed this as completed Apr 24, 2024

blms removed the ⚠️ tested needs attention Has been through acceptance testing and needs additional work label Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some partial search results in description not boosted by relevancy #1335

Some partial search results in description not boosted by relevancy #1335

kseniaryzhova commented Feb 28, 2023 •

edited by blms

blms commented Jul 31, 2023 •

edited

amelbensalim commented Oct 18, 2023

blms commented Oct 18, 2023 •

edited

rlskoeser commented Oct 18, 2023

blms commented Oct 18, 2023

kseniaryzhova commented Feb 26, 2024

blms commented Feb 26, 2024 •

edited

kseniaryzhova commented Mar 11, 2024

blms commented Mar 25, 2024

blms commented Mar 25, 2024 •

edited

kseniaryzhova commented Apr 22, 2024

blms commented Apr 22, 2024

kseniaryzhova commented Apr 22, 2024

blms commented Apr 22, 2024

kseniaryzhova commented Apr 22, 2024

blms commented Apr 22, 2024

kseniaryzhova commented Apr 22, 2024 •

edited

blms commented Apr 22, 2024 •

edited

blms commented Apr 22, 2024 •

edited

blms commented Apr 23, 2024

rlskoeser commented Apr 23, 2024

kseniaryzhova commented Apr 24, 2024 •

edited

Some partial search results in description not boosted by relevancy #1335

Some partial search results in description not boosted by relevancy #1335

Comments

kseniaryzhova commented Feb 28, 2023 • edited by blms

testing notes (QA - round two)

testing notes (QA)

dev notes

blms commented Jul 31, 2023 • edited

amelbensalim commented Oct 18, 2023

blms commented Oct 18, 2023 • edited

rlskoeser commented Oct 18, 2023

blms commented Oct 18, 2023

kseniaryzhova commented Feb 26, 2024

blms commented Feb 26, 2024 • edited

kseniaryzhova commented Mar 11, 2024

blms commented Mar 25, 2024

blms commented Mar 25, 2024 • edited

kseniaryzhova commented Apr 22, 2024

blms commented Apr 22, 2024

kseniaryzhova commented Apr 22, 2024

blms commented Apr 22, 2024

kseniaryzhova commented Apr 22, 2024

blms commented Apr 22, 2024

kseniaryzhova commented Apr 22, 2024 • edited

blms commented Apr 22, 2024 • edited

blms commented Apr 22, 2024 • edited

blms commented Apr 23, 2024

rlskoeser commented Apr 23, 2024

kseniaryzhova commented Apr 24, 2024 • edited

kseniaryzhova commented Feb 28, 2023 •

edited by blms

blms commented Jul 31, 2023 •

edited

blms commented Oct 18, 2023 •

edited

blms commented Feb 26, 2024 •

edited

blms commented Mar 25, 2024 •

edited

kseniaryzhova commented Apr 22, 2024 •

edited

blms commented Apr 22, 2024 •

edited

blms commented Apr 22, 2024 •

edited

kseniaryzhova commented Apr 24, 2024 •

edited