Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect words are highlighted in complete word quotation search (Hebrew script) #1570

Closed
1 task done
kseniaryzhova opened this issue Mar 25, 2024 · 4 comments
Closed
1 task done
Assignees
Labels
🐛 bug Something isn't working performant Tasks for or taken on by Performant

Comments

@kseniaryzhova
Copy link

kseniaryzhova commented Mar 25, 2024

testing notes

In the QA site:

  • The below search "אלממ" should highlight as expected in search results, when using double quotes.

Describe the bug
Quotation search of words I know exist in the corpus brings up inexact matches as well.

To reproduce
Steps to reproduce the behavior:

  1. Go to '...'
  2. Search for "אלממ" in double quotes
  3. Look at results 8, 10, 16 (1 and 3 are weird too)

Expected behavior
Quotation search in any script should bring up exact matches only.

@kseniaryzhova kseniaryzhova added 🐛 bug Something isn't working performant Tasks for or taken on by Performant labels Mar 25, 2024
@blms
Copy link
Contributor

blms commented Mar 25, 2024

This is a challenging solr highlighting issue. We have an exact match (unstemmed) field, and exact searches are matched against that. The exact search does bring up the correct results—and attempts to highlight them.

But in the highlighting portion of that query, the highlighter uses edismax to search all query fields:

search = search.raw_query_parameters(
**{
"hl.q": "{!type=edismax qf=$keyword_qf pf=$keyword_pf v=$hl_query}",
"hl_query": self.highlight_query,
"hl.qparser": "lucene",
}
)

Also, the content_nostem field isn't present in the highlighted fields:

.highlight(
"description",
snippets=3,
method="unified",
requireFieldMatch=True,
)
# return smaller chunk of highlighted text for transcriptions/translations
# since the lines are often shorter, resulting in longer text
.highlight(
"transcription",
method="unified",
fragsize=150, # try including more context
requireFieldMatch=True,
)
.highlight(
"translation",
method="unified",
fragsize=150,
requireFieldMatch=True,
)

This is a problem, because while the results are correct, it's highlighting partial matches within them when it should only be highlighting the exact matches. Simply adding content_nostem with .highlight() doesn't do the trick, no matter how I mess with the highlight query formation.

@rlskoeser Any ideas off the top of your head as to how I might approach this? Is there a way we can get the exact match highlighted when partial matches are also present in the result?

@rlskoeser
Copy link
Contributor

@blms I suspect that you are not getting highlighting back on content_nostem because it's indexed but not stored; I'm looking at the fields in the managed-schema file because I had to refresh my memory on this. You do have to have an active keyword search on the field as well. Even if you change that field so solr will store them I'm not sure if that fully solves your problem, because right now you have a single content_nostem field - I think you would need to revise copy field behavior and define separate fields (description_nostem and transcription_nostem); then you have to query across all of them, return highlighting for all of them, and then display in priority order (nostem matches first if there are any, then the other copy of the field).

It does make me feel like we're approaching the problem wrong. It seems to go against how Solr is meant to be used, and it it seems increasingly complicated; although I do remember that it took us a while to arrive at the nostem field as a solution for this problem.

Because of this complexity there may be some trade-offs with the exact searching and highlighting that will be difficult to solve.

@blms
Copy link
Contributor

blms commented Mar 29, 2024

That's super helpful, thank you @rlskoeser! I'd totally forgotten about the indexed vs stored concept in Solr. Great point also about the two separate fields for exact matches.

It does make me feel like we're approaching the problem wrong. It seems to go against how Solr is meant to be used, and it it seems increasingly complicated; although I do remember that it took us a while to arrive at the nostem field as a solution for this problem.

I agree with that, and I do think there are going to be additional problems; advanced users are really looking for exactly what was on PGPv3, a search that uses only exact character sequence matching, so anything we try to do with solr to match that is inevitably going to come up short. I do think the exact matching with double quotes, at least, is something that everyday users might expect to work, but some of the more complex use cases of this might not really be feasible with solr. I'll be curious to hear your thoughts on all this in our meeting Thursday!

@blms blms self-assigned this Apr 2, 2024
blms added a commit that referenced this issue Apr 18, 2024
…highlight

Highlight nostem matches on exact searches (#1570)
@blms blms added the 🗜️ awaiting testing Implemented and ready to be tested label Apr 19, 2024
@kseniaryzhova
Copy link
Author

@blms works as it should! Also tested a few other words like "אלקמח" (which I know for a fact appear in the corpus)! Closing!

@blms blms removed the 🗜️ awaiting testing Implemented and ready to be tested label Apr 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛 bug Something isn't working performant Tasks for or taken on by Performant
Projects
None yet
Development

No branches or pull requests

3 participants