-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a user I would like to see transcription excerpts in my search results so I can tell which records have a transcription and can see some of the content. #299
Comments
Increasing estimate from 3 to 5 to account for language ISO code work |
@mrustow @richmanrachel I've been working on transcription search & keywords in context and have some questions about language (particularly language metadata & transcription language). Some context: ideally, we need to add a language code attribute in the html when we display the transcription text to differentiate from whatever the default language is for the rest of the page content (e.g., English). This is particularly important for search engines and screen readers. I've revised our I was thinking that I should be able to use the primary language field for the code of the transcription — but in a lot of cases, that isn't set for documents with a transcription, and in other cases there are multiple primary languages. My questions:
FWIW: probably only the general approach and multiple primary languages question are urgent; the rest are not blockers — if the language isn't specified, I'm currently setting the language attribute to an empty string to indicate it isn't the same language as the rest of the page. However, if there are multiple primary languages it's possible I'm setting the wrong value. (Which, maybe is fine to live with for now, as long as we have a plan to address.) |
@rlskoeser - first for the testing rounds.
|
Yeah, this was the weird behavior I was seeing too. I was hoping it was something transitory! Can you tell if it's happening on anything besides transcription searches? We haven't applied the fonts yet since they are still being finalized. |
Oh wait - @rlskoeser - it might only be highlighting from the description, not transcription.... |
Hmm, interesting. The transcription indexing right now is pretty simple and not language-specific (which is what we agreed on for the MVP), so it could be something related to that. We could look at the indexing analysis together at some point if that would be useful. |
@rlskoeser - realizing I can still probably answer some of your questions now!
Do you know if there will be any major problems created by treating all Hebrew-script languages as "Hebrew" for the sake of the ISO? (Idk how they work in terms of trying to potentially correct spelling or anything). |
The script matters for font and formatting, but the language does matter also. I want to be sure to tag Judaeo-Arabic differently from Hebrew when we know that's what it is. IDK if there are any screen readers that handle Judaeo-Arabic (kind of hard to imagine?), but telling them that it's Hebrew seems like a bad idea! When we get to the point of customizing the search indexing to be language-specific, it will matter there too. e.g., for Judaeo-Arabic I'm hoping we'll be able to adapt the NLP work to convert to Arabic so it can be indexed and stemmed as Arabic, which should make the search more powerful. This is a good reminder that it will be important for our permanent transcription solution to handle the language tagging within the text, since they can be so mixed! Good to know bulk update doesn't make sense – I think that's ok, since we can at least mark it as different from the main text language. I should revisit how I'm handling texts with multiple languages, and I'll look into including script information so we can take advantage of that for formatting and display (which, as you point out, will be useful). |
@richmanrachel could you test this again? I was trying to duplicate the weird behavior we saw before and can't; if you're able to, please document the search terms that cause problems. I'm wondering if maybe there was a lag with synchronizing the solr configset change (now that we're using solr replication issue with solr cloud) when we were first testing it. |
@rlskoeser - It's working better but still not fully. For example, the highlighting only seems to work on the first 10 results. It's unclear if the last bit is working since the Mirador is pulling the wrong transcription text to check if the sample text is from the beginning. |
ooh, good catch on the highlighting + pagination, it's entirely possible we're not doing something correct for non-first pages of results were you able to duplicate any of the weird behavior we saw before? |
Trying to repeat some of my earlier searches and see there's some larger break in logic after the 10th entry. Here, I typed in the Arabic word for God, and as you can see in the screenshot, up through #10, the results are correct and as expected. But after 10, it switches the Hebrew text with unclear logic: |
I'm not having the same Solr issues as before with documents not showing up, thankfully. And not all of the results after 10 are wrong, but they definitely don't have the highlighter feature... |
@richmanrachel great, thank you — this is helpful. |
@rlskoeser - You're welcome! Sorry I just get to point out the problems and don't know how to fix them, haha |
@richmanrachel your insight about the highlighting stopping after the first ten and not working on subsequent pages helped me identify and fix the problem! Please confirm. |
@rlskoeser - Hooray! I think it's working properly now (certainly the issue with 10 is resolved). I'm just confused by your final test query. Could you please clarify?
|
@richmanrachel if your search term matches a document somewhere but not the text of the transcription, then instead of keywords in context you should see the beginning of the transcription — this is what we show on search results that don't have keywords in context for transcription. Maybe you could test by searching for a hebrew term that occurs in a description rather than a transcription to see this? Or search on a tag that will bring back documents with transcriptions? Basically, you're checking that in adding the keywords in context highlighting for transcriptions, we haven't broken the previous functionality. |
@rlskoeser - there's rarely Hebrew text in the description that's not in the document, but I do think it's working? This is the last two entries from a search for אמת (truth) and it's correctly highlighting the search word (however the last entry doesn't have a transcription): I think I feel comfortable closing, if you do? |
I think it is working too — thanks for your careful testing. I think we're good to finally close this one. Hooray! 🎉 |
testing notes
test using the public document search on the test site: https://test-geniza.cdh.princeton.edu/en/documents/
dev notes
index_data
(exclude line numbers and labels)The text was updated successfully, but these errors were encountered: