Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user I would like to see transcription excerpts in my search results so I can tell which records have a transcription and can see some of the content. #299

Closed
7 tasks done
gissoo opened this issue Oct 12, 2021 · 22 comments
Assignees

Comments

@gissoo
Copy link
Contributor

gissoo commented Oct 12, 2021

testing notes

test using the public document search on the test site: https://test-geniza.cdh.princeton.edu/en/documents/

  • when no keyword search terms are entered, documents that have transcription text associated should display an excerpt from the beginning of the transcription (if you sort by scholarship records most - least you should see some); NOTE, this is only for transcription text synced via the new script, which currently doesn't include everything due to the mismatches we still need to resolve
  • you should now be able to search on transcription text (you'll probably want to switch sorting to relevance)
  • if a keyword search term matches the transcription text, you should see an excerpt from the transcription around that keyword, with the term highlighted
  • if a keyword search term does not match a document with transcription text, you should see the excerpted text from the beginning of the transcription, as on the search without a keyword term

dev notes

  • include transcription content in document index_data (exclude line numbers and labels)
  • add transcription excerpt to search result, similar to existing description excerpt
  • set language code on the block based on document language (where possible)
@rlskoeser rlskoeser changed the title As a user I would like to see transcription excerpts for all my search results that have a transcription As a user I would like to see transcription excerpts in my search results so I can tell which records have a transcription and can see some of the content. Oct 14, 2021
@rlskoeser rlskoeser added this to the PGP v4.0 (MVP) milestone Oct 14, 2021
@rlskoeser rlskoeser self-assigned this Nov 16, 2021
@rlskoeser
Copy link
Contributor

Increasing estimate from 3 to 5 to account for language ISO code work

@rlskoeser
Copy link
Contributor

@mrustow @richmanrachel I've been working on transcription search & keywords in context and have some questions about language (particularly language metadata & transcription language).

Some context: ideally, we need to add a language code attribute in the html when we display the transcription text to differentiate from whatever the default language is for the rest of the page content (e.g., English). This is particularly important for search engines and screen readers.

I've revised our Language+Script model to add a field for ISO Codes, and I've written a migration that will populate the codes for languages used on documents that currently have transcriptions (and it will be viewable & editable in admin).

I was thinking that I should be able to use the primary language field for the code of the transcription — but in a lot of cases, that isn't set for documents with a transcription, and in other cases there are multiple primary languages.

My questions:

  1. Am I thinking about primary language differently than you are? Does it make sense to use this as the language of the transcription?
  2. Why do some documents have so many primary languages? Any thoughts on how to label the language of transcription for these?
  3. It seems like a lot of documents with existing transcriptions do not have a primary language set. Can we remedy this? Is there any logic that would let us do a bulk update?

FWIW: probably only the general approach and multiple primary languages question are urgent; the rest are not blockers — if the language isn't specified, I'm currently setting the language attribute to an empty string to indicate it isn't the same language as the rest of the page. However, if there are multiple primary languages it's possible I'm setting the wrong value. (Which, maybe is fine to live with for now, as long as we have a plan to address.)

@rlskoeser rlskoeser added ❓ question Further information is requested 🗜️ awaiting testing Implemented and ready to be tested labels Nov 29, 2021
@richmanrachel
Copy link

@rlskoeser - first for the testing rounds.

  1. The transcriptions show up and they look good! (I think the font is still wrong, but the spacing and layout looks good to me).
  2. Searching for the transcription text isn't perfect. While my second search worked, the first one pulled the records number (in what seems like an appropriate amount) but none of the docs showed up:
    image

@rlskoeser
Copy link
Contributor

Yeah, this was the weird behavior I was seeing too. I was hoping it was something transitory! Can you tell if it's happening on anything besides transcription searches?

We haven't applied the fonts yet since they are still being finalized.

@richmanrachel
Copy link

I'll keep testing some more things... but here I noticed that while Hebrew words were not getting highlighted in my last search, the Arabic word for God was successful:
image

@richmanrachel
Copy link

Oh wait - @rlskoeser - it might only be highlighting from the description, not transcription....

@richmanrachel
Copy link

Here Allah shows up in the transcription but is not highlighted...
image

@rlskoeser
Copy link
Contributor

Hmm, interesting. The transcription indexing right now is pretty simple and not language-specific (which is what we agreed on for the MVP), so it could be something related to that. We could look at the indexing analysis together at some point if that would be useful.

@richmanrachel
Copy link

@rlskoeser - realizing I can still probably answer some of your questions now!

Am I thinking about primary language differently than you are? Does it make sense to use this as the language of the transcription?

  • Yes, we are thinking about primary language differently, because for us the difference between Judaeo-Arabic and Hebrew really matters (so that the right researchers can look at the document) but for you they're probably the same (because they both use Hebrew characters), correct?

Why do some documents have so many primary languages? Any thoughts on how to label the language of transcription for these?

  • Legal documents in particular use many languages because they are referring to legal precedents and arguments that took place over centuries from the Torah and Mishnah (Hebrew), Talmud (Aramaic), and local legal systems (Judaeo-Arabic, Greek, etc). I think script will probably be more helpful for you than language, as that will be more consistent.

It seems like a lot of documents with existing transcriptions do not have a primary language set. Can we remedy this? Is there any logic that would let us do a bulk update?

  • Unfortunately there is no logic for a bulk update in regards to language. If we do need to just give you a bulk update for script, I think that assuming the writing is in Hebrew script will work for most documents that don't have "Arabic" in the description?

Do you know if there will be any major problems created by treating all Hebrew-script languages as "Hebrew" for the sake of the ISO? (Idk how they work in terms of trying to potentially correct spelling or anything).

@rlskoeser
Copy link
Contributor

The script matters for font and formatting, but the language does matter also. I want to be sure to tag Judaeo-Arabic differently from Hebrew when we know that's what it is. IDK if there are any screen readers that handle Judaeo-Arabic (kind of hard to imagine?), but telling them that it's Hebrew seems like a bad idea!

When we get to the point of customizing the search indexing to be language-specific, it will matter there too. e.g., for Judaeo-Arabic I'm hoping we'll be able to adapt the NLP work to convert to Arabic so it can be indexed and stemmed as Arabic, which should make the search more powerful.

This is a good reminder that it will be important for our permanent transcription solution to handle the language tagging within the text, since they can be so mixed!

Good to know bulk update doesn't make sense – I think that's ok, since we can at least mark it as different from the main text language. I should revisit how I'm handling texts with multiple languages, and I'll look into including script information so we can take advantage of that for formatting and display (which, as you point out, will be useful).

@rlskoeser rlskoeser added ⚠️ tested needs attention Has been through acceptance testing and needs additional work and removed 🗜️ awaiting testing Implemented and ready to be tested labels Dec 13, 2021
@rlskoeser
Copy link
Contributor

@richmanrachel could you test this again? I was trying to duplicate the weird behavior we saw before and can't; if you're able to, please document the search terms that cause problems.

I'm wondering if maybe there was a lag with synchronizing the solr configset change (now that we're using solr replication issue with solr cloud) when we were first testing it.

@richmanrachel
Copy link

@rlskoeser - It's working better but still not fully. For example, the highlighting only seems to work on the first 10 results.

It's unclear if the last bit is working since the Mirador is pulling the wrong transcription text to check if the sample text is from the beginning.

@rlskoeser
Copy link
Contributor

ooh, good catch on the highlighting + pagination, it's entirely possible we're not doing something correct for non-first pages of results

were you able to duplicate any of the weird behavior we saw before?

@richmanrachel
Copy link

Trying to repeat some of my earlier searches and see there's some larger break in logic after the 10th entry. Here, I typed in the Arabic word for God, and as you can see in the screenshot, up through #10, the results are correct and as expected. But after 10, it switches the Hebrew text with unclear logic:
image

@richmanrachel
Copy link

I'm not having the same Solr issues as before with documents not showing up, thankfully.

And not all of the results after 10 are wrong, but they definitely don't have the highlighter feature...

@rlskoeser
Copy link
Contributor

@richmanrachel great, thank you — this is helpful.

@richmanrachel
Copy link

@rlskoeser - You're welcome! Sorry I just get to point out the problems and don't know how to fix them, haha

@rlskoeser
Copy link
Contributor

@richmanrachel your insight about the highlighting stopping after the first ten and not working on subsequent pages helped me identify and fix the problem! Please confirm.

@rlskoeser rlskoeser added 🗜️ awaiting testing Implemented and ready to be tested and removed ❓ question Further information is requested ⚠️ tested needs attention Has been through acceptance testing and needs additional work labels Dec 16, 2021
@richmanrachel
Copy link

@rlskoeser - Hooray! I think it's working properly now (certainly the issue with 10 is resolved).

I'm just confused by your final test query. Could you please clarify?

if a keyword search term does not match a document with transcription text, you should see the excerpted text from the beginning of the transcription, as on the search without a keyword term

@rlskoeser
Copy link
Contributor

@richmanrachel if your search term matches a document somewhere but not the text of the transcription, then instead of keywords in context you should see the beginning of the transcription — this is what we show on search results that don't have keywords in context for transcription. Maybe you could test by searching for a hebrew term that occurs in a description rather than a transcription to see this? Or search on a tag that will bring back documents with transcriptions?

Basically, you're checking that in adding the keywords in context highlighting for transcriptions, we haven't broken the previous functionality.

@richmanrachel
Copy link

@rlskoeser - there's rarely Hebrew text in the description that's not in the document, but I do think it's working? This is the last two entries from a search for אמת (truth) and it's correctly highlighting the search word (however the last entry doesn't have a transcription):
image

I think I feel comfortable closing, if you do?

@rlskoeser
Copy link
Contributor

I think it is working too — thanks for your careful testing. I think we're good to finally close this one. Hooray! 🎉

@rlskoeser rlskoeser removed the 🗜️ awaiting testing Implemented and ready to be tested label Dec 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants