Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identifying / Highlight subsections of a Segment #1359

Open
EliezerIsrael opened this issue May 8, 2023 · 4 comments
Open

Identifying / Highlight subsections of a Segment #1359

EliezerIsrael opened this issue May 8, 2023 · 4 comments
Labels
Sefaria Discussion Prompt A prompt for discussion initiated by the Sefaria team Zone: API

Comments

@EliezerIsrael
Copy link
Member

In the context of a thread on refining our text API, this request came up -

We use the text api extensively to link to content that we do not have natively in our app. We make ca text api call - and present the content in a local popup within our app. The problem that we have is that amount of text that is presented many times is very long and not precise enough to help a person zero in on the correct part of the reference.

The way that we deal with this with internal links to our own content - is that besides the Page (to main entry in our database) We also include IDs of a Range of Phrases or Range of Words or a list of Multiple Phrases or multiple Words and our internal engine returns the "Page" with the words or phrases highlighted. That way you can see what the author was referring to in context of the whole source.

In our internal texts we either have ids on every word or on phrases. In the Sefaria texts you do not have ids at that level.

It would be nice if we could specify words 20-25 within a source and have those words wrapped in a tag as the selection that we could render as we please.

Originally posted by @mayerpasternak in #1343 (comment)

@EliezerIsrael
Copy link
Member Author

Some follow up from @mayerpasternak -

World level highlighting could be done externally – but your website numbers does not show word numbers – so our scholars are working blind – unless we create a new interface to your texts those expose the number of each word. We would also need to take your response from the text api and assign numbers to each word – to create the highlighting. We could possibly create all of this outside of your system – but I suspect that there are other users that could benefit from this – so it might make sense to add this as an option - instead of everyone building their own system.

@EliezerIsrael
Copy link
Member Author

EliezerIsrael commented May 8, 2023

Thinking about the best way to highlight subphrases of a segment in Sefaria.
We have an inherent limitation of our architectures that each word does not have its own ID. Given that, I see two paths (are there more?), each with some brittle elements.

  1. Specify range of word numbers
  2. Specify start and end phrases (or even the entire subphrase)

Option 1 is brittle in that the segments on Sefaria may change. Even a minor edit (Adding a dash or a missing word) will throw off the word counts. Some texts in our system are well defined and locked, but others do receive edits as we get corrections in, find better source material, etc.

Option 2 is brittle for the same reason, but less so. If the text of the opening or closing phrase change, then it can be thrown off. But in this case, it will be obvious that it is not correct, and the error can be handled. In the case of 1, the error in word count would pass silently.

Option 2 has another downside - there may be multiple matches to an opening and closing phrase. Presumedly, in all realistic cases, one can choose a long enough string to uniquely identify the passage.

Following this line of thinking, I would imagine that a request for a segment with opening and closing phrases of the subsection to be highlighted would be the best approach.

Interested to hear feedback.

@EliezerIsrael EliezerIsrael added Zone: API Sefaria Discussion Prompt A prompt for discussion initiated by the Sefaria team labels May 8, 2023
@ronshapiro
Copy link
Contributor

Option 3: pass the text that you want highlighted, together with start and end {character counts, word counts, or percentage counts}.

I have implemented this for talmud.page. For the same reasons highlighted above, I save as much spacial context as I can and hope that I can later find the right placement even if the text changes.

You can make this more robust by having a normalization step (remove punctuation, vocalization/trope). You could even use an edit-distance algorithm and cutoff for trying to find the best effort match..

But word-level IDs would be the best. Probably the best way to do this is to emit IDs as span tags without any styling (similar to the <i data-commentator="..."> approach for Shulchan Arukh. And then you could pass those IDs back. This would probably require some good tooling for the Sefaria team that manages text updates though

@mayerpasternak
Copy link

mayerpasternak commented May 21, 2023 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Sefaria Discussion Prompt A prompt for discussion initiated by the Sefaria team Zone: API
Projects
None yet
Development

No branches or pull requests

3 participants