Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a user, I want to search on metadata and transcriptions together so that I can find records by description or content. #10

Closed
rlskoeser opened this issue Oct 5, 2020 · 7 comments
Assignees
Labels
🧪 experiment Prototypes that support future work

Comments

@rlskoeser
Copy link
Contributor

No description provided.

@rlskoeser rlskoeser added the 🛠️ chore One-off task or update label Oct 5, 2020
@kmcelwee
Copy link
Contributor

Intersection of Cambridge PGPIDs that have links for which we have transcripts. Ignores 'b' and '-deleted' files.
cam-link-transcription-pgpids.txt

^^ Test set for converting transcriptions in search prototype

@rlskoeser rlskoeser changed the title incorporate transcription data from TEI in search prototype As a user, I want to search on metadata and transcriptions together so that I can find records by description or content. Oct 28, 2020
@rlskoeser rlskoeser added 🧪 experiment Prototypes that support future work and removed 🛠️ chore One-off task or update labels Oct 28, 2020
rlskoeser added a commit that referenced this issue Nov 3, 2020
@rlskoeser
Copy link
Contributor Author

The search prototype now includes text from the TEI transcriptions in the index. It's a fairly "dumb" implementation, just as a first pass so we can start searching on metadata & transcriptions together; I'm including the labels from the TEI as well as transcription text, and Solr doesn't know what language the transcriptions are, so it's treating them like English text for now. (In particular, this means stemming won't work, and tokenization may not work properly in all cases.)

When your search terms match transcription text, you should see a line of context with your search term highlighted.

Try searching on transcription terms alone and in combination with metadata searches.

@rlskoeser rlskoeser added the 🗜️ awaiting testing Implemented and ready to be tested label Nov 3, 2020
@sluescher
Copy link

sluescher commented Nov 4, 2020

I do see the line of context. I can search for terms and metadata and it shows fine. However, I am still puzzled behind the way it orders results.

For example. "מרכב illness bedbound Ramle". The first word is from the transcription, illness and bedbound, and Ramle tags or in the description, however the first two results have no Hebrew (the result I was looking for is result #3)

@rlskoeser
Copy link
Contributor Author

The relevance score between the first few is not very different; I think the exact match on the short tag is continuing to be scored as "more relevant" vs the Hebrew term occurring once in a larger text field. Perhaps it's exaggerated because Ramle is a less common term. (That seems to be the case from what I can tell.)

You can use boolean operators, like "מרכב AND illness bedbound Ramle"; you can also try exact phrase and proximity searching within the transcriptions.

@sluescher
Copy link

Good morning!

I tried searching for words split across two lines and it works (or I assume it does since i only see one line in the result, but the right document appears.

In one case, no line appeared at all. This search returns the document I want at the bottom T-S NS J24 with the two words going from line 23-24. However, no transcription appears.

When I search for the shelfmark plus the two words (not in quotation marks), the text highlighted in the Hebrew script line is only one of the two words. I am assuming it looks for the first match? When I search with the Hebrew script in quotation marks, again no line appears?

Another thing that is somewhat annoying, when switching to Hebrew script (or Arabic), the direction of writing changes and adding quotation marks or * and other characters becomes a pain as it adds it to the wrong side. This is an issue that google search still has, so I am not sure whether it's fixable, but thought I should mention it.

Boolean operators work well too. No negative reports.

@rlskoeser
Copy link
Contributor Author

I noticed the direction of writing change when trying to input Hebrew search terms as well! It is indeed annoying. I'm going to put this our list of questions coming out of the prototyping, hopefully we can get @gissoo to do some work on a better solution for the future.

Right now the transcription search is across lines but the highlighting is only individual lines. I did fix it so it shows up to three lines of matching context now instead of just one — but the exact phrase that runs across highlighting won't ever match a single line, which I think explains what you're seeing.

The reason I limited the highlighting to single lines was because, with whitespace preserved, the highlighted term context could be quite large for some records (many short lines). I think it's ok for now, but I'm going to add this to the prototype questions documents too.

@sluescher
Copy link

Sounds good! Happy to sign off

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🧪 experiment Prototypes that support future work
Projects
None yet
Development

No branches or pull requests

3 participants