-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a user, I want to search on metadata and transcriptions together so that I can find records by description or content. #10
Comments
Intersection of Cambridge PGPIDs that have links for which we have transcripts. Ignores 'b' and '-deleted' files. ^^ Test set for converting transcriptions in search prototype |
Basic tei transcription conversion & search #10
The search prototype now includes text from the TEI transcriptions in the index. It's a fairly "dumb" implementation, just as a first pass so we can start searching on metadata & transcriptions together; I'm including the labels from the TEI as well as transcription text, and Solr doesn't know what language the transcriptions are, so it's treating them like English text for now. (In particular, this means stemming won't work, and tokenization may not work properly in all cases.) When your search terms match transcription text, you should see a line of context with your search term highlighted. Try searching on transcription terms alone and in combination with metadata searches. |
I do see the line of context. I can search for terms and metadata and it shows fine. However, I am still puzzled behind the way it orders results. For example. "מרכב illness bedbound Ramle". The first word is from the transcription, illness and bedbound, and Ramle tags or in the description, however the first two results have no Hebrew (the result I was looking for is result #3) |
The relevance score between the first few is not very different; I think the exact match on the short tag is continuing to be scored as "more relevant" vs the Hebrew term occurring once in a larger text field. Perhaps it's exaggerated because Ramle is a less common term. (That seems to be the case from what I can tell.) You can use boolean operators, like "מרכב AND illness bedbound Ramle"; you can also try exact phrase and proximity searching within the transcriptions. |
Good morning! I tried searching for words split across two lines and it works (or I assume it does since i only see one line in the result, but the right document appears. In one case, no line appeared at all. This search returns the document I want at the bottom T-S NS J24 with the two words going from line 23-24. However, no transcription appears. When I search for the shelfmark plus the two words (not in quotation marks), the text highlighted in the Hebrew script line is only one of the two words. I am assuming it looks for the first match? When I search with the Hebrew script in quotation marks, again no line appears? Another thing that is somewhat annoying, when switching to Hebrew script (or Arabic), the direction of writing changes and adding quotation marks or * and other characters becomes a pain as it adds it to the wrong side. This is an issue that google search still has, so I am not sure whether it's fixable, but thought I should mention it. Boolean operators work well too. No negative reports. |
I noticed the direction of writing change when trying to input Hebrew search terms as well! It is indeed annoying. I'm going to put this our list of questions coming out of the prototyping, hopefully we can get @gissoo to do some work on a better solution for the future. Right now the transcription search is across lines but the highlighting is only individual lines. I did fix it so it shows up to three lines of matching context now instead of just one — but the exact phrase that runs across highlighting won't ever match a single line, which I think explains what you're seeing. The reason I limited the highlighting to single lines was because, with whitespace preserved, the highlighted term context could be quite large for some records (many short lines). I think it's ok for now, but I'm going to add this to the prototype questions documents too. |
Sounds good! Happy to sign off |
No description provided.
The text was updated successfully, but these errors were encountered: