New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FoLiA alignments in OCR output #44
Comments
Nice word choice , "lamented". It is a serious issue. At CLIN 2018 you explained that Folia is a format for machines. Still, users need to be able to see the output, and have an indication of the quality. I would think that in the output, all the information is kept, but that there is a means to select the information you want to see in a viewer. |
Yes, I agree, viewers should ideally allow to filter the necessary information and present only what the user asks for. That's what FLAT does too (but there are still issues visualising TICCL output currently), but at least the link is now set up. An additional plain-text output in PICCL sounds like a good idea and is simple to implemented, let's see what @martinreynaert says. |
yes, that's what you want. All details available, but 'filtered out' when not needed. @kdepuydt Be glad that you don't see the HOCR files, because those are really to be lamented about. :)
Regarding to the alignments: |
As per RFC3023 I guess we'd get: |
I now implemented the improved ''href'' and ''format'' attributes for both ''hocr'' and ''page'' |
I'm not sure to what extent this issue is still open/relevant? I know there have been quite some changes in the ticcltools output. |
(expired) |
This may be more of a Ticcltools or foliautils issue, but I'll post it here as it is the outcome of the pipeline. When running a document through OCR, we obtain very verbose untokenised FoLiA output as follows:
My question is about the alignments here. They refer to tif images and mention an ID. I realize you want to tie each word to its occurrence in the image. But I don't think the TIF file contains this information (being just a bitmap afaik). Shouldn't this link to the hOCR output instead? (or is ALTO XML still involved here and should it be that?). (@kosloot I'd suggest adding a format attribute on the alignment to make clear to what kind of file (mimetype) it links)
Moreover, is this intermediate output that the PICCL OCR pipeline should publish as output for the user? Because it currently doesn't. And linking to something you don't output seems fairly useless.
During our last meeting @kdepuydt lamented that the FoLiA XML output of TICCL was not very human-readable, where she has a point, but it is also kind of inevitable if you want to include all this higher-order information. The question is whether everybody wants to? A possible suggestion here could also be to make outputting certain information optional (such as the substrings and alignments). Still, I'd rather include too much information than too little.
The text was updated successfully, but these errors were encountered: