Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FoLiA alignments in OCR output #44

Closed
proycon opened this issue Nov 21, 2018 · 7 comments
Closed

FoLiA alignments in OCR output #44

proycon opened this issue Nov 21, 2018 · 7 comments
Assignees

Comments

@proycon
Copy link
Member

proycon commented Nov 21, 2018

This may be more of a Ticcltools or foliautils issue, but I'll post it here as it is the outcome of the pipeline. When running a document through OCR, we obtain very verbose untokenised FoLiA output as follows:

<p xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10">
 <t class="OCR">
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13">DISEASES</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_14">OF</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_15">AQUATIC</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_16">ORGANISMS</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_17">Dis.</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_18">aquat.</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_19">Org.</t-str>
 </t>
 <str annotator="folia-hocr" datetime="2018-11-19T20:47:13" xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13"><t class="OCR" offset="0">DISEASES</t>
   <alignment xlink:href="FH-OllevierGeets-001-000.tif" xlink:type="simple">
    <aref id="word_1_13" type="str"/>
 </alignment>
</str>

My question is about the alignments here. They refer to tif images and mention an ID. I realize you want to tie each word to its occurrence in the image. But I don't think the TIF file contains this information (being just a bitmap afaik). Shouldn't this link to the hOCR output instead? (or is ALTO XML still involved here and should it be that?). (@kosloot I'd suggest adding a format attribute on the alignment to make clear to what kind of file (mimetype) it links)

Moreover, is this intermediate output that the PICCL OCR pipeline should publish as output for the user? Because it currently doesn't. And linking to something you don't output seems fairly useless.

During our last meeting @kdepuydt lamented that the FoLiA XML output of TICCL was not very human-readable, where she has a point, but it is also kind of inevitable if you want to include all this higher-order information. The question is whether everybody wants to? A possible suggestion here could also be to make outputting certain information optional (such as the substrings and alignments). Still, I'd rather include too much information than too little.

@kdepuydt
Copy link

Nice word choice , "lamented". It is a serious issue. At CLIN 2018 you explained that Folia is a format for machines. Still, users need to be able to see the output, and have an indication of the quality. I would think that in the output, all the information is kept, but that there is a means to select the information you want to see in a viewer.
eg. view 1: text only
view 2: show below each text line the ticcle layer
view 3: show below the ticcle layer for each word PoS and lemma. Kind of similar to what is implemented in Nederlab

@proycon
Copy link
Member Author

proycon commented Nov 21, 2018

Yes, I agree, viewers should ideally allow to filter the necessary information and present only what the user asks for. That's what FLAT does too (but there are still issues visualising TICCL output currently), but at least the link is now set up.

An additional plain-text output in PICCL sounds like a good idea and is simple to implemented, let's see what @martinreynaert says.

@kosloot
Copy link

kosloot commented Nov 21, 2018

yes, that's what you want. All details available, but 'filtered out' when not needed.

@kdepuydt Be glad that you don't see the HOCR files, because those are really to be lamented about. :)
For instance a SINGLE space somewhere, in the file:

<span class='ocr_line' id='line_1_1' title="bbox 0 859 68 1017; baseline 0 -98"><span class='ocrx_word' id='word_1_1' title='bbox 0 859 68 1017; x_wconf 95' lang='deu-frak' dir='ltr'>   </span> 
  </span>

Regarding to the alignments:
@proycon You are right, they should refer to the HOCR file, not the tiff.
I'll fix this and on the fly will add a format attribute.Does HOCR have a special Mime type?

@proycon
Copy link
Member Author

proycon commented Nov 21, 2018

Does HOCR have a special Mime type?

As per RFC3023 I guess we'd get: application/hocr+xml or text/hocr+xml

@kosloot kosloot added the test label Nov 22, 2018
@kosloot
Copy link

kosloot commented Nov 22, 2018

I now implemented the improved ''href'' and ''format'' attributes for both ''hocr'' and ''page''

@proycon
Copy link
Member Author

proycon commented Feb 14, 2020

I'm not sure to what extent this issue is still open/relevant? I know there have been quite some changes in the ticcltools output.

@proycon
Copy link
Member Author

proycon commented Apr 15, 2020

(expired)

@proycon proycon closed this as completed Apr 15, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants