FoLiA alignments in OCR output #44

proycon · 2018-11-21T20:40:06Z

This may be more of a Ticcltools or foliautils issue, but I'll post it here as it is the outcome of the pipeline. When running a document through OCR, we obtain very verbose untokenised FoLiA output as follows:

<p xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10">
 <t class="OCR">
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13">DISEASES</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_14">OF</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_15">AQUATIC</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_16">ORGANISMS</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_17">Dis.</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_18">aquat.</t-str>
   <t-str id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_19">Org.</t-str>
 </t>
 <str annotator="folia-hocr" datetime="2018-11-19T20:47:13" xml:id="FH-OllevierGeets-001-000.tif.text.par_1_10.word_1_13"><t class="OCR" offset="0">DISEASES</t>
   <alignment xlink:href="FH-OllevierGeets-001-000.tif" xlink:type="simple">
    <aref id="word_1_13" type="str"/>
 </alignment>
</str>

My question is about the alignments here. They refer to tif images and mention an ID. I realize you want to tie each word to its occurrence in the image. But I don't think the TIF file contains this information (being just a bitmap afaik). Shouldn't this link to the hOCR output instead? (or is ALTO XML still involved here and should it be that?). (@kosloot I'd suggest adding a format attribute on the alignment to make clear to what kind of file (mimetype) it links)

Moreover, is this intermediate output that the PICCL OCR pipeline should publish as output for the user? Because it currently doesn't. And linking to something you don't output seems fairly useless.

During our last meeting @kdepuydt lamented that the FoLiA XML output of TICCL was not very human-readable, where she has a point, but it is also kind of inevitable if you want to include all this higher-order information. The question is whether everybody wants to? A possible suggestion here could also be to make outputting certain information optional (such as the substrings and alignments). Still, I'd rather include too much information than too little.

kdepuydt · 2018-11-21T21:22:44Z

Nice word choice , "lamented". It is a serious issue. At CLIN 2018 you explained that Folia is a format for machines. Still, users need to be able to see the output, and have an indication of the quality. I would think that in the output, all the information is kept, but that there is a means to select the information you want to see in a viewer.
eg. view 1: text only
view 2: show below each text line the ticcle layer
view 3: show below the ticcle layer for each word PoS and lemma. Kind of similar to what is implemented in Nederlab

proycon · 2018-11-21T21:35:26Z

Yes, I agree, viewers should ideally allow to filter the necessary information and present only what the user asks for. That's what FLAT does too (but there are still issues visualising TICCL output currently), but at least the link is now set up.

An additional plain-text output in PICCL sounds like a good idea and is simple to implemented, let's see what @martinreynaert says.

kosloot · 2018-11-21T21:47:44Z

yes, that's what you want. All details available, but 'filtered out' when not needed.

@kdepuydt Be glad that you don't see the HOCR files, because those are really to be lamented about. :)
For instance a SINGLE space somewhere, in the file:

<span class='ocr_line' id='line_1_1' title="bbox 0 859 68 1017; baseline 0 -98"><span class='ocrx_word' id='word_1_1' title='bbox 0 859 68 1017; x_wconf 95' lang='deu-frak' dir='ltr'>   </span> 
  </span>

Regarding to the alignments:
@proycon You are right, they should refer to the HOCR file, not the tiff.
I'll fix this and on the fly will add a format attribute.Does HOCR have a special Mime type?

proycon · 2018-11-21T22:03:26Z

Does HOCR have a special Mime type?

As per RFC3023 I guess we'd get: application/hocr+xml or text/hocr+xml

kosloot · 2018-11-22T13:19:29Z

I now implemented the improved ''href'' and ''format'' attributes for both ''hocr'' and ''page''

proycon · 2020-02-14T12:38:55Z

I'm not sure to what extent this issue is still open/relevant? I know there have been quite some changes in the ticcltools output.

proycon · 2020-04-15T15:38:14Z

(expired)

proycon assigned martinreynaert and kosloot Nov 21, 2018

kosloot added the test label Nov 22, 2018

proycon added the expired label Apr 15, 2020

proycon closed this as completed Apr 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FoLiA alignments in OCR output #44

FoLiA alignments in OCR output #44

proycon commented Nov 21, 2018 •

edited

kdepuydt commented Nov 21, 2018

proycon commented Nov 21, 2018

kosloot commented Nov 21, 2018

proycon commented Nov 21, 2018

kosloot commented Nov 22, 2018

proycon commented Feb 14, 2020

proycon commented Apr 15, 2020

FoLiA alignments in OCR output #44

FoLiA alignments in OCR output #44

Comments

proycon commented Nov 21, 2018 • edited

kdepuydt commented Nov 21, 2018

proycon commented Nov 21, 2018

kosloot commented Nov 21, 2018

proycon commented Nov 21, 2018

kosloot commented Nov 22, 2018

proycon commented Feb 14, 2020

proycon commented Apr 15, 2020

proycon commented Nov 21, 2018 •

edited