ocrd-fileformat-transform does not add an ALTO Processing tag #35

mikegerber · 2021-11-08T18:31:30Z

I believe it would be helpful if the ocrd-fileformat-transform PAGE → ALTO transformation would add a <Processing> tag. I looked into to the file to figure out if https://github.com/kba/page-to-alto was used for the conversion and did not find a processing tag for the conversion, just for segmentation/binarization/OCR.

The text was updated successfully, but these errors were encountered:

mikegerber · 2021-11-08T18:32:00Z

(Alternatively, page-to-alto could add it, of course.)

kba · 2021-11-09T11:51:55Z

Can you provide an example of PAGE input and how you'd like to see it converted. page-to-alto should convert processing metadata, cf. https://github.com/kba/page-to-alto/blob/master/ocrd_page_to_alto/convert.py#L248-L265

mikegerber · 2021-11-09T12:21:23Z

Yes it does convert the processing metadata correctly, but does not add itself as a processing step - which would have been helpful as I was investigating whether page-to-alto was used for the conversion using ocrd-fileformat-transform. Here is an example, this was converted using ocrd-fileformat-transform:

    <Processing ID="ocrd-eynollah-segment-0">
      <processingStepDescription>layout/segmentation/region</processingStepDescription>
      <processingStepSettings>{"models": "/data/default", "dpi": "0", "full_layout": "True", "curved_line": "False", "allow_scaling": "False", "headers_off": "False"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-eynollah-segment</softwareName>
      </processingSoftware>
    </Processing>
    <Processing ID="ocrd-sbb-binarize-1">
      <processingStepDescription>preprocessing/optimization/binarization</processingStepDescription>
      <processingStepSettings>{"model": "/data/sbb_binarization/models", "operation_level": "page"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-sbb-binarize</softwareName>
      </processingSoftware>
    </Processing>
    <Processing ID="ocrd-tesserocr-recognize-2">
      <processingStepDescription>layout/segmentation/region</processingStepDescription>
      <processingStepSettings>{"model": "deu", "dpi": "0", "padding": "0", "segmentation_level": "word", "textequiv_level": "word", "overwrite_segments": "False", "overwrite_text": "True", "shrink_polygons": "False", "block_polygons": "False", "find_tables": "True", "sparse_text": "False", "raw_lines": "False", "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": "{}", "xpath_parameters": "{}", "xpath_model": "{}", "auto_model": "False", "oem": "DEFAULT"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-tesserocr-recognize</softwareName>
      </processingSoftware>
    </Processing>

Full PAGE + ALTO:
example.zip

mikegerber · 2021-11-09T12:26:29Z

What I would expect is an additional processing step like this (entirely made up):

    <Processing ID="ocrd-fileformat-transform-3">
      <processingStepDescription>conversion</processingStepDescription>
      <processingStepSettings>{"backend": "page-to-alto"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-fileformat-transform</softwareName>
      </processingSoftware>
    </Processing>

I know this is extra work but it's very useful to answer the question of how a file was created exactly.

kba · 2021-11-09T13:10:33Z

Gotcha, yes this makes sense, at least for the OCR-D processor interface.

mikegerber · 2021-11-09T14:16:25Z

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

bertsky · 2021-11-30T21:42:51Z

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way. Then in ocrd-fileformat-transform, we can fully inform about the processor and its options. (While in other use cases, we might want to hide the conversion.)

bertsky · 2021-11-30T22:02:08Z

But then again, doing this from page-to-alto or ocr-fileformat/script/transform/page__alto is much easier than from ocrd_fileformat. In the latter case, one would have to

check the target format
in the case of PAGE-XML, add a /pc:PcGts/pc:Metadata/pc:MetadataItem (as in ocrd-olena-binarize)
in the case of ALTO-XML >= 4, add a /alto/Description/Processing (as outlined above)
in the case of ALTO-XML < 4, add a /alto/Description/OCRProcessing/postProcessingStep (in an analogous way)
in the case of hOCR...?

These editing commands should by done by a true XML editor, like xmlstarlet. That would have to be added to the system dependencies.

Perhaps one should even offer a parameter to make this postprocessing/annotation optional.

mikegerber · 2021-12-02T10:45:09Z

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way.

It does processing, so why should it not add processing info? I think it's not correct to omit it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ocrd-fileformat-transform does not add an ALTO Processing tag #35

ocrd-fileformat-transform does not add an ALTO Processing tag #35

mikegerber commented Nov 8, 2021

mikegerber commented Nov 8, 2021

kba commented Nov 9, 2021

mikegerber commented Nov 9, 2021

mikegerber commented Nov 9, 2021 •

edited

Loading

kba commented Nov 9, 2021

mikegerber commented Nov 9, 2021

bertsky commented Nov 30, 2021

bertsky commented Nov 30, 2021

mikegerber commented Dec 2, 2021

ocrd-fileformat-transform does not add an ALTO Processing tag #35

ocrd-fileformat-transform does not add an ALTO Processing tag #35

Comments

mikegerber commented Nov 8, 2021

mikegerber commented Nov 8, 2021

kba commented Nov 9, 2021

mikegerber commented Nov 9, 2021

mikegerber commented Nov 9, 2021 • edited Loading

kba commented Nov 9, 2021

mikegerber commented Nov 9, 2021

bertsky commented Nov 30, 2021

bertsky commented Nov 30, 2021

mikegerber commented Dec 2, 2021

mikegerber commented Nov 9, 2021 •

edited

Loading