Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocrd-fileformat-transform does not add an ALTO Processing tag #35

Open
mikegerber opened this issue Nov 8, 2021 · 9 comments
Open

ocrd-fileformat-transform does not add an ALTO Processing tag #35

mikegerber opened this issue Nov 8, 2021 · 9 comments

Comments

@mikegerber
Copy link

I believe it would be helpful if the ocrd-fileformat-transform PAGE → ALTO transformation would add a <Processing> tag. I looked into to the file to figure out if https://github.com/kba/page-to-alto was used for the conversion and did not find a processing tag for the conversion, just for segmentation/binarization/OCR.

@mikegerber
Copy link
Author

(Alternatively, page-to-alto could add it, of course.)

@kba
Copy link
Member

kba commented Nov 9, 2021

Can you provide an example of PAGE input and how you'd like to see it converted. page-to-alto should convert processing metadata, cf. https://github.com/kba/page-to-alto/blob/master/ocrd_page_to_alto/convert.py#L248-L265

@mikegerber
Copy link
Author

Yes it does convert the processing metadata correctly, but does not add itself as a processing step - which would have been helpful as I was investigating whether page-to-alto was used for the conversion using ocrd-fileformat-transform. Here is an example, this was converted using ocrd-fileformat-transform:

    <Processing ID="ocrd-eynollah-segment-0">
      <processingStepDescription>layout/segmentation/region</processingStepDescription>
      <processingStepSettings>{"models": "/data/default", "dpi": "0", "full_layout": "True", "curved_line": "False", "allow_scaling": "False", "headers_off": "False"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-eynollah-segment</softwareName>
      </processingSoftware>
    </Processing>
    <Processing ID="ocrd-sbb-binarize-1">
      <processingStepDescription>preprocessing/optimization/binarization</processingStepDescription>
      <processingStepSettings>{"model": "/data/sbb_binarization/models", "operation_level": "page"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-sbb-binarize</softwareName>
      </processingSoftware>
    </Processing>
    <Processing ID="ocrd-tesserocr-recognize-2">
      <processingStepDescription>layout/segmentation/region</processingStepDescription>
      <processingStepSettings>{"model": "deu", "dpi": "0", "padding": "0", "segmentation_level": "word", "textequiv_level": "word", "overwrite_segments": "False", "overwrite_text": "True", "shrink_polygons": "False", "block_polygons": "False", "find_tables": "True", "sparse_text": "False", "raw_lines": "False", "char_whitelist": "", "char_blacklist": "", "char_unblacklist": "", "tesseract_parameters": "{}", "xpath_parameters": "{}", "xpath_model": "{}", "auto_model": "False", "oem": "DEFAULT"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-tesserocr-recognize</softwareName>
      </processingSoftware>
    </Processing>

Full PAGE + ALTO:
example.zip

@mikegerber
Copy link
Author

mikegerber commented Nov 9, 2021

What I would expect is an additional processing step like this (entirely made up):

    <Processing ID="ocrd-fileformat-transform-3">
      <processingStepDescription>conversion</processingStepDescription>
      <processingStepSettings>{"backend": "page-to-alto"}</processingStepSettings>
      <processingSoftware>
        <softwareName>ocrd-fileformat-transform</softwareName>
      </processingSoftware>
    </Processing>

I know this is extra work but it's very useful to answer the question of how a file was created exactly.

@kba
Copy link
Member

kba commented Nov 9, 2021

Gotcha, yes this makes sense, at least for the OCR-D processor interface.

@mikegerber
Copy link
Author

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

@bertsky
Copy link
Contributor

bertsky commented Nov 30, 2021

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way. Then in ocrd-fileformat-transform, we can fully inform about the processor and its options. (While in other use cases, we might want to hide the conversion.)

@bertsky
Copy link
Contributor

bertsky commented Nov 30, 2021

But then again, doing this from page-to-alto or ocr-fileformat/script/transform/page__alto is much easier than from ocrd_fileformat. In the latter case, one would have to

  • check the target format
  • in the case of PAGE-XML, add a /pc:PcGts/pc:Metadata/pc:MetadataItem (as in ocrd-olena-binarize)
  • in the case of ALTO-XML >= 4, add a /alto/Description/Processing (as outlined above)
  • in the case of ALTO-XML < 4, add a /alto/Description/OCRProcessing/postProcessingStep (in an analogous way)
  • in the case of hOCR...?

These editing commands should by done by a true XML editor, like xmlstarlet. That would have to be added to the system dependencies.

Perhaps one should even offer a parameter to make this postprocessing/annotation optional.

@mikegerber
Copy link
Author

I would argue that it also makes sense for page-to-alto alone, as this conversion is a big processing step.

I'd argue to the contrary, that page-to-alto's job (despite being nontrivial) is to do exactly as it is told, not add provenance or other traces. It will be most versatile that way.

It does processing, so why should it not add processing info? I think it's not correct to omit it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants