Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

workspace_validator: allow skipping pcgtsid check #1066

Merged
merged 4 commits into from
Jul 10, 2023

Conversation

bertsky
Copy link
Collaborator

@bertsky bertsky commented Jun 29, 2023

This is necessary as some of our processors do not use set_pcGtsId(file.ID), IINM currently:

  • all ocrd_segment processors
  • ocrd-keraslm-rate (was already correct)
  • ocrd-repair-inconsistencies
  • ocrd-detectron2-segment
  • ocrd-fileformat-transform (hard to enforce in submodule transform, only for PAGE output)
  • ocrd-im6convert (does not apply: only image output)
  • ocrd-page-transform

Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to have this check optional since we cannot always enforce it (though we should, as you have started, ensure that OCR-D processors handle this correctly).

Minor note: Perhaps just pcgtsid instead of mets_fileid_page_pcgtsid which is easier to type?

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 7, 2023

Minor note: Perhaps just pcgtsid instead of mets_fileid_page_pcgtsid which is easier to type?

It's not clear what is meant by just pcgtsid, and it's also not obvious that we are trying to enforce mets:file/@ID=pc:PcGts/@pcGtsId. So I still think the longer identifier is warranted. (We could try to improve the filtering options and docstrings though...)

- let `download=True` only download URLs to temporary location on demand (like processors do), not convert to local refs (like `workspace find --download` would)
- instead of adding notices to the validation report for remote files not downloaded, just add a warning to the logger
@bertsky
Copy link
Collaborator Author

bertsky commented Jul 9, 2023

@kba in 6991166 I also changed the behaviour of download=True: IMO it was not correct for the workspace validator to convert remote refs to local – instead, that option now does what all our processors do, i.e. downloading on demand. Also, I fixed #450 (which had been closed earlier by accident): not downloaded files must not add to the report, these belong to the logger.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 9, 2023

But actually IMHO the workspace validator should be rewritten even more: It does not make sense to iterate over the same workspace for each PAGE XML test independently. Instead, imagefilename, dimension and page should enter as a page_checks post-parsing step (as I have started with mets_fileid_page_pcgtsid) when the PAGE XML has been downloaded and opened.

@bertsky
Copy link
Collaborator Author

bertsky commented Jul 9, 2023

But actually IMHO the workspace validator should be rewritten even more: It does not make sense to iterate over the same workspace for each PAGE XML test independently. Instead, imagefilename, dimension and page should enter as a page_checks post-parsing step (as I have started with mets_fileid_page_pcgtsid) when the PAGE XML has been downloaded and opened.

see #1070 – to be discussed

@kba
Copy link
Member

kba commented Jul 10, 2023

OK, let's merge this now and continue the discussion on improving the validator (which has not been substantially updated in too long a time) in #1070 .

@kba kba merged commit eede09e into master Jul 10, 2023
2 checks passed
@bertsky
Copy link
Collaborator Author

bertsky commented Jul 10, 2023

OK, let's merge this now and continue the discussion on improving the validator (which has not been substantially updated in too long a time) in #1070 .

I thought we do it the other way round, hence #1070 was against this PR branch, not master. Since AFAIK one cannot re-target PRs, I'll close and open a new one against master.

@kba
Copy link
Member

kba commented Jul 10, 2023

Since AFAIK one cannot re-target PR

Oh, I did not realize, sry about that.

@kba kba deleted the workspace-validator-skip-pcgtsid branch July 10, 2023 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants