Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of suspiciously small text regions #118

Open
tboenig opened this issue Jul 11, 2019 · 5 comments
Open

Validation of suspiciously small text regions #118

tboenig opened this issue Jul 11, 2019 · 5 comments
Assignees

Comments

@tboenig
Copy link
Contributor

tboenig commented Jul 11, 2019

Validation of PAGE should make sure that there are no lines, points or very small regions, lines etc.

Any suggestions on realistic dimensions to raise a warning? Less than 10 pixels wide or high.

OCR-D/assets#28

@tboenig tboenig assigned kba and wrznr Jul 11, 2019
@bertsky
Copy link
Collaborator

bertsky commented Jul 11, 2019

Yes, again, a case for OCR-D/core#252

@cneud
Copy link
Member

cneud commented Jul 31, 2019

Wait, what are "suspiciously small" regions? Will this not get hairy fast with heuristics based on dimensions? What about e.g. thin separator lines or punctuation marks?

@bertsky
Copy link
Collaborator

bertsky commented Jul 31, 2019

I think what @tboenig meant was suspiciously small text regions (and lines).

And yes, that would have to depend on the DPI of the input, too.

And yes, it could still get hairy with single-region punctuation marks or page numbers like "I" – but too many warnings in the validator are still better than searching the complete haystack by hand, right? Perhaps geometry heuristics should differentiate between forbidden and suspicious?

@cneud cneud changed the title Validation of suspiciously small regions Validation of suspiciously small text regions Aug 1, 2019
@cneud
Copy link
Member

cneud commented Aug 1, 2019

@bertsky Thanks, I've updated the titel accordingly. Anyway for all "validations" that are not directly related to violations of the PAGE schema I would expect a warning or suspicious flag rather than error or forbidden.

@bertsky
Copy link
Collaborator

bertsky commented Aug 2, 2019

@cneud Absolutely! This is not about the XML syntax, but about our (application-specific) semantic constraints. So maybe we should call this whole thing evaluation instead of validation, and have the report give a score instead of a boolean? (We could even offer different metrics for different situations, as in PRImA's layout evaluation profiles...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants