-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Validation of suspiciously small text regions #118
Comments
Yes, again, a case for OCR-D/core#252 |
Wait, what are "suspiciously small" regions? Will this not get hairy fast with heuristics based on dimensions? What about e.g. thin separator lines or punctuation marks? |
I think what @tboenig meant was suspiciously small text regions (and lines). And yes, that would have to depend on the DPI of the input, too. And yes, it could still get hairy with single-region punctuation marks or page numbers like "I" – but too many warnings in the validator are still better than searching the complete haystack by hand, right? Perhaps geometry heuristics should differentiate between |
@bertsky Thanks, I've updated the titel accordingly. Anyway for all "validations" that are not directly related to violations of the PAGE schema I would expect a |
@cneud Absolutely! This is not about the XML syntax, but about our (application-specific) semantic constraints. So maybe we should call this whole thing evaluation instead of validation, and have the report give a score instead of a boolean? (We could even offer different metrics for different situations, as in PRImA's layout evaluation profiles...) |
Validation of PAGE should make sure that there are no lines, points or very small regions, lines etc.
Any suggestions on realistic dimensions to raise a warning? Less than 10 pixels wide or high.
OCR-D/assets#28
The text was updated successfully, but these errors were encountered: