Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add semantics to coordinate system #13

Closed
bertsky opened this issue Jul 17, 2019 · 4 comments · Fixed by #16
Closed

add semantics to coordinate system #13

bertsky opened this issue Jul 17, 2019 · 4 comments · Fixed by #16

Comments

@bertsky
Copy link
Contributor

bertsky commented Jul 17, 2019

Coordinates are at the heart of stand-off annotation formats. In PAGE-XML, all visible elements must have a CoordsType, which must have a @points. There is even some syntax for that enforced by a regular expression. However, the standard lacks any semantics for the coordinate system whatsoever. There is not even a comment about this, so with luck, at least all implementors guessed consistently.

IMO we need to specify that:

  1. @points always describes (a list of x-y pairs of) absolute pixel coordinates ("absolute" meaning they refer to the root image in PageType/@imageFilename with the upper left corner as 0,0)

Moreover, we should clarify whether:

  1. @points has a topology of
    • (unordered) sets of points, or
    • a single (open or closed) path, or
    • multiple closed paths (and if so, whether orientation is relevant as in e.g. left=inside / right=outside)
  2. @points must obey certain constraints like
    • are paths allowed to leave the parent element's polygon outline / bounding box, or maybe even the page's bounding box (i.e. become negative, which is currently forbidden by syntax)? And if not:
    • must they be closed along the parent element's polygon outline / bounding box, or may they stay open when intersecting it?
    • are paths required to be planar (i.e. have no cross-sections)? And if not:
    • how does the content area compute,
      • by union, or
      • by difference, or
      • by orientation (left-of-path or right-of-path)?

This is highly relevant for implementors, especially when polygon processing and AlternativeImage processing on multiple hierarchy levels in the presence of skew becomes common practise – which is currently happening within OCR-D (for showcases see our Tesseract and our Ocropy preprocessing and segmentation wrappers).

(Cf. altoxml/schema#49)

@chris1010010
Copy link
Contributor

Good points ;-)

  1. Agreed
  • What would an unordered set of points represent?
  • At the moment, it's intended as single path (closed in case of regions etc. open in case of baseline)
  • In our understanding (although never specified in the format except the non-negative check) the paths should stay inside the page / parent object and they should be non-self-overlapping. Obviously that can't be enforced in XML, but in Aletheia we use higher-level validation to check for such things. If paths self-overlap we convert to a union shape I think. We don't crop polygons if they are outside their parent, but there are tools for that in Aletheia

@bertsky
Copy link
Contributor Author

bertsky commented Jul 18, 2019

  • Agreed

Splendid! Would you like me to do a PR?

* What would an unordered set of points represent?

I don't know. It just seemed like the minimal option. As in: "no interpretation is guaranteed, help yourself!" Or in having no specification at all. Implementors could try to always compute the outer hull, or try their luck with path interpretations...

* At the moment, it's intended as single path (closed in case of regions etc. open in case of baseline)

Ok, fair enough. (Closed by description – at least one pair must repeat – or closed by convention – the first pair is meant to be repeated?)

But what about cases where the region is non-contiguous, because e.g. a TextRegion gets flowed over by a ImageRegion, or a TextLine by a GraphicRegion? In that case, only having a single path necessitates including the intruders, so the only way to get rid of them for further processing (layout / dewarping / recognition) would be to offer a AlternativeImage where they get clipped to white. See here for example images on this approach.

* In our understanding (although never specified in the format except the non-negative check) the paths should stay inside the page / parent object and they should be non-self-overlapping.

I was hoping you say so. (But I do get non-planar polygons from Tesseract sometimes, and some contour libraries never bother to close their paths.)

* Obviously that can't be enforced in XML, but in Aletheia we use higher-level validation to check for such things.

Yes, as with most of the semantics, this would be a matter of some non-XSD validation. In OCR-D, we are planning to write one using geometry heuristics. Is there some place I can look at the respective rules in Aletheia?

* If paths self-overlap we convert to a union shape I think.

That is also what Tesseract does itself (if asked to return a raw image of blocks from layout analysis). But if this is forbidden by the schema, it's totally up to the processor/library trying to produce PAGE to handle this case if it does arise internally. (Maybe it's still worth commenting on in the schema, though.)

@chris1010010
Copy link
Contributor

chris1010010 commented Jul 18, 2019

Closed by convention (first pair repeats).
Yes, "intruders" are a problem, but simplicity was favored over being able to cover all use cases (this was a decision by the creators). PAGE was never intended for pixel-accurate description.
There's a list in the Aletheia user guide (page 118).

@bertsky
Copy link
Contributor Author

bertsky commented Jul 18, 2019

Thanks a lot for all the clarification! I hope the PR meets your approval.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants