Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Load Result" with parsererror #326

Closed
l0rn0r opened this issue Nov 8, 2022 · 6 comments
Closed

"Load Result" with parsererror #326

l0rn0r opened this issue Nov 8, 2022 · 6 comments
Labels
Type: Bug Indicates an unexpected problem or unintended behavior.

Comments

@l0rn0r
Copy link

l0rn0r commented Nov 8, 2022

Hello
I'm running the OCR4all Docker container on my Ubuntu 20.04.
It works quite well but there is an error, when I tried to load a PageXML in LAREX.

When I have a page in the LAREX editor, which went through every OCR4all steps till recognition, I wanted to load an already existing PageXML of this page - to check if I could load a ground truth text for training - I get the error message:
"Couldn't retrieve annotations from file."

And in the console it says
"request:/file/upload/annotations - fail 'parsererror'"
which is indicated by
Larex/resources/js/viewer/communicator.js, Line 17 - failed Post-request.
The writing permissions of the data-folder on the server should be good (777).
The PageXML file is v2013-07-15.

Any hint for this problem?
Or any hint how to load ground truth from existing PageXMLs to train a new model?

@bertsky
Copy link

bertsky commented Nov 8, 2022

Don't remember anything about OCR4all integration (request API), but I often see this error with valid PAGE-XML files when

  • a TextEquiv has no Unicode or Plaintext element
  • some @regionRef does not exist
  • some OrderedGroup(Indexed) or UnorderedGroup(Indexed) is empty (has no child elements)
  • some @points are negative or float (which is also invalid by schema)

(This is due to the parser from PRImA being not very robust, and not exposing the internal cause of error correctly.)

@maxnth maxnth added the Type: Bug Indicates an unexpected problem or unintended behavior. label Feb 20, 2023
@maxnth
Copy link
Member

maxnth commented Feb 20, 2023

Excuse the late reply, I somehow totally overlooked this issue.
As already mentioned above, this is most likely caused by an PAGE XML file which isn't valid according to the schema.
If you could upload the XML file which causes the error, I'll have a look at it.

@bertsky
Copy link

bertsky commented Feb 26, 2023

Except for the last point (@points format), these are all cases which do not violate the schema. It's only the PRImA parser that fails. This is reproducible with all PRImA tools (editor, converter, viewer, layout evaluation), too.

I don't have examples readily available, but it should be straightforward to construct some from your existing test cases.

@maxnth
Copy link
Member

maxnth commented Feb 26, 2023

Except for the last point (@Points format), these are all cases which do not violate the schema.

I'm not an XML schema expert so the following train of thought might be flawed but I'd be interested to know why the above mentioned cases wouldn't make the XML invalid?

  • @regionRef has IDREF as type and AFAIK this should always require the referenced ID to be present in the document according to the XML Schema Definition to make the document valid, doesn't it?
  • e.g. OrderedGroup requires minOccurs="1" for either RegionRefIndexed / OrderedGroupIndexed / UnorderedGroupIndexed so it being completely empty shouldn't be valid according to the schema
  • As there isn't any minOccurs value explicitly set for Unicode elements in a TextEquiv it defaults to minOccurs="1" and therefore should be mandatory to make the document valid

@bertsky
Copy link

bertsky commented Feb 26, 2023

  • @regionRef has IDREF as type and AFAIK this should always require the referenced ID to be present in the document according to the XML Schema Definition to make the document valid, doesn't it?

You're right. Dangling IDREF should make the document invalid as of XML specification. I had based my judgement on the behaviour of the libxml2 implementation, which does not check IDREF.

  • e.g. OrderedGroup requires minOccurs="1" for either RegionRefIndexed / OrderedGroupIndexed / UnorderedGroupIndexed so it being completely empty shouldn't be valid according to the schema

Right again, my bad.

  • As there isn't any minOccurs value explicitly set for Unicode elements in a TextEquiv it defaults to minOccurs="1" and therefore should be mandatory to make the document valid

Again, you're spot on. Sorry for my sloppy nonsense! (I carried this misconception with me for quite some time...)

@maxnth
Copy link
Member

maxnth commented Feb 27, 2023

I'll close this for now, feel free to reopen this @l0rn0r if the issue still persists and isn't caused by invalid PAGE XML (or if the invalid PAGE XML is produced by OCR4all).

@maxnth maxnth closed this as completed Feb 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Bug Indicates an unexpected problem or unintended behavior.
Projects
None yet
Development

No branches or pull requests

3 participants