Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EMLParser misses validity problems #328

Closed
mbjones opened this issue Dec 2, 2018 · 6 comments
Closed

EMLParser misses validity problems #328

mbjones opened this issue Dec 2, 2018 · 6 comments
Assignees
Labels
Milestone

Comments

@mbjones
Copy link
Contributor

@mbjones mbjones commented Dec 2, 2018

The EMLParser fails to detect invalid EML documents during the test phase. Clean up the build and test system, and ensure that all files that are not following the EML rules fail the validation and that the tests detect that. This includes:

  • Add invalid test files for id/references checks
  • Add invalid files for annotations
  • Add test output to understand what is failing
  • Stop duplicating files in multiple locations unnecessarily
@mbjones mbjones added this to the EML2.2.0 milestone Dec 2, 2018
@mbjones mbjones self-assigned this Dec 2, 2018
mbjones added a commit that referenced this issue Dec 5, 2018
Also add annotation tests for invalid files. Related to issue #328.
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jan 31, 2019

I completely rewrote the validation in a new EMLValidator class that is massively more efficient. For some reason, the old EMLParser class was repeatedly parsing the XML document for every single XPath lookup (of which there were lots). There was also a lit of complexity in the validation code ad config files that was unneeded, and has now been boiled down to a simple validate function.

Need to compare the results of these tests with the independently implemented tests in the R emld package. @cboettig, could you please take a look and see if it makes sense to you how I did it in Java? And do you get the same results against the test files in the EML src/test/resources directory?

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Feb 1, 2019

@mbjones thanks! emld includes a copy of the src/test/resources directory already that it tests against. I'll sync over these updates and the test suite should mostly just pick these up. Will need to probably write a separate test script for the invalidEML cases. I'm a bit backlogged as usual but hope to get through that soon.

emld is currently nearly through rOpenSci review, which means we should be able to get a candidate on CRAN and update EML on CRAN relatively soon too (hoping to have this done close to when EML 2.2.0 ships!)

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 1, 2019

Thanks @cboettig. You can see how I'm using the three directories for testing in the EMLValidator test, but it boils down to:

  • src/test/resources: contains valid EML documents exercising a bunch of features (I wish they were ore systematic about it, but its a hodgepodge)
  • src/test/resources/invalidEML: contains EML docs that have one (and only one) problem with one of the extended validation rules (these are still schema valid, just not following EMLs validation rules)
  • src/test/resources/moduleEML: contains docs that exercise one particular EML module and are not rooted at eml, so they wouldn't normally be considered valid EML, but they are schema valid against the XSD schemas
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 3, 2019

After discussion, I have reintroduced the EMLParser class for backwards compatibility with older versions of EML (prior to 2.2.0). Now, calling EMLParser(file) and EMLParser(string) will delegate validation to EMLValidator if the version of the EML file is 2.2.0 or later. Otherwise, it uses the original, buggy logic and implementation within EMLParser. This will allow implementations that rely on the older, buggy logic to continue to function. However, new EML documents SHOULD be validated using the EMLValidator.validate() method to ensure they conform to all EML rules.

Because the old EMLParser still uses its original logic for versions 2.1.1 and before, it is still very slow. For version 2.2.0 and after, it should be faster now due to the new delegated implementation.

At this point, once people review the validation logic, this bug can be closed.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 11, 2019

Jing built and tested the validator and things seem to be working well in terms of backwards compatibility. There are still some issues around getting EML 2.2.0 into Metacat, but it seems the validator is working. See Jing's email below for details.

I built the datamanager-0.9.1.jar file on the eml 2.2 branch and used it
to build a Metacat. Then I run the Metacat junit tests against the new
Metacat. All tests passed except the expected two failures of the
integration replication tests. This is good.

I also used Morpho 1.11.0 to generate EML objects on the Metacat and it
worked. Then I used dataone command to create EML objects in the three
scenarios:

  1. Valid eml 2.1.1 object - the creation succeeded.

  2. Invalid eml 2.1.1 object with a wrong element name - the creation failed

  3. Invalid eml 2.1.1 object with duplicated id - the creation failed.

So the tests result shows the new datamanager-0.9.1.jar file in Metacat
works good with released eml namespaces.

I also imported EML 2.2.0 schema to my local Metacat and tried the above
three scenario with eml 2.2.0:

  1. Valid eml 2.2.0 object - the creation succeeded.

  2. Invalid eml 2.2.0 object with a wrong element name - the creation failed

  3. Invalid eml 2.2.0 object with duplicated id - the creation succeeded.

The scenario 3 should fail but it succeeded. I took a look at Metacat
code and found it is an issue on Metacat itself - it hasn't the
mechanism to launch the eml parser for objects with the eml 2.2.0
namespace. I input a ticket for this issue:

NCEAS/metacat#1316

I also set up an eml parser servlet service. But the service failed when
I tried to parse an eml object. Please see this ticket:

#331

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 11, 2019

Closing validation as complete. If further bugs surface, please enter a new ticket to describe the issue.

@mbjones mbjones closed this Feb 11, 2019
@mbjones mbjones removed the needs-review label Feb 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.