Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EML parser passes records which are missing customUnit definitions #355

Open
mobb opened this issue May 1, 2020 · 6 comments
Open

EML parser passes records which are missing customUnit definitions #355

mobb opened this issue May 1, 2020 · 6 comments

Comments

@mobb
Copy link
Contributor

mobb commented May 1, 2020

The EML parser: https://knb.ecoinformatics.org/emlparser/parse
no longer traps datasets which are missing their customUnit definitions in metadata. I.e., this passes the parser (no additionalMetadata section)

:

<eml:eml ...
  <dataset>
   ....
              <unit>
                <customUnit>millimetersPerYear</customUnit>
              </unit>
  ...
  </dataset>
</eml:eml>

Test doc is uploaded (with extension .txt for github)
edi.9.0_2.1.1_custom_unit_missing.txt

mbjones added a commit that referenced this issue May 1, 2020
Tests are now failing for the EML parser and validator and need to be fixed.
@mbjones
Copy link
Member

mbjones commented May 2, 2020

I can verify this for both EML 2.1.1 that @mobb sent, as well as for EML 2.2.0 documents. The issue seems to arise from customUnit using a different XPath selector for its references than other key/keyref pairs in the document, and so unit validation was skipped in the new EMLValidator class.

The original EMLParser class (now Deprecated) configuration included the following key/keyref pair check for units:

  <key name="unitIdentifierKey">
    <selector xpath="//unitList/unit"/>
    <field xpath="@id"/>
  </key>
  <keyref name="unitReference" refer="unitIdentifierKey">
    <selector xpath="//unit/customUnit"/>
    <field xpath="."/>
  </keyref>

This was not implemented in EMLValidator and needs to be added. I committed two test files to the develop branch (for 2.1.1 and 2.2.0) that illustrate the problem. The travis checks for EML on the develop branch will start failing until we fix this.

@mbjones
Copy link
Member

mbjones commented May 2, 2020

I commited a fix for this to the develop branch in SHA 4cd1f9d.

In the process, I found several erroneous test documents in the test suite, which are now also fixed in the branch. I updated the validation rules documentation to reflect this additional rule which got lost in translation from the previous release. The validation algorithm for units is now:

        // If a customUnit is referenced in a document, it must have a
        // corresponding STMML unit definition in the document with a matching
        // @id
        ArrayList<String> unitIdentifiers = getXPathValues("//*[local-name() = 'unitList']/*[local-name() = 'unit']/@id");
        ArrayList<String> unitReferences = getXPathValues("//unit/customUnit");
        for (String s : unitReferences) {
            if (!unitIdentifiers.contains(s)) {
                errors.add("Invalid: Custom unit definition missing: " + s);
                isValid = false;
            }
        }

@cboettig and @amoeba and @mobb Please review and let me know if that looks correct to you. If so, we'll need to be sure to add a corresponding validation check to emld in R.

This set of changes does not change the EML schema, but it does affect any system that might have done a validation check before accepting content, such as Metacat. @taojing2002 will need to incorporate a new built version of the eml.jar into Metacat, and we'll need to redeploy the online parser on the KNB. We will also need to redeploy the EML site documentation to reflect the updated validation documentation.

Comments appreciated.

@mobb
Copy link
Contributor Author

mobb commented May 2, 2020

This fix looks like it will catch docs with missing unit definitions.

HOWEVER: I am pretty sure that the previous parse behavior was to match the <customUnit> to the XPath unitList/unit/@name, and not to unitList/unit/@id. I only know this as a user, having been bit by it several times. Is there a way to confirm that?

IMO, id is a better choice. But if this is a change in behavior, we should be prepared for some docs that passed the parser previously to now fail.

@mbjones
Copy link
Member

mbjones commented May 4, 2020

Hey @mobb thanks for the feedback. I accidentally started with @name, and found that produces validation errors in EMLParser. The XPath that I showed above from the configuration file shows that we previously used @id, so I stuck with that. The one thing I did do was make the XPath agnostic with respect to namespace, as I found live examples that were both explicit about the STMML namespace and others that omitted it. So, now both are allowed in my revised XPath via use of local-name().

@mbjones
Copy link
Member

mbjones commented May 4, 2020

@taojing2002 When this upgrade is applied to Metacat on our repositories and for DataONE, it may find that documents that previously passed the EMLValidator no longer do so.... which concerns me, and we need a strategy. It would be interesting to find out which documents fail before we deploy it widely (which we might be able to do with a modified quality check with @gothub that runs the new validator on all DataONE content).

@amoeba
Copy link
Contributor

amoeba commented Mar 4, 2021

Sam Csik found this example of a doc missing a few custom unit defs that passed Metacat's validation so I'll add it here https://search.dataone.org/view/doi:10.18739/A29882N5H.

@mbjones mbjones added this to the EML3.0.0 milestone Mar 4, 2021
servilla added a commit to PASTAplus/PASTA that referenced this issue Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants