Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clarify validation rules #306

Closed
mbjones opened this issue May 14, 2018 · 5 comments
Closed

clarify validation rules #306

mbjones opened this issue May 14, 2018 · 5 comments
Assignees
Labels
Milestone

Comments

@mbjones
Copy link
Contributor

@mbjones mbjones commented May 14, 2018

EML has had additional rules for validation that were somewhat opaquely described in section 3.3 of the spec. Clarify and update these rules to make it crystal clear what a conformant parser should be testing for.

@mbjones mbjones added this to the EML2.2.0 milestone May 14, 2018
@mbjones mbjones self-assigned this May 14, 2018
@mbjones mbjones added in progress and removed next labels Jul 3, 2018
mbjones added a commit that referenced this issue Nov 27, 2018
Also provided a pseudo-code example of validation logic, as requested in issue #306.
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Nov 28, 2018

I reworked the validation rules, which are now much more explicit in the specification. Please review here: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/docs/eml-validation-refs.md

@cboettig and @csjx Could you take a look at these and make sure everything seems complete?

@mbjones mbjones added needs-review and removed in progress labels Nov 28, 2018
@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Nov 29, 2018

See my take on the validation rules in ropensci/emld#26 (as code and with a few minor questions)

@srearl

This comment has been minimized.

Copy link
Contributor

@srearl srearl commented Nov 29, 2018

reads clearly; made a few grammatical edits

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Dec 3, 2018

@mbjones Can you clarify a few things for me:

  • must packageId also be distinct from any id (+system) attribute, or is that okay?
  • The validation rules don't say anything about units. @amoeba tells me that checking <standardUnit> unit names against standardUnits list and customUnit against a provided unitList should be part of the standard validation (which makes a lot of sense). Shouldn't this be added to the above list?
  • is <alternateIdentifier> subject to any identifier uniqueness checks?
  • Your prose version of the validation rule says explicitly that id must be unique. That's what I've done in emld now too; but I thought technically it's a combination of id and system that has to be unique? (I'd much prefer being able to ignore system and scope logic though!)
  • I'm not sure the algorithm-version entirely matches the text version with respect to the rules regarding annotation id. (I believe the algorithm only invalidates case where annotation has no subject, i.e. no id and no reference. It should also invalidate having both parent id and child reference. Also, I think it would be easier for the reader to interpret the rule if you added the clause "an annotation must have one and only subject, coming either the parent id or child reference.)

An aside, but the way EML sometimes specifies IDs as attributes and sometimes as node content leaves some open questions in my mind as well. (The first gotcha I learned was that we should probably assert trimming whitespace around ids given as node values.) Not sure if there are other risks here with regard to parsing rules (e.g. thinking about things like xs:type and if any coercion rules need to be defined, probably all taken care of in the schema but makes me nervous).

Lastly, note that I think one of the current test files does not satisfy these validation rules; the annotation:

<annotation>
<propertyURI label="Subject">http://purl.org/dc/elements/1.1/subject</propertyURI>
<valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
</annotation>

appears as the child of a dataset element which lacks and id.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Dec 4, 2018

@cboettig Thanks for the feedback...

must packageId also be distinct from any id (+system) attribute, or is that okay?

This is open to discussion, but my thought is yes -- it represents the whole package, whereas an id on a subtree of the document would represent a part of the package, and so they are necessarily different subjects and should have unique identifiers.

The validation rules don't say anything about units. @amoeba tells me that checking <standardUnit> unit names against standardUnits list and customUnit against a provided unitList should be part of the standard validation (which makes a lot of sense). Shouldn't this be added to the above list?

The schema contains an enumeration of standard unit names which is used to validate that field with the overall XML validation, but I agree it would be good to check customUnit against the provided unitList.

is <alternateIdentifier> subject to any identifier uniqueness checks?

Not currently. Do you think it should?

Your prose version of the validation rule says explicitly that id must be unique. That's what I've done in emld now too; but I thought technically it's a combination of id and system that has to be unique? (I'd much prefer being able to ignore system and scope logic though!)

Yeah, I'm torn on this. We originally planned to have the system affect the uniqueness constraint so that an identifier could be linked to an external system even if it wasn't globally unique (e.g., two mjones user identifiers in different authentication systems). In reality we never got that implemented in the validator, and EML docs have been de facto dealing with the constraint that all ids are unique. So, we could change the language to deprecate the system attribute wrt scoping and uniqueness. Certainly easier to implement that way. Or we could remove it but that would be backwards incompatible syntactically -- better to just ignore it. Technically it would be a backwards incompatible change, but in practice the full uniqueness check has been enforced, so documents should in general be valid wrt that rule. We may run into some problems if people haven't been validating with the EMLParser. Other opinions welcomed on this issue.

I'm not sure the algorithm-version entirely matches the text version with respect to the rules regarding annotation id. I believe the algorithm only invalidates case where annotation has no subject, i.e. no id and no reference. It should also invalidate having both parent id and child reference. Also, I think it would be easier for the reader to interpret the rule if you added the clause "an annotation must have one and only subject, coming either the parent id or child reference.)

Possibly. I meant for the step "If the element containing the id contains a references element as an immediate child then the document is invalid." to handle the case of both having a parent id and child references. Does that miss some edge cases?

Lastly, note that I think one of the current test files does not satisfy these validation rules;

Yeah, I agree, and I have fixed that and will look for others. When I finish the new java-based parser (soon) I will make sure that all the test docs pass, except the ones in the invalidEML folder. I also plan to add a bunch of other invalid EML documents to be sure the validator is picking up these cases. It will be good to see if the java and R validation produces the same results. Feel free to add your own test cases. The test files could use a much better organization and naming scheme.

@mbjones mbjones closed this Jan 31, 2019
@mbjones mbjones removed the needs-review label Jan 31, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.