Skip to content

Commit

Permalink
Rewrote validation section to be much clearer.
Browse files Browse the repository at this point in the history
Also provided a pseudo-code example of validation logic, as requested in issue #306.
  • Loading branch information
mbjones committed Nov 26, 2018
1 parent ea2acf7 commit 7360165
Showing 1 changed file with 112 additions and 69 deletions.
181 changes: 112 additions & 69 deletions docs/ch3-spec-architecture.md
Original file line number Diff line number Diff line change
@@ -1,74 +1,115 @@
# Chapter 3. Technical Architecture (Normative)

## Introduction {#introduction-1 .title style="clear: both"}

This section explains the rules of EML. There are some rules that cannot
be written directly into the XML Schemas nor enforced by an XML parser.
These are guidelines that every EML package must follow in order for it
to be considered EML compliant.

## Module Structure

Each EML module, with the exception of \"eml\" itself, has a top level
choice between the structured content of that modules or a
\"references\" field. This enables the reuse of content previously
defined elsewhere in the document. Methods for defining and referencing
content are described in the
[next](#reusableContent "3.3. Reusable Content") section

## Reusable Content

EML allows the reuse of previously defined structured content (DOM
sub-trees) through the use of key/keyRef type references. In order for
an EML package to remain cohesive and to allow for the cross platform
compatibility of packages, the following rules with respect to packaging
must be followed.

- An ID is required on the eml root element.
- Elements which contain an `annotation` element require an ID, which
defines the subject of the annotation.
- IDs are optional on all other elements.
- If an ID is not provided, that content must be interpreted as representing a distinct object.
- If an ID is provided for content then that content is distinct from
all other content except for that content that references its ID.
- If a user wants to reuse content to indicate the repetition of an
object, a reference must be used. Two identical ids with the same
system attribute cannot exist in a single document.
- \"Document\" scope is defined as identifiers unique only to a single
instance document (if a document does not have a system attribute or
if scope is set to \'document\' then all IDs are defined as distinct
content).
- \"System\" scope is defined as identifiers unique to an entire data
management system (if two documents share a system string, then any
IDs in those two documents that are identical refer to the same
object).
- If an element references another element, it must not have an ID
itself. The system attribute must have the same value in both the
target and referencing elements or it must be absent in both.
- All EML packages must have the \'eml\' module as the root.
- The system and scope attribute are always optional except for at the
\'eml\' module where the scope attribute is fixed as \'system\'. The
scope attribute defaults to \'document\' for all other modules.

### ID and Scope Examples

#### EML Parser
# Validation and Content references

This section explains the validation rules of EML. While most of the validation
rules are expressed as constraints within the XML Schema definition files, there are
some rules that cannot be written directly into the XML Schemas nor enforced
by an XML parser. These additional validation rules MUST be enforced by every
EML package must follow in order for it to be considered EML-compliant.

## Validation rules

For a document to be EML-valid, all of the following constraints must hold true:

- The document MUST validate using a compliant XML Schema validating parser
- All EML documents MUST have the 'eml' module as the root
- A `packageId` attribute MUST be present on the root `eml` element
- All `id` attributes within the document MUST be unique
- Elements which contain an `annotation` child element MUST contain an `id` attribute,
unless the containing `annotation` element contains a `references` attribute
- If an element references another using a child `references` element,
another element with that value in its `id` attribute MUST exist in the document
- When `references` is used, the `system` attribute MUST have
the same value in both the target and source elements, or it must be absent in both.
Frequently it is absent in both.
- If an element references another using a child `references` element,
it MUST not have an `id` attribute itself
- If an `additionalMetadata` element references another using a child `describes` element,
another element with that value in its `id` attribute MUST exist in the document

## Validation algortihm

One reasonable algorithm for assessing these constraints without loading the XML into
a DOM structure could be implemented by checking `id` and `references` fields while
parsing the document and storing their values in `identifierHash` and `referencesHash` data
structures in order to do the final consistency check. For example, in pseudocode:

- Parse the XML document using an XML Schema-compliant parser
- If the root element is not `eml`, then the document is invalid
- For each element, record whether it has an `id` attribute or not
- If an element does not contain an `id`, but it has a child `annotation`
element, and that child annotation does not contain a `references` attribute, then the document is invalid
- For each `id` attribute
- If `id` is not in `identifiersHash` then add it as the key of `identifiersHash`, with its `system` as the value
- If `id` is already in `identifiersHash` then the document is invalid
- If the element containing the id contains a `references` element as an
immediate child then the document is invalid
- For each `references` element
- If the `references` key is not in `referencesHash`,
then add it as a key with the `system` value to `referencesHash`
- If the `references` key is in `referencesList`, but the current `system`
value does not match the value for that key, then the document is invalid
- For each `references` attribute on an `annotation` element
- If the `references` key is not in `referencesHash`,
then add it as a key with the empty string '' value to `referencesHash`
- For each `describes` element within an `additionalMetadata` element
- If the `describes` key is not in `referencesHash`,
then add it as a key with the empty string '' value to `referencesHash`
- Once document processing is complete, for each `key` in `referencesHash`
- If `!identifierHash.hasKey(key) OR 'referencesHash[key] != identifierHash[key]'` then the document is invalid
- If no validity errors are found above or by the parser, then the document is valid

## Content references

Each EML module, with the exception of "eml" itself, has a top level
choice between the structured content of that element or a
"references" field. This enables the reuse of content previously
defined elsewhere in the document. This allows, for example, an author to
create a single `<creator id='m.jones'>` element with all of its child detail,
and then reference that as `<contact><references>m.jones</references></contact>`
to indicate that the same person is both the creator and contact. This creates
an unambigous linkage via the `id` field that the two elements refer to the same
entity, in this case a person, and avoids having to re-enter the same information
multiple times in the document. Another common location for re-use is when a single
`attributeList` is defined with a set of variables and their metadata, and then
that list is referenced in multiple `dataTable` elements to show that they are
structured identically.

The reuse of structured content is accomplished through the
use of `id`/`references` pairs. Each element that is to be reused will contain a
unique `id` attribute on the element. Because this identifier is guaranteed to
be unique within the EML document, any other location that wants to point at that
content can do so using the `references` element, as shown in the example above.
THese types of references can also be used in the `references` attribute of
`annotation` elements, and in the `describes` element within the `additionalMetadata`
element.

If an `id` attribute is provided for content, then that content is considered
to represent a different entity than all other elements that are defined in
the document, except for those that include its `id` in the `references` child.
This is useful to indicate, for example, that two people with similar names
(e.g., "D. Clark" and "D. Clark") are in fact distinct individuals
(e.g., "Deborah Clark" and "David Clark"), or that two variables with the same
`attributeName` are in fact different variables. While it would be bad practice
to reuse attribute names like this, it does happen and EML needs to be able to document it
when it does.

## EML Validity Parser

Because some of these rules cannot be enforced in XML-Schema, we have
written a parser which checks the validity of the references and IDs
used in your document. This parser is included with the 2.1.0 release of
EML. To run the parser, you must have Java 1.3.1 or higher. To execute
written a parser which checks the validity of the references and `id`s
used in a document. This parser is included with the release of
EML. To run the parser, you must have Java installed. To execute
it change into the lib directory of the release and run the
\'runEMLParser\' script passing your EML instance file as a parameter.
There is also an [online
'runEMLParser' script passing your EML instance file as a parameter.
There mat also be an [online
version](https://knb.ecoinformatics.org/emlparser) of this parser which
is publicly accessible. The online parser will both validate your XML
document against the schema as well as check the integrity of your
references.

#### Example Documents
## id and Scope Examples

**Example 3.1. Invalid EML due to duplicate identifiers**
**Example: Invalid EML due to duplicate identifiers**

```xml
<?xml version="1.0"?>
Expand All @@ -88,7 +129,7 @@ references.
</creator>
<creator id="23445" scope="document">
<individualName>
<surName>Myer</surName>
<surName>Smith</surName>
</individualName>
</creator>
...
Expand All @@ -99,7 +140,7 @@ references.
This instance document is invalid because both creator elements have the
same id. No two elements can have the same string as an id.

**Example 3.2. Invalid EML due to a non-existent reference**
**Example: Invalid EML due to a non-existent reference**

```xml
<?xml version="1.0"?>
Expand Down Expand Up @@ -130,10 +171,10 @@ same id. No two elements can have the same string as an id.
```

This instance document is invalid because the contact element references
an id that does not exist. Any referenced id must exist.
an `id` that does not exist. Any referenced `id` must exist in the document.

**Example 3.3. Invalid EML due to a conflicting id attribute and a
\<references\> element**
**Example: Invalid EML due to a conflicting id attribute and a
`<references>` element**

```xml
<?xml version="1.0"?>
Expand Down Expand Up @@ -168,7 +209,7 @@ references another element and has an id itself. If an element
references another element, it may not have an id. This prevents
circular references.

**Example 3.4. A valid EML document**
**Example: A valid EML document**

```xml
<?xml version="1.0"?>
Expand Down Expand Up @@ -202,4 +243,6 @@ circular references.
```

This instance document is valid. Each contact is referencing one of the
creators above and all the ids are unique.
creators above and all the ids are unique. The each creator has a its own `id`
indicates that they are different people, even though they have the same
`surName` and there is no other distinguishing metadata.

0 comments on commit 7360165

Please sign in to comment.