-
-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change EML namespace and elementFormDefault #334
Comments
Hi there @antibozo,
EML follows the Namespaces in XML recommendation, particularly section 6.2 regarding default namespaces. When you declare a namespace on an element, that namespace applies to all non-prefixed child elements. So, to be clear, we use the default namespace, not the "empty" namespace. XML parsers handle this quite well.
The namespace follows the spec in that it must be a URI. I'd agree that it is idiosyncratic (done 20 years ago when XML was fairly young). That said, it is basically just a URI string, and processors handle it just fine. I've seen the use of
As I'm sure you are well aware, there are security implications of dereferencing XML Schema documents from the Suffice it to say that we (at NCEAS, I can't speak for other EML producers) have opted to use the As an aside, the value you suggest ( |
Actually, you do not use a default namespace. There is no default namespace. There is a namespace only on the root "eml:eml" element. Here is some code demonstrating that, using the LibXML2 library to parse the XML:
This code assigns the temporary prefix "x" to the namespace matching the EML URI. So now I can use that prefix to search for all elements in the EML namespace:
This search works exactly as expected. I find one node, the root "eml:eml" node of the document.
This search does not work as expected based on your description above. The expectation is that all of the nodes in the document have the default namespace, namely, "eml://ecoinformatics.org/eml-2.1.1", as above.
So now I am searching for the same nodes, but with no namespace. I get exactly one node, the node child of the eml:eml node.
And this results in an error, because, well, because the node and all other nodes in the document do not have a namespace. This can also be shown on the linux command line using xml_grep. A search for the root node finds it:
A search for the child dataset node does not:
Again, that's because the dataset node has no namespace. And searching for a non-namespaced dataset node finds it:
Aside from the xsi:schemaLocation error in antibozo's proffered declaration, it is correct. Here is another correct declaration that does not try to address his other observations:
When I change the declaration of the EML namespace thus, and remove the eml: prefix from the root node, then the namespace works as you think it should. That is, a search for '//x:eml' finds the root node, a search for '//x:dataset' finds the child "dataset" node of the root node, and a search for "//x:para" in my example document (the science metadata from doi:10.18739/A2222R550) finds 28 matches. Another possible implementation is this:
Where every node in the EML namespace has the prefix "eml". That is also a correct implementation. |
In
you aren't declaring a namespace for child elements; you are declaring only a namespace prefix, which you apply solely to the root element. The default namespace of child elements is unaffected by a declaration of this kind, and remains the empty namespace. To put child elements in the EML namespace in this syntax, you must add a namespace prefix to them just as you have done with the root element, e.g. To declare a default namespace, you must write something like:
Note the To see this, take a sample document and pass it through something that prints out the namespace of each element. Here is a version of the sample document found at the end of chapter 2 in your specification, with typos corrected (
and here is a simple XSL that prints out the namespace URI and local name of each element in document order:
Process the sample XML through this XSL and you will get:
in which you see that every element except the root If you correct the sample XML to define a default namespace, thus:
and process this through the same XSL, this yields:
Because of the way you have written this specification, along with EML documents we are actually observing, it is not possible when processing mixed documents to distinguish EML elements from elements in the empty namespace. As for XML injection, you aren't going to protect people from XML injection by using weird schemes in your URIs. It is up to people to protect themselves with appropriate steps, such as disabling external entity resolution. And, after all, Regardless of the correct value for Yes, we have all done things 20 years ago that needed to be modified as things became clearer. But fixing mistakes is still a good thing. It is commonplace to update namespace URIs when new versions are created, so perhaps this is a good time to do so. One other note: it is perfectly fine to use |
One other note about default namespaces: Again, consider the sample document from chapter 2 of the specification:
I wrote earlier that elements such as
then Hopefully it is altogether clear by now why this is real flaw in the specification. |
… and, following the concern about the weird scheme: again, it is possible Use |
Hi Jeff and John, Ah, yes, I stand corrected on the default namespace issue. After reading your explanations, and looking at this more closely, I see that only the root element is namespaced in these instance documents, and that the children have no namespace. This perplexed me a bit since the Xerces-J parser we tend to use to validate these documents validates them fine, and in fact they are completely valid with regard to adhering to the EML schema. As you point out, the main issue is that we (Arctic Data Center) omitted the default namespace attribute on the
It doesn't require any namespace change to the root So, I will bring this up with the ADC group, and we'll discuss adding in the default namespace attribute in future documents. That said, for your processing purposes, I think you will need to inject the So, before we close this issue, I think the summary of the three items raised are:
|
The immediate problem is that any use case that uses namespaces to locate elements will have a problem with putting the child elements in the originally intended namespace. You have both a backward and forward compatibility issue. If someone has, for example, written an EML processor that searches for The correct thing to do, i believe, is to explicitly say that all elements defined in the EML specification are in the EML namespace. The usual way to make this clear in a specification is to use a namespace prefix on every element, and not to rely on default namespaces, because default namespaces have scope. I.e., your sample record would look like:
To be clear, this is going to break existing processors that were handling the EML records that actually exist by finding child elements in the empty namespace. You'll see in this example i have also increased the version number in the namespace URI to accommodate this. |
Heh - disregard that last comment (I deleted it) - wrong ticket. 🙂 |
ClarificationsA couple of points of clarification, just so we are on the same page:
I've spent some time looking at this proposal, and talking to a few folks about it. While our current use of ProposalThat said, I think its a good idea for a 3.0.0 release, and here is what I think we should change.
I think at that point, people could set a default namespace on their documents and it would be used for any elements not explicitly prefixed. The only place it would need to be changed would be in One outstanding question for me is:
In any case, due to the compatibility issue, I will retarget this to milestone 3.0.0. Comments appreciated. |
It appears that the EML specification defines a namespace for the root element of a document, but proceeds to use the empty namespace for every other element. This is very weird, and makes the use of any namespace at all seemingly useless.
In addition, the namespace used for the root element uses an idiosyncratic scheme "eml:". This is also very weird.
Furthermore, the xsd:schemaLocation URI also uses this otherwise-undefined eml: schema, making it unretrievable for any automated schema validation code.
Example from current spec:
Here the root element is {eml://ecoinformatics.org/eml-2.1.1}eml, but the first child is {}dataset.
The normal way to do this, which doesn't perplex everyone who tries to process an XML document written to spec, would be:
Here, all of the elements, from the root down, are in the {http://ecoinformatics.org/eml-2.1.1} namespace. This assures that elements intended to mean EML things can be distinguished from elements with the same local name that mean something else, e.g. "title".
Please rewrite the spec to use namespaces in a non-eyebrow-raising manner, or explain why you have written it as it is.
The text was updated successfully, but these errors were encountered: