Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upsemantic metadata module/extensions #25
Comments
This comment has been minimized.
This comment has been minimized.
Original Redmine Comment Changing QA contact to the list for all current EML bugs so that people can |
This comment has been minimized.
This comment has been minimized.
Original Redmine Comment Original Bugzilla ID was 277 |
This comment has been minimized.
This comment has been minimized.
Added new schema file eml-semantics.xsd for providing a new SemanticAnnotation type. Needs to be tested, reviewed, and incorporated into the other schemas. |
This comment has been minimized.
This comment has been minimized.
@mobb and @mpsaloha I wanted to bring this semantic extension for EML to your attention in particular. I'm just starting thinking about how this would work, but for now I committed a new eml-sematics.xsd file with a SemanticAnnotation ComplexType in commit sha 1dacda8 in the EML 2.2 branch. My thought is that I will add optional annotation elements using the SemanticAnnotation type in key structures in EML, particularly in the following places:
I would appreciate your thoughts on this. You can view the xsd file in the branch: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/xsd/eml-semantics.xsd |
This comment has been minimized.
This comment has been minimized.
@mobb, @amoeba, @mpsaloha, @csjx, @cboettig -- The new fields for populating semantic annotations are now present in the EML schemas in the BRANCH_EML_2_2, and I have linked them into three locations -- ResourceGroup, EntityGroup, and AttributeType. So, now you can add zero or more Could you please review, comment, and revise? Including the element and type documentation in the xsd files? Here's an excerpt from the eml-sample.xml document that shows the annotations in use: <?xml version="1.0"?>
<eml:eml
packageId="eml.1.1" system="knb"
xmlns:eml="eml://ecoinformatics.org/eml-2.2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
xsi:schemaLocation="eml://ecoinformatics.org/eml-2.2.0 ../../../xsd/eml.xsd">
<dataset>
<title>Data from Cedar Creek LTER on productivity and species richness
for use in a workshop titled "An Analysis of the Relationship between
Productivity and Diversity using Experimental Results from the Long-Term
Ecological Research Network" held at NCEAS in September 1996.</title>
<creator id="clarence.lehman">
<individualName>
<salutation>Mr.</salutation>
<givenName>Clarence</givenName>
<surName>Lehman</surName>
</individualName>
...
</creator>
...
<keywordSet>
<keyword>Old field grassland</keyword>
<keyword>biomass</keyword>
<keyword>productivity</keyword>
<keyword>species-area</keyword>
<keyword>species richness</keyword>
</keywordSet>
<annotation>
<termURI>http://purl.obolibrary.org/obo/ENVO_01000177</termURI>
<termLabel>grassland biome</termLabel>
</annotation>
<contact>
<references>clarence.lehman</references>
</contact>
<contact>
<references>richard.inouye</references>
</contact>
<dataTable id="xyz">
<entityName>CDR LTER-patterns among communities.txt</entityName>
<entityDescription>patterns amoung communities at CDR</entityDescription>
<physical>
...
</physical>
<annotation>
<termURI>http://purl.obolibrary.org/obo/ENVO_00000260</termURI>
<termLabel>prarie</termLabel>
</annotation>
<attributeList id="at.1">
...
<attribute id="att.12">
<attributeName>biomass</attributeName>
<attributeLabel>Biomass</attributeLabel>
<attributeDefinition>The total biomass measured in this field
</attributeDefinition>
<storageType>float</storageType>
<measurementScale>
<ratio>
<unit><customUnit>gramsPerSquareMeter</customUnit></unit>
<precision>0.01</precision>
<numericDomain id="nd.6">
<numberType>real</numberType>
<bounds>
<minimum exclusive="true">0</minimum>
</bounds>
</numericDomain>
</ratio>
</measurementScale>
<annotation>
<termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass</termURI>
<termLabel>Mass</termLabel>
</annotation>
<annotation>
<termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-standards.owl#Kilogram</termURI>
<termLabel>Kilogram</termLabel>
</annotation>
<annotation>
<termURI>http://example.com/example-vocab-1.owl#PlantSample</termURI>
<termLabel>Plant Sample</termLabel>
</annotation>
</attribute>
...
</attributeList>
<caseSensitive>no</caseSensitive>
<numberOfRecords>22</numberOfRecords>
</dataTable>
</dataset>
<additionalMetadata>
<metadata>
<stmml:unitList xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
xsi:schemaLocation="http://www.xml-cml.org/schema/stmml-1.1 ../../../xsd/stmml.xsd">
<!--note that the unitTypes here are taken from the eml-unitDictionary.xml-->
<stmml:unit name="gramsPerSquareMeter" unitType="arealMassDensity" id="gramsPerSquareMeter" parentSI="kilogramsPerSquareMeter" multiplierToSI=".001"/>
<stmml:unit name="speciesPerSquareMeter" unitType="arealDensity" id="speciesPerSquareMeter" parentSI="numberPerSquareMeter" multiplierToSI="1"/>
</stmml:unitList>
</metadata>
</additionalMetadata>
</eml:eml> |
This comment has been minimized.
This comment has been minimized.
@gastil-buhl thanks so much for the comments. I think the challenges you describe are very real, and we'll need to work on good documentation, including a primer. But I think they will be present for any semantically-precise implementation we might choose for EML. But re-reading the documentation I wrote, its clear that I could be more concrete in describing just how the For example, I might want an annotation to say:
In this annotation, But the words 'hasStorageType' and 'float' are semantically ambiguous, in that there can be multiple definitions of those words. So, rather than using the human readable (and ambiguous) word 'hasStorageType', we instead use the URI for that term the provides a formal definition in its controlled vocabulary (something like http://example.com/vocab1/hasStorageType). So 'hasStorageType' is just a label we use to display the term defined by the URI. Similarly, the word 'float' is just a label used to display the more precise term that it represents (e.g., http://example.com/vocab1/float). So, in reality, the true annotation is expressed using URIs, not labels:
So the labels are just human readable strings to substitute for the controlled term URI when displaying the information. Maybe this helps? Clearly we'll need help writing clear documentation. The challenge will be in being both clear and concise. A primer will allow us to be more complete than we can be in the EML specification itself. |
This comment has been minimized.
This comment has been minimized.
Some people have requested that the definition of the annotation type include mention of the ability to include it in the |
This comment has been minimized.
This comment has been minimized.
Yes Matt that does help. How you explained it there would be useful text
for a guide.
…On Thu, Jan 4, 2018 at 11:07 PM, Matt Jones ***@***.***> wrote:
@gastil-buhl <https://github.com/gastil-buhl> thanks so much for the
comments. I think the challenges you describe are very real, and we'll need
to work on good documentation, including a primer. But I think they will be
present for *any* semantically-precise implementation we might choose for
EML. But re-reading the documentation I wrote, its clear that I could be
more concrete in describing just how the annotation abstraction (and it
is definitely an abstraction) works. Its a very meta-level concept. In
short, each annotation asserts some information about a part of an eml
document, and that information is expressed as a property and a value, both
of which are drawn from controlled vocabularies.
For example, I might want an annotation to say:
variable1 hasStorageType float
In this annotation, variable1 is the EML attribute that is being
annotated (i.e., we are saying something about it), the property that we
are asserting about variable1 is hasStorageType, which has the value float
.
But the words 'hasStorageType' and 'float' are semantically ambiguous, in
that there can be multiple definitions of those words. So, rather than
using the human readable (and ambiguous) word 'hasStorageType', we instead
use the URI for that term the provides a formal definition in its
controlled vocabulary (something like http://example.com/vocab1/
hasStorageType). So 'hasStorageType' is just a label we use to display
the term defined by the URI. Similarly, the word 'float' is just a label
used to display the more precise term that it represents (e.g.,
http://example.com/vocab1/float). So, in reality, the true annotation is
expressed using URIs, not labels:
variable1 http://example.com/vocab1/hasStorageType
http://example.com/vocab1/float
So the labels are just human readable strings to substitute for the
controlled term URI when displaying the information.
Maybe this helps? Clearly we'll need help writing clear documentation. The
challenge will be in being both clear and concise. A primer will allow us
to be more complete than we can be in the EML specification itself.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AE8gZLCTsjwjNp_kr7qYAY7qttCTLzSZks5tHcorgaJpZM4MaZO8>
.
|
This comment has been minimized.
This comment has been minimized.
Hey @mbjones this looks pretty good. A few thoughts came to mind:
Regarding your above points,
I think so.
Yes, though there is a chance for the label to be out of sync with the URI if a user mis-types the information or if the
I think so. Putting the annotations in-line reduces the need for the user to look in multiple places in the document for the information. And allowing additional annotations to be put into the |
This comment has been minimized.
This comment has been minimized.
After discussion with @mpsaloha and @mobb, we agreed that it would be good to make |
This comment has been minimized.
This comment has been minimized.
@amoeba Thanks for the comments.
I'm unclear on what your final point is about how we |
This comment has been minimized.
This comment has been minimized.
I coulda made that comment more clear. I was referring to this:
This means my XPath to grab all annotations has to look for both the |
This comment has been minimized.
This comment has been minimized.
I might just not be following things correctly at this stage, but I think I'm not in favor of this new proposal. I don't like not having a property URI, or having a property URI that is only defined by some implicit convention (let alone two separate default conventions depending on context). I think it's fine if user-facing tooling wants permit a default property to make it 'easy' to tag an I would like to be able to treat any EML node as the subject (the parent node of the annotation, as in turtle or JSON-LD, or RDFa), and always have predicate/property URI and object/value of the triple clearly stated. (To me, RDFa still seems like the most obvious way to add semantics to XML, and permits existing technology (any RDFa parser) to extract the semantic annotations with minimum fuss, though I don't particularly like RDFa notation). Unrelated issue I probably should have asked earlier, but I admit I'm also lost as to why you enforce that the value is a URI at all -- why not permit Literal valued objects? |
This comment has been minimized.
This comment has been minimized.
You know, after catching up a bit on this thread, I do wonder why we are limiting annotations to resources, entities, and attributes, other than the very practical reasons that it limits the scope which limits the implementation changes. I certainly get that. I think I agree with @cboettig here that it would be nice to apply annotations to any element in the EML. I like that @mbjones has put the time into defining the <eml packageId="4cdb6dd6-66c2-478a-af66-9969a3142813" ...>
<dataset>
<title>We heart data science</title>
<creator id="12345">
<individualName><surName id="54321">Mecum</surName></individualName>
</creator>
<contact><references>12345</references></contact>
</dataset>
<annotations>
<annotation>
<references>54321</references>
<propertyURI label="a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
<valueURI label="familyName">http://xmlns.com/foaf/0.1/#familyName</valueURI>
</annotation>
</annotations>
</eml> The crux here is to allow for the XML One advantage of doing this is that we might be able to convert these non-compliant triple statements into RDF-compliant statements by concatenating some dereference-able URI and the
Hmm, now that I look at that, we'd have a problem interpreting the Well, food for thought. |
This comment has been minimized.
This comment has been minimized.
@cboettig wrote:
|
This comment has been minimized.
This comment has been minimized.
Love @csjx suggestion of having an option for an The NeXML schema in phylogenetics works this way via RDFa I see what Chris's example is trying to say, but I find it confusing to think of I imagine something like:
which would contain the single triple
Note the above
which has the advantage that the triple could be extracted by any existing RDFa->RDF stylesheet and doesn't involve creating any new syntax. (Though actually I think it makes more sense to interpret the whole document as triples, like this) Okay, maybe I'm way off the deep end now, feel free to pull me back. |
This comment has been minimized.
This comment has been minimized.
Ah, I see what you mean. @cboettig wrote:
I guess I was trying to disambiguate the EML Regarding your example of putting the |
This comment has been minimized.
This comment has been minimized.
Thanks for all of the input, @cboettig, @csjx, and @amoeba. Good stuff. Regarding the optionality of the property field: I agree, but @mpsaloha and @mobb felt strongly in the other direction, so I was trying to accommodate their desire to have a default property. I agree with you that having the property be explicit is important and far more manageable within the context of EML. I'll wait for some feedback from the others, but I think I will plan to move it back to having property be required. Getting more voices on this issue would be helpful. Regarding serialization, I like Bryce's suggestion of embedding the property and value labels as attributes in the parent element, and will plan on making that change in the next revision. Regarding the use of annotations in
Finally, @cboettig brought up a really new example with his use of a typed literal as the value of one of his example statements. I had been specifically avoiding the use of literals, as once we go down that slope we are really just re-inventing the RDF model within EML. The reason for the As these issues are getting complex, and this conversation is dragging out, I think we should schedule a call to discuss the merits of the various proposals for |
This comment has been minimized.
This comment has been minimized.
@mbjones great points all around, and agree this would be nice to hash out in a more real-time discussion on the slack channel. Meanwhile I'm going to take the liberty of jotting a few notes here just in case scheduling on slack doesn't work out for me.
Okay, bigger issues (in which i'm probably jousting at windmills, but anyway)... Right, I appreciate the objective of the semantic extensions, as envisioned here, is really to provide some more precise semantics to existing literals such as measurements and not to open pandora's box to arbitrary RDF statements. I'm not entirely clear that using semantic annotations and restricting those annotations to URI types is the best way to accomplish though -- there's still a lot that can be expressed as URIs that go outside of this scope, and it still means that EML is inherently a mixed-type model from which I can neither easily extract generic RDF statements that have much meaning nor predict all of the properties. If the goal is narrowly to, say, define attributes in terms of OBOE properties, I wonder if we shouldn't be XML-izing OBOE and extending EML explicitly with those terms rather than adding arbitrary URIs? (That sounds complicated and I personally I don't advocate for that path, but just throwing it out there as a thought experiment). I see/agree that adding literals and permitting annotations on arbitrary nodes would allow arbitrary semantic extensions without extending EML. Likewise, I agree that in such case, it makes sense to treat all EML as RDF, and not just a few random semantic nodes (e.g. in my example, it makes little sense to extend I see that if any random RDF is suddenly valid EML that we've pretty much lost any advantage in having a well-defined schema and we're pretty much back in a mess where you don't know where to find the title of the dataset (is it
I think such an approach makes it more obvious to developers and consumers how to interact with a semantic layer in EML, or more generally, how to interact in EML semantically, without making arbitrary semantic extensions into first-class citizens that any parser must suddenly be able to deal with. Anyway, treating all of EML as JSON-LD or RDF is pretty far from the topic here, so like I've said before, forge ahead with the practical, but I do think it provides a nice illustration of something that is both extensible and flexible but doesn't lose any of the power we gained in the first place from a rigidly-defined schema -- after all you can always transform into the XSD-valid schema representation. I've mentioned this before and pestered @amoeba with a proof-of-principle to transform between RDF, JSON, and EML-valid XML: https://github.com/cboettig/emld. (Really this is just the back-end for a re-write of the EML package that uses lists instead of S4, which provides what I hope will be a lot more intuitive interface for most R users or developers. https://github.com/cboettig/eml2) okay, probably lost everyone now, so better turn in for the night. |
This comment has been minimized.
This comment has been minimized.
Hi, I'm not an XML maven like most of you here, so I don't have opinion nor insight on the specifics of serialization solutions from the XML guts of EML. However, Matt asked me to look over the latest comments on semanticizing EML and here are my thoughts, fwiw: the proposed "valueURI" should probably be renamed as "objectURI" as it appears consistently to be used as an object for an RDF triple-- and as such it can contain either literals (i.e. values) or URI's -- Carl's concern that "foaf:name" doesn't seem like a "value" (triple object) is reasonable, since that foaf term is indeed a property (predicate). Looking at csjx suggestion about using the EML 'id' (comment from 13 days ago), I think a desired triplification {s,p,o} would be: <URI for EML element ID><foaf:name><"Mecum"> (where the challenge may be minting that httpURI in the subject position? and we'd hope that rather than <foaf:familyName>="Mecum", things will eventually devolve to <https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321> (above isn't quite correct since foaf doesn't offer appropriate property "foaf:orcidID", although it has "foaf:skypeID" and "foaf:icqchatID" etc.!) For many of our Use Cases, however, I think we frequently "simply" (ha!) want to mint an http URI that points to the specific element in EML for which we want to add semantics. So, going back to Matt's early comments from Jan. 4 <"variable1"> <http://example.com/vocab1/hasStorageType> <http://example.com/vocab1/float> what most excites me is being able to frequently assert: (...although note again that we'd need to mint an httpURI in the subject position.) Also, this is why in discussion with Matt and Margaret, I had suggested that, at least for attribute metadata, a default propertyURI could be "rdf:type", simple asserting class membership of some URI-specified instance ("variable1" in this case), as a member of Class "measurementTypeXXX". That "measurementTypeXXX" would be defined in our ECSO ontology and accessible with its PIRI GUID (PURL, that is) --with, e.g. an rdf:label or skos:prefLabel of "air temperature"-- and appropriate axioms about what characteristic ("temperature"), what entity ("air"), and potentially dimensions ("degrees Celsius") etc describe that MeasurementTypeXXX. But all that additional information would/could be garnered from dereferencing the ECSO URI in the object position of the triple. Well, hope this isn't too muddled or trivial...it's kind of turtle-ish. |
This comment has been minimized.
This comment has been minimized.
After several offline conversations, we have reached consensus on implementing annotations using just property and value URIs, which in turn can be located in 5 locations in the EML document:
We've also agreed to embed the label in the element for readability. So a typical annotation would look like: <annotation>
<propertyURI label="uses unit">http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#usesStandard</propertyURI>
<valueURI label="Kilogram">http://ecoinformatics.org/oboe/oboe.1.2/oboe-standards.owl#Kilogram</valueURI>
</annotation> In that case, the For annotations in <annotations>
<annotation references="CDR-biodiv-table">
<propertyURI label="Subject">http://purl.org/dc/elements/1.1/subject</propertyURI>
<valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
</annotation>
<annotation references="adam.shepherd">
<propertyURI label="is a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
<valueURI label="Person">https://schema.org/Person</valueURI>
</annotation>
<annotation references="adam.shepherd">
<propertyURI label="member of">https://schema.org/memberOf</propertyURI>
<valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
</annotation>
</annotations> For annotations in <additionalMetadata>
<describes>adam.shepherd</describes>
<metadata>
<annotation>
<propertyURI label="member of">https://schema.org/memberOf</propertyURI>
<valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
</annotation>
</metadata>
</additionalMetadata> That should wrap up implementation of the annotation field implementation. Merge commit is SHA fbafee0. |
This comment has been minimized.
This comment has been minimized.
After discussion, we agreed to add language to conditionally require the use of the We decided that making This requires an addition to EML Parser. Reopening until I can update this documentation and EMLParser. |
This comment has been minimized.
This comment has been minimized.
IThis sounds good to me, but I think we need to clarify the constraints, if
any, on the contents of the “value=“ fields for the two URI elements. Are
these free text or {should | must} these be populated by an rdfs:label or
skos:prefLabel if such exist? Although we have (i believe) encountered
cases where there is no helpful Annotation Property of this sort and some
natural language semantics is implied by the URI itself...
…On Wed, Jul 25, 2018 at 4:50 PM Matt Jones ***@***.***> wrote:
Reopened #25 <#25>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#25 (comment)>, or mute the
thread
<https://github.com/notifications/unsubscribe-auth/AE61-YW9VIumfl6M35MVZyu8GtKhS200ks5uKQQogaJpZM4MaZO8>
.
|
This comment has been minimized.
This comment has been minimized.
for the two label fields, my opinion is that it's a { should } be populated by an rdfs:label or Mainly, because to say {must} would mean that we ought to be able to confirm the label is correct, which is not practical. Communities may want to do their own checking however, which would be tied to specific vocabularies. |
This comment has been minimized.
This comment has been minimized.
At the LTER ASM breakout discussion on vocabularies, there was great interest in how to use/substitute formally defined (i.e. by specifying a dereferenceable GUID from a term in an (approved) thesaurus or ontology) terms as EML KEYWORDS. Some discussion ensued that semantic annotation at the level of dataset and entity essentially constitute EML KEYWORDS describing the object at that level. We and potentially the LTER Community need to agree on best practices in this process. Clearly having well-conceived EML KEYWORDS will be a major boon (and possibly opens up some interesting uses for Object Properties). |
Author Name: Matt Jones (Matt Jones)
Original Redmine Issue: 277, https://projects.ecoinformatics.org/ecoinfo/issues/277
Original Date: 2001-08-31
Original Assignee: Matt Jones
Need to extend EML, either by adding a new module or extending the current
entity/attribute system, so that semantic metadata can be accommodated.
Basically, this means being able to enter terms from an ontology (see bug 274)
so that a particular data table attribute can be tied into the ontology. See
the KDI proposal on canonical variables for more information.