Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semantic metadata module/extensions #25

Closed
mbjones opened this issue Mar 12, 2017 · 37 comments
Closed

semantic metadata module/extensions #25

mbjones opened this issue Mar 12, 2017 · 37 comments
Assignees
Labels
Milestone

Comments

@mbjones
Copy link
Contributor

@mbjones mbjones commented Mar 12, 2017


Author Name: Matt Jones (Matt Jones)
Original Redmine Issue: 277, https://projects.ecoinformatics.org/ecoinfo/issues/277
Original Date: 2001-08-31
Original Assignee: Matt Jones


Need to extend EML, either by adding a new module or extending the current
entity/attribute system, so that semantic metadata can be accommodated.
Basically, this means being able to enter terms from an ontology (see bug 274)
so that a particular data table attribute can be tied into the ontology. See
the KDI proposal on canonical variables for more information.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Mar 12, 2017


Original Redmine Comment
Author Name: Matt Jones (Matt Jones)
Original Date: 2004-09-02T16:38:17Z


Changing QA contact to the list for all current EML bugs so that people can
track what is happening.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Mar 12, 2017


Original Redmine Comment
Author Name: Redmine Admin (Redmine Admin)
Original Date: 2013-03-27T21:13:50Z


Original Bugzilla ID was 277

@mbjones mbjones added this to the Postpone milestone Mar 12, 2017
@mbjones mbjones added this to TODO in EML 2.2.0 Release Mar 12, 2017
@mbjones mbjones added bug and removed Status: In Progress labels Mar 12, 2017
@mbjones mbjones moved this from TODO to High priority in EML 2.2.0 Release Apr 22, 2017
@mbjones mbjones self-assigned this Apr 22, 2017
@mbjones mbjones moved this from High priority to In progress in EML 2.2.0 Release Apr 22, 2017
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 7, 2017

Added new schema file eml-semantics.xsd for providing a new SemanticAnnotation type. Needs to be tested, reviewed, and incorporated into the other schemas.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 7, 2017

@mobb and @mpsaloha I wanted to bring this semantic extension for EML to your attention in particular. I'm just starting thinking about how this would work, but for now I committed a new eml-sematics.xsd file with a SemanticAnnotation ComplexType in commit sha 1dacda8 in the EML 2.2 branch. My thought is that I will add optional annotation elements using the SemanticAnnotation type in key structures in EML, particularly in the following places:

  • ResourceGroup group, to get all resource types, including /eml/dataset
  • EntityGroup group in eml-entity.xsd (covers dataTable, otherEntity, spatialRaster, ...)
  • AttributeType complexType in eml-attribute.xsd

I would appreciate your thoughts on this. You can view the xsd file in the branch: https://github.com/NCEAS/eml/blob/BRANCH_EML_2_2/xsd/eml-semantics.xsd

mbjones added a commit that referenced this issue Jul 24, 2017
The `SemanticAnnotation` type was difined as a sequence nested inside of a choice, which
was not needed. See issue #25.
mbjones added a commit that referenced this issue Jul 24, 2017
Continuing work on issue #25, added in fields of type sem:SemanticAnnotation in three
key schemas.
mbjones added a commit that referenced this issue Jul 24, 2017
Continuing work on issue #25, this commit now allows the new annotation
fields to be demonstrated. The eml-sample.xml document validates using xerces.
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 24, 2017

@mobb, @amoeba, @mpsaloha, @csjx, @cboettig -- The new fields for populating semantic annotations are now present in the EML schemas in the BRANCH_EML_2_2, and I have linked them into three locations -- ResourceGroup, EntityGroup, and AttributeType. So, now you can add zero or more annotation fields to each of those structures in EML. We would typically be using the annotation in eml-attribute to attach OBOE-style annotations to the attributes in a data set. But you can also attach more general annotations to, for example, /eml/dataset and /eml/dataset/dataTable, which makes it broadly applicable as a semantic tagging module.

Could you please review, comment, and revise? Including the element and type documentation in the xsd files? Here's an excerpt from the eml-sample.xml document that shows the annotations in use:

<?xml version="1.0"?>
<eml:eml
    packageId="eml.1.1" system="knb"
    xmlns:eml="eml://ecoinformatics.org/eml-2.2.0"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
    xsi:schemaLocation="eml://ecoinformatics.org/eml-2.2.0 ../../../xsd/eml.xsd">

<dataset>
  <title>Data from Cedar Creek LTER on productivity and species richness
  for use in a workshop titled "An Analysis of the Relationship between
  Productivity and Diversity using Experimental Results from the Long-Term
  Ecological Research Network" held at NCEAS in September 1996.</title>
  <creator id="clarence.lehman">
    <individualName>
      <salutation>Mr.</salutation>
      <givenName>Clarence</givenName>
      <surName>Lehman</surName>
    </individualName>
    ...
  </creator>
  ...
  <keywordSet>
    <keyword>Old field grassland</keyword>
    <keyword>biomass</keyword>
    <keyword>productivity</keyword>
    <keyword>species-area</keyword>
    <keyword>species richness</keyword>
  </keywordSet>
  <annotation>
      <termURI>http://purl.obolibrary.org/obo/ENVO_01000177</termURI>
      <termLabel>grassland biome</termLabel>
  </annotation>
  <contact>
    <references>clarence.lehman</references>
  </contact>
  <contact>
    <references>richard.inouye</references>
  </contact>
  <dataTable id="xyz">
    <entityName>CDR LTER-patterns among communities.txt</entityName>
    <entityDescription>patterns amoung communities at CDR</entityDescription>
    <physical>
        ...
    </physical>
    <annotation>
        <termURI>http://purl.obolibrary.org/obo/ENVO_00000260</termURI>
        <termLabel>prarie</termLabel>
    </annotation>
    <attributeList id="at.1">
      ...
      <attribute id="att.12">
        <attributeName>biomass</attributeName>
        <attributeLabel>Biomass</attributeLabel>
        <attributeDefinition>The total biomass measured in this field
        </attributeDefinition>
        <storageType>float</storageType>
        <measurementScale>
          <ratio>
            <unit><customUnit>gramsPerSquareMeter</customUnit></unit>
            <precision>0.01</precision>
            <numericDomain id="nd.6">
              <numberType>real</numberType>
              <bounds>
                <minimum exclusive="true">0</minimum>
              </bounds>
            </numericDomain>
          </ratio>
        </measurementScale>
        <annotation>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Mass</termURI>
            <termLabel>Mass</termLabel>
        </annotation>
        <annotation>
            <termURI>http://ecoinformatics.org/oboe/oboe.1.2/oboe-standards.owl#Kilogram</termURI>
            <termLabel>Kilogram</termLabel>
        </annotation>
        <annotation>
            <termURI>http://example.com/example-vocab-1.owl#PlantSample</termURI>
            <termLabel>Plant Sample</termLabel>
        </annotation>
      </attribute>
...
    </attributeList>
    <caseSensitive>no</caseSensitive>
    <numberOfRecords>22</numberOfRecords>
  </dataTable>
</dataset>
<additionalMetadata>
<metadata>
<stmml:unitList xmlns:stmml="http://www.xml-cml.org/schema/stmml-1.1"
    xsi:schemaLocation="http://www.xml-cml.org/schema/stmml-1.1 ../../../xsd/stmml.xsd">
    <!--note that the unitTypes here are taken from the eml-unitDictionary.xml-->
    <stmml:unit name="gramsPerSquareMeter" unitType="arealMassDensity" id="gramsPerSquareMeter" parentSI="kilogramsPerSquareMeter" multiplierToSI=".001"/>
    <stmml:unit name="speciesPerSquareMeter" unitType="arealDensity" id="speciesPerSquareMeter" parentSI="numberPerSquareMeter" multiplierToSI="1"/>
  </stmml:unitList>
  </metadata>
</additionalMetadata>
</eml:eml>
@mbjones mbjones moved this from In progress to Completed in EML 2.2.0 Release Jul 24, 2017
@mbjones mbjones removed this from the Postpone milestone Oct 30, 2017
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jan 5, 2018

@gastil-buhl thanks so much for the comments. I think the challenges you describe are very real, and we'll need to work on good documentation, including a primer. But I think they will be present for any semantically-precise implementation we might choose for EML. But re-reading the documentation I wrote, its clear that I could be more concrete in describing just how the annotation abstraction (and it is definitely an abstraction) works. Its a very meta-level concept. In short, each annotation asserts some information about a part of an eml document, and that information is expressed as a property and a value, both of which are drawn from controlled vocabularies.

For example, I might want an annotation to say:

variable1 hasStorageType float

In this annotation, variable1 is the EML attribute that is being annotated (i.e., we are saying something about it), the property that we are asserting about variable1 is hasStorageType, which has the value float.

But the words 'hasStorageType' and 'float' are semantically ambiguous, in that there can be multiple definitions of those words. So, rather than using the human readable (and ambiguous) word 'hasStorageType', we instead use the URI for that term the provides a formal definition in its controlled vocabulary (something like http://example.com/vocab1/hasStorageType). So 'hasStorageType' is just a label we use to display the term defined by the URI. Similarly, the word 'float' is just a label used to display the more precise term that it represents (e.g., http://example.com/vocab1/float). So, in reality, the true annotation is expressed using URIs, not labels:

variable1 http://example.com/vocab1/hasStorageType http://example.com/vocab1/float

So the labels are just human readable strings to substitute for the controlled term URI when displaying the information.

Maybe this helps? Clearly we'll need help writing clear documentation. The challenge will be in being both clear and concise. A primer will allow us to be more complete than we can be in the EML specification itself.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 14, 2018

Some people have requested that the definition of the annotation type include mention of the ability to include it in the additionalMetadata field of EML. In this case, the describes element would be used to define the subject of the annotation triple. Once I add that documentation, I think this feature is ready for release. We should, however, open another ticket for a primer document.

@gastil-buhl

This comment has been minimized.

Copy link

@gastil-buhl gastil-buhl commented Feb 14, 2018

@amoeba

This comment has been minimized.

Copy link
Contributor

@amoeba amoeba commented Feb 16, 2018

Hey @mbjones this looks pretty good. A few thoughts came to mind:

  • Did we ever talk about this XML structure as an alternative?

    <annotation>
      <property label="ofCharacteristic">http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#ofCharacteristic</property>
      <value label="Temperature">http://ecoinformatics.org/oboe/oboe.1.2/oboe-characteristics.owl#Temperature</value>
    </annotation>

    which feels a little more XML-ish. The label attribute could be optional. How it is now is fine with me though. I tend to shy away from XML attributes.

  • In the examples above, I see different types of things put in the property* and term* elements. I'm not sure what the correct way to use this is. In some examples, property contains a predicate (e.g., oboe:ofCharacteristic) and in others, it uses a class (e.g., oboe:Characteristic). In your most recent example, you put predicates in property which makes the most sense.

    In this case, when we annotate an attribute with annotations of ofCharacteristic X and ofEntity Y (as in your above example), are we implying that the attribute is of type oboe:Measurement, but only though induction? Is there any benefit to explicitly stating this through another annotation with an rdf:type property and the class as the value?

  • In terms of workflows, I wonder about how this fits into the existing EML R package. Currently, a data.frame is used as universal data structure for working with attributes in R, with each row corresponding to one attribute and each column describing the information about that attribute.

    # eg
    data.frame(attributeName = ..., 
               attributeLabel = ..., 
               measurementScale = ..., 
               domain = ...)

    Since each attribute can have zero or more annotations, we'd either (1) have to break the one-row-per-attribute model, (2) use list columns, or (3) describe each attribute's annotation in a separate data structure. Options (2) and (3) seem reasonable and achievable given the proposed schema so I don't see any issues. I'd probably go with (3). Just wanted to get that out there.

Regarding your above points,

does the PropertyURI/valueURI approach work generically

I think so.

is including a single label for each URI sensible so that resolution is not needed to produce more human readable output

Yes, though there is a chance for the label to be out of sync with the URI if a user mis-types the information or if the rdf:label of a term changes in the ontology.

Are the places that I allowed annotations adequate? 1) Resource/Dataset, 2) Entity, 3) Attribute, given that people could add additional annotations through additionalMetadata as desired using the describes element

I think so. Putting the annotations in-line reduces the need for the user to look in multiple places in the document for the information. And allowing additional annotations to be put into the additionalMetadata sets aside a catch-all place for annotations that don't belong elsewhere. Why did we decide to use a separate element though? It might be nice if I could just run the XPath //annotation to grab all the annotations in a document.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 24, 2018

After discussion with @mpsaloha and @mobb, we agreed that it would be good to make propertyURI and propertyLabel optional to create a simpler case when someone wants to just generically tag a resource, entity, or attribute. I made this change in SHA 733a650, and provided documentation that indicates the default property in the case one is omitted. For resource and entity subclasses, the default property is Dublin Core Element Set dc:subject, which lets us indicate generally the topic associated with a data set or a data table. For attribute elements, the default property is oboe:MeasurementType, which lets us associate attribute semantics with the variable.

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 24, 2018

@amoeba Thanks for the comments.

  • Your proposal to nest the label in property and value elements is nice and readable and compact. I considered it, and decided initially not to because its harder to access attributes using XPath. However, seeing it written out the way you did makes the whole block much more readable, so I am inclined to agree this would be good.
  • I'll have to look through the examples, but they should be consistent in the docs and examples. I agree that property should contain something that is like a predicate, whereas value should contain some class. Feel free to point out or fix specific places where the examples in the EML directory are wrong (the github ticket has earlier versions so I wouldn't worry about that per se).
  • We'll definitely need to discuss how to handle this in the EML R package functions.

I'm unclear on what your final point is about how we decided to use a separate element? They are all of the same type, and //annotation should work fine, although it loses the resource/entity/attribute parent that is critical to knowing the subject of the triple to be generated.

@amoeba

This comment has been minimized.

Copy link
Contributor

@amoeba amoeba commented Feb 24, 2018

I coulda made that comment more clear. I was referring to this:

through additionalMetadata as desired using the describes element

This means my XPath to grab all annotations has to look for both the annotation and the describes tag unless I misunderstand something.

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Feb 24, 2018

I might just not be following things correctly at this stage, but I think I'm not in favor of this new proposal. I don't like not having a property URI, or having a property URI that is only defined by some implicit convention (let alone two separate default conventions depending on context). I think it's fine if user-facing tooling wants permit a default property to make it 'easy' to tag an entity or attribute, but I think the property URI should be written explicitly into the EML. I think the EML schema itself is not the best way to establish this kind of implicit or default property (partly because I don't see where that definition exists, other than in documentation), and I think it is asking for trouble at some stage.

I would like to be able to treat any EML node as the subject (the parent node of the annotation, as in turtle or JSON-LD, or RDFa), and always have predicate/property URI and object/value of the triple clearly stated.

(To me, RDFa still seems like the most obvious way to add semantics to XML, and permits existing technology (any RDFa parser) to extract the semantic annotations with minimum fuss, though I don't particularly like RDFa notation). Unrelated issue I probably should have asked earlier, but I admit I'm also lost as to why you enforce that the value is a URI at all -- why not permit Literal valued objects?

@csjx

This comment has been minimized.

Copy link
Member

@csjx csjx commented Feb 24, 2018

You know, after catching up a bit on this thread, I do wonder why we are limiting annotations to resources, entities, and attributes, other than the very practical reasons that it limits the scope which limits the implementation changes. I certainly get that.

I think I agree with @cboettig here that it would be nice to apply annotations to any element in the EML. I like that @mbjones has put the time into defining the eml-annotation.xsd module, and that seems like how we can validate annotation syntax. But I've always thought they would be similarly to customUnits that we drop into /eml/additionalMetadata. But, to avoid the pain that is xs:any content, I'm wondering about adding a top level optional /eml/annotations element, which would be a list 0..n annotation elements however we define them per the discussion above. The annotation so far provides the predicate and object of the triple, and so I'm thinking that we could define the subject as a references element that points to the id of the element being annotated. Something like:

<eml packageId="4cdb6dd6-66c2-478a-af66-9969a3142813" ...>
    <dataset>
        <title>We heart data science</title>
        <creator id="12345">
            <individualName><surName id="54321">Mecum</surName></individualName>
        </creator>
        <contact><references>12345</references></contact>
    </dataset>
    <annotations>
        <annotation>
            <references>54321</references>
            <propertyURI label="a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
            <valueURI label="familyName">http://xmlns.com/foaf/0.1/#familyName</valueURI>
        </annotation>
    </annotations>
</eml>

The crux here is to allow for the XML id attribute on effectively any [or almost any] element defined in the schema. As I understand it, we have only added an id attribute to certain elements for referencing within the instance documents, but this would be a uniform, optional, backward-compatible change (I think).

One advantage of doing this is that we might be able to convert these non-compliant triple statements into RDF-compliant statements by concatenating some dereference-able URI and the id attribute to create a subject URI, like:

https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321

Hmm, now that I look at that, we'd have a problem interpreting the pid from the document fragment reference. But you get what I mean. The ids are all unique references into the XML document as anchors.

Well, food for thought.

@amoeba

This comment has been minimized.

Copy link
Contributor

@amoeba amoeba commented Feb 24, 2018

@cboettig wrote:

but I think the property URI should be written explicitly into the EML.

👍 I didn't notice that, if this is actually the case, when I read over the proposal.

I would like to be able to treat any EML node as the subject (the parent node of the annotation, as in turtle or JSON-LD, or RDFa), and always have predicate/property URI and object/value of the triple clearly stated.

👍 Though what, in this case, is the URI of the Subject? i.e., can we export all of the annotations in an EML document into a triple store?

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Feb 24, 2018

Love @csjx suggestion of having an option for an id attribute on every element (at least every complex type). That also addresses @amoeba's second question, since the id of the parent node is subject URI.

The NeXML schema in phylogenetics works this way via RDFa meta elements with the about attribute to refer to any node in the XML document, where all nodes can have an id.

I see what Chris's example is trying to say, but I find it confusing to think of foaf:familyName as a valueURI instead of a propertyURI, and I can't quite figure out the resulting triples. Basically I don't think this makes sense for simple types, which take values rather than nodes as their argument. (i.e. in JSON-LD, you cannot have both @id and @value

I imagine something like:

<?xml version="1.0" encoding="UTF-8"?>
<eml:eml xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 eml.xsd"
"packageId": "http://dataone.org/abc123">
  <title>Sample Dataset Description</title>
  <creator id="23445" scope="document">
    <annotation>
      <propertyURI>https://schema.org/birthDate</propertyURI>
      <valueLiteral typeOf="xs:Date">1980-02-02</valueLiteral>
    </annotation>
    <individualName>
      <surName>Smith</surName>
    </individualName>
  </creator>
  <contact>
    <references>23445</references>
  </contact>
</eml:eml>

which would contain the single triple

<http://dataone.org/abc123#23445> <https://schema.org/birthDate> "1980-02-02"^^Date

Note the above <annotation> is clearly equivalent to

<meta property="https://schema.org/birthDate" content= "1980-02-02" typeOf="xs:Date">

which has the advantage that the triple could be extracted by any existing RDFa->RDF stylesheet and doesn't involve creating any new syntax.

(Though actually I think it makes more sense to interpret the whole document as triples, like this)

Okay, maybe I'm way off the deep end now, feel free to pull me back. 🏊

@csjx

This comment has been minimized.

Copy link
Member

@csjx csjx commented Feb 25, 2018

Ah, I see what you mean. @cboettig wrote:

I see what Chris's example is trying to say, but I find it confusing to think of foaf:familyName as a valueURI instead of a propertyURI

I guess I was trying to disambiguate the EML surName term and assert that it is of rdf:type foaf:familyName, but that raises the question of how you reference the "Mecum" value in order to annotate it. I like your idea that the parent element with an attached id references the value contained within that element. Yeah, I don't see any other way to identify the content uniquely.

Regarding your example of putting the annotation element as a child of the creator: that would assume we would change the content model of all complexTypes in EML and add in an optional annotation element. This can be done, but to me it has more overall impact across all modules (I guess in a visual sense). I suggested that we consolidate all annotations inside of /eml/annotations (somewhat less obtrusive), so I think this is a point to raise with others to see what people like. I do in fact like the annotation co-occurring right with the element it is annotating. In my example it is a step removed, so may be harder to grok when perusing the EML, which I'm sure we all like to do on a Saturday evening. 😜

@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Feb 27, 2018

Thanks for all of the input, @cboettig, @csjx, and @amoeba. Good stuff.

Regarding the optionality of the property field: I agree, but @mpsaloha and @mobb felt strongly in the other direction, so I was trying to accommodate their desire to have a default property. I agree with you that having the property be explicit is important and far more manageable within the context of EML. I'll wait for some feedback from the others, but I think I will plan to move it back to having property be required. Getting more voices on this issue would be helpful.

Regarding serialization, I like Bryce's suggestion of embedding the property and value labels as attributes in the parent element, and will plan on making that change in the next revision.

Regarding the use of annotations in additionalMetadata, that was what I was intending all along. That is functionally equivalent to what @csjx proposed with the <annotations> element that spans the document. The nice thing about the <annotations> element is that it indicates explicitly in the EML schema that providing annotations on any element with an id is intended behavior, whereas it is only implicit in additionalMetadata. This is the same reason why I think it is helpful to provide an explicit annotation element for the major places where we will really look for semantic clarifications, mainly the dataset, entity, and attribute elements. By making the possibility of the annotation clear, I think people will be more likely to provide them, which is particularly important for attribute elements. So, what I take from this is that we should allow annotations in 3 types of places, and harvest them all up at the time of schema parsing:

  • in attribute, entity, and dataset (or other resource) elements
  • in an /eml/annotations root element
  • in /eml/additionalMetadata

Finally, @cboettig brought up a really new example with his use of a typed literal as the value of one of his example statements. I had been specifically avoiding the use of literals, as once we go down that slope we are really just re-inventing the RDF model within EML. The reason for the annotation element, in my mind, is to clarify for semantics of the existing literal values in an EML document. So, when we have an attribute with the literal attributeName=littermass, we can semantically clarify that the property measured might be Biomass. In contrast, Carl's example adds a whole new literal value (birthdate) to the ResponsibleParty type without extending the EML -- it basically would mean that people could add any literal to any EML element, and at that point we might as well just eliminate the XML serialization for EML and move to a full RDF serialization, which would be far easier to process than a mixed model. If we are to stick with an XSD schema for EML, I think the literal values should in general be modeled as extensions to the EML types. This is why I wrote the value element as having type xsd:anyURI, which precludes it from being a literal. I'm sur ethis will engender discussion.

As these issues are getting complex, and this conversation is dragging out, I think we should schedule a call to discuss the merits of the various proposals for annotation and come to some decisions so we can move forward with this. I will try to find a time this week on the EML slack channel (available on https://slack.nceas.ucsb.edu).

@cboettig

This comment has been minimized.

Copy link
Member

@cboettig cboettig commented Feb 27, 2018

@mbjones great points all around, and agree this would be nice to hash out in a more real-time discussion on the slack channel. Meanwhile I'm going to take the liberty of jotting a few notes here just in case scheduling on slack doesn't work out for me.

  • 👍 from me for keeping property explicit; though I'd love to hear the perspectives from others on that. I feel there are some really important points being made there, but also that they are better addressed at a user tooling level rather than in the schema itself.

  • 👍 on @amoeba label as attribute notation.

  • 👍 on consolidating all the annotations

  • Would love to hear more discussion on @csjx 's proposal of expanding the use of an id element to more complexTypes. I think this opens the door to more semantic annotation use cases we might not be thinking of right now (maybe not everyone sees that as a good thing), but more generally I think the ability to use id to reference nodes is a very useful feature.

Okay, bigger issues (in which i'm probably jousting at windmills, but anyway)...

Right, I appreciate the objective of the semantic extensions, as envisioned here, is really to provide some more precise semantics to existing literals such as measurements and not to open pandora's box to arbitrary RDF statements. I'm not entirely clear that using semantic annotations and restricting those annotations to URI types is the best way to accomplish though -- there's still a lot that can be expressed as URIs that go outside of this scope, and it still means that EML is inherently a mixed-type model from which I can neither easily extract generic RDF statements that have much meaning nor predict all of the properties. If the goal is narrowly to, say, define attributes in terms of OBOE properties, I wonder if we shouldn't be XML-izing OBOE and extending EML explicitly with those terms rather than adding arbitrary URIs? (That sounds complicated and I personally I don't advocate for that path, but just throwing it out there as a thought experiment).

I see/agree that adding literals and permitting annotations on arbitrary nodes would allow arbitrary semantic extensions without extending EML. Likewise, I agree that in such case, it makes sense to treat all EML as RDF, and not just a few random semantic nodes (e.g. in my example, it makes little sense to extend creator with a birthDate as a semantic annotation if the rest of the creator metadata is not also accessible as the obvious triples). However, I don't think this means abandoning the XSD schema.

I see that if any random RDF is suddenly valid EML that we've pretty much lost any advantage in having a well-defined schema and we're pretty much back in a mess where you don't know where to find the title of the dataset (is it dc:title or schema:name?) let alone anything more complex, and I'm not advocating for that at all. I currently think of EML as functionally equivalent to JSON-LD modulo some syntax: defining nested objects in a predictable structure in a well-defined context (i.e. the EML namespace). (Since JSON-LD has 1:1 map to RDF, this is semantic, but also has a pretty obvious 1:1 map to XML and maintains the notion of schema validity). I think of any semantic annotations on top of EML as necessarily being outside of the EML @context (in the JSON-LD sense), i.e. that a tool should always be able to ignore these and still get a meaningful picture, but that a particular family of tools could also be defined to work on an extended @context, e.g. EML + oboe extensions. I think this allows meaningful semantic extensions

  • (a) can see the EML in which they are embedded semantically,
  • (b) obey all the standard syntax and rules of semantic markup anywhere else, no arbitrary gotchas of default types or URI-values only
  • (c) encourages that extensions are clearly scoped and defined in an 'extended' context
  • (d) Anything outside of the base EML context can easily be stripped off and what remains can be validated against the EML schema

I think such an approach makes it more obvious to developers and consumers how to interact with a semantic layer in EML, or more generally, how to interact in EML semantically, without making arbitrary semantic extensions into first-class citizens that any parser must suddenly be able to deal with.

Anyway, treating all of EML as JSON-LD or RDF is pretty far from the topic here, so like I've said before, forge ahead with the practical, but I do think it provides a nice illustration of something that is both extensible and flexible but doesn't lose any of the power we gained in the first place from a rigidly-defined schema -- after all you can always transform into the XSD-valid schema representation. I've mentioned this before and pestered @amoeba with a proof-of-principle to transform between RDF, JSON, and EML-valid XML: https://github.com/cboettig/emld. (Really this is just the back-end for a re-write of the EML package that uses lists instead of S4, which provides what I hope will be a lot more intuitive interface for most R users or developers. https://github.com/cboettig/eml2)

okay, probably lost everyone now, so better turn in for the night. 🌔 🛌

@mpsaloha

This comment has been minimized.

Copy link
Contributor

@mpsaloha mpsaloha commented Mar 9, 2018

Hi, I'm not an XML maven like most of you here, so I don't have opinion nor insight on the specifics of serialization solutions from the XML guts of EML. However, Matt asked me to look over the latest comments on semanticizing EML and here are my thoughts, fwiw:

the proposed "valueURI" should probably be renamed as "objectURI" as it appears consistently to be used as an object for an RDF triple-- and as such it can contain either literals (i.e. values) or URI's --
unlike subjects and predicates that can only be represented by URI's.

Carl's concern that "foaf:name" doesn't seem like a "value" (triple object) is reasonable, since that foaf term is indeed a property (predicate). Looking at csjx suggestion about using the EML 'id' (comment from 13 days ago), I think a desired triplification {s,p,o} would be:

<URI for EML element ID><foaf:name><"Mecum">
or
<https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321>
<foaf:familyName> <"Mecum">

(where the challenge may be minting that httpURI in the subject position?

and we'd hope that rather than <foaf:familyName>="Mecum", things will eventually devolve to

<https://cn.dataone.org/cn/v2/resolve/4cdb6dd6-66c2-478a-af66-9969a3142813#54321>
<foaf:orcidID><http://orcid.org/0000-0002-0381-3766>

(above isn't quite correct since foaf doesn't offer appropriate property "foaf:orcidID", although it has "foaf:skypeID" and "foaf:icqchatID" etc.!)

For many of our Use Cases, however, I think we frequently "simply" (ha!) want to mint an http URI that points to the specific element in EML for which we want to add semantics. So, going back to Matt's early comments from Jan. 4

<"variable1"> <http://example.com/vocab1/hasStorageType> <http://example.com/vocab1/float>

what most excites me is being able to frequently assert:
<"variable1"> <rdf:type> <http:MeasurementTypeXXX>

(...although note again that we'd need to mint an httpURI in the subject position.)

Also, this is why in discussion with Matt and Margaret, I had suggested that, at least for attribute metadata, a default propertyURI could be "rdf:type", simple asserting class membership of some URI-specified instance ("variable1" in this case), as a member of Class "measurementTypeXXX".

That "measurementTypeXXX" would be defined in our ECSO ontology and accessible with its PIRI GUID (PURL, that is) --with, e.g. an rdf:label or skos:prefLabel of "air temperature"-- and appropriate axioms about what characteristic ("temperature"), what entity ("air"), and potentially dimensions ("degrees Celsius") etc describe that MeasurementTypeXXX. But all that additional information would/could be garnered from dereferencing the ECSO URI in the object position of the triple.

Well, hope this isn't too muddled or trivial...it's kind of turtle-ish.

mbjones added a commit that referenced this issue Jun 29, 2018
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jun 29, 2018

After several offline conversations, we have reached consensus on implementing annotations using just property and value URIs, which in turn can be located in 5 locations in the EML document:

  • in attribute, entity, and dataset (or other resource) elements
  • in an /eml/annotations root element
  • in /eml/additionalMetadata

We've also agreed to embed the label in the element for readability. So a typical annotation would look like:

<annotation>
    <propertyURI label="uses unit">http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#usesStandard</propertyURI>
    <valueURI label="Kilogram">http://ecoinformatics.org/oboe/oboe.1.2/oboe-standards.owl#Kilogram</valueURI>
</annotation>

In that case, the annotation is embedded in a containing EML attribute element, and so the annotation's subject is that attribute. Constructing a URI for the subject can be done by appending the element identifier onto the document URI with a fragment identifier.

For annotations in /eml/annotations, the subject of the annotation is established using a references attribute that points at the id of the subject of the annotation. In working through the implementation of the 'annotations' element at the top level EML module, I decided its cleaner to treat references as an attribute, so that the annotations list ends up like this:

<annotations>
    <annotation references="CDR-biodiv-table">
        <propertyURI label="Subject">http://purl.org/dc/elements/1.1/subject</propertyURI>
        <valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
    </annotation>
    <annotation references="adam.shepherd">
        <propertyURI label="is a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
        <valueURI label="Person">https://schema.org/Person</valueURI>
    </annotation>
    <annotation references="adam.shepherd">
        <propertyURI label="member of">https://schema.org/memberOf</propertyURI>
        <valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
    </annotation>
</annotations>

For annotations in /eml/additionalMetadata, the subject is determined to be the element has the id listed within the associated described element:

<additionalMetadata>
    <describes>adam.shepherd</describes>
    <metadata>
        <annotation>
            <propertyURI label="member of">https://schema.org/memberOf</propertyURI>
            <valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
        </annotation>
    </metadata>
</additionalMetadata>

That should wrap up implementation of the annotation field implementation. Merge commit is SHA fbafee0.

@mbjones mbjones closed this Jun 29, 2018
@mbjones mbjones removed the needs-review label Jun 29, 2018
@mbjones

This comment has been minimized.

Copy link
Contributor Author

@mbjones mbjones commented Jul 25, 2018

After discussion, we agreed to add language to conditionally require the use of the id on elements that contain annotation elements with an implied subject. The id would then be used to construct a subject URI based on the document's base URI plus a fragment identifier, such as https://dataone.org/datasets/{dataset-identifier}#element-id

We decided that making id mandatory everywhere would be backwards incompatible and therefore undesirable, despite the fact the benefits of having unique ids to reference documents elements.

This requires an addition to EML Parser.

Reopening until I can update this documentation and EMLParser.

@mbjones mbjones reopened this Jul 25, 2018
@mbjones mbjones added the in progress label Jul 25, 2018
@mpsaloha

This comment has been minimized.

Copy link
Contributor

@mpsaloha mpsaloha commented Jul 25, 2018

@mobb

This comment has been minimized.

Copy link
Contributor

@mobb mobb commented Jul 26, 2018

for the two label fields, my opinion is that it's a { should } be populated by an rdfs:label or
skos:prefLabel if one exists.

Mainly, because to say {must} would mean that we ought to be able to confirm the label is correct, which is not practical. Communities may want to do their own checking however, which would be tied to specific vocabularies.

@mpsaloha

This comment has been minimized.

Copy link
Contributor

@mpsaloha mpsaloha commented Oct 2, 2018

At the LTER ASM breakout discussion on vocabularies, there was great interest in how to use/substitute formally defined (i.e. by specifying a dereferenceable GUID from a term in an (approved) thesaurus or ontology) terms as EML KEYWORDS. Some discussion ensued that semantic annotation at the level of dataset and entity essentially constitute EML KEYWORDS describing the object at that level. We and potentially the LTER Community need to agree on best practices in this process. Clearly having well-conceived EML KEYWORDS will be a major boon (and possibly opens up some interesting uses for Object Properties).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
7 participants
You can’t perform that action at this time.