Semantic Annotation Primer

Introduction

A semantic annotation is the attachment of semantic metadata to a resource - in this case, a dataset. It provides precise definitions of concepts and clarifies the relationships between concepts in a machine-readable way. The process of creating semantic annotations may seem tedious, but the payoff is enhanced discovery and reuse of your data.

The main differences between semantic annotation and simply adding keywords are:

annotations can be read and interpreted by computers
annotations describe the relationship between a specific part of the metadata and an external vocabulary

Benefits of annotation: Annotations vastly enhance data discovery and interpretation. Semantic annotations will make it easier for others to find and reuse data (and thus give proper credit), including the following cases:

Equivalent concepts: Assume one dataset uses the phrase "carbon dioxide flux" and another dataset "CO2 flux". An information system is able to recognize that these datasets are about equivalent concepts, if the datasets were annotated with the same identifier for that measurement.
Disambiguation: Assume you are searching for datasets about "litter" (as in "plant litter"). If datasets have been annotated, the system will be able to understand the difference between your meaning and other meanings (e.g., "garbage", a "group of animals born together", a "device for transporting the wounded", etc.). Each type of "litter" would be associated with a different identifier, and connected to related concepts.
Hierarchical searches: If you search for datasets about "carbon flux", then datasets about "carbon dioxide flux" can also be returned because "carbon dioxide flux" is a type of "carbon flux". This is possible because the concepts came from a structured system where "carbon dioxide flux" is lower in the hierarchy than "carbon flux".

EML 2.2.0 now provides ways to embed references to external vocabularies using HTTP uniform resource identifiers (or URIs). The process is called semantic annotation, and provides a rigorous, expressive and consistent interpretation of the metadata. Usually the external reference (or annotation) is to a knowledge graph, sometimes called a controlled vocabulary or ontology. The annotation provides a computer-usable pointer (the HTTP URI) that resolves (and dereferences) to a useful description, definition or other relationships for that annotated resource.

Take-home messages

Semantic statements must be logically consistent, as they are not simply a set of loosely structured keywords.
EML 2.2.0 has five places or methods to add annotations.
The best place for advice and feedback on EML annotations is your data management community

Organization of this document

The purpose of this Primer is to provide an introduction to how semantic annotations are structured in EML documents. It is expected that you already have some familiarity with the EML schema, and the focus of this document then, is explanation and examples of annotations in EML. This Primer is divided into three major sections. You should be able to create EML annotations immediately, using only the main section Semantic Annotations in EML 2.2.0, referencing the Appendix when you would like a longer explanation.

Introduction: (this section)
Semantic Annotations in EML 2.2.0, with examples. Where used, EML elements are shown as inline code blocks (elementName).
Appendix additional information on specific related topics, linked from the Introduction and Semantic Annotations in EML 2.2.0 section.
- Glossary: Glossary of terms, linked from text
- Semantic triples: details on their structure, and how that structure is leveraged by annotations with examples of their power
- URIs: defined, and as components of semantic triples
- RDF model: the W3C's RDF model with example graphs based on EML annotations
- Logical consistency: Common mistakes and how to check for them
- Vocabularies and repositories used in examples: Descriptions an links out to explore further
- Supplemental background information: The EML annotation approach here is compatible with recommendations by the World Wide Web Consortium (W3C) for construction of the Semantic Web. A wealth of material is available; a few selected ones are here.
- Frequently asked questions: Some questions asked by readers, and their answers

Semantic Annotations in EML 2.2.0

In EML 2.2.0 there are 5 places where annotation elements can appear in an EML document:

top-level resource -- an annotation element is a child of the dataset, literature, software, or protocol elements
entity-level -- an annotation element is a child of a dataset's entity (e.g., dataTable )
attribute -- an annotation element is a child of a dataset entity's attribute element
eml/annotations -- a container for a group of annotation elements, using references
eml/additionalMetadata -- annotation elements that reference a main-body element by its id attribute

Annotation element structure

All annotation nodes are defined as an XML type, so they have the same structure anywhere they appear in the EML record. Here is the basic structure. Sections below have more examples.

    <annotation>
        <propertyURI label="property label here">property URI here</propertyURI>
        <valueURI label="value label here">value URI here</valueURI>
    </annotation>

An annotation element always has a parent-EML element, which is the 'thing' being annotated, or the subject. (e.g., dataset, attribute, see above). The annotation element has two required child elements, propertyURI and valueURI. Together, these three form a "semantic statement", that can become a "semantic triple". The concept of a triple is covered in more detail (see Semantic Triples, below). Here, we concentrate on the structure of an annotation within the EML doc itself:

propertyURI and valueURI elements
- the element's text is the URI for the concept in an external vocabulary. The identifier represents a precise definition, relationships to other concepts, etc.
- the XML attribute, label is required
  - it should be suitable for application interfaces to display to humans
  - should be populated by values from the referenced vocabulary's label field (e,g, rdfs:label or skos:prefLabel ). Note that this assumes the referenced vocabulary is stored as an RDF document, which is best practice for vocabularies.

When are IDs required in the EML doc? To be precise, all annotations must have an unambiguous subject. At the dataset, entity, or attribute level, the parent element is the subject. So, if an element has an annotation child, it must also have an id (i.e. the subject, or parent element must have an id attribute value). Annotations at eml/annotations or eml/additionalMetadata will have subjects defined with a references attribute or describes element. As for other internal EML references, an id is required. With EML 2.2.0, the parser will check that an id attribute is present on elements with annotation children. As a reminder, the id must be unique within an EML document. See examples below.

Top-level resource, entity-level, and attribute annotations

Annotations for top-level resources, entities, and attributes follow the same general pattern.

The subject of the semantic statement is the parent element of the annotation. It must have an id attribute.

Example 1: Top-level resource annotation (dataset)

In the following dataset annotation, the semantic statement can be read as "the dataset with the id 'dataset-01' is about grassland biome(s)".

the subject of the semantic statement is the dataset element containing the id attribute value "dataset-01"
the annotation itself has 2 parts:
- propertyURI is 'http://purl.obolibrary.org/obo/IAO_0000136', and explicates the relationship, using a term from the Information Artifact Ontology (IAO).
- valueURI is 'http://purl.obolibrary.org/obo/ENVO_01000177', which resolves to the "grassland biome" term in the Environment Ontology (EnvO).

<dataset id="dataset-01">
    <title>Data from Cedar Creek LTER on productivity and species richness for use in a workshop titled
    "An Analysis of the Relationship between Productivity and Diversity using Experimental Results from
    the Long-Term Ecological Research Network" held at NCEAS in September 1996.</title>
    <creator id="clarence.lehman">
        <individualName>
            <salutation>Mr.</salutation>
            <givenName>Clarence</givenName>
            <surName>Lehman</surName>
        </individualName>
    </creator>
    ...    
    <coverage> 
        ...
    </coverage>    
    <annotation>
        <propertyURI label="is about">http://purl.obolibrary.org/obo/IAO_0000136</propertyURI>
        <valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
    </annotation>
      ...    
</dataset>

Example 2: Entity-level annotation

In the following entity-level annotation, the semantic statement can be read as "the entity with the id 'urn:uuid:9f0eb128-aca8-4053-9dda-8e7b2c43a81b' is about Mammalia".

The subject of the semantic statement is the otherEntity with id attribute value, "urn:uuid:9f0eb128-aca8-4053-9dda-8e7b2c43a81b".
The annotation itself has 2 parts
- propertyURI is "http://purl.obolibrary.org/obo/IAO_0000136", which resolves to "is about", from IAO
- valueURI is "http://purl.obolibrary.org/obo/NCBITaxon_40674", which resolves to "Mammalia" in the NCBI Taxon ontology.

<otherEntity id="urn:uuid:9f0eb128-aca8-4053-9dda-8e7b2c43a81b" scope="document">
    <entityName>DBO_MMWatch_SWL2016_MooreGrebmeierVagle.xlsx</entityName>
    <entityDescription>Data contained in the file DBO_MMWatch_SWL2016_MooreGrebmeierVagle.xlsx are marine mammal observations and observation conditions from CCGS Sir Wilfrid Laurier July 10-20, 2016.  Data observations and locations are part of the Distributed Biological Observatory (DBO).</entityDescription>
    <physical scope="document">
        <objectName>DBO_MMWatch_SWL2016_MooreGrebmeierVagle.xlsx</objectName>
        <size unit="bytes">24635</size>
    </physical>
    <entityType>Other</entityType>
    <annotation>
        <propertyURI label="is about">http://purl.obolibrary.org/obo/IAO_0000136</propertyURI>
        <valueURI label="Mammalia">http://purl.obolibrary.org/obo/NCBITaxon_40674</valueURI>  
    <annotation>
</otherEntity>

Example 3: Attribute annotation

In the following attribute annotation, the semantic statement can be read as "the attribute with the id 'att.4' contains measurements of type plant cover percentage"

The subject of the semantic statement is the attribute element with the id value "att.4".
The annotation itself has 2 parts
- propertyURI is "http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#containsMeasurementsOfType", from the Extensible Ontology for Observations (OBOE)
- valueURI is "http://purl.dataone.org/odo/ECSO_00001197", which resolves to "Plant Cover Percentage" in the Ecosystem Ontology (ECSO)

<attribute id="att.4">
    <attributeName>pctcov</attributeName>
    <attributeLabel>percent cover</attributeLabel>
    <attributeDefinition>The percent ground cover on the field</attributeDefinition>
    <annotation>
        <propertyURI label="contains measurements of type">http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#containsMeasurementsOfType</propertyURI>
        <valueURI label="Plant Cover Percentage">http://purl.dataone.org/odo/ECSO_00001197</valueURI>
    </annotation>
</attribute>

Example 3 as an RDF graph

`eml/annotations` element annotation

An annotation in the annotations element differs from Examples 1-3 above, because the subject is directly referred to by a references attribute. Each annotation element has a references attribute that points to the id attribute of the element being annotated. Stated another way, what is listed in the references attribute is the id of the subject of the semantic annotation. Any of the EML modules may be referenced by the references attribute and because ids are unique within an EML document, this is a single subject.

The subject of the semantic statement is implictly the element containing the referenced id.

Example 4: `annotations` element annotation

All the annotations for a resource can be group together under an annotations element. If you use this construct, each annotation must have its subject specifically identified with a references attribute that points to the subject's id. The group of annotations must be placed TO DO< WHERE IN DOC?

Example 4 contains 3 different annotations.

In the first, the subject is the dataTable element with the id attribute "CDF-biodiv-table". Its annotation components are analogous to Example 2 above, again referencing terms in IAO and ENVO. The semantic statement can be read as

"the dataTable with the id 'CDR-biodiv-table' is about grassland biome(s)".

The second and third annotations both have an individual as their subjects -- the creator element that has the id "adam.shepherd".

Respectively, their semantic statements can be read as

"'adam.shepherd', the creator (of the dataset), is a person".
"'adam.shepherd', the creator (of the dataset), is a member of BCO-DMO".

The ontologies used for adam.shepherd are

in the second annotation
- propertyURI : an RDF built-in type, "is a" (as in, the subject is an instance of a class)
- valueURI : schema.org's concept of a "person"
third annotation
- propertyURI : another schema.org concept for a relationship, "is a member of"
- valueURI : the DOI, which is managed by re3data.org, for the organization BCO-DMO.

<eml>
   ...
    <dataset id="dataset-01">
        <title>Data from Cedar Creek LTER on productivity and species richness for use in a workshop titled "An Analysis of the Relationship between Productivity and Diversity using Experimental Results from the Long-Term Ecological Research Network" held at NCEAS in September 1996.</title>
        <creator id="adam.shepherd">
            <individualName>
                <salutation>Mr.</salutation>
                <givenName>Adam</givenName>
                <surName>Shepherd</surName>
            </individualName>
        </creator>
        <dataTable id="CDR-biodiv-table">
            <entityName>CDR LTER-patterns among communities.txt</entityName>  
        ...
       </dataTable>  
    </dataset>
    ...
    <annotations>
        <annotation references="CDR-biodiv-table">
            <propertyURI label="is about">http://purl.obolibrary.org/obo/IAO_0000136</propertyURI>
            <valueURI label="grassland biome">http://purl.obolibrary.org/obo/ENVO_01000177</valueURI>
        </annotation>
        <annotation references="adam.shepherd">
            <propertyURI label="is a">http://www.w3.org/1999/02/22-rdf-syntax-ns#type</propertyURI>
            <valueURI label="Person">https://schema.org/Person</valueURI>
        </annotation>
        <annotation references="adam.shepherd">
            <propertyURI label="member of">https://schema.org/memberOf</propertyURI>
            <valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
        </annotation>
    </annotations>
   ...
</eml>

See Example 4 as an RDF graph

`eml/additionalMetadata` element annotation

If an additionalMetadata section holds a semantic annotation, it must have a describes element (to hold the subject) with a metadata element containing at least one annotation element.

The subject of the semantic statement has its id contained in the describes element.
The annotation itself is within the additionalMetadata metadata section
Multiple annotation elements may be embedded in the same metadata element to assert multiple semantic statements about the same subject.
To annotate different subjects it's best to use additional additionalMetadata sections, each with a single subject

Example 5: `additionalMetadata` element annotation

Example 5 shows one of the same annotations as Example 4, but this time, it is contained in an additionalMetadata section.

The semantic statements can be read as "'adam.shepherd', the creator (of the dataset), is a person".

The subject of the semantic statement is the creator element with the id attribute "adam.shepherd".
The annotation itself has 2 parts
- propertyURI is "https://schema.org/memberOf", which resolves to "is a member of", from schema.org
- valueURI is "https://doi.org/10.17616/R37P4C", a DOI which resolves to "BCO-DMO".

<eml>
    ...
    <dataset id="dataset-01">
        <title>Data from Cedar Creek LTER on productivity and species richness for use in a workshop titled "An Analysis of the Relationship between Productivity and Diversity using Experimental Results from the Long-Term Ecological Research Network" held at NCEAS in September 1996.</title>
        <creator id="adam.shepherd">
            <individualName>
                <salutation>Mr.</salutation>
                <givenName>Adam</givenName>
                <surName>Shepherd</surName>
            </individualName>
        </creator>
        <dataTable id="CDR-biodiv-table">
            <entityName>CDR LTER-patterns among communities.txt</entityName>  
         ...
       </dataTable>  
    </dataset>
    ...
     <additionalMetadata>
         <describes>adam.shepherd</describes>
         <metadata>
             <annotation>
                 <propertyURI label="member of">https://schema.org/memberOf</propertyURI>
                 <valueURI label="BCO-DMO">https://doi.org/10.17616/R37P4C</valueURI>
             </annotation>
         </metadata>
     </additionalMetadata>
</eml>

Appendix

Semantic triples

Semantic annotations enable the creation of what are called triples, that are 3-part statements conforming to the W3C recommended RDF data model (learn more: https://www.w3.org/TR/rdf11-primer/).

A triple is composed of three parts: a subject, a predicate (object property or datatype property), and an object.

[subject] [predicate] [object]

These components are analogous to parts of a sentence: the subject and object can be thought of as nouns in the sentence and the predicate (object property or datatype property) is akin to a verb or relationship that connects the subject and object. The semantic triple expresses a statement about the associated resource, that is generally the subject.

There are (perhaps unfortunately) several other ways that the components of an RDF statement are sometimes described. One popular "synonymy" for subject-predicate-object is resource-property-value, i.e. the subject is referred to as the resource, the predicate a property, and the object a value. This can be confusing, since the usual definition of a resource is any identifiable 'thing' or object, especially one assigned a URI; and by this definition, resources can and often do occur in all three components of a triple. But thinking of a triple as a resource-property-value does provide an indication of the directionality of the semantics of an RDF statement. This latter terminology is also somewhat similar to how analogous components are named in JSON-LD. Note that JSON-LD is closely compatible with RDF, and one format can often be readily translated to the other (although there are some exceptions).

Semantic annotations added to an EML document can be extracted and processed into a semantic web format, such as RDF/XML. These "semantic" statements, i.e. RDF triples, are interpretable by any machines that can process the W3C standard of RDF. Those RDF statements contribute to the Semantic Web.

URIs

Ideally, the components of the semantic triple should be globally unique and persistent (unchanging), and consist of resolvable/dereferenceable HTTP uniform resource identifiers (URIs; or more formally, IRI's). The subjects of most EML semantic annotations will likely be HTTP URIs that identify the dataset resource itself, or specific attributes or other features within a dataset. The objects of EML semantic annotations, as well as the predicates that relate the subject to the object, will most typically be HTTP URI references to terms in controlled vocabularies (also called "knowledge graphs", or "ontologies") accessible through the Web, so that users (or computers) can dereference the URIs and look up precise definitions and relationships of these resources to other terms.

An example of a URI is "http://purl.obolibrary.org/obo/ENVO_00000097". When entered into the address bar of a web browser, it resolves to the term with a label of "desert area" in the Environment Ontology (EnvO). Users can learn what this URI indicates and explore how the term is related to other terms in the ontology simply by dereferencing its URI in a web browser. All those other aspects you see on the Web page describing "http://purl.obolibrary.org/obo/ENVO_00000097" are from other RDF statements (triples) related to "ENVO_00000097", and that have been rendered into HTML. From here, you might decide, e.g. that "http://purl.obolibrary.org/obo/ENV0_00000172" ("sandy desert") is a better annotation for your object.

An RDF triple can be constructed as follows, with subject URI, predicate URI, and object URI:

<<https://doi.org/10.6073/pasta/06db7b16fe62bcce4c43fd9ddbe43575>> <<http://purl.obolibrary.org/obo/RO_0001025>>   <<http://purl.obolibrary.org/obo/ENVO_00000097>>

.

... indicating that the referenced dataset (subject/resource) was "located in" (predicate/property) a "desert area" (object/value). Note that a blank-space must separate the subject, from the predicate, from the object, and that a "period" completes the triple. This is a valid RDF triple, expressed in N-Triple syntax. RDF is most often serialized into XML, however, as Web browsers and many applications are good at parsing XML.

While our focus here is on the semantic annotation of EML documents, it is easy to see how the RDF statements can be used to describe and inter-relate any resources that have unique, persistent HTTP URIs!

Note that the above RDF triple consists of three HTTP URIs. While the exact distinction among what is a URI, a URN, and a URL can be debated, for our purposes, these HTTP URIs are can be considered both the name and web location of a resource. Content negotiation between a Web server and a client (which might be a browser, or a Python or R script) can enable an HTTP URI to dereference in ways optimized for the requesting client -- e.g. in one case, presenting a human-readable view of metadata for a dataset, and in another, activating a download of that dataset for import into a script.

The software needed to extract semantic annotations out of EML, and convert these into valid RDF triples, is under development at NCEAS and EDI, and through the rOpenSci project. The RDF triple described above, however, hopefully gives an idea of how such triples, constructed of dereferenceable HTTP URIs, can be very useful.

RDF Graphs

A graph consists of resources linked to other resources. Thus, the simplest graph structure is when you specify how one resource (node) is linked to another resource (node).

The parts of a triple (subject, predicate, and object) become nodes and links in a graph. Below are examples of how annotations can be converted to RDF triples in RDF/XML, so that the RDF information is now computer-readable. Be aware that there are several formats for serializing RDF, including RDF/XML, Turtle, N-Triples, and N3, that vary in the level of how human-readable they are.

This process of converting a semantic annotation in EML into RDF, is done by parsing applications under development at EDI, NCEAS, rOpenSci, and other data repositories. Careful examination of the examples below also show references to "owl:Class", "owl:ObjectProperty", and other statements that may not be familiar. These are fundamental entities or building blocks in W3C-recommended Semantic Web languages, and are determined by the relationships that the triple component identifiers (HTTP URIs) have within their native knowledge graph/ontology.

Related FAQ: What is RDFS?

Graph from Example 3 (attribute annotation): (back to Example 3 XML)

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#">
    
    <rdf:Description rdf:about="att.4"> ### See note below
        <owl:ObjectProperty rdf:about="http://ecoinformatics.org/oboe/oboe.1.2/oboe-core.owl#containsMeasurementsOfType">
            <owl:Class rdf:about="http://purl.dataone.org/odo/ECSO_00001197" />
        </owl:ObjectProperty> 
    </rdf>
</rdf:RDF>

Note: The subject described in the rdf:Description about attribute should actually be a globally unique HTTP URI for the attribute, rather than 'att.4'. The details of how this HTTP URI GUID is constructed are being developed by EDI, NCEAS, and others.

Graph from Example 4 (using `annotations` element): (back to Example 4 XML)

<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:owl="http://www.w3.org/2002/07/owl#">
    
    <rdf:Description rdf:about="adam.shepherd"> ### See note below 
        <owl:ObjectProperty rdf:about="http://www.w3.org/1999/02/22-rdf-syntax-ns#type">
            <owl:Class rdf:about="https://schema.org/Person" />
        </owl:ObjectProperty> 
        <owl:ObjectProperty rdf:about="https://schema.org/memberOf">
            <owl:Class rdf:about="https://doi.org/10.17616/R37P4C" />
        </owl:ObjectProperty> 
    </rdf>
    
</rdf:RDF>

Note: The subject described in the rdf:Description about attribute should actually be the globally unique URI issued for 'adam.shepherd'. The details of how this HTTP URI GUID is constructed are being developed by EDI, NCEAS, and others.

Check for Logical Consistency

With semantic annotation, you are adding precise definitions of concepts and relationships that can be traversed with computer logic. Annotations are not simply a set of loosely structured keywords! This is a really powerful addition to EML, and so it comes with some risk. The main thing you should ensure is that your annotations are logically consistent.

The simplest way to check your logic is to write out the RDF triple components and see if it makes sense as a sentence.

[subject (element-id)]  [predicate (propertyURI)]          [object (valueURI)]
[att.4]                 [contains measurements of type]    [plant cover percentage]

The graph examples (Example 3 RDF graph, Example 4 RDF graph) make 'true' statements that are logically consistent:

att.4 contains measurements of the type plant cover percentage
adam.shepherd is a person
adam.shepherd, member of BCO-DMO

However, below is the kind of statement you would NOT want to make:

[adam.shepherd] [is a type of] [measurement]

If you suspect your RDF triple might look like this, you should go back and examine the way you structured the annotation.

Things to check:

Be sure you have used the right classes, properties, or vocabularies for your annotation components
Become familiar with the vocabularies in your annotation, especially definitions and relationships
Check with your community for specific recommendations on the best vocabularies to use for annotations at different levels. Our examples use well-constructed vocabularies.
In additionalMetadata, don't combine annotations with more than one describes element. EML allows 1:many describes elements in a single additionalMetadata section. So if you have 2 describes and 2 annotations, you will have 4 RDF statements. Make sure they are all true, and if not, break them up into multiple additionalMetadata sections.

Glossary

dereference: To interpret a URI and retrieve information about a resource stored in another location

knowledge graph any knowledge base that is represented as a mathematical graph. A graph is a structure for a set of objects, where some pairs of the objects are in some sense related. The objects are called nodes or vertices and are interconnected by a set of lines called edges. For a semantic triple, the subject and object may be considered nodes and the relationship between the nodes as an edge.

ontology: A knowledge graph representation of a set of terms, including their names, and descriptions of the categories, properties, and relationships among those terms.

pointer: A kind of reference to a datum stored in computer memory.

resolve: To interpret a URI and determine a course of action for dereferencing the URI.

Resource Description Framework (RDF): A family of World Wide Web Consortium (W3C) recommendations that enable the encoding, exchange, and reuse of structured metadata. The RDF data model employs semantic triples composed of a subject, predicate, and object to share and integrate data across different applications and communities through the Web.

uniform resource identifier (URI): A string of characters that unambiguously identifies a particular resource. For semantic annotations, the components of semantic triples are ideally HTTP URIs that resolve and describe precise definitions and relationships to other terms, using Web technology.

Vocabularies and repositories used in examples

Communities using EML annotations will develop recommendations for suitable vocabularies, based on their own requirements (e.g., domain coverage, structure, adaptability, reliabliity and maintenance model). The following ontologies are already widely used, and were employed in the examples above:

ECSO (Ecosystem Ontology) (https://github.com/DataONEorg/sem-prov-ontologies/tree/master/observation). An ontology for ecosystem measurements under development by the Arctic Data Center and DataONE.

EnvO (Environment Ontology) (http://www.obofoundry.org/ontology/envo.html) An OBO Foundry ontology for the concise, controlled description of environments.

IAO (Information Artifact Ontology) (http://www.obofoundry.org/ontology/iao.html) An OBO Foundry ontology of information entities.

NCBITaxon Ontology http://www.obofoundry.org/ontology/ncbitaxon.html An OBO Foundry ontology representation of the National Center for Biotechnology Information organismal taxonomy.

OBOE (Extensible Ontology for Observations) (https://github.com/NCEAS/oboe) An ontology for scientific observations and measurements developed by DataONE and NCEAS.

re3data.org (Registry of Research Data Repositories) (https://www.re3data.org) A global registry of research data repositories spanning all academic disciplines.

schema.org (https://schema.org/) An initiative to create and support a common set of schemas for structured data markup on web pages. Extensions work with the core vocabulary to provide more specialized and/or deeper vocabularies.

Additional background information

Following are tutorials and supplemental background reading

LinkedDataTools tutorial: http://www.linkeddatatools.com/introducing-rdf
RDF data model: https://www.w3.org/TR/WD-rdf-syntax-971002/
W3C RDF primer: https://www.w3.org/TR/rdf11-primer/
A tidyverse lover’s intro to RDF https://ropensci.github.io/rdflib/articles/rdf_intro.html

Tim Berners-Lee's article on the semantic web: Berners-Lee, T., Hendler, J., & Lassila, O. (2001). The semantic web. Scientific american, 284(5), 34-43.

Frequently asked questions

Below are answers to questions some readers had, which may be helpful to you. If you have additional questions, please bring them up in your community for feedback.

Q: What does ‘dereferenced’ mean?

A: Within the context of semantic annotation, "dereferencing" refers to the process of interpreting a URI, and providing "useful information" back about the Resource of interest. The phrase "resolving a URI" is often used synonymously with "dereferencing", but technically "resolution" refers to the process of determining HOW and WHAT to do with the URI, whereas "dereferencing" is explicitly about the action taken, which is typically retrieving a representation of the Resource of interest. The formal specification for these terms and what they mean is found in the IETF's (Internet Engineering Task Force) RFC (Request for Comment) 3986 (https://tools.ietf.org/html/rfc3986).

Q: What is the difference between an URI and a URL? Sample URIs look a lot like URLs...

A: The distinctions among URIs (Uniform Resource Identifiers), URLs (Uniform Resource Locators), and URNs (Uniform Resource Names), relate to differentiating the functionalities of identifying a Resource, as opposed to locating a Resource, or doing both. URLs are all URIs (with some edge case exceptions subject to argument), and URNs are also URIs. In many cases, URIs serve both to name and locate a Resource.

Within the vision of the Semantic Web, URIs are ideally unique, persistent URNs identifying some Web Resource, that can also serve to locate and retrieve (dereference) a representation of that Resource (URLs). The formal specification for these terms and what they mean is found in the IETF's RFC 3986, section 1.1.3 (https://tools.ietf.org/html/rfc3986#section-1.1.3). Another acronym one may encounter with increasing frequency is IRI (Internationalized Resource Identifier) that simply extends the concept of a URI to include full Unicode character set, rather than just ASCII in its construction (https://tools.ietf.org/html/rfc3987).

Q: What is SKOS?

A: SKOS (Simple Knowledge Management System) is a W3C recommendation for organizing a vocabulary in thesauri, taxonomies, and other classification schemes. SKOS provides a set of concepts and properties, that, when expressed in a formal RDF-compatible syntax, can assist with interpreting the relationship of terms with one another, such as defining some category as broader than another. For example, one could state in SKOS syntax, that "animals" is a broader concept than "mammals". Definitive specification of SKOS can be found at https://www.w3.org/TR/2009/REC-skos-reference-20090818/. SKOS does not provide strong semantics (see RDFS example below), but SKOS concepts and properties can be used within more expressive knowledge organization frameworks, such as RDFS/OWL ontologies.

Q: What is RDFS?

A: RDFS (Resource Description Framework Schema; https://www.w3.org/TR/rdf-schema/) is a W3C recommendation that extends the formal vocabulary for describing Resources expressed in an RDF data model (i.e., in a graph). "Base" RDF https://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/ provides a set of concepts for creating a graph model of data-- consisting of one or more triples relating a subject, predicate, and object. RDFS adds to the base RDF model by specifying a number of well-defined concepts and properties, such as rdfs:Class and rdfs:subClassOf. These and other RDFS classes and properties, enable data and knowledge modellers to express many relationships between the Subject and Object of a Triple.

In the context of the Semantic Web, the RDF model relies extensively on dereferenceable URIs in the subject and predicate positions, and URIs or literals in the object position (there are small formal exceptions to this not immediately relevant here). RDF triples can be expressed in several syntaxes, including XML, JSON-LD, and Turtle, among others. RDFS then can be used to enrich the precision and expressivity of the components of a triple, as well as clarify the relationships among these.

Q: Are all EML dataTable attributes "measurements"?

A: Yes, in the context of a data table and for annotation purposes, any attribute (observation or column of data) can be considered ‘a measurement’. A philosopher might disagree, saying that unique identifiers are not really measurements; but many “nominals”, i.e. text strings identifying some class types (e.g. predator, lizard, tundra) imply quantification.

Q: Can you provide an example of a controlled vocabulary with a rdfs:label or skos:label?

A: Most Semantic Web vocabularies make extensive use of rdfs:label or SKOS label properties. For example, this URI: http://purl.dataone.org/odo/ECSO_00000536 is from the ECSO ontology, under development at NCEAS by NSF's DataONE and Arctic Data Center. Within that ontology, the URI is associated with an rdfs:label of "Carbon Dioxide Flux", and a skos:altLabel of "CO2 flux". If you dereference the URI, you will see how the BioPortal ontology repository displays this information-- providing a human-readable representation of the underlying RDF/OWL language in which the ontology is stored.

Q: An image of an RDF graph is great, but a computer doesn't parse that. What does the RDF look like?

A: As mentioned above, RDF is a data model based on triples, each of which consists of a subject, predicate, and object. In order to function interoperably on the Web, however, there is the need for these triple components to be constructed of dereferenceable URIs, although the object value can also be a literal. RDF triples can be "serialized" in several syntaxes, including XML, JSON-LD, Turtle, N-Triples, and others. These syntaxes are isomorphic, such that translations of RDF graphs from one serialization to another are available-- enabling consistent interpretation by machines.

Perhaps the most straightforward serialization of RDF graphs for human interpretation is N-Triples, where an RDF triple could look like this:

http://purl.obolibrary.org/obo/CHEBI_16526 http://purl.obolibrary.org/obo/RO_0000087 http://purl.obolibrary.org/obo/CHEBI_76413 .

These are three URIs-- representing the Subject, Predicate, and Object of a Triple. The "." indicates the end of the Triple. Dereferencing these URIs (e.g. a Web browser or specialized application) one can see that this Triple represents the statement:

"Carbon dioxide"(Subject) "has role"(Predicate) "Greenhouse Gas"(Object)

While the phrasing is a bit awkward sounding, the meaning is clear by simply depicting the rdfs:Labels of those terms from the ChEBI (Chemical Entities of Biological Interest) and RO (Relation) ontologies, that are both robust OBO Foundry ontologies.

As another example: http://purl.obolibrary.org/obo/NCIT_C20461 http://purl.org/dc/elements/1.1/creator https://orcid.org/0000-0003-1279-3709 .

that asserts:

"World Wide Web"(Subject) "creator"(Predicate) "Timothy Berners Lee"(Object) .

...although some semantic purists might question whether the Dublin Core property "Creator" can be used in this way as an RDF predicate, since it is not semantically defined-- would its rdfs:label be "creatorOf" or "hasCreator"? (Dublin Core does not say!). Regardless of the formal semantic well-formedness of this Triple, however, one can see the expressive power of the RDF data model, and the value of dereferenceable URIs.

A better solution would be to use the semantically defined term from SIO (the Semantic Science Integrated Ontology) http://semanticscience.org/resource/SIO_000364 as the predicate, with an rdfs:label "has creator"

http://purl.obolibrary.org/obo/NCIT_C20461 http://semanticscience.org/resource/SIO_000364 https://orcid.org/0000-0003-1279-3709 .

...that would translate as (based on content of the rdfs:label):

World Wide Web(Subject) has creator(Predicate) Tim Berners Lee(Object)

or inversely, one could use http://semanticscience.org/resource/SIO_000365 as the predicate, that has rdfs:Label "is creator of"

Tim Berners Lee(Subject) is creator of(Predicate) World Wide Web(Object)

https://orcid.org/0000-0003-1279-3709 http://semanticscience.org/resource/SIO_000365 http://purl.obolibrary.org/obo/NCIT_C20461.

Within the SIO ontology, SIO_000364 and SIO_000365 are defined as inverses of one another. This enables one to ask both questions-- "who created the Web?" (A: Tim Berners Lee), and "what did Tim Berners Lee create" (A: the Web)-- even though you only asserted one of the Triples.

Finally, it is worth noting that one's choice of which Ontologies to use is important. Within the Ecological and Environmental sciences, there are several highly-recommended vocabularies, including those from the OBO Foundry (e.g. ChEBI, EnvO), as well as SIO. Specifically for annotating scientific measurements, the Arctic Data Center and DataONE are developing an Ontology for Ecosystem Measurements, ECSO. We have used all these in the examples.

Q: Are there tools available to help data managers select subjects, predicates, and objects to annotate with?

A: Yes, tools are being built to assist with the semantic annotation of EML documents, within the DataONE and Arctic Data Center data repository projects, and others.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eml-semantic-annotation-primer.md

eml-semantic-annotation-primer.md

Semantic Annotation Primer

Introduction

Take-home messages

Organization of this document

Semantic Annotations in EML 2.2.0

Annotation element structure

Top-level resource, entity-level, and attribute annotations

Example 1: Top-level resource annotation (dataset)

Example 2: Entity-level annotation

Example 3: Attribute annotation

`eml/annotations` element annotation

Example 4: `annotations` element annotation

`eml/additionalMetadata` element annotation

Example 5: `additionalMetadata` element annotation

Appendix

Semantic triples

URIs

RDF Graphs

Graph from Example 3 (attribute annotation): (back to Example 3 XML)

Graph from Example 4 (using `annotations` element): (back to Example 4 XML)

Check for Logical Consistency

Glossary

Vocabularies and repositories used in examples

Additional background information

Frequently asked questions

Files

eml-semantic-annotation-primer.md

Latest commit

History

eml-semantic-annotation-primer.md

File metadata and controls

Semantic Annotation Primer

Introduction

Take-home messages

Organization of this document

Semantic Annotations in EML 2.2.0

Annotation element structure

Top-level resource, entity-level, and attribute annotations

Example 1: Top-level resource annotation (dataset)

Example 2: Entity-level annotation

Example 3: Attribute annotation

eml/annotations element annotation

Example 4: annotations element annotation

eml/additionalMetadata element annotation

Example 5: additionalMetadata element annotation

Appendix

Semantic triples

URIs

RDF Graphs

Graph from Example 3 (attribute annotation): (back to Example 3 XML)

Graph from Example 4 (using annotations element): (back to Example 4 XML)

Check for Logical Consistency

Glossary

Vocabularies and repositories used in examples

Additional background information

Frequently asked questions

`eml/annotations` element annotation

Example 4: `annotations` element annotation

`eml/additionalMetadata` element annotation

Example 5: `additionalMetadata` element annotation

Graph from Example 4 (using `annotations` element): (back to Example 4 XML)