Skip to content

Commit

Permalink
shortened intro
Browse files Browse the repository at this point in the history
  • Loading branch information
mpsaloha committed Mar 20, 2019
1 parent 1fadecb commit 14c45c5
Showing 1 changed file with 18 additions and 15 deletions.
33 changes: 18 additions & 15 deletions docs/eml-semantic-annotations-primer.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
# Semantic Annotation Primer

## Introduction
The purpose of this primer is to provide an introduction to how semantic annotations are structured
in EML documents. It is expected that you have some familiarity with the EML schema prior to reading this document. It is important to note that our approach of using annotations structured in the Resource Description Framework (RDF) specification is based on recommendations from the World Wide Web Consortium (W3C) about how a Semantic Web should be constructed.
If you want to read more about the W3C's RDF data model, graphs or the semantic web, there is supplemental material at the bottom of this primer.

A semantic annotation involves the attachment ("annotation") of semantic metadata to a resource -- which in this context would be an EML element. A semantic annotation provides a pointer (HTTP universal resource identifier; URI) that should resolve (and dereference) to useful descriptions, definitions, or relationships that the annotated resource has, relative to other terms or resources, and do so in a computer-usable way. These "pointers" reference terms organized into web-accessible *knowledge graphs* (also called "ontologies"). The process of creating semantic annotations may seem tedious, but the payoff is vastly enhanced information discovery and interpretation. Semantic annotations will make it easier for others to find and reuse your data (and thus give you credit).
EML 2.2.0 now provides ways to embed HTTP URI's into several EML elements, thus serving as semantic annotations of those elements.

The purpose of this primer is to provide an introduction to how these semantic annotations are structured
in EML documents. It is expected that you have some familiarity with the EML schema prior to reading this document. It is important to note that our approach of using annotations structured in the Resource Description Framework (RDF) specification is based on recommendations from the World Wide Web Consortium (W3C) about how a Semantic Web should be constructed. If you want to read more about the W3C's RDF data model, graphs or the semantic web, there is supplemental material at the bottom of this primer.

A semantic annotation involves the attachment ("annotation") of semantic metadata to a resource -- which in this context would be an EML element. A semantic annotation provides a pointer (HTTP universal resource identifier; URI) that should resolve (and dereference) to useful descriptions, definitions, or relationships that the annotated resource has with other terms or resources, and do so in a computer-usable way. The "pointers" reference terms within web-accessible *knowledge graphs* (also called "ontologies"). The process of creating semantic annotations may seem tedious, but the payoff is vastly enhanced information discovery and interpretation. Semantic annotations will make it easier for others to find and reuse your data (and thus give you credit).

For example, if a dataset is annotated as being about "carbon dioxide flux" and another dataset is annotated as being about
"CO2 flux" the information system can recognize that these datasets are about equivalent concepts, because this equivalence can be indicated in a "computer-usable" way through the semantic annotation-- e.g. by sharing the same HTTP URI for their annotation.
Expand All @@ -24,23 +26,22 @@ If you already understand the basics of how RDF and the Semantic Web work, and a

Semantic annotations enable the creation of what are called *triples*, that are 3-part statements conforming to the W3C recommended *RDF data model* (learn more: <https://www.w3.org/TR/rdf11-primer/>).

A *triple* is composed of three parts: a **subject**, a **predicate (object property or data property)**, and an **object**.
A *triple* is composed of three parts: a **subject**, a **predicate (object property or datatype property)**, and an **object**.

```
[subject] [predicate] [object]
```

These components are analogous to parts of a sentence: the **subject** and **object** can be thought of as nouns in the sentence and the **predicate** (object property or data property) is akin to a verb or relationship that connects the **subject** and **object**. The semantic triple expresses a statement about the associated resource, that is generally the **subject**.
These components are analogous to parts of a sentence: the **subject** and **object** can be thought of as nouns in the sentence and the **predicate** (object property or datatype property) is akin to a verb or relationship that connects the **subject** and **object**. The semantic triple expresses a statement about the associated resource, that is generally the **subject**.

There are (perhaps unfortunately) several other ways that the components of an RDF statement are sometimes described. One popular "synonymy" for **subject-predicate-object** is **resource-property-value**, i.e. the subject is referred to as the **resource**, the predicate a **property**, and the object a **value**. This can be confusing, since the usual definition of a *resource* is any identifiable 'thing' or object, especially one assigned a URI; and by this definition, *resources* can and often do occur in all three components of a triple. But thinking of a triple as a *resource-property-value* does provide an indication of the directionality of the semantics of an RDF statement. This latter terminology is also somewhat similar to how analogous components are named in JSON-LD. Note that JSON-LD is closely compatible with RDF, and one format can often be readily translated to the other (although there are some exceptions).

Semantic annotations added to an EML document can be extracted and processed into a semantic web format, such as RDF/XML, such that the semantic statement(s), i.e. RDF triples, become interpretable by any machines that can process the W3C standard of RDF. Those RDF statements contribute to the Semantic Web.
Semantic annotations added to an EML document can be extracted and processed into a semantic web format, such as RDF/XML. These"semantic" statements, i.e. RDF triples, are interpretable by any machines that can process the W3C standard of RDF. Those RDF statements contribute to the Semantic Web.

#### URIs
Ideally, the components of the semantic triple should be globally
unique and persistent (unchanging), and consist of resolvable/dereferenceable HTTP uniform resource identifiers (URIs; or more formally, IRI's). The *subjects* of most EML semantic annotations will likely be HTTP URI's that identify the dataset resource itself, or specific attributes or other features within a dataset. The *objects* of EML semantic annotations, as well as the *predicates* that relate the subject to the object, will most typically be HTTP URI references to terms in controlled vocabularies (also called "knowledge graphs", or "ontologies") accessible through the Web, so that users (or computers) can dereference the URI's and look up precise definitions and relationships of these resources to other terms.
Ideally, the components of the semantic triple should be globally unique and persistent (unchanging), and consist of resolvable/dereferenceable HTTP uniform resource identifiers (URIs; or more formally, IRI's). The *subjects* of most EML semantic annotations will likely be HTTP URI's that identify the dataset resource itself, or specific attributes or other features within a dataset. The *objects* of EML semantic annotations, as well as the *predicates* that relate the subject to the object, will most typically be HTTP URI references to terms in controlled vocabularies (also called "knowledge graphs", or "ontologies") accessible through the Web, so that users (or computers) can dereference the URI's and look up precise definitions and relationships of these resources to other terms.

An example of a URI is "http://purl.obolibrary.org/obo/ENVO_00000097", when entered into the address bar of a web browser, resolves to the term with a label of "desert area" in the Environment Ontology (EnvO). Users can learn what this URI indicates and explore how the term is related to other terms in the ontology simply by dereferencing its URI in a web browser. All those other aspects you see on the Web page describing "http://purl.obolibrary.org/obo/ENVO_00000097" are from RDF statements (triples) that have been rendered into HTML. From here, you might realize that "http://purl.obolibrary.org/obo/ENV0_00000172" ("sandy desert") is a better annotation for your object.
An example of a URI is "http://purl.obolibrary.org/obo/ENVO_00000097", when entered into the address bar of a web browser, resolves to the term with a label of "desert area" in the Environment Ontology (EnvO). Users can learn what this URI indicates and explore how the term is related to other terms in the ontology simply by dereferencing its URI in a web browser. All those other aspects you see on the Web page describing "http://purl.obolibrary.org/obo/ENVO_00000097" are from other RDF statements (triples) related to "ENVO_00000097", and that have been rendered into HTML. From here, you might decide, e.g. that "http://purl.obolibrary.org/obo/ENV0_00000172" ("sandy desert") is a better annotation for your object.

An RDF triple can be constructed as follows, with subject URI, predicate URI, and object URI:

Expand All @@ -52,16 +53,18 @@ An RDF triple can be constructed as follows, with subject URI, predicate URI, an

.

... indicating that the referenced *dataset* (subject/resource) was *"located in"* (predicate/property) a *"desert area"* (object/value).
Note that a blank-space must separate the subject, from the predicate, from the object, and that a "period" completes the triple. This is a valid RDF triple, expressed in N-Triple syntax. RDF is most often serialized into XML, however, as Web browsers and many applications are good at parsing XML.
... indicating that the referenced *dataset* (subject/resource) was *"located in"* (predicate/property) a *"desert area"* (object/value). Note that a blank-space must separate the subject, from the predicate, from the object, and that a "period" completes the triple. This is a valid RDF triple, expressed in N-Triple syntax. RDF is most often serialized into XML, however, as Web browsers and many applications are good at parsing XML.

While the *essence* of the RDF data model is as simple as having URI's indicating the subject, predicate, and object constituting a *triple*, there are also *blank nodes* that can occur in the subject and object positions, and *literals* can occur as objects-- but these are complexities beyond the scope of this Primer, and not necessary to know in order to do extremely useful semantic annotation of EML elements. While our focus here is on semantic annotation of EML documents, it is easy to see how the RDF model can be used to describe, in a triple, any resource that has a URI.
While our focus here is on the semantic annotation of EML documents, it is easy to see how the RDF statements can be used to describe and inter-relate any resources that have unique, persistent HTTP URIs!

Note that the above *RDF triple* consists of three HTTP URIs. While the exact distinction among what is a URI, a URN, and a URL can be debated, essentially all URLs (Uniform Resource Locators) are URIs -- they point to a location where some resource exists (in the case of an HTTP URL, on the Web) and can be resolved or dereferenced. But a URI can also serve as, ideally, a (globally) *unique and persistent name* of a resource, i.e., it is a URN (Uniform Resource Name). While URIs, URNs, and URLs don't necessarily have to work with the HTTP protocol, for practical purposes in the present, these are most useful if they work well with the Web, and thus HTTP. Having an HTTP URI, however, does not mean that these are only useful for viewing in a Web browser. Content negotiation between a Web server and a client (which might be a browser, or a Python or R script) can enable an HTTP URI to dereference in ways optimized for the requesting client -- e.g. in one case, presenting a human-readable view of metadata for a dataset, and in another, activating a download of that dataset for import into a script.
Note that the above *RDF triple* consists of three HTTP URIs. While the exact distinction among what is a URI, a URN, and a URL can be debated, for our purposes, these HTTP URI's are can be considered both the *name* and *web location* of a resource. Content negotiation between a Web server and a client (which might be a browser, or a Python or R script) can enable an HTTP URI to dereference in ways optimized for the requesting client -- e.g. in one case, presenting a human-readable view of metadata for a dataset, and in another, activating a download of that dataset for import into a script.

EML 2.2.0 now provides ways to embed HTTP URI's into several EML elements, thus serving as semantic annotations of those elements, such that those EML snippets can be extracted and serialized into the formal Semantic Web vocabulary of RDF. These latter functions, i.e. extracting the semantic annotations and converting these into valid RDF triples as in the example provided above, will rely on software tools that are under development at NCEAS and EDI, and through the rOpenSci project. The RDF triple described above, however, hopefully gives an idea of how triples, constructed of dereferenceable HTTP URI's, can be very useful. The sections below describe the exact syntax for embedding annotations in EML 2.2.0 documents.
The software needed to extract semantic annotations out of EML, and convert these into valid RDF triples, is under development at NCEAS and EDI, and through the rOpenSci project. The RDF triple described above, however, hopefully gives an idea of how such triples, constructed of dereferenceable HTTP URI's, can be very useful.

The sections below describe the exact syntax for embedding annotations in EML 2.2.0 documents.

## Semantic Annotations in EML 2.2.0

In **EML 2.2.0** there are 5 places where annotation elements can appear in an EML document:

- **top-level resource** -- an `annotation` element is a child of the `dataset`, `literature`, `software`, or `protocol` elements
Expand Down

0 comments on commit 14c45c5

Please sign in to comment.