Annotations in NexSON

Jim Allman edited this page Mar 26, 2014 · 69 revisions

The primary NexSON documentation is at http://opentree.wikispaces.com/NexSON.

This page is a short-term scratchpad/sandbox for fleshing out the details of how annotations that are added to NexSON files during a process of commenting or validation should be structured.

1 Background

NexSON is a JSON format derived from NexML using BadgerFish conventions. NexML is intended to be used to encode phylogenetic trees, alignments, and associated information. NexSON implements several of the NexML first-class object types including a top-level study object (the nexml object), which may contain a number of other first-class objects of the types: otus, otu, trees, tree, node, and edge. For more information, see the other NexSON pages and the NexML schema.

1.1 The meta element

As a means to store additional metadata, the NexML schema defines a meta tag, which may be a child of any first-class NexML object. These meta tags may contain arbitrarily complex, discretionary data structures, and are intended to contain metadata annotating the first-class element to which they are attached. In NexSON, sets of meta tags are represented as a JSON array, stored under they meta key in the first-class object to which the meta tags apply. For example:

{  
    "tree": {
        "meta": [
            {
                "@property": "the meta element property type",
                "childElement": { "$": "inner text of this child element" },
                "aDifferentChild": { "$": "foo" }
            },
            {
                "@property": "a different property type",
                "arrayValue": [
                    {"$": "inner text of an 'arrayValue' child of this meta element" },
                    {"$": "another arrayValue" },
                    {"$": "and a final one" }
                ]
            }
        ]
    }
}

The content of these meta tags is unconstrained by the NexML schema. For example, phenoscape embeds data structured according to other XML schemas inside meta tags. See some example phenoscape files here.

1.2 NexSON annotations for the Open Tree of Life Project

This document proposes a standardized model to facilitate efficient and straightforward storage and retrieval of human- and machine-generated annotation metadata regarding a NexSON study and its contained objects. The goals of this proposed model are limited to the scope of the Open Tree of Life project. Thus, no attempt is made to generalize a model suitable for all conceivable annotation purposes under the sun. Rather, the concepts are tailored to suit the activities expected to occur as part of the OToL workflow, including (but not necessarily limited to):

1.2A Example use-cases for NexSON annotations

  1. Study curation
  2. NexSON structural validation
  3. Data quality assessment
  4. Metadata persistence
  5. Cross-purpose communication of OToL tools, including external tools intended to complement and extend OToL tools.

Extensions to this model, or other models, may be required for other purposes outside the defined scope.

1.3 Historical information

Much of the content documented here was originally discussed in the Annotations thread on the opentreeoflife-software@googlegroups.com. Where appropriate, attempts have been made to incorporate concepts from related projects addressing the formalization of annotation data, including Open Annotation, Annotation Ontology, and the W3C PROV Ontology.

2 Primary annotation objects

Three primary annotation object classes are proposed: annotationEvent, agent, and message. This three-part breakdown corresponds to the PROV data model, with its Activity, Agent, and Entity types corresponding to NexSON annotationEvent, agent, and message objects respectively. The roles of these object types are defined as follows:

2.1 annotationEvent class

An annotationEvent is a one-time event, during which an agent generates one or more messages related to a study or element(s) within it. Each annotationEvent should generate one or more message objects. annotationEvent objects relate message objects to associated agent objects and contain information about the event itself, including the date [more info here].

After some discussion (Jan 13, 2014 software G+ hangout), we decided to put the message objects inside their containing annotationEvent object.

2.2 agent class

An agent is a person or program that creates annotations, possibly acting on behalf of another agent. agent. objects contain information identifying and describing real-world annotating agents, including names, urls, information about the execution environment (for automated agents), version (for automated agents), etc. For more information, refer to the agent syntax below.

2.3 message class

A message is a simple data structure provides information about a particular target object or set of objects. Messages are generalized, and contain features to accommodate diverse annotation data. For more information, refer to the message object syntax below.

3 Storage conventions

Annotation information may conceivably be stored anywhere (e.g. within single-file NexSON documents, or externally accessible via URL). For convenience and simplicity, at this time we propose storing annotations within NexSON documents themselves.

3.1 Container elements

Two top-level NexSON meta element containers are proposed to store collections of primary annotation objects. These container elements are in fact NexSON meta elements, whose @property value may be equal to "ot:annotationEvents" or "ot:agents". Corresponding annotation elements of each respective type should be stored in the appropriate meta container. (The "ot:messages" container is now deprecated, in favor of storing messages inside annotation events.)

3.2 Storage of annotationEvent and agent objects

Exactly one meta element with the property "ot:annotationEvents" and one with "ot:agents" should exist for a given study, as children of the nexml object itself. These containers should contain all of the annotationEvent and agent objects associated with message objects applied to elements within the study.

3.3 Storage and placement of message objects

Inside the AnnotationEvent that created them.

Deprecated recommendation The message objects themselves should be stored in "ot:messages" containers that are attached to the least inclusive NexSON element to which the information in the message applies. Thus, one "ot:messages" container may exist as a child of each annotated object (see 2A for a list of annotatable object types) in the study. meta containers of the "ot:messages" type should only be assigned to the following first-class NexSON objects:

3.3A NexSON elements suitable for storing "ot:messages" meta containers

N.B. This entire section is deprecated, in favor of storing messages inside annotation events.

  1. nexml (the study itself)
  2. tree
  3. node
  4. edge
  5. otu

Determining the best location to attach these "ot:messages" containers may be a rather arbitrary choice in many cases, but placements facilitating ease of interpretation and semantic consistency are encouraged. By convention, the "ot:messages" element attached to top-level nexml element should contain message objects that describe information about the study itself, or about one of its associated annotationEvent or agent objects; "ot:messages" containers attached to a tree element should contain message objects specific to that tree, but message objects specific to a single node within that tree should be stored in a "ot:messages" container attached to the node itself; etc.

It may be instructive to consider a negative example: it is possible to store every message object in the "ot:messages" meta container attached to the nexml element itself, and simply use the "refersTo" field (see below) to associate message objects with the NexSON objects to which they pertain. This usage pattern is discouraged since it complicates the association of the message objects to their relevant NexSON elements. With that in mind, it is worth recognizing that there may be rare cases where it is appropriate to store all of the message objects associated with a given annotationEvent in the "ot:messages" container attached to the nexml object. For instance, when every message object refers to both tree and otu objects, or to other annotationEvent objects, then most inclusive placement of each message is the nexml object.


JA: If we put warnings/queries/errors in the meta that is inside the top-level nexml object, then the curator application could grab the annotations, and quickly ascertain which parts of the study have problems. This will require that app to hold onto tree, node, edge and otu IDs until it has the data to instantiate objects of those types. But that does not seem too onerous.

CEH: I think we want to avoid the need for the curator app (or any other app for that matter) to download and parse the entire NexSON study and/or entire set of messages in order to find the relevant ones. I would suggest that implementing services (such as OTI) capable of returning the information based on queries (e.g. "return all warnings/queries/errors for anything in study X") would be more scalable than searching the NexSON for them on every load. In this case, the placement of the messages within the file is arbitrary. I would argue that storing them as children of the objects to which they most closely pertain makes more intuitive sense than not doing so, and that it will be easier to parse in many cases (e.g. no need to hold onto node, edge, otu, etc. ids as mentioned above). So this is my recommendation.


4 Syntax for meta container objects

In accordance with Badgerfish conventions (for XML compatibility), each container object in the JSON representation will contain an array of objects of the corresponding type under the defined key. Each element of these arrays corresponds to a single tag of the same type name in the XML as the as the array key in the Badgerfished JSON (e.g. annotationEvent, agent, or message)`.

4.1 annotationEvents collection

tag legal value(s) explanation
@property "ot:annotationEvents"
@xsi:type "nex:ResourceMeta"
annotation list of annotationEvent elements See details below

4.2 agents collection

tag legal value(s) explanation
@property "ot:agents"
@xsi:type "nex:ResourceMeta"
agent list of agent elements See details below

4.3 messages collection Deprecated in favor of message objects inside annotationEvents

tag legal value(s) explanation
@property "ot:messages"
@xsi:type "nex:ResourceMeta"
message list of message elements See details below.

5 Primary object syntax

5.1 annotationEvent object

tag legal value(s) explanation
@id string unique among the set of IDs used in this file (not necessarily globally unique)
@description string human-readable description of the type of annotation performed (e.g. "NexSON validation" or "treemachine import check")
@wasAssociatedWithAgentId string id of the agent (person or tool; see below) that created the annotationEvent
@dateCreated String in ISO 8601 date that the annotationEvent occurred
@passedChecks boolean default True. False indicates that the author is a validating service (rather than just a commenting tool), and some aspect of the validation procedure failed in some serious way. The details should be in the messages.
@preserve boolean False by default. True serves as a flag to future invocations of the same tool (software agent), indicating that the message should be retained
otherProperty array of otherProperty elements Optional. See below for additional information
message list of message elements See details below.

5.2 agent object

An Agent can be a human author or a program. (Is there a standard way of describing a software tool that we should be using here? <-- Yes, we are adapting this from the PROV model.) Here is the basic info we want:

tag legal value(s) explanation
@id string unique among the set of IDs used in this file (not necessarily globally unique)
@name string Name of software that produced the annotation, or authorized user (GitHub username or email)
@url string URL of service or page that describes the tool (blank for a human)
@description string human-readable description of the tool, or full name for a human
@version string version number string of the authoring tool (blank for a human)
invocation object Only applicable to automated (i.e. software) agents. invocation object that contains relevant info about the execution environment and operating parameters
otherProperty array of otherProperty objects Optional. See below for more information

5.2.1 invocation object (sub-object of agent)

tag legal value(s) explanation
commandLine list of strings (optional) args
method string GET, PUT... for web services
data string data parameters passed to the web-services call
checksPerformed list of strings list of Message Codes (see below) that the service claims to have checked for
otherProperty array of otherProperty objects for additional information Optional. See below for more information

5.3 message object

tag legal value(s) explanation
@id string unique among the set of IDs used in this file (not necessarily globally unique)
@wasGeneratedById string Deprecated no longer used because message objects now occur inside the annotation event that generates them. The id of the annotationEvent object with which this message is associated
wasAttributedToId string Optional. The id of an agent object that this message is attributed to, which may be different from the agent associated with the generating annotationEvent. For example, "wasAttributedToId" could identify a human agent operating a software agent with which the annotationEvent itself may be associated.
@severity string one of the defined Severity values (like logger message levels; see below)
@code string one of the Message Codes (see below)
@humanMessageType string one of the Message Types (see below). Optional if the Message Code indicates that a front end should be able generate a message from the code (see below).
@humanMessage string human-interpretable message (ie. no NexSON IDs). Optional if the Message Code indicates that a front end should be able generate a message from the code (see below).
dataAnnotation string Optional. More precise message for machine consumption
data object fields depend on the Message Code (see below)
refersTo path object (see below) path to the object that the message refers to (see path syntax below)
other object object (key to string, number, or boolean) for additional information

6 Secondary/supporting object syntax

6.1 otherProperty object

These objects are used to designate optional properties. They are intended to be used a catch-all for necessary round-trip information that does not belong in any pre-defined property for a given object. This feature is intentionally restrictive to reduce complexity and increase consistency/adherence to the annotation spec.

tag explanation
name the name of this property
value a value of one of the predefined value types below

6.2 Value types

Acceptable values are defined by JSON.

tag explanation
STRING a string wrapped in quotes.
NUMBER a floating point or integer value.
BOOLEAN a boolean value. Acceptable values are either of the strings "true" or "false", without the quotes.

6.3 Message levels

tag explanation
ERROR an error. generally designates a fail condition
WARNING a warning. designates a condition that is not encouraged but is not generally a fail condition
INFO neither a warning nor an error

6.4 Message types

The following message types (borrowed from the Open Annotation and Annotation Ontology projects) are used to define different cases for the human-readable message (if there is one):

tag explanation
NONE there is no human-readable message
NOTE a general, human-readable note
COMMENT this suggests editorial intent
REPLY points to another annotation (Note, Comment, or Reply)
EXPLANATORY_NOTE by the curator?
QUESTION specifically asks for reply or clarification
ERRATUM identifies an error (added by curator to historical stuff? or by a reviewer?)

7 Referring to other objects from within message objects

Because message objects may relate to more than a single NexSON element,

Here we define a lightweight, NexSON-specific method of describing the paths

Avoiding strict use of a JSON version of XPATH will avoid parsing on the string and dealing with funky Ids (which are legal but could make naive parsing hard to implement).

7.1 Path syntax

Used in the refersTo field to indicate the target of the comment. It seems like we can just expand the parts of an absolute path expression (taking advantage of Ids and the fact that NexSONs are not that "deep").

tag legal value(s) explanation
@idref string ID of the object referred to. This ID will also be found in one of the subsequent fields, but duplicating it here makes it easy for a id->object map to quickly interpret this path blob
@top "meta" "otus" or "trees" child of the nexml element
@otusID string only if otus is top
@otuID string only if otus is top. Optional
@treesID string only if trees is top
@treeID string only if trees is top. Optional
@edgeID string only if trees is top and treeId is specified. Optional
@nodeID string only if trees is top and treeId is specified. Optional
@metaID string only if meta is top (useful for replies)
@annotationID string only if meta is top
@messageID string only if meta is top; NOTE that message may be "localized" anywhere in the study!
@property string optional. property of the element referenced by the preceding parts of the path
@inMeta bool The property is in the meta list of the element referenced by the preceding parts of the path

The pseudocode for processing on of these paths would be something like this (assuming that [] looks of a property or contained Id in an object):

function find_prop_in_meta(meta_list, prop) {
  for (element in meta_list) {
    if (element.property == path.property) {
        return element
    }
  }
  throw InvalidPathException()
}

function resolve(nexml, path) {
  if (path.top == "meta") {
    el = nexml.meta
  } else if (path.top == "otus") {
    otus = nexml.otus[path.otusID];
    if (defined(path.otuID)) {
      el = otus[path.otuID]
    } else {
      el = otus
    }
  } else if (path.top == "trees") {
    trees = nexml.trees[path.treesID]
    if (defined(path.treeID)) {
      tree = trees[path.treeID]
      if (defined(path.nodeID)) {
        el = tree[path.nodeID]
      } else if (defined(path.edgeID)) {
        el = tree[path.edgeID]
      } else {
        el = tree
      }
    } else {
      el = trees;
    }
  } else {
    throw InvalidPathException();
  }
  if (defined(path.inMeta)) {
    return find_prop_in_meta(el.meta, path.property)
  }
  if (defined(path.property)) {
    return el[path.property]
  } else {
    return el
  }
}

8 Lists of OpenTree message codes

This is intended to be an extensible, controlled vocabulary of the types of messages that we anticipate seeing/generating. Preferably, many of the codes, along with the data blob in the message will be rich enough to create a meaningful user interface for the message (without simply forcing the UI to simply display the messageForUser and hope that user will know how to react to the message).

8.1 Codes that we should anticipate seeing regularly during curation

code name data contents explanation
REFERENCED_ID_NOT_FOUND {key: string, value: string} The NexSON attribute with the name key refers to an ID, but the ID is not in the NexSON. We have about 3000 cases of this with @otu in nodes or @source in edge objects not matching.
TIP_WITHOUT_OTU {} refersTo object is a node that is a tip on the tree, but is not mapped to any OTU object. This is an NexSON error, not failure to map to OTT. We have about 3000 cases
UNRECOGNIZED_PROPERTY_VALUE {key: string, value: string} the meta array associated key value pair in which the key is recognized, but the value is not valid. We have about 51 cases of this in which key is "ot:branchLengthMode" and value is "ot:years" (which is deprecated, I think).
MISSING_OPTIONAL_KEY string the attribute is not found. Used to report lack of "ot:dataDeposit", "ot:focalClade", "ot:inGroupClade", "ot:ottolid", and "ot:studyPublication" fields. So we have about 32 thousand of these
NO_ROOT_NODE {} tree that is refersTo has no node flagged as the root. We have 12 cases
TIP_WITHOUT_OTT_ID {} refersTo is a node with and otu, but the otu has no OTT ID. (about 31 thousand cases)
MULTIPLE_TIPS_MAPPED_TO_OTT_ID {nodes:[list of IDs]} refersTo is a tree the nodes listed are tips in the tree that map to the same OTT ID (about 31 thousand nodes)
MULTIPLE_TREES {} trees element is refersTo and it has multiple trees with no indication of which one treemachine should prefer to use
UNRECOGIZED_TAG string value of an ot:tag meta is not understood. This is not unexpected at all (and this sort of message will probably be suppressed), but the validator does emit it currently so we can see what tags are being used.
UNVALIDATED_ANNOTATION {key: string, value: string} a object in the meta list was an unrecognized key. Not surprising (will be suppressed).
CONFLICTING_PROPERTY_VALUES list of key-value pairs that conflict Flags with conflicting meanings, for example the "delete me" and the "choose me" tags
NO_TREES {} file contains no trees that are not flagged for deletion
NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID list of lists of IDs. each sublist is a set of nodes that are monophyletic on the tree and for which all the tips have the same OTT ID This code is more serious than MULTIPLE_TIPS_MAPPED_TO_OTT_ID because it indicates cases in which different arbitrary prunings could lead to different phylogenetic statements

8.2 Codes emitted when POSTing NexSON (when we support it)

These indicate serious problems with the NexSON (and we can probably be unfriendly about them in terms of UI, because they'll probably be encountered by developers):

code name data contents explanation
MISSING_MANDATORY_KEY string - key name refersTo object lacks a mandatory attribute.
UNRECOGNIZED_KEY string - key name refersTo object has an attribute that is not allowed by the NeXML schema
MISSING_LIST_EXPECTED ? element (e.g. edge) that should be a list, was not
DUPLICATING_SINGLETON_KEY string the attribute specified was encountered more than one time, though it should have been found only once (e.g. a doi)
REPEATED_ID string ID found more than once
MULTIPLE_ROOT_NODES {} tree has more than one node marked as root
MULTIPLE_EDGES_FOR_NODES {} node has more than one edge to parent
CYCLE_DETECTED {node : id string} tree has a cycle (including the referenced node)
DISCONNECTED_GRAPH_DETECTED {} tree is not connected graph
INCORRECT_ROOT_NODE_LABEL {} the node labelled as the root has a parent

8.3: Codes generated by the opentree web app (curation UI)

code name data contents explanation
OTU_MAPPING_HINTS object Object describing 'searchContext' (string) and required 'substitutions' (sub-objects)
SUPPORTING_FILE_INFO object Object describing 'files' (sub-objects)

9 Examples

9.1 Use cases

Presumably the curator app (see the "curator" subdir of the opentree repo ) will try to render a subset of this information to curators. Specifically the annotations could be warnings, error messages, and queries to the curators.

Some annotations could also be "extra" contributions to the study data, that need not be shown to curators. These could still be useful for users of the git repo of the studies (currently this is the treenexus repo, but that name will probably change soon).

9.2 An example annotation object representation

Taken from study 1003:

{
    "id": "anno1",
    "description": "Open Tree NexSON validation", 
    "agent": "agentX",
    "checksPassed": false
}

9.3 An example agent object representation

{
    "id": "agentX",
    "description": "validator of NexSON constraints as well as constraints that would allow a study to be imported into the Open Tree of Life's phylogenetic synthesis tools", 
    "invocation": {
        "checksPerformed": [
            "MISSING_MANDATORY_KEY", 
            "MISSING_OPTIONAL_KEY", 
            "UNRECOGNIZED_KEY", 
            "MISSING_LIST_EXPECTED", 
            "DUPLICATING_SINGLETON_KEY", 
            "REFERENCED_ID_NOT_FOUND", 
            "REPEATED_ID", 
            "MULTIPLE_ROOT_NODES", 
            "NO_ROOT_NODE", 
            "MULTIPLE_EDGES_FOR_NODES", 
            "CYCLE_DETECTED", 
            "DISCONNECTED_GRAPH_DETECTED", 
            "INCORRECT_ROOT_NODE_LABEL", 
            "TIP_WITHOUT_OTU", 
            "TIP_WITHOUT_OTT_ID", 
            "MULTIPLE_TIPS_MAPPED_TO_OTT_ID", 
            "NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID", 
            "INVALID_PROPERTY_VALUE", 
            "PROPERTY_VALUE_NOT_USEFUL", 
            "UNRECOGNIZED_PROPERTY_VALUE", 
            "MULTIPLE_TREES", 
            "UNRECOGNIZED_TAG", 
            "CONFLICTING_PROPERTY_VALUES", 
            "NO_TREES"
        ], 
        "commandLine": [
            "--validate"
        ]
    }, 
    "name": "normalize_ot_nexson.py", 
    "url": "https://github.com/OpenTreeOfLife/api.opentreeoflife.org", 
    "version": "0.0.1a"
}

9.4 An example message object representation

{
    "parentAnnotationId": "anno1",
    "code": "NON_MONOPHYLETIC_TIPS_MAPPED_TO_OTT_ID",
    "comment": "Multiple nodes that do not form the tips of a clade are mapped to the OTT ID \"210453\". The clades are \"node503822\" +++ \"node503824\" +++ \"node503827\" +++ \"node503832\" in \"tree(id=tree1945)\"\n", 
    "data": {
        "nodes": [
            [
                "node503822"
            ], 
            [
                "node503824"
            ], 
            [
                "node503827"
            ], 
            [
                "node503832"
            ]
        ]
    }, 
    "preserve": false, 
    "refersTo": {
        "inMeta": false, 
        "top": "trees", 
        "treeID": "tree1945", 
        "treesID": "trees1003"
    }, 
    "severity": "WARNING"
}
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.