HoneyBadgerFish

Mark T. Holder edited this page Mar 20, 2015 · 82 revisions

Background

The NeXML standard (https://github.com/nexml/nexml/wiki/NeXML-Manual) describes how to express the core data of a phylogenetic study in XML. The standard also allows arbitrary key-value pairs to be added to any entity through the use of meta child elements. Each meta can either be of type LiteralMeta or ResourceMeta.

Because Open Tree's study curation app's manipulations are primarily the addition, deletion, and changing of these meta elements, it makes sense for us to make them accessible. In a naive transformation of NeXML to JSON, finding a meta property requires iterating through every child meta object, checking the "@property" for the desired property name, and then looking for the value in one of few places ("@content" or "$" for LiteralMeta elements, and "@href" or "$" for ResourceMeta).

The ot:* key-value pairs that the Open Tree project is using to add extra info are documented on the NexSON page.

Rules for translating a NeXML element to NexSON with @nexml2json=1.0.*

The NexSON files are produced using a syntactic convention based on the BadgerFish convention (see below).

The XML tree will be mirrored as a tree of JS objects. The topmost object contains the root of the XML tree. Each element in the NeXML is processed using the following rules, such that an XML element becomes an JS object inside its parent.

The first 4 rules only deviate slightly from BadgerFish (see Note in rule #3)

  1. The XML element name becomes the name of the property in the parent JS object.

  2. The text value of the XML element is contained in the $ property of the object. Whitespace is stripped from the ends. If the text value of an XML element is broken up by intervening child elements, the $ of the object is produced by stripping leading and trailing whitespace from each fragment and concatenating fragments.

  3. The child elements in XML maps to an array of objects. Note: in BadgerFish single elements are mapped to single JS objects. In the NeXML schema, all of the core objects can be repeated. So an array (of any length) is a more natural mapping. Missing elements are omitted (not written as empty arrays).

  4. XML attributes become properties of the object with a name that is a prefix of @ before the property name. So: <alice charlie="david">bob</alice> at the top level would become:

{"alice": [{
    "$" : "bob", 
    "@charlie" : "david" }]}

Rules 5 and 6 deal with XML namespaces. They mainly differ from BadgerFish in that the namespaces are only added to the root object:

  1. The default namespace becomes the $ property of an @xmlns object, and other namespaces become properties of that object. The names of the properties are the names of XML namespaces without the "xmlns:" qualifier. So
<alice xmlns="http://some-namespace" xmlns:charlie="http://some-other-namespace">bob</alice>

as a top-level object becomes:

{"alice": [{
    "$" : "bob", 
    "@xmnls" : {
        "$": "http://some-namespace",
        "charlie": "http://some-other-namespace"}]}

Unlike BadgerFish, this @xmlns in only added to the root object.

  1. prefixes in an element or attribute name is just treated as part of the name (no substitution of the URL or cropping of the element name to exclude the prefix.

Rules 7-9 are special case handling of meta elements:

  1. If an element has meta child element with xsi:type="nex:LiteralMeta" then it must have
  • a property attribute; we will call the value of this attribute prop-value;
  • a datatype specifying which xsd: datatype the element holds; we will call the value of this attribute datatype-value; and
  • the data in a content attribute OR in the text content of the element; we will call this the content-value;

This sort of meta element will appear in the parent object under a name with a ^ prefix followed by prop-val. The content-value will be coerced to the JavaScript type that corresponds to datatype-value.

The exact representation of the property depends on what needs to be conveyed:

  • Rule 7A: If there are no other attributes of the meta element needing to be mapped, then the key-value pair will have a JS primitive type as its value.
  • Rule 7B: If there are other attributes that need to be written (such as an id attribute), then the value will be a JS object with content-value stored in the $ field.
  1. If an element has meta child element with xsi:type="nex:ResourceMeta" then it must have
  • a rel attribute; we will call the value of this attribute prop-value;
  • the data in an href attribute OR a nested meta element; we will call this the content-value;

This sort of meta element will appear in the parent object under a name with a ^ prefix followed by prop-val. The value will be a JS object with:

  • Rule 8A: if the data is in a href attribute, then @href property will hold the href string
  • Rule 8B: ifa nested meta element holds the data, then a $ property will map to a JavaScript object that holds the representation of the inner meta.
  1. Many of the meta attributes can only occur once per element. To streamline the meta encoding (and as an exception to Rule 3 above) we use the BadgerFish convention for dealing with cardinality:
  • Rule 9A: If there is one element that maps to a property name, the value is the object described above (either a primitive for simple nex:LiteralMeta-type elements, or a full JS object otherwise).
  • Rule 9B: If there are multiple elements that map to a property name, then value of the property is an array which holds each of the object represenations as described above.

Note that the type hints (datatype and xsi:type attributes) are not present in the JSON. Reverse translation is possible by relying on:

  • If the value is a primitive, then nex:LiteralMeta will be used.
  • If the value is an object with a $ that is a primitive, then nex:LiteralMeta will be used.
  • If the value is an object with ah href property, then nex:ResourceMeta will be used.
  1. If there is an about attribute with a value that refers to the same element's id, then a @about is not present in the JSON.

  2. The top-level object in JSON will have a @nexml2json property that maps to a version string such as "1.0.0a" or "1.0.0". Direct BadgerFish translations to JSON will lack this property, or will have a version string that starts with "0." (because most projects tweak the BadgerFish rules at least a little bit, it seems like a good idea to leave some room in the 0... namespace for distinguishing between versions JSON produced by those conventions).

Caveats

There are three ways (that we are aware of) that roundtrip of XML -> JSON -> XML might not result in identical syntax:

  1. The attribute and element order is not preserved. This is an trivial barrier to using diff to test roundtrips, but not a serious issue.

  2. Introspection will provide the datatype of nex:LiteralMeta elements. This means that xsd:integer and xsd:float values will be used for integer and floating point numbers. Thus the details of the meta properties (e.g. integer vs long or float vs double) may not be "round-trip-able". We do not know of cases in NeXML documents in which this fine-grained distinctions of type is needed.

  3. A LiteralMeta form of meta can store its value in a content attribute or the text body of the element. Both of these map to $ in JSON, so the exact placement cannot be recovered. This is not a substantive concern, as there is no indication in the NeXML standard that the two locations for the data should affect handling of the data.

Example

The NeXML snippet below was pieced together from multiple files. So it does not make sense biologically. It was constructed to be valid NeXML and to show a diversity of the meta cases that introduce complexity:

The version-controlled home for the file is at https://github.com/OpenTreeOfLife/api.opentreeoflife.org/blob/roundtrip2xml/nexson-validator/tests/nexml/otu.xml

<?xml version="1.0" encoding="UTF-8"?>
<nex:nexml
  xmlns:nex="http://www.nexml.org/2009"
  xmlns="http://www.nexml.org/2009"
  version="0.9"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xmlns:ot="http://purl.org/opentree/nexson"
  xmlns:xsd="http://www.w3.org/2001/XMLSchema#"
  xmlns:tb="http://purl.org/phylo/treebase/2.0/terms#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#">
   <meta property="ot:tag" xsi:type="nex:LiteralMeta" datatype="xsd:string">cpDNA</meta>
   <meta property="ot:tag" xsi:type="nex:LiteralMeta" datatype="xsd:string">ingroup added</meta>
   <meta property="ot:candidateTreeForSynthesis" xsi:type="nex:LiteralMeta" datatype="xsd:string">tr1</meta>
   <otus id="ob1">
    <otu about="#otu88801" id="otu88801" label="Ancyromonas sigmoides">
      <meta property="ot:ottId" xsi:type="nex:LiteralMeta" datatype="xsd:integer">415973</meta>
      <meta property="ot:originalLabel" id="bogus" xsi:type="nex:LiteralMeta" datatype="xsd:string">Ancyromonas sigmoides</meta>
      <meta href="http://dx.doi.org/10.3732/ajb.94.12.2026" rel="ot:studyPublication" xsi:type="nex:ResourceMeta"/>
      <meta content="7002" datatype="xsd:long" id="m0" property="tb:identifier.taxon" xsi:type="nex:LiteralMeta"/>
      <meta href="http://purl.uniprot.org/taxonomy/94215" id="meta4912509" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
      <meta href="http://purl.uniprot.org/taxonomy/102624" id="meta4912517" rel="skos:closeMatch" xsi:type="nex:ResourceMeta"/>
    </otu>
  </otus>
  <trees id="tb1" otus="ob1">
    <tree id="tr1"  xsi:type="nex:FloatTree">
      <node id="n1" otu="otu88801"/>
      <node id="n0"/>
      <edge id="e0" source="n0" target="n1"/>
    </tree>
  </trees>
</nex:nexml>

will be represented as (there is not much of interest after the otu object):

{
  "nex:nexml": {
    "@version": "0.9", 
    "@xmlns": {
      "$": "http://www.nexml.org/2009", 
      "nex": "http://www.nexml.org/2009", 
      "ot": "http://purl.org/opentree/nexson", 
      "skos": "http://www.w3.org/2004/02/skos/core#", 
      "tb": "http://purl.org/phylo/treebase/2.0/terms#", 
      "xsd": "http://www.w3.org/2001/XMLSchema#", 
      "xsi": "http://www.w3.org/2001/XMLSchema-instance"
    }, 
    "^ot:candidateTreeForSynthesis": "tr1",  # Rule 7A, 9A
    "^ot:tag": ["cpDNA", "ingroup added"],   # Rule 7A, 9B
    "otus": [{
        "@id": "ob1", 
        "otu": [{
            "@id": "otu88801", 
            "@label": "Ancyromonas sigmoides", 
            "^ot:originalLabel": {                        # Rule 7B, 9A
               "$": "Ancyromonas sigmoides",
               "@id": "bogus"
            },
            "^ot:ottId": 415973,                          # Rule 7A, 9A
            "^ot:studyPublication": {                     # Rule 8A, 9A
              "@href": "http://dx.doi.org/10.3732/ajb.94.12.2026"
            }, 
            "^skos:closeMatch": [{                        # Rule 8A, 9B
              "@href": "http://purl.uniprot.org/taxonomy/94215", 
              "@id": "meta4912509"},{
              "@href": "http://purl.uniprot.org/taxonomy/102624", 
              "@id": "meta4912517"
            }], 
            "^tb:identifier.taxon": {                     # Rule 7B, 9B
              "$": 7002,
              "@id": "m0"
            }
          }
        ]
      }], 
    "trees": [{
        "@id": "tb1", 
        "@otus": "ob1", 
        "tree": [{
            "@id": "tr1", 
            "@xsi:type": "nex:FloatTree", 
            "edge": [{
                "@id": "e0", 
                "@source": "n0", 
                "@target": "n1"
              }
            ], 
            "node": [{
                "@id": "n1", 
                "@otu": "otu88801"
              },{
                "@id": "n0"
              }
            ]
          }
        ]
      }
    ]
  }
}

Rules for translating a NeXML element to NexSON with @nexml2json=1.1.*

We can probably avoid supporting this form - it was proposed in email, but not implemented.

This representation is very similar to the @nexml2json=1.1.* with the following exception: a "byId" representation is used for some fields rather than an array. In this representation:

"byID" representation v1.2:

  • a single object is used in place of array in the 1.0.0 syntax,
  • The only permitted keys in the object are the id attributes of the element,
  • The value associated with the key is an object identical to the 1.0.0 reprsentation except that the @id is not included.
  • The NeXML form of the object is sequence of elements, one for each key-value pair.

Specifically:

  1. Instead of node and edge array, the tree representation is expressed as:
    • internalEdge and terminalEdge arrays instead of edge(which if concatenated would recreate the edge array of the 1.0.* representation).
    • leafById and internalNodeById objects are used instead of a node, and:
    • The ^ot:isLeaf field is omitted (since the presence in leaf conveys this info).
    • an otuByID object replaces a otu array.
    • an otusByID object replaces a otus array and the parent (nexml) object will have a ^ot:otusElementOrder key with an array of otusIDs to supply the order of the otus elements.
    • a treesByID object replaces a trees array and the parent (nexml) object will have a ^ot:treesElementOrder key with an array of treesIDs to supply the order of the trees elements
    • a trees group object will have a ^ot:treeElementOrder key with an array of treeIDs to supply the order of the tree elements

Rules for translating a NeXML element to NexSON with @nexml2json=1.2.*

This is the form (1.2.1) that MTH thinks should be stored in serialized form, but on-the-fly translation could make that decision less important for tools other than the api.opentree.org services.

This is the same as syntax 1.1.* except:

  1. the internalEdge and terminalEdge arrays are replaced by an edgeBySourceId objects with the following rules:
  • The only permitted keys in the object are the @source attributes of the egde,
  • The value associated with the key is an object with keys being the edge ids of the edges have that @source. Despite the fact that the @source would not need to be included in minimally sized representation. The @source is retained because most clients will want create "edgeById" and/or "edgeByTargeId" maps; the duplication here allows all 3 maps to share references to the same object. Note: in 1.2.0 the value was an array of edges; that is no longer supported by peyotl.
  1. Each object in the tree array will have a "^ot:rootNodeId" property that holds the ID of the node of the tree that is not the @target of any edge. The @root property is still retained in that node. The "^ot:specifiedRoot" is not identical to this, because that property is used to determine if the rooting is arbitrary.

  2. Instead of leafById and internalNodeById there is just a nodeById object; there is still no ^ot:isLeaf required because internal node ids will be keys in edgeBySourceId, enabling a fast answer to the "isLeaf" question.

This representation allows for a very rapid construction of the tree:

  • Start at "^ot:rootNodeId"
  • build the tree in preorder by looking up all of the outgoing edges in edgeBySourceId

Each of these lookups can be done in constant time, so tree can built in order(N) time without any code to deal with partially connected trees during the building process or any additional memory. Subtrees can also be built by starting at the MRCA.

BadgerFish

BadgerFish is one of several schemes for rendering XML as JSON. Several sites, including a site that appears to be the original, and several refinements were consulted in developing the mapping appropriate for NeXML.

Correctness of translation was verified by using a backtranslator and validating the resulting XML using the validator on the NeXML home page.

Justification for straying from BadgerFish

We were straying from strict BadgerFish by not emitting the active XML namespaces in each object, and occasionally omitting the "datatype" for "meta" elements.

Given that roundtripping a file required special tools, we decided to take the leap and clean up several aspects of the BadgerFish mapping to make data access easier on clients and reduce the size of NexSON.

BadgerFish version of NeXML @nexml2json=0.* or @nexml2json absent

MTH intends to add logic to the API code produce our old (close to straight BadgerFish conversion) via the API layer if the call include a output_nexml2json=0.* argument to calls.

Contributors

Jim Allman, Karen Cranston, Cody Hinchliff, Mark Holder, Peter Midford, and Jonathan Rees participated in discussions and design of NexSON.