Skip to content

Bulk TNRS (name resolution) tool

Jim Allman edited this page Apr 5, 2019 · 8 revisions

By popular demand, we're cloning the name-resolution feature from our curation web-app to support pre-publication scenarios. This will also be useful for authors of multiple studies that use common, lab-specific nomenclature, so that they don't need to repeat the mapping chores for multiple studies.

The standalone TNRS tool will build and save a "name set" archive that includes all input files (lists of names in expected formats), output files (original vs. mapped names, in expected formats), and a JSON document that captures details of the mapping process to date, including

  • contributors/authors, possibly with github ids
  • free-form description in markdown
  • number of duplicate names squashed into one, and possibly the test for duplicate names:
    • trimmed/normalized whitespace?
    • removed capitalization?
    • forced diacritical characters to their Latin-1 or ASCII equivalents?
  • name mapping hints (same as in OTU mapping in curation tool)
    • taxonomic search context (e.g. "All life", "Mammals")
    • use exact or fuzzy matching
    • regexp substitutions to remove numeric ids, lab-specific identifiers, etc.

Where possible, this JSON file will use Nexson conventions, but without some of the Badgerfish cruft required for NEXML conversion. For example, mapping hints would use this streamlined version (compare to its more elaborate source Nexson).

"mappingHints": {
    "description": "Aids for mapping listed names to OTT taxa", 
    "searchContext": "All life",
    "useFuzzyMatching": true,
    "substitutions": [
        {
            "active": true, 
            "old": ".*group ",
            "new": "", 
            "valid": true
        }, 
        {
            "active": false, 
            "old": ".* ([A-Z][a-z]+ [a-z.]+ [A-Z 0-9]+)$",
            "new": "$1", 
            "valid": true
        }, 
        {
            "active": true, 
            "old": ".* ([A-Z][a-z]+ [a-z.]+ [A-Za-z 0-9]+)$",
            "new": "$1", 
            "valid": true
        }
    ],
}

The top-level structure of main.json is pretty simple, with some metadata and a list of names (streamlined from otusById in our study Nexson):

{
        "metadata": {
            "name": "Untitled nameset",
            "description": "",
            "authors": [ "jimallman", "kcranston" ],
            "date_created": "2015-09-23T15:27:43.298Z"
            "last_saved": "2015-09-25T12:11:22.8979"
        },
        "mappingHints": { 
            // see above excerpt
        },
        "names": {
            // a typical example with arbitrary/serial ID
            [
                "id": "name23",
                "originalLabel": "Bacteria Proteobacteria impatiens DSM 12546",
                "adjustedLabel": "Proeobacteria",  /* WAS '^ot:altLabel' */
                "ottTaxonName": "Saccharospirillum impatiens DSM 12546",
                "ottId": 132751,
                "taxonomicSources": ["silva:A16379/#1", "ncbi:2", "worms:6", "gbif:3", "irmng:13"]
            ],
            ...
        }
    }

The name-set data will be saved as a basic ZIP archive with this structure:

README.txt (OR README.md)
main.json (OR nameset.json)
input/
  so-many-turtles.txt
  more-turtles.csv
  even-more-turtles.tsv
output/
  mapped-names.txt
  mapped-names.tsv
Clone this wiki locally