Skip to content
jhpoelen edited this page Sep 12, 2019 · 13 revisions

TODO format page and document existing and future quality control methods.

potential data quality problems

Problems could originate at the source or happen during GloBI integration

  • incorrect association type
    • wrong association type reported at source, e.g., pollination reported when only flower visitation was observed
    • association type information is misinterpreted or mistranscribed when data are imported to GloBi
  • incorrect geographic data
  • incorrect temporal data
  • incorrect taxonomic data
    • taxon misidentified at source
    • suboptimal taxon name used at source (synonym, misspelling, unresolved name)
    • poor taxonomic information from GloBI taxonomy providers (e.g., a name is mapped as a synonym when it is considered a valid name by stakeholders)
  • incorrect provenance data (contributors, publishers, sources, references, etc.)

identify metrics

  • track number of erroneous records reported by user community over time
  • track rate of resolution of errors reported by user community
  • track proportion of georeferenced records
  • track proportion of records derived from scholarly sources
  • track proportion of vouchered records (specimens, media)

existing quality control methods

  • all taxonomic names are cross references using many external taxonomies / naming services including EOL, Global Names, ITIS, WoRMS, GulfBase and GBIF. Names that did not match any existing names are reported in the unmatchedTaxa.csv report.
  • all latitude/longitude locations are validated to be in [-90,90] and [-180, 180] range respectively.
  • various interactive interaction data browser tools are available (e.g. http://globalbioticinteractions.org, http://gomexsi.tamucc.edu, http://eol.org/) to spot possible errors and reporting of suspicious interactions is facilitated using github issues and an easily shareable url syntax (e.g. https://github.com/globalbioticinteractions/globalbioticinteractions.github.io/issues/59).
  • citations and DOIs associated to interaction data are resolved using http://crossref.org making is possible to detect citations that are potentially invalid or malformed.
  • usage of open data and open source approach makes it possible to openly track and resolve issues, creating a sense of shared ownership and trust.
  • the integration of all data into GloBI is fully automated, as well as the generation of GloBI data products such as the darwin core archive and the neo4j graph database instance. This makes possible to propagate bug fixes and data record updates in a reproducible manner.
  • all interaction data in GloBI can be traced to its source, making it possible for a data consumer to contact the curator or author the source data to propose changes or discuss issues.
  • GloBI products are versioned and archived, making it easier to reproduce and detect data errors that were observed at a specific time.
  • GloBI can be queries using an open data browser at http://neo4j.globalbioticinteractions.org using the Cypher query language. This enables the creation of custom queries that are specifically designed to data issues.

ideas for quality control methods

  • flag data sets with large numbers of errors reported
  • develop a system for flagging outliers/omitting records based on:
    • taxonomy, e.g., all spider prey in the db is in arthropoda, except for three records of octopus, fish, bird, which need to be curated. (see here for a query that shows all spider prey that are not Arthopods including the associated study and source).
    • TraitBank terrestrial/marine data, e.g., there should not be any marine organisms in the middle of continents, e.g., no sea cucumbers in Hungary or no terrestrial organism in the middle of the sea. (see this cypher query example for getting a list eco-regions in which of sea cucumbers (Holothuroidea) where reported).
    • TraitBank first/last occurrence (in geologic time, e.g., there should not be associations between organisms that have non-overlapping time horizons.
  • create a curated list of invalid data for particular association types based on taxonomy, e.g.:
    • plants do not prey on or pollinate animals
    • only flowering plants can have flower visitors
    • only animals can visit flowers
    • vertebrates do not parasitize invertebrates and cannot be considered pathogens
  • identify gaps in data coverage
    • broad taxonomic survey: which species/genera/families have no/low/high association coverage?
      • would be more powerful based on a comprehensive taxonomy, but we could get a head start with one of the current larger reference taxonomies (e.g., Catalogue of Life)
    • surveys based on other variables recorded in TraitBank, e.g.:
      • what proportion of known herbivores/carnivores/decomposers have diet/host-related association data?
      • what proportion of known parasites and pathogens have host-related association data?
      • what proportion of flowering plants has flower visitation/pollination data?
      • what proportion of land plants has herbivore data?
      • what proportion of mycorrhizal fungi has host plant data?