Skip to content
Dan Leehr edited this page Jan 15, 2015 · 13 revisions

Data Import

TraitDB loads data in from CSV files. There are some high-level requirements to the structure of the CSV files, but most of the import is controlled by writing one or more Import Configs.

High-level requirements

File Encoding

CSV files should be UTF-8 encoded. A common workflow is to use Microsoft Excel to build up a spreadsheet and save as CSV. Sometimes, these files are not UTF-8 encoded, but you can use a free application like TextWrangler or gedit to open a CSV file and save in UTF-8 format.

Layout / Formatting

TraitDB expects CSV files to be laid out as follows:

  1. The first row must be a header row - to provide column names.
  2. Column names must be unique
  3. Any row after the first row must be a data row.

Example CSV file

Family Genus Species Chromosome Number source: Chromosome Number Karyotype source: Karyotype Comments Other
Rutaceae Citrus xlimon 36 doi:10.9999/abc123 ZW http://karyotypes.io Fascinating! abc
Rutaceae Citrus cavaleriei 40 doi:10.9999/xyz456 ZW http://karyotypes.io Not so fascinating xyz

Columns

There are three types of columns that are imported by TraitDB: Taxonomy, Traits, and Metadata. The taxonomy is a series of ordered names used to identify an operational unit, to which trait observations can be recorded. Traits are categorical or continuous values that have been observed of the taxonomic unit. Each trait column can have associated source/reference columns as well as free-text note columns. Metadata is used to capture any other data that should be recorded for the unit, but is not necessarily a trait. An Import Config instructs the TraitDB importer which columns are which. Columns not configured are ignored.

1. Taxonomy

As described earlier, the taxonomy columns are used to identify an operational taxonomic unit (OTU).

Family Genus Species
Rutaceae Citrus xlimon
Rutaceae Citrus cavaleriei

In this example we have two OTUs. Rutaceae-Citrus-xlimon and Rutaceae-Citrus-cavaleriei. Taxonomy is hierarchical, so these OTUs will share the same Family (Rutaceae) and Genus (Citrus).

For best results, the taxonomy in the dataset should not have any empty cells. TraitDB treats each row as an independent piece of data, so each row should contain every level of the taxonomy known, to ensure the taxonomic hierarchy is consistent.

TraitDB has an internal registry of taxonomic level names. Any taxonomy columns in your dataset must be mapped to an internal name by the Import Config.

Name Level
kingdom 100
htg 200
order 300
family 400
genus 500
species 600
species_author 700
infraspecific 800

In our example, the mapping would be

  • family -> Family
  • genus -> Genus
  • species -> Species

The names are case sensitive. The registry exists in the database and can be altered after installation.

2. Traits

Traits are recorded observations of an OTU. A trait column can either be continuous (numeric) or categorical (discrete values). Free-text traits are not supported.

Chromosome Number source: Chromosome Number Karyotype source: Karyotype
36 doi:10.9999/abc123 ZW http://karyotypes.io
40 doi:10.9999/xyz456 ZW http://karyotypes.io

In this case, Chromosome Number is a continuous trait and Karyotype is a categorical trait.

The continuous value type (floating point or integer) as well as the allowed categorical trait values are configured in the Import Config. Categorical traits support multiple values, separated by a delimeter. Continuous trait values are stored in the database as floating point values, but formatted on search/download.

If configured to require source information, your dataset should have a source column for each trait column. In our example, data in the source: Karyotype column corresponds to values in the Karyotype column. Source information is recorded as text. If the cell contains a URI or DOI, they will be hyperlinked when browsing results.

The TraitDB import process will report an error for a row if it includes source information for a trait but no trait value.

In addition to trait-specific source columns, TraitDB supports trait-specific notes columns. In this case, we could add a column called notes: Karyotype, and include notes specific to the Karyotype trait for an OTU.

If an OTU is not coded for a given trait, the cell for that trait should be empty in the CSV file. Additionally, the source and notes cells for that trait/OTU should be empty.

The import process is extremely aggressive in validating data sets for valid trait values. Whitespace will be trimmed around values, but non-numeric values in continuous traits or non-matching categorical trait values will cause the entire row to be skipped at upload.

3. Metadata

Metadata values in TraitDB represent non-trait data that describe an OTU. They are freeform text, so they can be used for notes or additional annotations. Metadata column names must be identified in the Import Config.

Comments
Fascinating!
Not so fascinating

Like taxonomy, the Import Config maps column names in your CSV file to an internal name in the database. This allows you two map columns in different datasets to the same metadata field. For example, one CSV file may have a "notes" column, and another may have "comments". By mapping both notes and comments to the same internal name, you can store this data in the same field without altering your CSV file. Unlike taxonomy, the mapping targets aren't predefined.

In our example, we would map

  • comments -> Comments

4. Ignored

Any columns in your spreadsheet that are not covered by one of the above categories/behaviors will be ignored. This is a design decision that allows you to specify exactly which columns get imported into TraitDB. A side effect of this decision is that columns that contain important data but have typos or case-mismatches in their names will be missed.

Other
abc
xyz

If the Other column is not identified as taxonomy, trait, or metadata, it will be skipped entirely.

Import Configs

In order for TraitDB to import your datasets, it needs some information about their formatting and conventions. This is accomplished by writing a YAML that describes your dataset and uploading it to the project. Import configs can be uploaded to TraitDB by administrators.

Simple Example

Below is a simple example of an Import Config:

traitdb_spreadsheet_template:
  template_name: Sample Config
  # All taxonomy columns specified here are required
  taxonomy_columns: # keys refer to internal TraitDB identifiers.  Values are the column names in CSV
    family: Family
    genus: Genus
    species: species
  metadata_columns: # keys again are the internal TraitDB identifiers.
    comments: Comments
  trait_options:
    source_prefix: "source: " # include the space after the colon
    require_source: true # source rules for formatting?
    notes_prefix: "notes: " #include the space after the colon
    value_separator: "|" # what separates multiple values in an entry
  categorical_trait_columns:
    -
      name: Karyotype
      values:
        - ZO
        - ZW
        - XY
      summarization_method: "collect" # Optional, method for aggregating trait values when summarizing over higher level taxa
      name: Hybrid
      values:
        - yes
        - no
  continuous_trait_columns:
    -
      name: Chromosome Number
      format: integer # formats are only for display
      summarization_method: "avg" # Optional, method for aggregating trait values when summarizing over higher level taxa
    -
      name: Avg Mass
      format: float

Basic structure

traitdb_spreadsheet_template:
  template_name: Sample Config

The YAML file containing your config shall have a root key traitdb_spreasheet_template. The value for this key should be additional key/value pairs (template_name,taxonomy_columns, etc).

Taxonomy Columns

  taxonomy_columns: # keys refer to internal TraitDB identifiers.  Values are the column names in CSV
    family: Family
    genus: Genus
    species: species

The taxonomy columns should contain a set of key-value pairs, where the keys are names of IcznGroup objects in the database, and the values are the names of columns in your CSV files. The default IcznGroup names are below:

IcznGroup Name
kingdom
htg
order
family
genus
species
species_author
infraspecific

This allows flexibility in the CSV datasets, while maintaining consistency within TraitDB. Your CSV files can use different names than TraitDB's internal identifiers. CSV files can use any subset of these levels, and can map different names to the same IcznGroup. For example, you may have one class of datasets that uses Taxa Group and another that uses Higher Taxonomic Group. Both columns refer to the same concept, so both would be mapped to the htg IcznGroup in their respective configs.

Metadata Columns

  metadata_columns: # keys again are the internal TraitDB identifiers.
    comments: Comments

The metadata columns should contain a set of key-value pairs where the keys are names of OtuMetadataField objects in the database, and the values are the names of columns in your CSV files. There are no pre-defined OtuMetadataField objects, they are entirely project-specific. Like IcznGroup, this design allows datasets with differing column names for the same concept (e.g. Notes, Comments) to be mapped into a single field internally: (e.g. comments)

Trait Options

  trait_options:
    source_prefix: "source: " # include the space after the colon
    require_source: true # source rules for formatting?
    notes_prefix: "notes: " #include the space after the colon
    value_separator: "|" # what separates multiple values in an entry

Trait options provide control for how column headers and data are interpreted. There are 4 supported trait options:

  1. source_prefix - The value should be a string indicating how columns that include source or reference information are prefixed. In this example, the source_prefix is "source: ". Given a trait column named Hybrid, the source of the hybrid data is expected to be in a column called source: Hybrid
  2. require_source - The value should be true or false, indicating whether or not source information is required. If source information is required, the importer will ignore trait values that do not source data.
  3. notes_prefix - Like source_prefix, the value of this should be a string indicating how columns that contain trait-specific notes are prefixed. In this example, the notes_prefix is "notes: ". Given a trait column named Hybrid, any notes about the trait coding are expected to be in a column called notes: Hybrid.
  4. value_separator - When multiple values are coded for a trait (e.g. woody and shrub, they should be delimited with a separator. In this example, the separator is the pipe | character, so the value would be woody|shrub to indicate both codings.

Categorical Trait Columns

  categorical_trait_columns:
    -
      name: Karyotype
      values:
        - ZO
        - ZW
        - XY
      summarization_method: "collect" # Optional, method for aggregating trait values when summarizing
    -
      name: Hybrid
      values:
        - yes
        - no

Categorical Trait Columns should contain a YAML array of key/value sets. Each item in the array defines the name of the trait as well as the possible trait values. Categorical trait values are stored as strings, so any numeric (0,1,2) or boolean (true, false) values in your import config should be wrapped in double quotes.

The summarization_method is optional, and defaults to collect. See Summarization Methods below.

Continuous Trait Columns

    -
      name: Chromosome Number
      format: integer # formats are only for display
      summarization_method: "avg" # Optional, method for aggregating trait values when summarizing
    -
      name: Avg Mass
      format: float

Continuous Trait Columns should contain a YAML array of key/value sets. Each item in the array defines the name of the trait as well as the display format. Continuous Traits are stored as decimal numbers, but their presentation (integer, float) can be configured.

The summarization_method is optional, and defaults to collect. See Summarization Methods below.

Summarization Methods

Summarization Methods are is used when searching with the Return results summarized as option. This option allows the user to specify an ICZN Group level (e.g. family) and view the results as one row per family. The OTUs in each family are grouped together and their trait values are aggregated to display in a single cell. The summarization method specifies the function used to aggregate those values.

Summarization method is specified per trait, and should name one of the options in merge_trait_hashes.rb. The methods are implemented as simple ruby code. The initial methods are meant to serve as examples:

Method Description
avg Averages the trait values. Only valid for continuous traits, whose values are numbers that can be added and divided
collect Collects all unique trait values and returns an array of unique values. This is the default method
first Returns the first trait value found
last Returns the last trait value found
min Returns the minimum trait value. Valid for both continuous and categorical but may deliver unexpected results with categorical. Simply uses Ruby's Enumerable#minmethod.
max Returns the maximum trait value. Valid for both continuous and categorical but may deliver unexpected results with categorical. Simply uses Ruby's Enumerable#max method.

Testing with ruby tools

Import configs must be valid YAML files. YAML can be checked with YAML Lint.

The code that handles the import config YAML files and CSV files is located in the lib/traitdb_import directory. The code does not depend on Rails, so it can be used to parse a dataset and check it through an import config. is standalone and can be run from the command-line:

    ruby lib/traitdb_import/init.rb Sample.yml Sample.csv

Advanced features

Trait Groups

Trait groups are used to group traits together for specific taxonomies when searching. For example, some traits are only relevant to Plants. If a user is only searching for Animal taxa, it would not be helpful to display traits that are only relevant to plants.

To support Trait groups, you add a key in the import config to define the groupings (see TreeOfSex.yml):

  trait_groups:
    -
      name: Plants
      taxonomic_rank: kingdom
      taxon_name: Plants
    -
      name: Animals
      taxonomic_rank: kingdom
      taxon_name: Animals

Also, each trait must indicate which groups it is a member of:

  categorical_trait_columns:
    -
      name: Hybrid # Denoted as hybrid based on species name; add reference to notes column if known from other data
      values: ["0","1"]
      groups: [Plants]

Trait Sets

Trait sets are an experimental feature that allows traits to be organized hierarchically. For an example, see SocialTraits-Example-TraitSets.yml

Trait Sets were initially developed for a dataset with over 200 traits with a repeated hierarchical structure. The dataset has since been reduced to 41 traits, so the hierarchical trait sets aren't used.

Uploading an Import Config File

After the import config has been written, Administrators can upload a config by clicking Admin->Import Configs, and Upload New CSV Import Config. After uploading the YAML file, the import config will be available to logged-in users for uploading

Note: Users cannot upload data until an import config exists

Downloading a system-generated template CSV File

TraitDB provides a few enhancements to facilitate the upload and import of your datasets. These include an outline view of the import config, as well as generation of blank CSV files from your import config.

On the Upload page, see the link in the notification banner. You will be presented with links to review config info (including expected taxonomy, allowed categorical trait values, metadata column names) as well as downloading the ready-to-fill CSV files for these datasets.

If your import config uses Trait Sets, you will see a different link for each set. For example, the Animals CSV template will only include traits that reference the Animals trait set.

Uploading CSV Files

To import data into TraitDB, you must upload a CSV file. Click the Upload link in the navigation bar, select your CSV file, and click the Upload button.

As you progress through the steps, you must select an import config. If the dataset passes the initial checks, TraitDB will report on any inconsistencies before providing the option to import.

Please note that any cells that fail validation will cause the entire row to be skipped.

Constraints, limitations, and workarounds

  • The source columns are designed to be specific to each trait. Some users may prefer to document a single source for an entire OTU row. The workaround for this is to set require_source to false in Trait Options, and add a metadata field for your source column. Source data will appear in searches, but will not be uniqued/cleaned up as it would otherwise.
  • Each row/otu is taken all or none, and there is no mechanism for merging/eliminating duplicate records.
  • Data within a project (OTUs, traits) is siloed, so the same trait name will be represented by a different database object across projects
  • The taxonomy levels used within a project should be consistent, otherwise data may be obscured when searching
  • By design, the import process ignores rows that are not mentioned in the import config. This means that a typo or mixed case in your import config (or dataset) could render an entire dataset invisible.