Skip to content
This repository has been archived by the owner on Sep 24, 2019. It is now read-only.

Adding new Namespace datasets

William Hayes edited this page Oct 26, 2015 · 18 revisions

File format

A 'standard' format is available to use for adding new namespaces to the resource-generation pipeline. This format is a tab-delimited text file with the following columns:

  1. ID - a unique identifier for the namespace value required
  2. ALTIDS - any alternative ids
  3. LABEL - the preferred label for the namespace value required
  4. SYNONYM - alternative labels, pipe-delimited
  5. DESCRIPTION - documentation text
  6. TYPE - the encoding for the namespace value (e.g., 'O' for pathology, 'C' for complex) required
  7. SPECIES - the species associated with the namespace value, if any
  8. XREF - equivalent values from other BEL namespaces, pipe-delimited. Must include a recognized prefix to be used for generating equivalences
  9. OBSOLETE - flag obsolete values with '1'
  10. PARENTS - any parent terms, valid for ID isA PARENT
  11. CHILDREN - any child terms, valid for CHILD isA ID

General information can be included at the top of the file, but must be preceded with a '#'.

Example data

Examples of namespace data in this format can be found for the following namespaces:

  1. SFAM - Selventa protein families
  2. SCHEM - Selventa legacy chemical names
  3. SCOMP - Selventa named complexes
  4. SDIS - Selventa legacy diseases

Integration into resource-generator pipeline

To add a namespace dataset in this format to your resource-generator pipeline, the following steps are required:

  • Add to configuration.py:
  • initialize data object (NOTE - the data object is expected to be named using the prefix for your namespace, followed by "_data" ) my_data = StandardCustomData(name='my-namespace-name', prefix='my', domain=['my-namespace-domain'])
  • configure dataset by adding to baseline_data. baseline_data is an ordered dictionary containing information for all of the data files used by gp_baseline.py. baseline_data maps data file names to a tuple containing [1] file location, [2] the file parser (in parsers.py, and [3] the data object to store the parsed data. baseline_data['my_file_name'] = ('file_location', parsers.NamespaceParser, my_data)
  • Create header templates for .belns and .beleq files
  • These are optional for running the pipeline, but will need to be added manually to your .belns and .beleq files to run current versions of the framework.
  • Add to templates and name as follows:
    • my-namespace-name.belns
    • my-namespace-name.beleq

Optional - Add .belns and .beleq files for IDs as well as labels

  1. Add to configuration.py after your dataset initialization: my_data.ids = True
  2. Create templates my-namespace-name-ids.belns and my-namespace-name-ids.beleq

Optional - Create equivalences to an existing BEL namespace using XREFS in your dataset

Note - the default is to generate .beleq files with a new UUID for each value in your namespace.

  1. Confirm that the root namespace is included in the equiv_root_data list (add if necessary). Any namespace that you are equivalencing to must generate .beleq files prior to your namespace.
  2. add to equiv.py equiv function: elif str(d) == 'my': resolve_xrefs(d, 'chebi', 'chebi_id_eq', verbose) (here, 'chebi' is the prefix of the xref data, and chebi_id_eq is an equivalence dictionary created within the equiv module.)

If you need to add a namespace data set in your own format

(This is a high-level overview)

  1. configure in configuration.py
  2. write parser for parsers.py
  3. write dictionary format for parsed data object in parsed.py
  4. create data object class in datasets.py inheriting from NamespaceDataSet (or format your parsed dictionary to use the StandardCustomData format)
  5. add (if desired) handling of your dataset in equiv.py