Skip to content

Add support for per-identifier properties #237

@gaurav

Description

@gaurav

With PR #211, we now include taxa in the clique and synonym information. In order to fully implement #155, we will need to refactor compendium and synonym files to support multiple properties side-by-side.

Compendium files

I propose that Compendium files should have a p slot for each identifier and properties slot for the entire clique. Here is an example:

{
  "type": "biolink:Gene",
  "identifiers": [
    { "i": "NCBIGene:5367", "l": "PMCH", "p": { "in_taxon": ["NCBITaxon:9606"] }},
    { "i": "ENSEMBL:ENSG00000183395", "p": {}},
    { "i": "HGNC:9109", "l": "PMCH", "p": {}},
    { "i": "OMIM:176795", "p": { "description": "The melanin-concentrating hormone (MCH) is a cyclic neuropeptide isolated initially from salmon pituitary gland and later from rat hypothalamus. In mammals, MCH perikarya are confined largely to the lateral hypothalamus and zona incerta area with extensive neuronal projections throughout the brain, including the neurohypophysis. The anatomic distribution suggests a neurotransmitter or neuromodulator role for MCH in a broad array of neuronal functions directed toward the regulation of goal-directed behavior, such as food intake, and general arousal. MCH and 2 other putative neuropeptides, NEI and NGE, are encoded by the same precursor and appear colocalized in nerve cells and in many instances within the projections. The precursor is designated pro-melanin-concentrating hormone (PMCH) (summary by [Nahon et al., 1992](https://www.omim.org/entry/176795#3))." }},
    { "i": "UMLS:C1418669", "l": "PMCH gene", "p": {}}
  ],
  "properties": {
    "description": "The melanin-concentrating hormone (MCH) is a cyclic neuropeptide isolated initially from salmon pituitary gland and later from rat hypothalamus. In mammals, MCH perikarya are confined largely to the lateral hypothalamus and zona incerta area with extensive neuronal projections throughout the brain, including the neurohypophysis. The anatomic distribution suggests a neurotransmitter or neuromodulator role for MCH in a broad array of neuronal functions directed toward the regulation of goal-directed behavior, such as food intake, and general arousal. MCH and 2 other putative neuropeptides, NEI and NGE, are encoded by the same precursor and appear colocalized in nerve cells and in many instances within the projections. The precursor is designated pro-melanin-concentrating hormone (PMCH) (summary by [Nahon et al., 1992](https://www.omim.org/entry/176795#3)).",
    "in_taxon": ["NCBITaxon:9606"],
    "information_content": "100"
  }
}

The property keys should be documented in Babel, but each property should be mappable to an RDF property for exports:

In Redis

We currently store descriptions and information content in ideqids. We should separate them into a separate database and use that to store the properties as a JSON object indexed by the primary identifier. That will allow us to look it up quickly once we've resolved the identifier to return to the user.

Synonym files

Similarly, in synonym files, properties should be referred to in the same way as Compendium files: a single properties key that has a dictionary of key/value pairs. Since we get rid of internal clique information, we only store clique-level properties here.

  • Need to figure out if we can index properties.curie_suffix in Solr in the same way that we currently index curie_suffix.
{
  "curie": "NCBIGene:5367",
  "preferred_name": "PMCH", 
  "names": ["MCH", "PMCH", "ppMCH", "pro-MCH", "PMCH gene", "prepro-MCH", "MELANIN-CONCENTRATING HORMONE", "pro-melanin concentrating hormone", "pro-melanin-concentrating hormone", "PRO-MELANIN-CONCENTRATING HORMONE", "prepro-melanin-concentrating hormone"],
  "types": ["Gene", "GeneOrGeneProduct", "GenomicEntity", "ChemicalEntityOrGeneOrGeneProduct", "PhysicalEssence", "OntologyClass", "BiologicalEntity", "ThingWithTaxon", "NamedThing", "Entity", "PhysicalEssenceOrOccurrent", "MacromolecularMachineMixin"],
  "properties": {
    "shortest_name_length": 3,
    "curie_suffix": 5367,
    "in_taxa": ["NCBITaxon:9606"]
  }
}

Other exports

  • In KGX, we can export this as node properties (e.g. {"id": "NCBIGene:5367", "name": "PMCH", "category": "biolink:Gene", "equivalent_identifiers": ["NCBIGene:5367", ...], "in_taxa": ["NCBITaxon:9606"], ...}).
    • We will need to handle properties that don't have Biolink equivalents, like information content.
  • In SSSOM, this information can be stored as pipe-delimited key-value pairs in the other slot.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions