Skip to content

TaylorResearchLab/CFDE_DataDistillery

Repository files navigation

CFDE Data Distillery Project

UBKG

The Unified Biomedical Knowledge Graph (UBKG) is a knowledge graph database that represents a set of interrelated concepts from biomedical ontologies and vocabularies. The UBKG combines information from the National Library of Medicine's Unified Medical Language System (UMLS) with assertions from “non-UMLS” ontologies or vocabularies, including:

  • Ontologies published in references such as the NCBO Bioportal and the OBO Foundry.
  • Custom ontologies derived from data sources such as UNIPROTKB.
  • Other custom ontologies, such as those for the HuBMAP platform.

An important goal of the UBKG is to establish connections between ontologies. For example,if information on the relationships between proteins and genes described in one ontology can be connected to information on the relationships between genes and diseases described in another ontology, it may be possible to identify previously unknown relationships between proteins and diseases.

Components and generation frameworks

The primary components of the UBKG are:

  • a graph database, deployed in neo4j
  • a REST API that provides access to the information in the graph database

The UBKG database is populated from the load of a set of CSV files, using [neo4j-admin import] (https://neo4j.com/docs/operations-manual/current/tutorial/neo4j-admin-import/). The set of CSV import files is the product of two generation frameworks.

UBKG API

The UBKG prohibits direct Cypher access to the neo4j knowledge graph database. The UBKG API is a REST API with endpoints that can be used to return information from the UBKG.

The UBKG API is described in this SmartAPI page.

Source framework

The source framework is a combination of manual and automated processes that obtain the base set of nodes (entities) and edges (relationships) of the UBKG graph.

The source framework is also known as the UMLS-Graph.

  • Information on the concepts in the ontologies and vocabularies that are integrated into the UMLS Metathesaurus can be downloaded using the MetamorphoSys application. MetamorphoSys can be configured to download subsets of the entire UMLS.
  • Additional semantic information related to the UMLS can be downloaded manually from the Semantic Network.

The result of the Metathesaurus and Semantic Network downloads is a set of files in Rich Release Format (RRF). The RRF files contain information on source vocabularies or ontologies, codes, terms, and relationships both with other codes in the same vocabularies and with UMLS concepts.

The RRF files are loaded into a data mart. A python script then executes SQL scripts that perform Extraction, Transformation, and Loading of the RRF data into a set of twelve temporary tables. These tables are exported to CSV format in files that become the UMLS CSVs.

Source_framework

Generation framework

The UMLS CSVs can be loaded into neo4j to build a graph version of the UMLS, including concepts and relationships from over 150 vocabularies and ontologies that are integrated into the UMLS, such as SNOMED CT, ICD10, NCI, etc..

The UBKG extends the UMLS graph by integrating additional concepts and relationships from sources outside of the UMLS, including a number of standard biomedical ontologies that are published in NCBO BioPortal, including:

Ontology or Source Description
PATO Phenotypic Quality Ontology
UBERON Uber Anatomy Ontology
CL Cell Ontology
DOID Human Disease Ontology
OBI Ontology for Biomedical Investigations
EDAM EDAM
HSAPDV Human Developmental Stages Ontology
SBO Systems Biology Ontology
MI Molecular Interactions
CHEBI Chemical Entities of Biological Interest Ontology
MP Mammalian Phenotype Ontology
ORDO Orphan Rare Disease Ontology
UO Units of Measurement Ontology
UNIPROTKB Protein-gene relationships from UniProtKB
HUSAT HuBMAP Samples Added Terms
HUBMAP the application ontology supporting the infrastructure of the HuBMAP Consortium
CCF Human Reference Atlas Common Coordinate Framework Ontology
MONDO MONDO Disease Ontology
EFO Experimental Factor Ontology
SENNET the application ontology supporting the infrastructure of the SenNet Consortium

The generation framework is a suite of scripts that:

  • extract information on assertions (also known as triples, or subject-predicate-object relationships) found in ontologies or derived from other sources
  • iteratively add assertion information to the base set of UMLS CSVs to create a set of ontology CSVs.

Once a set of ontology CSVs is ready, they can be imported into a neo4j database to form a new instance of the UBKG.

The generation framework can work with:

  • data from ontologies published in Web Ontology Language (OWL) files that conform to the principles of the OBO Foundry
  • data from private or custom ontologies that are in the SimpleKnowledge format. (SimpleKnowledge is a lightweight ontology editor based on spreadsheets developed by Pitt UBMI.)
  • assertion data that conforms to the UBKG Edge/Node format.

PheKnowLator and OWLNETS

The generation framework obtains assertion data from OWL files with scripts that are based on the Phenotype Knowledge Translator (PheKnowLator) application. PheKnowLator converts information from an OWL file into the OWL-NETS (OWL NEtwork Transformation for Statistical learning) format.

generation_framework