This project generates listings for biological identifiers that can be used in OpenBEL scripts. Biological identifiers may be from online databases (e.g. Entrez Gene, Uniprot) or custom databases read from CSV files.
It generates four types of files:
-
BEL Namespace
-
A listing of biological identifiers from the same database (e.g. GO, Entrez Gene). Identifiers code for specific types of biological concepts with OpenBEL.
-
Files end with a .belns extension.
-
-
BEL Equivalence
-
A listing of equivalent namespace identfiers spanned across multiple sources. These files reference UUID values that join equivalent namespace identifiers across multiple sources. For example equivalence relationships can be made across Uniprot, Entrez Gene, and HGNC for the AKT1 human protein.
-
-
BEL Annotation
-
A listing of biological annotations used in describe biological interactions. For example NCBI Taxonomy can be used to annotate species or Disease Ontology can be used to annotate a diseased state.
-
-
RDF format (unified)
-
An RDF file that unifies namespaces, annotations, and equivalence. This output represents the complete set of identifiers and equivalence relationships.
-
The goal of the resource generator is to aggregate many sources of biological identifiers to provide:
-
Identifiers, Names, and Synonyms for biological entities or annotations.
-
Exact biological identifiers matches (e.g. EntrezGene:207 is the same as HGNC:AKT1).
-
Close biological identifiers (e.g. DiseaseOntology:"hemophagocytic lymphohistiocytosis" is a close match to Mesh:"Hemophagocytic lymphohistiocytosis, familial, 2").
-
Orthologous biological identifier matches (e.g. HGNC:AKT1 is orthologous to RGD:Akt1).
The BEL Namespace, BEL Equivalence, and BEL Annotation files are then used by the legacy OpenBEL Framework.
The RDF format includes all of the above data elements in a single file. This RDF data can be used with the next generation suite named OpenBEL Platform. Currently the OpenBEL Platform provides bel.rb and OpenBEL API tools which can use the RDF data directly.
The resource generator should be run frequently to:
-
Keep up with upstream changes from online databases
-
Correctly map biological identifiers from new research
-
Reflect changes in your custom databases
You may decide that weekly, montly, or as needed is appropriate for you.
Quick Start
-
Download the BEL Resource Generator.
-
Extract download with
tar xzf Java-BEL-Resource-Generator-v0.1.2.tar.gz
.
-
Download Apache Jena.
-
Extract Apache Jena with
tar xzf apache-jena-MAJOR.MINOR.PATCH.tar.gz
.
-
Download BEL RDF Artifacts
-
Extract BEL RDF Artifacts
-
bunzip2 biological-concepts-rdf.ttl.bz2
-
bunzip2 bel_chembl.ttl.bz2
-
mkdir tdb-data output export RG_TDB_DATA=$(pwd)/tdb-data export RG_JAVA_OUTPUT=$(pwd)/output export RG_JAVA_TEMPLATES=$(pwd)/templates
Note the first command uses Apache Jena’s tdbloader2
:
./apache-jena-3.0.0/bin/tdbloader2 --loc $RG_TDB_DATA biological-concepts-rdf.ttl
The second command uses Apache Jena’s tdbloader
:
./apache-jena-3.0.0/bin/tdbloader --loc $RG_TDB_DATA bel_chembl.ttl
Load custom RDF artifacts into Apache Jena:
for x in custom/results/*.ttl; do ./apache-jena-3.0.0/bin/tdbloader --loc $RG_TDB_DATA "$x" done
At this point, the size of the Apache Jena TDB data should be approximately 3.1 GB.
du -sh $RG_TDB_DATA
cp Java-BEL-Resource-Generator-v0.1.2/java/templates/* $RG_JAVA_TEMPLATES
This process takes several seconds. It can be run at anytime without impacting the generated data.
./Java-BEL-Resource-Generator-v0.1.2/java/scripts/integrity-check.sh
This process takes approximately 10 minutes on modern hardware.
./Java-BEL-Resource-Generator-v0.1.2/java/scripts/assign-uuids.sh
This process takes approximately 2 minutes on modern hardware.
./apache-jena-3.0.1/bin/tdbloader --loc $RG_TDB_DATA $RG_JAVA_OUTPUT/uuids.nt
At this point, the size of the Apache Jena TDB data should be approximately 4.4 GB.