Extract ontology terms referenced from PubMed abstracts as per the MEDLINE/PubMed Baseline Repository by using SciGraph against a set of ontologies.
Using OmniCorp requires the following open source tools:
- Scala and sbt
On macOS, these can be installed using Homebrew by running
brew install git maven scala sbt wget.
Setting up SciGraph
We need to use a specially modified version of SciGraph in order to carry out text annotations.
To install this version locally, run
make SciGraph. This will download, compile and install the customized SciGraph we use.
You will then need to run
make omnicorp-scigraph to generate the SciGraph instance for the ontologies specified in ontologies.ofn.
Extract ontology terms used in the COVID-19 Open Research Dataset (CORD) as tab-delimited files for further processing in COVID-KOP.
In order to generate OmniCORD output files, you should:
- Update the
Makefile. You can look up the latest CORD-19 release date on their website.
- Download the CORD-19 dataset by running
make robocord-download. This will automatically create a directory in the
robocord-datasdirectory and download the CORD-19 dataset for
$ROBOCORD_DATEinto that directory.
- Uncompress the dataset by running
- Test the extraction program by running
make robocord-test. This will extract data from some articles in order to ensure that the program is working correctly. It will also create a directory in the
robocord-outputsdirectory to store the results in. It's usually a good idea to clear the
robocord-outputdirectory after running the test and ensuring that the output files look correct.
robocord.jobto attempt to run all the jobs on a SLURM cluster. Any number of jobs can be specified, but values of around 4000 seem to work with. Example:
sbatch --array=0-3999 robocord.job.
- Use RoboCORDManager to re-run any jobs that failed to complete. You can
--dry-runoption to see what jobs will be executed before they are run. Jobs are executed using the
robocord-sbatch.shscript, so modify that if necessary. Example:
srun sbt "runMain org.renci.robocord.RoboCORDManager --job-size 20
Currently, we look for terms from the following ontologies:
- Uberon (base) (OWL)
- ChEBI (OWL)
- Cell Ontology (OWL)
- Environment Ontology (OWL)
- Gene Ontology (plus) (OWL)
- NCBITaxon (OWL)
- Relation Ontology (OWL)
- PRotein Ontology (PRO) (OWL)
- Biological Spatial Ontology (OWL)
- Mondo Disease Ontology (OWL)
- The Human Phenotype Ontology (OWL)
- Ontology for Biomedical Investigations (OWL)
- Sequence Ontology (OWL)
- HUGO Gene Nomenclature Committee (OWL)
- Experimental Factor Ontology (OWL)