An analysis-centric knowledge graph framework for untargeted metabolomics.
MetaBoKG turns the heterogeneous outputs of public mass-spectrometry repositories — spectra, features, GNPS molecular-network jobs, library annotations, confidence evidence, sample metadata, environmental and taxonomic context — into a single, queryable knowledge graph. It is designed to keep the link between every annotation and the analytical artifact, sample, and study it came from explicit, so that biochemical questions can be asked across hundreds of analyses at once instead of one job at a time.
Public infrastructures (GNPS/MassIVE, MetaboLights, Metabolomics Workbench, Pan-ReDU) have made raw data and study metadata broadly reusable, and recent graphs such as ENPKG and METRIN-KG have shown the value of semantic integration for compound-centric reasoning. The analytical layer, however, stays fragmented: spectra, features, workflow outputs, annotations, confidence evidence, and sample context live in different tables, with different IDs, and rarely point to each other. MetaBoKG addresses that fragmentation with three contributions:
- A transformation workflow that preserves links between repository exports, analytical files, spectra, features, and annotation results — from raw download all the way to SPARQL.
- A semantic model grounded in PROV-O and SIO, aligned with the Mass Spectrometry ontology (MS), ChEBI, NCBITaxon, ENVO, and NCIT — so provenance, analytical evidence, metadata attributes, and controlled-vocabulary terms all live in the same graph.
- A Universal Annotation Identifier (UAI) strategy that extends the Universal Spectrum Identifier (USI) with workflow-specific components, enabling late binding, incremental ingestion, and post-hoc linkage across analyses.
The current release scales to 680 GNPS molecular-networking jobs and is evaluated through a battery of competency questions on biochemical enrichment, environmental specificity, and cross-instrument analytical variation.
┌────────────────────────────────────────────────────────────┐
│ Public sources │
│ PubMed / iCite ── PMC ── GNPS/MassIVE ── ReDU │
└────────────────────────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────┐
│ Pipeline (main.py) │
│ │
│ fetch ─► extract ─► jobs ─► map ─► load ─► cq │
│ PMIDs GNPS/MassIVE GNPS morph-kgc Virtuoso SPARQL │
│ PDFs Zenodo IDs archives + RML graphs + CQ CSVs │
└───────────────────────────────────────────────────────────────────┘
│
▼
┌────────────────────────────────────────────────────────────┐
│ Knowledge graph │
│ SIO · MS · ENVO · NCBITaxon · NCIT · Uberon · │
│ PROV-O · DCAT · CHMO · AFO │
│ anchored on Universal Annotation Identifiers (UAI) │
└────────────────────────────────────────────────────────────┘
A single entry point, main.py, drives six stages:
fetch— data_retriever/pmid.py
Pull PMIDs citing a seed paper from the NIH iCite API, fetch PubMed metadata, download open-access PDFs from the PMC AWS S3 mirror.extract— data_retriever/find_massive_gnps.py
Convert PDFs to markdown with docling; regex-mine GNPS task IDs (with OCR-tolerant fuzzy matching), MassIVE accessions, and Zenodo records.jobs— data_retriever/job_download.py
Download and extract the GNPS job archives (classical molecular networking and feature-based molecular networking) referenced in every paper.map— mapping/script.py
Materialise per-job RDF with morph-kgc and RML mappings from mapping/rml/; write Turtle tomapping/kg/.load— mapping/load_to_virtuoso.py
Start OpenLink Virtuoso 7 viadocker compose; bulk-load/data(per-job KGs),/schema(project schema), and/ontology(external OWL files) into named graphs. TheOntology/directory is not tracked in git — pass--populate-ontologyonce to download SIO, MS, ENVO, NCBITaxon, NCIT, Uberon, PROV-O, DCAT, CHMO, and AFO from their canonical web URLs.cq— mapping/load_and_query_kg.py
Run the competency-question SPARQL suite against the live endpoint and write results to a localCQ/directory.
- Docker (Virtuoso RDF store)
- Python 3.10+
- uv for dependency management
git clone https://github.com/HolobiomicsLab/MetaBoKG.git
cd MetaBoKG
uv syncuv run python main.py allOr run any subset, in any order. Extra arguments after the stage name are forwarded to the underlying script:
uv run python main.py fetch
uv run python main.py extract --workers 8
uv run python main.py jobs --gnps-version 1 --max-workers 8
uv run python main.py map
uv run python main.py load --populate-ontology # first run: fetch external ontologies
uv run python main.py load --reload # later runs: refresh /data only
uv run python main.py cq --only CQ1 CQ2 CQ3
# chain a subset (extras forwarded to every stage)
uv run python main.py map load cqfetch reads its seed PMID from SOURCE_PMID in
data_retriever/pmid.py; edit it there to retarget.
Every annotation, sample, feature, and scan minted by MetaBoKG is anchored
on a MBS:UAI node carrying a strict, machine-checkable subset of
properties (MBS:collectionID, MBS:mzml, MBS:annotation, MBS:hit,
MBS:featureTable, MBS:feature, MBS:scan). This gives three properties for free:
- Late binding — an annotation row can be ingested before its sample metadata is available, and re-linked when ReDU lands.
- Incremental ingestion — re-running a GNPS job overwrites only its own
named slice of the graph (
main.py load --reload). - Post-hoc linkage — two annotations on the same sample, same feature, or same compound across two different studies become a single SPARQL join (see CQ1 and CQ4 defined in mapping/load_and_query_kg.py).
| Component | Location | Notes |
|---|---|---|
| Per-job materialised TTLs | mapping/kg/ | One TTL per GNPS / ReDU artifact |
| Project schema (classes + props) | Schema/ | Full MetaBoKG schema (classes, properties, ReDU hierarchies) |
| Imported ontologies | Ontology/ | SIO, MS, ENVO, NCBITaxon, NCIT, Uberon, PROV-O, DCAT, CHMO, AFO (web-fetched) |
| RML mappings | mapping/rml/ | One template per source (GNPS, FBMN, MN, ReDU) |
| Competency questions | mapping/load_and_query_kg.py | SPARQL queries; results written locally to CQ/ |
Schema/ holds the full schema for MetaBoKG: class
declarations, property declarations, and the ReDU class hierarchy
(metabokg.ttl, reDU_extraction_collection.ttl,
reDU_internal_standard.ttl, reDU_organism.ttl,
reDU_sample_type.ttl). Two namespaces are used throughout:
| Prefix | IRI | Role |
|---|---|---|
MBS: |
<https://ns.inria.fr/metaboKG/schema/> |
Schema-level terms (classes, props) |
MBD: |
<https://ns.inria.fr/metaboKG/data/> |
Instance-level resources (data) |
A complete visual modeling of the knowledge graph — entities, properties, and external-ontology anchors — is available at doc/MetaboKG.svg.
The evaluation suite in mapping/load_and_query_kg.py answers the four competency questions defined in the paper:
- CQ1 — Do GNPS annotations land on samples whose biological and environmental context has been harmonized in Pan-ReDU? (joins each annotation to its sample, collection, and NCBITaxon source through the Universal Annotation Identifier.)
- CQ2 — How does spectral match quality vary across studies? (stratifies annotations by MQScore and shared peaks attached to the identification activity.)
- CQ3 — For a given annotation, are the ClassyFire and NPClassifier taxonomies consistent, and which pairs co-occur most often?
- CQ4 — How well is each compound covered across the reference spectral libraries? (counts distinct libraries per InChIKey to flag compounds that rely on a single source.)
main.py load wraps everything below. The raw commands are documented here
in case you want to drive Virtuoso directly.
Volumes mounted by docker-compose.virtuoso.yml:
| Host | Container | Contents |
|---|---|---|
mapping/kg/ |
/data |
morph-kgc materialised TTLs (one per GNPS / ReDU) |
Schema/ |
/schema |
Project schema (metabokg.ttl and ReDU mappings) |
Ontology/ |
/ontology |
External ontologies (sio.owl, envo.owl, ncbitaxon.owl, …) |
docker compose -f docker-compose.virtuoso.yml up -dSPARQL endpoint at http://localhost:8890/sparql, ISQL on port 1111. State
persists in the virtuoso-db named volume.
docker exec -i metabokg-virtuoso isql 1111 dba dba <<'SQL'
ld_dir('/data', '*.ttl', 'https://ns.inria.fr/metaboKG/graph/main');
ld_dir('/schema', '*.ttl', 'https://ns.inria.fr/metaboKG/graph/schema');
ld_dir('/ontology', '*.ttl', 'https://ns.inria.fr/metaboKG/graph/ontology');
ld_dir('/ontology', '*.owl', 'https://ns.inria.fr/metaboKG/graph/ontology');
rdf_loader_run();
checkpoint;
SELECT ll_state, count(*) FROM DB.DBA.LOAD_LIST GROUP BY ll_state;
SQLNotes:
rdf_loader_run()blocks until all queued files complete — for the full GNPS set expect several minutes. Don't Ctrl-C; to watch progress, open a second ISQL session and re-run theSELECT ll_state, count(*) …query.- States in
DB.DBA.LOAD_LIST:0queued,1in progress,2loaded. To reset rows stuck at1after an aborted run:UPDATE DB.DBA.LOAD_LIST SET ll_state = 0, ll_started = NULL WHERE ll_state = 1;.
ld_dir does not deduplicate against previous loads. To rebuild /data
after a fresh materialisation while keeping schema/ontology graphs in place:
uv run python main.py load --reloador by hand:
docker exec -i metabokg-virtuoso isql 1111 dba dba <<'SQL'
SPARQL CLEAR GRAPH <https://ns.inria.fr/metaboKG/graph/main>;
DELETE FROM DB.DBA.LOAD_LIST WHERE ll_graph = 'https://ns.inria.fr/metaboKG/graph/main';
ld_dir('/data', '*.ttl', 'https://ns.inria.fr/metaboKG/graph/main');
rdf_loader_run();
checkpoint;
SQLTo wipe everything and start fresh:
docker compose -f docker-compose.virtuoso.yml down -v && docker compose -f docker-compose.virtuoso.yml up -d.
Virtuoso generates virtuoso.ini from env vars only on first boot and
reads it from the persistent /database volume on every subsequent start.
If you add a new mount after the volume already exists, DirsAllowed won't
pick it up. Patch the ini in place:
docker exec metabokg-virtuoso sh -c \
"sed -i 's|^DirsAllowed.*|DirsAllowed = ., /opt/virtuoso-opensource/share/virtuoso/vad, /data, /schema, /ontology|' /database/virtuoso.ini"
docker restart metabokg-virtuosoContributions are welcome — open a pull request or start a discussion if you want to extend the ontology, the RML mappings, the extractor heuristics, or add a competency question. Bug reports with a reproducible job ID are especially appreciated.
Licensed under the Apache License, Version 2.0.
Matthieu Feraud — matthieu.feraud@univ-cotedazur.fr.