Skip to content

HolobiomicsLab/MetaBoKG

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

67 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MetaBoKG

An analysis-centric knowledge graph framework for untargeted metabolomics.

MetaBoKG turns the heterogeneous outputs of public mass-spectrometry repositories — spectra, features, GNPS molecular-network jobs, library annotations, confidence evidence, sample metadata, environmental and taxonomic context — into a single, queryable knowledge graph. It is designed to keep the link between every annotation and the analytical artifact, sample, and study it came from explicit, so that biochemical questions can be asked across hundreds of analyses at once instead of one job at a time.

Why another metabolomics KG?

Public infrastructures (GNPS/MassIVE, MetaboLights, Metabolomics Workbench, Pan-ReDU) have made raw data and study metadata broadly reusable, and recent graphs such as ENPKG and METRIN-KG have shown the value of semantic integration for compound-centric reasoning. The analytical layer, however, stays fragmented: spectra, features, workflow outputs, annotations, confidence evidence, and sample context live in different tables, with different IDs, and rarely point to each other. MetaBoKG addresses that fragmentation with three contributions:

  1. A transformation workflow that preserves links between repository exports, analytical files, spectra, features, and annotation results — from raw download all the way to SPARQL.
  2. A semantic model grounded in PROV-O and SIO, aligned with the Mass Spectrometry ontology (MS), ChEBI, NCBITaxon, ENVO, and NCIT — so provenance, analytical evidence, metadata attributes, and controlled-vocabulary terms all live in the same graph.
  3. A Universal Annotation Identifier (UAI) strategy that extends the Universal Spectrum Identifier (USI) with workflow-specific components, enabling late binding, incremental ingestion, and post-hoc linkage across analyses.

The current release scales to 680 GNPS molecular-networking jobs and is evaluated through a battery of competency questions on biochemical enrichment, environmental specificity, and cross-instrument analytical variation.

Architecture

       ┌────────────────────────────────────────────────────────────┐
       │  Public sources                                            │
       │   PubMed / iCite ── PMC ── GNPS/MassIVE ── ReDU            │
       └────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
  ┌───────────────────────────────────────────────────────────────────┐
  │  Pipeline (main.py)                                               │
  │                                                                   │
  │   fetch ─► extract ─► jobs ─► map ─► load ─► cq                   │
  │   PMIDs   GNPS/MassIVE   GNPS    morph-kgc   Virtuoso   SPARQL    │
  │   PDFs    Zenodo IDs    archives  + RML       graphs    + CQ CSVs │
  └───────────────────────────────────────────────────────────────────┘
                                  │
                                  ▼
       ┌────────────────────────────────────────────────────────────┐
       │  Knowledge graph                                           │
       │   SIO · MS · ENVO · NCBITaxon · NCIT · Uberon ·            │
       │   PROV-O · DCAT · CHMO · AFO                               │
       │   anchored on Universal Annotation Identifiers (UAI)       │
       └────────────────────────────────────────────────────────────┘

Pipeline

A single entry point, main.py, drives six stages:

  1. fetchdata_retriever/pmid.py
    Pull PMIDs citing a seed paper from the NIH iCite API, fetch PubMed metadata, download open-access PDFs from the PMC AWS S3 mirror.
  2. extractdata_retriever/find_massive_gnps.py
    Convert PDFs to markdown with docling; regex-mine GNPS task IDs (with OCR-tolerant fuzzy matching), MassIVE accessions, and Zenodo records.
  3. jobsdata_retriever/job_download.py
    Download and extract the GNPS job archives (classical molecular networking and feature-based molecular networking) referenced in every paper.
  4. mapmapping/script.py
    Materialise per-job RDF with morph-kgc and RML mappings from mapping/rml/; write Turtle to mapping/kg/.
  5. loadmapping/load_to_virtuoso.py
    Start OpenLink Virtuoso 7 via docker compose; bulk-load /data (per-job KGs), /schema (project schema), and /ontology (external OWL files) into named graphs. The Ontology/ directory is not tracked in git — pass --populate-ontology once to download SIO, MS, ENVO, NCBITaxon, NCIT, Uberon, PROV-O, DCAT, CHMO, and AFO from their canonical web URLs.
  6. cqmapping/load_and_query_kg.py
    Run the competency-question SPARQL suite against the live endpoint and write results to a local CQ/ directory.

Getting started

Prerequisites

  • Docker (Virtuoso RDF store)
  • Python 3.10+
  • uv for dependency management

Install

git clone https://github.com/HolobiomicsLab/MetaBoKG.git
cd MetaBoKG
uv sync

Run the full pipeline

uv run python main.py all

Or run any subset, in any order. Extra arguments after the stage name are forwarded to the underlying script:

uv run python main.py fetch
uv run python main.py extract --workers 8
uv run python main.py jobs --gnps-version 1 --max-workers 8
uv run python main.py map
uv run python main.py load --populate-ontology   # first run: fetch external ontologies
uv run python main.py load --reload              # later runs: refresh /data only
uv run python main.py cq --only CQ1 CQ2 CQ3

# chain a subset (extras forwarded to every stage)
uv run python main.py map load cq

fetch reads its seed PMID from SOURCE_PMID in data_retriever/pmid.py; edit it there to retarget.

The Universal Annotation Identifier (UAI)

Every annotation, sample, feature, and scan minted by MetaBoKG is anchored on a MBS:UAI node carrying a strict, machine-checkable subset of properties (MBS:collectionID, MBS:mzml, MBS:annotation, MBS:hit, MBS:featureTable, MBS:feature, MBS:scan). This gives three properties for free:

  • Late binding — an annotation row can be ingested before its sample metadata is available, and re-linked when ReDU lands.
  • Incremental ingestion — re-running a GNPS job overwrites only its own named slice of the graph (main.py load --reload).
  • Post-hoc linkage — two annotations on the same sample, same feature, or same compound across two different studies become a single SPARQL join (see CQ1 and CQ4 defined in mapping/load_and_query_kg.py).

Knowledge graph layout

Component Location Notes
Per-job materialised TTLs mapping/kg/ One TTL per GNPS / ReDU artifact
Project schema (classes + props) Schema/ Full MetaBoKG schema (classes, properties, ReDU hierarchies)
Imported ontologies Ontology/ SIO, MS, ENVO, NCBITaxon, NCIT, Uberon, PROV-O, DCAT, CHMO, AFO (web-fetched)
RML mappings mapping/rml/ One template per source (GNPS, FBMN, MN, ReDU)
Competency questions mapping/load_and_query_kg.py SPARQL queries; results written locally to CQ/

Schema and prefixes

Schema/ holds the full schema for MetaBoKG: class declarations, property declarations, and the ReDU class hierarchy (metabokg.ttl, reDU_extraction_collection.ttl, reDU_internal_standard.ttl, reDU_organism.ttl, reDU_sample_type.ttl). Two namespaces are used throughout:

Prefix IRI Role
MBS: <https://ns.inria.fr/metaboKG/schema/> Schema-level terms (classes, props)
MBD: <https://ns.inria.fr/metaboKG/data/> Instance-level resources (data)

A complete visual modeling of the knowledge graph — entities, properties, and external-ontology anchors — is available at doc/MetaboKG.svg.

Competency questions

The evaluation suite in mapping/load_and_query_kg.py answers the four competency questions defined in the paper:

  1. CQ1 — Do GNPS annotations land on samples whose biological and environmental context has been harmonized in Pan-ReDU? (joins each annotation to its sample, collection, and NCBITaxon source through the Universal Annotation Identifier.)
  2. CQ2 — How does spectral match quality vary across studies? (stratifies annotations by MQScore and shared peaks attached to the identification activity.)
  3. CQ3 — For a given annotation, are the ClassyFire and NPClassifier taxonomies consistent, and which pairs co-occur most often?
  4. CQ4 — How well is each compound covered across the reference spectral libraries? (counts distinct libraries per InChIKey to flag compounds that rely on a single source.)

Running Virtuoso by hand

main.py load wraps everything below. The raw commands are documented here in case you want to drive Virtuoso directly.

Volumes mounted by docker-compose.virtuoso.yml:

Host Container Contents
mapping/kg/ /data morph-kgc materialised TTLs (one per GNPS / ReDU)
Schema/ /schema Project schema (metabokg.ttl and ReDU mappings)
Ontology/ /ontology External ontologies (sio.owl, envo.owl, ncbitaxon.owl, …)

Start Virtuoso

docker compose -f docker-compose.virtuoso.yml up -d

SPARQL endpoint at http://localhost:8890/sparql, ISQL on port 1111. State persists in the virtuoso-db named volume.

Full bulk load

docker exec -i metabokg-virtuoso isql 1111 dba dba <<'SQL'
ld_dir('/data',     '*.ttl', 'https://ns.inria.fr/metaboKG/graph/main');
ld_dir('/schema',   '*.ttl', 'https://ns.inria.fr/metaboKG/graph/schema');
ld_dir('/ontology', '*.ttl', 'https://ns.inria.fr/metaboKG/graph/ontology');
ld_dir('/ontology', '*.owl', 'https://ns.inria.fr/metaboKG/graph/ontology');
rdf_loader_run();
checkpoint;
SELECT ll_state, count(*) FROM DB.DBA.LOAD_LIST GROUP BY ll_state;
SQL

Notes:

  • rdf_loader_run() blocks until all queued files complete — for the full GNPS set expect several minutes. Don't Ctrl-C; to watch progress, open a second ISQL session and re-run the SELECT ll_state, count(*) … query.
  • States in DB.DBA.LOAD_LIST: 0 queued, 1 in progress, 2 loaded. To reset rows stuck at 1 after an aborted run: UPDATE DB.DBA.LOAD_LIST SET ll_state = 0, ll_started = NULL WHERE ll_state = 1;.

Reload only the data graph

ld_dir does not deduplicate against previous loads. To rebuild /data after a fresh materialisation while keeping schema/ontology graphs in place:

uv run python main.py load --reload

or by hand:

docker exec -i metabokg-virtuoso isql 1111 dba dba <<'SQL'
SPARQL CLEAR GRAPH <https://ns.inria.fr/metaboKG/graph/main>;
DELETE FROM DB.DBA.LOAD_LIST WHERE ll_graph = 'https://ns.inria.fr/metaboKG/graph/main';
ld_dir('/data', '*.ttl', 'https://ns.inria.fr/metaboKG/graph/main');
rdf_loader_run();
checkpoint;
SQL

To wipe everything and start fresh: docker compose -f docker-compose.virtuoso.yml down -v && docker compose -f docker-compose.virtuoso.yml up -d.

Editing mounts on an existing volume

Virtuoso generates virtuoso.ini from env vars only on first boot and reads it from the persistent /database volume on every subsequent start. If you add a new mount after the volume already exists, DirsAllowed won't pick it up. Patch the ini in place:

docker exec metabokg-virtuoso sh -c \
  "sed -i 's|^DirsAllowed.*|DirsAllowed = ., /opt/virtuoso-opensource/share/virtuoso/vad, /data, /schema, /ontology|' /database/virtuoso.ini"
docker restart metabokg-virtuoso

Contributing

Contributions are welcome — open a pull request or start a discussion if you want to extend the ontology, the RML mappings, the extractor heuristics, or add a competency question. Bug reports with a reproducible job ID are especially appreciated.

License

Licensed under the Apache License, Version 2.0.

Contact

Matthieu Feraud — matthieu.feraud@univ-cotedazur.fr.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages