# Retrieving metabolite information

The addition of metabolite related information to biomedical knowledge graphs is gaining importance to describe and learn about genome druggability and disease pathology for several applications such as drug repurposing or disease biomarker discovery. In this BioHackathon, we have investigated how to retrieve metabolite associations from curated databases focusing on the specific use case of drug repurposing on inborn errors of metabolism disorders.

(uniprot)=
## UniProt

Here is a link [to the intro](intro.md). Here is the link to [](uniprot) section. UniProt is a comprehensive knowledge base that focuses its metabolite content on natural products.

[Example SPARQL notebook for SIB](https://github.com/biosoda/tutorial_orthology/blob/master/Orthology_SPARQL_Notebook.ipynb)

[Another example of SPARQL notebook for BIND](https://github.com/LUMC-BioSemantics/bind/blob/main/queries/bind_wp_queries_model_v1.ipynb)

### Exploring genome druggability


#### Retrieving drug candidates based on activity metabolite similarity (UniProt, Rhea, ...)
    1. In UniProt, query for catalytic reactions (#39 + 40 at https://sparql.uniprot.org/.well-known/sparql-examples/) 
First, retrieve all enzymes. Then the catalytic reaction and its accession ID. Then filter human enzymes.

In [1]:
# first imports to use SPARQL endpoints of each source and execute SPARQL queries
import sys, os, time
!{sys.executable} -m pip install SPARQLWrapper
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

# always display full column results (don't truncate output)
pd.set_option('display.max_colwidth', -1)

# define the endpoints as wrappers for executing SPARQL queries
sparql_uniprot = SPARQLWrapper("https://sparql.uniprot.org/sparql")

# function to print in a table results of a SPARQL query
def pretty_print(results):
    
    # how to transform SPARQL results into Pandas dataframes
    
    # get header (column names) from results
    header = results["results"]["bindings"][0].keys()

    # display table of results:
    table = []
    
    # the SPARQL JSON results to the query are available in the "results", "bindings" entry:
    for entry in results["results"]["bindings"]:
        # append entries from the results to a regular Python list of rows, which we can then transform to a Pandas DF
        row = [entry[column]["value"] if entry.get(column, None) != None else None for column in header]
        table.append(row)
    df = pd.DataFrame(table, columns=list(header))
    return df



  pd.set_option('display.max_colwidth', -1)


In [2]:
# Queries in UniProt-RDF
# Query for human enzymes and retrieve the catalytic reaction(s) IDs in Rhea

query_human_enzymes = """
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX vg: <http://biohackathon.org/resource/vg#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX uberon: <http://purl.obolibrary.org/obo/uo#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX sp: <http://spinrdf.org/sp#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX schema: <http://schema.org/>
PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>
PREFIX rh: <http://rdf.rhea-db.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX patent: <http://data.epo.org/linked-data/def/patent/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX orthodbGroup: <http://purl.orthodb.org/odbgroup/>
PREFIX orthodb: <http://purl.orthodb.org/>
PREFIX orth: <http://purl.org/net/orth#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX np: <http://nextprot.org/rdf#>
PREFIX nextprot: <http://nextprot.org/rdf/entry/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX mnet: <https://rdf.metanetx.org/mnet/>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX lipidmaps: <https://www.lipidmaps.org/rdf/>
PREFIX keywords: <http://purl.uniprot.org/keywords/>
PREFIX insdcschema: <http://ddbj.nig.ac.jp/ontologies/nucleotide/>
PREFIX insdc: <http://identifiers.org/insdc/>
PREFIX identifiers: <http://identifiers.org/>
PREFIX glyconnect: <https://purl.org/glyconnect/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX genex: <http://purl.org/genex#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX eunisSpecies: <http://eunis.eea.europa.eu/rdf/species-schema.rdf#>
PREFIX ensembltranscript: <http://rdf.ebi.ac.uk/resource/ensembl.transcript/>
PREFIX ensemblterms: <http://rdf.ebi.ac.uk/terms/ensembl/>
PREFIX ensemblprotein: <http://rdf.ebi.ac.uk/resource/ensembl.protein/>
PREFIX ensemblexon: <http://rdf.ebi.ac.uk/resource/ensembl.exon/>
PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl/>
PREFIX ec: <http://purl.uniprot.org/enzyme/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX chebislash: <http://purl.obolibrary.org/obo/chebi/>
PREFIX chebihash: <http://purl.obolibrary.org/obo/chebi#>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX busco: <http://busco.ezlab.org/schema#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX allie: <http://allie.dbcls.jp/>
PREFIX SWISSLIPID: <https://swisslipids.org/rdf/SLM_>
PREFIX GO: <http://purl.obolibrary.org/obo/GO_>
PREFIX ECO: <http://purl.obolibrary.org/obo/ECO_>
PREFIX CHEBI: <http://purl.obolibrary.org/obo/CHEBI_>
SELECT DISTINCT 
  ?protein 
  ?rhea
WHERE {
  ?protein up:annotation/up:catalyticActivity/up:catalyzedReaction ?rhea ;
    up:organism taxon:9606 .
}
"""

In [3]:
# set the query to be executed against the UniProt endpoint and set the return format to JSON
sparql_uniprot.setQuery(query_human_enzymes)
sparql_uniprot.setReturnFormat(JSON)

NUM_EXAMPLES=3
results_human_enzymes = sparql_uniprot.query().convert()
pretty_print(results_human_enzymes).head(NUM_EXAMPLES)

KeyboardInterrupt: 

    2. In RHEA, query for metabolites participating in catalytic reactions (#13)
With this information query RHEA to retrieve reaction participants involved in metabolic pathways, and get the ChEBI IDs from both reactants and products. Then cross IDSM Sachem with these identifiers to expand these known compounds in ChEBI with similar compounds in PubChem. In PubChem we can retrieve three types of compounds: 1. known drugs that have protein targets with which have activity (useful for link prediction); 2. ChEMBL targets with known sensibility; 3. compounds with unknown relevance.

In [11]:
# Queries in UniProt-RDF
# Query for human enzymes and retrieve the catalytic reaction(s) IDs in Rhea
# Query Rhea for metabolites participating in catalytic reactions

query_metabolites = """
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX wikibase: <http://wikiba.se/ontology#>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX vg: <http://biohackathon.org/resource/vg#>
PREFIX up: <http://purl.uniprot.org/core/>
PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
PREFIX uberon: <http://purl.obolibrary.org/obo/uo#>
PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
PREFIX sp: <http://spinrdf.org/sp#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX sio: <http://semanticscience.org/resource/>
PREFIX sh: <http://www.w3.org/ns/shacl#>
PREFIX schema: <http://schema.org/>
PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>
PREFIX rh: <http://rdf.rhea-db.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
PREFIX ps: <http://www.wikidata.org/prop/statement/>
PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
PREFIX patent: <http://data.epo.org/linked-data/def/patent/>
PREFIX p: <http://www.wikidata.org/prop/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX orthodbGroup: <http://purl.orthodb.org/odbgroup/>
PREFIX orthodb: <http://purl.orthodb.org/>
PREFIX orth: <http://purl.org/net/orth#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX np: <http://nextprot.org/rdf#>
PREFIX nextprot: <http://nextprot.org/rdf/entry/>
PREFIX mnx: <https://rdf.metanetx.org/schema/>
PREFIX mnet: <https://rdf.metanetx.org/mnet/>
PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
PREFIX lscr: <http://purl.org/lscr#>
PREFIX lipidmaps: <https://www.lipidmaps.org/rdf/>
PREFIX keywords: <http://purl.uniprot.org/keywords/>
PREFIX insdcschema: <http://ddbj.nig.ac.jp/ontologies/nucleotide/>
PREFIX insdc: <http://identifiers.org/insdc/>
PREFIX identifiers: <http://identifiers.org/>
PREFIX glyconnect: <https://purl.org/glyconnect/>
PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
PREFIX genex: <http://purl.org/genex#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX faldo: <http://biohackathon.org/resource/faldo#>
PREFIX eunisSpecies: <http://eunis.eea.europa.eu/rdf/species-schema.rdf#>
PREFIX ensembltranscript: <http://rdf.ebi.ac.uk/resource/ensembl.transcript/>
PREFIX ensemblterms: <http://rdf.ebi.ac.uk/terms/ensembl/>
PREFIX ensemblprotein: <http://rdf.ebi.ac.uk/resource/ensembl.protein/>
PREFIX ensemblexon: <http://rdf.ebi.ac.uk/resource/ensembl.exon/>
PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl/>
PREFIX ec: <http://purl.uniprot.org/enzyme/>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX chebislash: <http://purl.obolibrary.org/obo/chebi/>
PREFIX chebihash: <http://purl.obolibrary.org/obo/chebi#>
PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
PREFIX busco: <http://busco.ezlab.org/schema#>
PREFIX bibo: <http://purl.org/ontology/bibo/>
PREFIX allie: <http://allie.dbcls.jp/>
PREFIX SWISSLIPID: <https://swisslipids.org/rdf/SLM_>
PREFIX GO: <http://purl.obolibrary.org/obo/GO_>
PREFIX ECO: <http://purl.obolibrary.org/obo/ECO_>
PREFIX CHEBI: <http://purl.obolibrary.org/obo/CHEBI_>
SELECT DISTINCT 
  ?protein 
  ?rhea
  ?chebi
WHERE {
  ?protein up:annotation/up:catalyticActivity/up:catalyzedReaction ?rhea ;
    up:organism taxon:9606 .
  # Query Rhea for   
  SERVICE <https://sparql.rhea-db.org/sparql> {
        ?rhea rh:side/rh:contains/rh:compound/rdfs:subClassOf ?chebi .
      }
}
"""

In [12]:
# set the query to be executed against the UniProt endpoint and set the return format to JSON
sparql_uniprot.setQuery(query_metabolites)
sparql_uniprot.setReturnFormat(JSON)

NUM_EXAMPLES=3
results_metabolites = sparql_uniprot.query().convert()
pretty_print(results_metabolites).head(NUM_EXAMPLES)

Unnamed: 0,chebi,protein,rhea
0,http://rdf.rhea-db.org/SmallMolecule,http://purl.uniprot.org/uniprot/B2RB89,http://rdf.rhea-db.org/17989
1,http://purl.obolibrary.org/obo/CHEBI_30616,http://purl.uniprot.org/uniprot/B2RB89,http://rdf.rhea-db.org/17989
2,http://rdf.rhea-db.org/GenericPolypeptide,http://purl.uniprot.org/uniprot/B2RB89,http://rdf.rhea-db.org/17989


#### Retrieving drug candidates based on structural metabolite similariy
From our user metabolite drug candidates, retrieve similar ligands in PDB.

#### Explore DBCLS databases for ligands and druggability.

### Exploring disease pathology

#### Retrieval of curated disease-metabolite associations
From IMCD, retrieve a curated dataset of metabolite-disease associations. There is other the Recon models (Leiden) data source rich on this metabolite content. 

#### Retrieval of isoforms


#### Retrieval of disease enzymes that interact with Cholestane metabolite substructure

_Author_: Jerven Bolleman, Principal Software Engineer at SIB

_Source_: https://www.linkedin.com/feed/update/activity:7171566017431724033/?trk=feed_main-feed-card_social-actions-comments

Here Bolleman et al. show the power of the combination of the IDSM (small molecules - https://lnkd.in/e2JWCd2e) Rhea (https://lnkd.in/e93Pzuva) and UniProt (https://lnkd.in/emDE6CKi) SPARQL endpoints.

In the following query we use these three independent #SPARQL endpoints.
https://lnkd.in/ejXrwUH3

Searching for all human #UniProtKB entries that are annotated to be involved in a disease that are enzymes catalyzing reactions where a substrate or product has Cholestane substructure.

We only the #SMILES representation of a Cholestane (https://lnkd.in/e83mrQMr) to start the search. 

In [2]:
# From UniProt, retrieve all human disease enzymes related to Cholestane metabolite
# query (based on Jervern post on LinkedIn)
query_disease_enzymes_cholestane = """
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
    PREFIX wikibase: <http://wikiba.se/ontology#>
    PREFIX wdt: <http://www.wikidata.org/prop/direct/>
    PREFIX wd: <http://www.wikidata.org/entity/>
    PREFIX vg: <http://biohackathon.org/resource/vg#>
    PREFIX up: <http://purl.uniprot.org/core/>
    PREFIX uniprotkb: <http://purl.uniprot.org/uniprot/>
    PREFIX uberon: <http://purl.obolibrary.org/obo/uo#>
    PREFIX taxon: <http://purl.uniprot.org/taxonomy/>
    PREFIX sp: <http://spinrdf.org/sp#>
    PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
    PREFIX sio: <http://semanticscience.org/resource/>
    PREFIX sh: <http://www.w3.org/ns/shacl#>
    PREFIX schema: <http://schema.org/>
    PREFIX sachem: <http://bioinfo.uochb.cas.cz/rdf/v1.0/sachem#>
    PREFIX rh: <http://rdf.rhea-db.org/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX pubmed: <http://rdf.ncbi.nlm.nih.gov/pubmed/>
    PREFIX ps: <http://www.wikidata.org/prop/statement/>
    PREFIX pq: <http://www.wikidata.org/prop/qualifier/>
    PREFIX patent: <http://data.epo.org/linked-data/def/patent/>
    PREFIX p: <http://www.wikidata.org/prop/>
    PREFIX owl: <http://www.w3.org/2002/07/owl#>
    PREFIX orthodbGroup: <http://purl.orthodb.org/odbgroup/>
    PREFIX orthodb: <http://purl.orthodb.org/>
    PREFIX orth: <http://purl.org/net/orth#>
    PREFIX obo: <http://purl.obolibrary.org/obo/>
    PREFIX np: <http://nextprot.org/rdf#>
    PREFIX nextprot: <http://nextprot.org/rdf/entry/>
    PREFIX mnx: <https://rdf.metanetx.org/schema/>
    PREFIX mnet: <https://rdf.metanetx.org/mnet/>
    PREFIX mesh: <http://id.nlm.nih.gov/mesh/>
    PREFIX lscr: <http://purl.org/lscr#>
    PREFIX lipidmaps: <https://www.lipidmaps.org/rdf/>
    PREFIX keywords: <http://purl.uniprot.org/keywords/>
    PREFIX insdcschema: <http://ddbj.nig.ac.jp/ontologies/nucleotide/>
    PREFIX insdc: <http://identifiers.org/insdc/>
    PREFIX identifiers: <http://identifiers.org/>
    PREFIX glyconnect: <https://purl.org/glyconnect/>
    PREFIX glycan: <http://purl.jp/bio/12/glyco/glycan#>
    PREFIX genex: <http://purl.org/genex#>
    PREFIX foaf: <http://xmlns.com/foaf/0.1/>
    PREFIX faldo: <http://biohackathon.org/resource/faldo#>
    PREFIX eunisSpecies: <http://eunis.eea.europa.eu/rdf/species-schema.rdf#>
    PREFIX ensembltranscript: <http://rdf.ebi.ac.uk/resource/ensembl.transcript/>
    PREFIX ensemblterms: <http://rdf.ebi.ac.uk/terms/ensembl/>
    PREFIX ensemblprotein: <http://rdf.ebi.ac.uk/resource/ensembl.protein/>
    PREFIX ensemblexon: <http://rdf.ebi.ac.uk/resource/ensembl.exon/>
    PREFIX ensembl: <http://rdf.ebi.ac.uk/resource/ensembl/>
    PREFIX ec: <http://purl.uniprot.org/enzyme/>
    PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX dc: <http://purl.org/dc/terms/>
    PREFIX chebislash: <http://purl.obolibrary.org/obo/chebi/>
    PREFIX chebihash: <http://purl.obolibrary.org/obo/chebi#>
    PREFIX cco: <http://rdf.ebi.ac.uk/terms/chembl#>
    PREFIX busco: <http://busco.ezlab.org/schema#>
    PREFIX bibo: <http://purl.org/ontology/bibo/>
    PREFIX allie: <http://allie.dbcls.jp/>
    PREFIX SWISSLIPID: <https://swisslipids.org/rdf/SLM_>
    PREFIX GO: <http://purl.obolibrary.org/obo/GO_>
    PREFIX ECO: <http://purl.obolibrary.org/obo/ECO_>
    PREFIX CHEBI: <http://purl.obolibrary.org/obo/CHEBI_>
    SELECT DISTINCT 
      ?protein 
      ?disease
      ?rhea
      ?chebi
      ?omim
    WHERE {
      # find complete ChEBIs with a Cholestane skeleton, via the Czech Elixir node IDSM Sachem 
      # chemical substructure search.
      SERVICE <https://idsm.elixir-czech.cz/sparql/endpoint/chebi> {
        ?chebi sachem:substructureSearch [
            sachem:query
            "[H][C@@]1(CC[C@@]2([H])[C@]3([H])CCC4CCCC[C@]4(C)[C@@]3([H])CC[C@]12C)[C@H](C)CCCC(C)C"
        ] .
      }
      # Use the fact that UniProt catalytic activities are annotated using Rhea
      # Mapping the found ChEBIs o Rhea reactions
      SERVICE <https://sparql.rhea-db.org/sparql> {
        ?rhea rh:side/rh:contains/rh:compound/rdfs:subClassOf ?chebi .
      }
      # Match the found Rhea reactions with human UniProtKB proteins
      ?protein up:annotation/up:catalyticActivity/up:catalyzedReaction ?rhea ;
         up:organism taxon:9606 .
      # Find only those human entries that have an annotated related disease, and optionally 
      # map these to OMIM
      ?protein up:annotation/up:disease ?disease .
      OPTIONAL {
            ?disease rdfs:seeAlso ?omim .
            ?omim up:database <http://purl.uniprot.org/database/MIM> 
      }
    } 
"""

In [3]:
# set the query to be executed against the UniProt endpoint and set the return format to JSON
sparql_uniprot.setQuery(query_disease_enzymes_cholestane)
sparql_uniprot.setReturnFormat(JSON)

NUM_EXAMPLES=3
results_disease_enzymes = sparql_uniprot.query().convert()
pretty_print(results_disease_enzymes).head(NUM_EXAMPLES)

Unnamed: 0,disease,chebi,omim,protein,rhea
0,http://purl.uniprot.org/diseases/330,http://purl.obolibrary.org/obo/CHEBI_2288,http://purl.uniprot.org/mim/235555,http://purl.uniprot.org/uniprot/P51857,http://rdf.rhea-db.org/46632
1,http://purl.uniprot.org/diseases/330,http://purl.obolibrary.org/obo/CHEBI_2290,http://purl.uniprot.org/mim/235555,http://purl.uniprot.org/uniprot/P51857,http://rdf.rhea-db.org/46640
2,http://purl.uniprot.org/diseases/330,http://purl.obolibrary.org/obo/CHEBI_16074,http://purl.uniprot.org/mim/235555,http://purl.uniprot.org/uniprot/P51857,http://rdf.rhea-db.org/11524


(pubchem)=
## PubChem

Here is a link [to the intro](intro.md). Here is the link to [](pubchem) section. PubChem is a comprehensive knowledge base that focuses its metabolite content on several curated databases such as HMDB. The large amount of data and complexity of PubChem structure make the retrieval of information tailored to the user use case. Once this is clear, PubChem offers a set of tools to help users to retrieve this information in a more systematic way.


### Exploring disease pathology

#### Retrieval of curated disease-metabolite associations
Subset human metabolites from HMDB and cross this data via JOIN or UNION with ChEBI (metabolite role).

## Other databases 

We also asked the Plant and Microbiome communities, if there are other databases used to retrieve metabolite-related data. The plant community is focusing mainly on sequence and phenomic data for now. The microbiome community is using KEGG (enzyme sequence list of each microbe).
MS data: they don’t know that much about MS databases for microbes. Maybe no standard database. They usually obtain metabolomics data using mass spectrometry such as GC-MS, LC-MS, CE-TOF/MS, etc. Also, "metabolite-related" data also includes metabolic pathways and enzymes and the reactions they are involved in; for pathways, KEGG or MetaCyc would work, and for predicting enzymes, DeepGO or a similar system.