# Use Case: Connecting EPA AOP-DB Data to Extant Databases with RDF and SPARQL <a name="top"></a>

<div class="alert alert-success">
    The following Python code executes <strong>SPARQL</strong> queries on <strong>AOP-DB</strong> and related databases. The first example queries just AOP-DB. The second example is a <strong>federated</strong> SPARQL query. Federated SPARQL queries are SPARQL queries which retrieve linked RDF data from multiple databases.
<br><br>
This document will showcase how SPARQL can be used to access AOP-DB RDF data, and link it with extant datasets. For more detailed information on RDF, SPARQL, and other semantic technologies, please visit the <a href="https://www.w3.org/standards/semanticweb/#w3c_overview">official semantic web documentation</a>.
</div>

<div class="alert alert-info">
The <strong>SPARQLWrapper</strong> package allows SPARQL query execution in a Python environment. The <strong>pandas</strong> package is used for data manipulation and display. 
</div>

In [1]:
from SPARQLWrapper import SPARQLWrapper, JSON
import pandas as pd

<div class="alert alert-info">
The functions defined below convert JSON, which is the data format returned by SPARQL queries, into pandas dataframes.
</div>

In [2]:
# This function takes the JSON data retrieved by SPARQL queries and converts it to a useful format
def convertjson(jdata):
    jdata.setReturnFormat(JSON)
    return jdata.query().convert()

In [3]:
# This function unpacks the queried data into a pandas dataframe
def sparql_to_df(results):

    head = []
    for header in results["head"]["vars"]:
        head.append(str(header))

    dbdict = {}
    for i in range(len(head)):
        dbdict[head[i]] = []

    for result in results["results"]["bindings"]:
        for item in head:
            dbdict[item].append(result[item]["value"])

    return pd.DataFrame.from_dict(dbdict)

<div class="alert alert-info">
    Now we will use SPARQLWrapper to connect to <a href="http://81.169.200.64:8892/sparql/">AOP-DB's SPARQL endpoint</a>.
</div>

In [4]:
aopdb = SPARQLWrapper("http://81.169.200.64:8892/sparql/")

<div class="alert alert-info">
    The SPARQL query in <strong><em>example one</em></strong> -- the query itself is the red text in triple quotes -- retrieves AOP-DB genes and related EPA ToxCast assay information. The variables returned in a SPARQL query are denoted by a leading question mark. In this case, these variables are <strong>?gene</strong> and <strong>?ToxCast_assay</strong>.
</div>

### Example 1: Simple SPARQL Query of AOP-DB 
Retrieving Genes and ToxCast Assays

In [5]:
aopdb.setQuery("""
PREFIX mmo: <http://purl.obolibrary.org/obo/MMO_>
PREFIX edam: <http://edamontology.org/>

SELECT DISTINCT ?gene ?ToxCast_assay {

    ?gene_id mmo:0000441 ?assay_id;
            edam:data_1027 ?gene.
          
    ?assay_id dc:title ?ToxCast_assay.
    
    }LIMIT 50 
""")

In [6]:
gene_assay_Query = sparql_to_df(convertjson(aopdb))

<div class="alert alert-info">
   Below is a sample of the National Center for Biotechnology Information (NCBI) gene identification numbers,
    and their asociated assays, that the SPARQL query of AOP-DB in example one returned.
</div>

In [7]:
gene_assay_Query.sort_values('gene').head(5)

Unnamed: 0,gene,ToxCast_assay
0,150,NVS_GPCR_hAdra2A
1,154,NVS_GPCR_hAdrb2
31,1588,TOX21_Aromatase_Inhibition
30,1588,NVS_ADME_hCYP19A1
46,1956,NVS_ENZ_hEGFR


<div class="alert alert-info">
    Example one is a simple demonstration of how SPARQL queries retrieve information from RDF data. <strong><em>Example two</em></strong>, a federated SPARQL query, will retrieve linked data from four databases- <strong> AOP-DB</strong>, <strong>AOP-Wiki</strong>,<strong> Protein Ontology</strong>, and <strong> WikiPathways</strong>.
    <br><br>
   This second example begins much like the first, retrieving gene and ToxCast assay information from AOP-DB. However, it then extends into the <a href="https://aopwiki.rdf.bigcat-bioinformatics.org/sparql">AOP-Wiki SPARQL endpoint</a> using the <strong>SERVICE</strong> call. Because AOP-DB and AOP-Wiki have RDF data describing different attributes of the same genes, the query, which has already retrieved gene and ToxCast assay information from AOP-DB, can retrieve from AOP-Wiki those gene's protein and key event information. Then the query makes another SERVICE call to the <a href="https://lod.proconsortium.org/yasgui.html">Protein Ontology SPARQL endpoint</a> to obtain detailed descriptions of each protein. Finally, the query calls the <a href="http://sparql.wikipathways.org/sparql">WikiPathways</a> endpoint and retrieves biological pathways that the genes are part of, and descriptions of those pathways.
</div>

### Example 2: Federated SPARQL Query of AOP-DB, AOP-Wiki, Protein Ontology, and WikiPathways
Retrieving AOP, Key Event, Gene, Protein, and Biological Pathway Data

In [8]:
aopdb.setQuery("""
PREFIX pato: <http://purl.obolibrary.org/obo/PATO_>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX aopo: <http://aopkb.org/aop_ontology#>
PREFIX edam: <http://edamontology.org/>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX obo: <http://purl.obolibrary.org/obo/>
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX wp: <http://vocabularies.wikipathways.org/wp#>
PREFIX dcterms: <http://purl.org/dc/terms/> 


SELECT DISTINCT ?aop_id ?ke_id ?key_event ?gene_NCBI ?protein ?description ?wp_pathname ?wp_description  {

          ?gene mmo:0000441 ?assay.
          ?assay dc:title ?assayId.
    
    SERVICE <https://aopwiki.rdf.bigcat-bioinformatics.org/sparql> {
    
          ?gene a edam:data_1027 ;
                dc:identifier ?gene_NCBI .
                
          ?object dc:title ?protein;
                  skos:exactMatch ?gene.
          
          ?ke pato:0001241 ?object; 
              dc:title ?key_event; 
              rdfs:label ?ke_id.
          
          ?aop a aopo:AdverseOutcomePathway ;
              rdfs:label ?aop_id;
              aopo:has_key_event ?ke.
          }

    SERVICE <https://sparql.proconsortium.org/virtuoso/sparql> {

          ?object rdfs:label ?_PRO_label ;
                  obo:IAO_0000115 ?description .
       
          BIND(STR(?_PRO_label) AS ?PRO_label) .
          }
        
    SERVICE <http://sparql.wikipathways.org/sparql> {
    
          ?wp_gene wp:bdbEntrezGene ?gene;
                  dcterms:isPartOf ?wpPath .
                 
          ?wpPath dcterms:identifier ?pathway ;
                  dcterms:description ?wp_description ;
                  dc:title ?wp_pathname .
                
          BIND(STR(?pathway) AS ?pathway) . 
          }
          
    }LIMIT 50 
""")

In [9]:
fedQuery = sparql_to_df(convertjson(aopdb))

<div class="alert alert-info">
   Below is a sample of the adverse outcome pathway, key event, assay, gene, protein, and biological pathway information cobined using the SPARQL query in example two.
</div>

In [10]:
fedQuery.head(5)

Unnamed: 0,aop_id,ke_id,key_event,gene_NCBI,protein,description,wp_pathname,wp_description
0,AOP 10,KE 667,"Binding at picrotoxin site, iGABAR chloride ch...",ncbigene:780973,gamma-aminobutyric acid receptor subunit alpha-1,A GABA(A) receptor protein that is a translati...,Iron uptake and transport,The transport of iron between cells is mediate...
1,AOP 10,KE 667,"Binding at picrotoxin site, iGABAR chloride ch...",ncbigene:780973,gamma-aminobutyric acid receptor subunit alpha-1,A GABA(A) receptor protein that is a translati...,SIDS Susceptibility Pathways,"In this model, we provide an integrated view o..."
2,AOP 52,KE 111,"Agonism, Estrogen receptor",ncbigene:407238,estrogen receptor,A protein that is a translation product of the...,Integrated Breast Cancer Pathway,This pathway incorporates the most important p...
3,AOP 67,KE 658,Decreased testosterone by the fetal Leydig cel...,ncbigene:407238,estrogen receptor,A protein that is a translation product of the...,Integrated Breast Cancer Pathway,This pathway incorporates the most important p...
4,AOP 165,KE 1046,"Suppression, Estrogen receptor (ER) activity",ncbigene:407238,estrogen receptor,A protein that is a translation product of the...,Integrated Breast Cancer Pathway,This pathway incorporates the most important p...


<div class="alert alert-success" role='alert'>
In approximately 20 lines of Python and about 35 lines of SPARQL, this query retrieves data on eight variables from four databases. There are many more variables in AOP-DB, AOP-Wiki, Protein Ontology, and WikiPathways that can be combined in a SPARQL query, and there are also many more databases with relevant information that could be included. Having AOP-DB data available as RDF presents a broad range of opportunities for research and data access.
</div>

<hr></hr>

<table style="border:2px solid #add8e6;margin-left:0;margin-right:auto;text-align:left">
    <tr>
        <td style='text-align:left'>Author:</td><td> </td><td>Wes Slaughter</td>
    </tr>
    <tr>
        <td style='text-align:left'>Date:</td><td> </td><td>5/28/2020</td>
    </tr>
    <tr>
        <td style='text-align:left'>ORISE Fellow</td><td> </td><td>US EPA ORD CPHEA</td>
    </tr>
    
</table>

<br>
<br>