# Querying external database sources of interest

* Enable users to integrate data from external databases of interest within BBP KG
* While using the Nexus Forge interface and BMO vocabulary as much as possible as
* While benefiting from out of the box (meta)data transformation to make them ready for BBP internal pipelines and applications
* Demo with Mouselight, NeuroElectro, UniProt

In [1]:
import json

from kgforge.core import KnowledgeGraphForge
from kgforge.specializations.resources import Dataset

In [2]:
endpoint = "https://staging.nise.bbp.epfl.ch/nexus/v1"
BUCKET = "neurosciencegraph/datamodels"
forge = KnowledgeGraphForge("../../configurations/database-sources/prod-nexus-sources.yml", endpoint=endpoint, bucket=BUCKET)

# List of Data sources

In [3]:
forge.db_sources(pretty=True)

Available Database sources:
UniProt
NeuroElectro
MouseLight


In [4]:
sources = forge.db_sources(pretty=False)

In [5]:

data = {
       'store':{
          'name': 'DemoStore'
       },
        'model': { 
          'name': 'DemoModel',
          'origin': 'directory',
          'source': "../../../tests/data/demo-model/" 
        }
}


In [6]:
from kgforge.specializations.resources import DatabaseSource
ds = DatabaseSource(forge, name="DemoDB", from_forge=False, **data)

In [7]:
# print(ds)

In [8]:
forge.db_sources(pretty=True)

Available Database sources:
UniProt
NeuroElectro
MouseLight
DemoDB


# Data source metadata

In [9]:
mouselight = sources["MouseLight"]

## Name, description, url, license, protocol => more can be added through configuration

In [10]:
print(mouselight.name)
print(mouselight.protocol)
print(mouselight.license)

MouseLight
https://www.janelia.org/project-team/mouselight/resources
{'id': 'https://creativecommons.org/licenses/by-nc/4.0', 'label': 'CC BY-NC 4.0'}


## Get data mappings (hold transformations logic) per data type

* Data mappings are used to transform results obtained from the external data sources so that they are ready for consumption by BBP tools
* Perform automatic ontology linking

In [11]:
forge.mappings("MouseLight", pretty=False)

{'NeuronMorphology': ['DictionaryMapping']}

In [12]:
forge.mappings('UniProt', pretty=True)

Managed mappings for the data source per entity type and mapping type:
   - Gene:
        * DictionaryMapping
   - Protein:
        * DictionaryMapping


In [13]:
forge.mappings('NeuroElectro', pretty=True)

Managed mappings for the data source per entity type and mapping type:
   - ElectrophysiologicalFeatureAnnotation:
        * DictionaryMapping
   - ParameterAnnotation:
        * DictionaryMapping
   - ParameterBody:
        * DictionaryMapping
   - ScholarlyArticle:
        * DictionaryMapping
   - SeriesBody:
        * DictionaryMapping


In [14]:
from kgforge.specializations.mappings import DictionaryMapping
mapping = forge.mapping("NeuronMorphology", "MouseLight", type=DictionaryMapping)
direct_mapping = mouselight.mapping("NeuronMorphology", type=DictionaryMapping)

In [15]:
print(mapping)

{
    id: forge.format("identifier", "neuronmorphologies/mouselight", x.neurons[0]["idString"])
    type:
    [
        Dataset
        NeuronMorphology
    ]
    brainLocation:
    {
        type: BrainLocation
        brainRegion:
        {
            id: f"http://api.brain-map.org/api/v2/data/Structure/{x.neurons[0]['soma']['allenId']}"
            label: x.neurons[0]["allenLabel"]
        }
        coordinatesInBrainAtlas:
        {
            valueX: x.neurons[0]["soma"]["x"]
            valueY: x.neurons[0]["soma"]["y"]
            valueZ: x.neurons[0]["soma"]["z"]
        }
    }
    contribution:
    {
        type: Contribution
        agent:
        {
            id: https://www.grid.ac/institutes/grid.443970.d
            type: Organization
            label: Janelia Research Campus
        }
    }
    dateCreated: x.neurons[0]["sample"]["date"]
    description: x.neurons[0]["annotationSpace"]["description"]
    distribution: forge.attach(f"./mouselight/{x.neurons[0]['idSt

In [16]:
print(direct_mapping)

{
    id: forge.format("identifier", "neuronmorphologies/mouselight", x.neurons[0]["idString"])
    type:
    [
        Dataset
        NeuronMorphology
    ]
    brainLocation:
    {
        type: BrainLocation
        brainRegion:
        {
            id: f"http://api.brain-map.org/api/v2/data/Structure/{x.neurons[0]['soma']['allenId']}"
            label: x.neurons[0]["allenLabel"]
        }
        coordinatesInBrainAtlas:
        {
            valueX: x.neurons[0]["soma"]["x"]
            valueY: x.neurons[0]["soma"]["y"]
            valueZ: x.neurons[0]["soma"]["z"]
        }
    }
    contribution:
    {
        type: Contribution
        agent:
        {
            id: https://www.grid.ac/institutes/grid.443970.d
            type: Organization
            label: Janelia Research Campus
        }
    }
    dateCreated: x.neurons[0]["sample"]["date"]
    description: x.neurons[0]["annotationSpace"]["description"]
    distribution: forge.attach(f"./mouselight/{x.neurons[0]['idSt

In [17]:
forge.db_sources(with_datatype='NeuronMorphology', pretty=True)

Available Database sources:
MouseLight


In [18]:
ne = sources['NeuroElectro']

# Search and Access data from data source

* Mapping are automatically applied to search results
* takes a mn for now => working on making it faster 

In [19]:
# Type, source or target brain region, 
filters = {"type":"ScholarlyArticle"}
#map=True, use_cache=True, # download=True
resources = forge.search(filters, db_source="NeuroElectro", limit=2) 
# ADd function for checking datsource health => reqsuire health url from db


In [20]:
len(resources)

2

In [21]:
print(resources[0])

{
    context: https://bbp.neuroshapes.org
    id: https://bbp.epfl.ch/neurosciencegraph/data/scholarlyarticles/34164
    type:
    [
        Entity
        ScholarlyArticle
    ]
    abstract: Striatal spiny projection (SP) neurons control movement initiation by integrating cortical inputs and inhibiting basal ganglia outputs. Central to this control lies a "microcircuit" that consists of a feedback pathway formed by axon collaterals between GABAergic SP neurons and a feedforward pathway from fast spiking (FS) GABAergic interneurons to SP neurons. Here, somatically evoked postsynaptic potentials (PSPs) and currents (PSCs) were compared for both pathways with dual whole cell patch recording in voltage- and current-clamp mode using cortex-striatum-substantia nigra organotypic cultures. On average, feedforward inputs were 1 ms earlier, more reliable, and about twice as large in amplitude compared with most feedback inputs. On the other hand, both pathways exhibited widely varying, partia

In [22]:
uquery = """
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein
WHERE {
  ?protein a up:Protein ;
  up:reviewed true.
}
"""

In [23]:
uresources = forge.sparql(query=uquery, db_source='UniProt', limit=10, debug=True)

query in sparql 
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein
WHERE {
  ?protein a up:Protein ;
  up:reviewed true.
}

Submitted query:
   
   PREFIX up: <http://purl.uniprot.org/core/>
   SELECT ?protein
   WHERE {
     ?protein a up:Protein ;
     up:reviewed true.
   }
     LIMIT 10

amount of results = 10


In [24]:
len(uresources)

10

In [25]:
# uresources

In [26]:
from kgforge.core.wrappings.paths import Filter, FilterOperator

In [27]:
proteins = forge.search({'type': 'Protein', 'up:reviewed': True}, db_source='UniProt', limit=10)

query in sparql SELECT ?id WHERE {?id type Protein;
 up:reviewed ?v1 . 
 FILTER(?v1 = 'true'^^xsd:boolean)
}
amount of results = 10


In [28]:
uniprot = sources['UniProt']

In [29]:
uniprot._store.context.prefixes

{'up': 'http://purl.uniprot.org/core/',
 'owl': 'http://www.w3.org/2002/07/owl#',
 'owl2xml': 'http://www.w3.org/2006/12/owl2-xml#',
 'swrlb': 'http://www.w3.org/2003/11/swrlb#',
 'protege': 'http://protege.stanford.edu/plugins/owl/protege#',
 'swrl': 'http://www.w3.org/2003/11/swrl#',
 'xsd': 'http://www.w3.org/2001/XMLSchema#',
 'skos': 'http://www.w3.org/2004/02/skos/core#',
 'rdfs': 'http://www.w3.org/2000/01/rdf-schema#',
 'dc11': 'http://purl.org/dc/terms/',
 'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
 'foaf': 'http://xmlns.com/foaf/0.1/'}

In [38]:
genes = forge.search({'type': 'Gene'}, db_source='UniProt', limit=10)

query in sparql SELECT ?id WHERE {?id type Gene . 
 
}
amount of results = 10


In [39]:
genes[0]

Resource(_last_action=None, _validated=False, _synchronized=False, _store_metadata=None, id=Resource(_last_action=None, _validated=False, _synchronized=False, _store_metadata=None, id='http://purl.uniprot.org/uniprot/H5SQ95#gene-MD58301D33AF640374C84A4DA4CAF383BE6', _inner_sync=False, annotationScore=2.0, comments=[{'texts': [{'evidences': [{'evidenceCode': 'ECO:0000305', 'source': 'PubMed', 'id': '28087277'}], 'value': 'May function as a protein modifier covalently attached to lysine residues of substrate proteins. This may serve to target the modified proteins for degradation by proteasomes'}], 'commentType': 'FUNCTION'}, {'texts': [{'evidences': [{'evidenceCode': 'ECO:0000255', 'source': 'HAMAP-Rule', 'id': 'MF_02133'}], 'value': 'Belongs to the ubiquitin-like protein UBact family'}], 'commentType': 'SIMILARITY'}], entryAudit={'firstPublicDate': '2017-10-25', 'lastAnnotationUpdateDate': '2022-05-25', 'lastSequenceUpdateDate': '2012-04-18', 'entryVersion': 15, 'sequenceVersion': 1}, 

# Save in BBP KG (Nexus)

In [31]:
# forge.register(resources)

## Access

### Set filters

In [32]:
_type = "NeuronMorphology"
filters = {"type": _type}

### Run Query

In [33]:
limit = 10  # You can limit the number of results, pass `None` to fetch all the results

data = forge.search(filters, db_source='MouseLight', limit=limit)

print(f"{str(len(data))} dataset(s) of type {_type} found")

10 dataset(s) of type NeuronMorphology found


### Display the results as pandas dataframe

In [34]:
property_to_display = ["id","name","subject","brainLocation.brainRegion.id","brainLocation.brainRegion.label","brainLocation.layer.id","brainLocation.layer.label", "contribution","brainLocation.layer.id","brainLocation.layer.label","distribution.name","distribution.contentUrl","distribution.encodingFormat"]
reshaped_data = forge.reshape(data, keep=property_to_display)

forge.as_dataframe(reshaped_data)

Unnamed: 0,id,brainLocation.brainRegion.id,brainLocation.brainRegion.label,contribution.type,contribution.agent.id,contribution.agent.type,distribution.contentUrl,distribution.encodingFormat,distribution.name,name,...,subject.strain,brainLocation.layer,contribution.agent.label,subject.age.period,subject.age.unitCode,subject.age.value,subject.identifier,subject.name,subject.sex.label,subject.strain.label
0,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,Simple lobule,Contribution,https://www.grid.ac/institutes/grid.443970.d,Organization,https://staging.nise.bbp.epfl.ch/nexus/v1/file...,application/swc,AA1015.swc,AA1015,...,Sim1-Cre,,,,,,,,,
1,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,Ansiform lobule,Contribution,https://www.grid.ac/institutes/grid.443970.d,Organization,https://staging.nise.bbp.epfl.ch/nexus/v1/file...,application/swc,AA1024.swc,AA1024,...,Sim1-Cre,,,,,,,,,
2,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,MTG,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,H17.03.010.11.13.01,...,,3,Allen Institute for Brain Science,Post-natal,yrs,38.0,601901227.0,H17.03.010,Female,
3,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,FroL,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,H16.06.007.01.07.02,...,,3,Allen Institute for Brain Science,Post-natal,yrs,26.0,518229880.0,H16.06.007,Male,
4,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,MTG,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,H16.06.008.01.26.04,...,,5,Allen Institute for Brain Science,Post-natal,yrs,24.0,527747035.0,H16.06.008,Female,
5,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure/74,VISl6a,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,Rorb-IRES2-Cre-D;Ai14-234246.02.02.01,...,,6a,Allen Institute for Brain Science,Post-natal,,,503871576.0,Rorb-IRES2-Cre-D;Ai14-234246,,Rorb-IRES2-Cre
6,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,VISp2/3,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,Chrna2-Cre_OE25;Ai14(IVSCC)-294465.05.01.01,...,,2/3,Allen Institute for Brain Science,Post-natal,,,565087402.0,Chrna2-Cre_OE25;Ai14(IVSCC)-294465,,Chrna2-Cre_OE25
7,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,VISp5,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,Cux2-CreERT2;Ai14-205530.03.02.01,...,,5,Allen Institute for Brain Science,Post-natal,,,485250100.0,Cux2-CreERT2;Ai14-205530,,Cux2-CreERT2
8,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure/33,VISp6a,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,Ctgf-2A-dgCre;Ai14(IVSCC)-229233.03.02.01,...,,6a,Allen Institute for Brain Science,Post-natal,,,501715368.0,Ctgf-2A-dgCre;Ai14(IVSCC)-229233,,Ctgf-T2A-dgCre
9,https://bbp.epfl.ch/neurosciencegraph/data/neu...,http://api.brain-map.org/api/v2/data/Structure...,MTG,Contribution,https://www.grid.ac/institutes/grid.417881.3,Organization,https://staging.nexus.ocp.bbp.epfl.ch/v1/files...,application/swc,reconstruction.swc,H17.06.005.12.15.01,...,,4,Allen Institute for Brain Science,Post-natal,yrs,38.0,571364629.0,H17.06.005,Male,


### Download

In [35]:
dirpath = "./downloaded/"
forge.download(data, "distribution.contentUrl", dirpath)

<action> send
<error> ConnectionError: HTTPSConnectionPool(host='staging.nexus.ocp.bbp.epfl.ch', port=443): Max retries exceeded with url: /v1/files/dke/kgforge/f0c59cf7-3b1b-4cb9-b81b-62fefd5bbd0f (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb1281dcc10>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))



### Try query

In [36]:
mquery = """
# PREFIXES
SELECT ?id WHERE {
    ?id a nsg:DetailedCircuit
} LIMIT 100
"""

forge.sparql(mquery, debug=True, rewrite=False)

Submitted query:
   
   # PREFIXES
   SELECT ?id WHERE {
       ?id a nsg:DetailedCircuit
   } LIMIT 100

<action> _sparql
<error> QueryingError: 400 Client Error: Bad Request for url: https://staging.nise.bbp.epfl.ch/nexus/v1/views/neurosciencegraph/datamodels/https%3A%2F%2Fbluebrain.github.io%2Fnexus%2Fvocabulary%2FdefaultSparqlIndex/sparql



In [37]:
forge.sparql(mquery, debug=True, rewrite=False)

Submitted query:
   
   # PREFIXES
   SELECT ?id WHERE {
       ?id a nsg:DetailedCircuit
   } LIMIT 100

<action> _sparql
<error> QueryingError: 400 Client Error: Bad Request for url: https://staging.nise.bbp.epfl.ch/nexus/v1/views/neurosciencegraph/datamodels/https%3A%2F%2Fbluebrain.github.io%2Fnexus%2Fvocabulary%2FdefaultSparqlIndex/sparql

