# Querying external database sources of interest

* Enable users to integrate data from external databases of interest within BBP KG
* While using the Nexus Forge interface and BMO vocabulary as much as possible as
* While benefiting from out of the box (meta)data transformation to make them ready for BBP internal pipelines and applications
* Demo with Mouselight, NeuroElectro, UniProt

In [1]:
import os
import uuid
import json

from kgforge.core import KnowledgeGraphForge
from kgforge.specializations.resources import Dataset

In [2]:
# import getpass
# TOKEN = getpass.getpass()

In [3]:
endpoint = "https://staging.nise.bbp.epfl.ch/nexus/v1"
BUCKET = "bbp/atlas"
forge = KnowledgeGraphForge("../../configurations/database-sources/prod-nexus-sources.yml",
                            # endpoint=endpoint,
                            bucket=BUCKET, debug=True)

# List of Data sources

In [4]:
forge.dataset_sources(pretty=True)

Available Database sources:
UniProt
NeuroElectro
NeuroMorpho


In [5]:
sources = forge.dataset_sources()

In [6]:

data = {
       'origin': 'store',
       'source': 'DemoStore',
       'model': { 
          'name': 'DemoModel',
          'origin': 'directory',
          'source': "../../../tests/data/demo-model/" 
        }
}


In [7]:
from kgforge.specializations.databases import StoreDatabase
ds = StoreDatabase(forge, name="DemoDB", **data)

In [8]:
forge.add_dataset_source(ds)

In [9]:
forge.dataset_sources(pretty=True)

Available Database sources:
UniProt
NeuroElectro
NeuroMorpho
DemoDB


# Data source metadata

In [10]:
neuroelectro = sources['NeuroElectro']

## Get data mappings (hold transformations logic) per data type

* Data mappings are used to transform results obtained from the external data sources so that they are ready for consumption by BBP tools
* Perform automatic ontology linking

In [11]:
forge.mappings(source="NeuroElectro")

Managed mappings for the data source per entity type and mapping type:
   - ElectrophysiologicalFeatureAnnotation:
        * DictionaryMapping
   - ParameterAnnotation:
        * DictionaryMapping
   - ParameterBody:
        * DictionaryMapping
   - ScholarlyArticle:
        * DictionaryMapping
   - SeriesBody:
        * DictionaryMapping


In [12]:
forge.mappings('UniProt')

Managed mappings for the data source per entity type and mapping type:
   - Gene:
        * DictionaryMapping
   - Protein:
        * DictionaryMapping


In [13]:
from kgforge.specializations.mappings import DictionaryMapping
mapping = forge.mapping(entity="ScholarlyArticle", source="NeuroElectro")

In [14]:
print(mapping)

{
    id: forge.format("identifier", "scholarlyarticles", x.id)
    type:
    [
        Entity
        ScholarlyArticle
    ]
    abstract: x.abstract
    author: x.authors_shaped
    datePublished: x.date_issued
    identifier: x.identifiers
    isPartOf:
    {
        type: Periodical
        issn: x.issn
        name: x.journal
        publisher: x.publisher
    }
    name: f"article_{x.id}"
    sameAs: x.full_text_link
    title: x.title
    url: x.full_text_link
}


In [15]:
forge.dataset_sources(type_='Gene', pretty=True)

Available Database sources:
UniProt


# Search and Access data from data source

* Mapping are automatically applied to search results
* takes a mn for now => working on making it faster 

In [16]:
filters = {"type":"ScholarlyArticle"}
#map=True, use_cache=True, # download=True
resources = forge.search(filters, dataset_source="NeuroElectro", limit=2, debug=True) 
# Add function for checking datsource health => reqsuire health url from db


Submitted query:
   PREFIX bmc: <https://bbp.epfl.ch/ontologies/core/bmc/>
   PREFIX bmo: <https://bbp.epfl.ch/ontologies/core/bmo/>
   PREFIX commonshapes: <https://neuroshapes.org/commons/>
   PREFIX datashapes: <https://neuroshapes.org/dash/>
   PREFIX dc: <http://purl.org/dc/elements/1.1/>
   PREFIX dcat: <http://www.w3.org/ns/dcat#>
   PREFIX dcterms: <http://purl.org/dc/terms/>
   PREFIX mba: <http://api.brain-map.org/api/v2/data/Structure/>
   PREFIX nsg: <https://neuroshapes.org/>
   PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>
   PREFIX oa: <http://www.w3.org/ns/oa#>
   PREFIX obo: <http://purl.obolibrary.org/obo/>
   PREFIX owl: <http://www.w3.org/2002/07/owl#>
   PREFIX prov: <http://www.w3.org/ns/prov#>
   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
   PREFIX schema: <http://schema.org/>
   PREFIX sh: <http://www.w3.org/ns/shacl#>
   PREFIX shsh: <http://www.w3.org/ns/shacl-shacl#>
   PREFI

In [17]:
len(resources)

2

In [18]:
print(resources[0])

{
    context: https://bbp.neuroshapes.org
    id: https://bbp.epfl.ch/neurosciencegraph/data/scholarlyarticles/35177
    type:
    [
        Entity
        ScholarlyArticle
    ]
    abstract: On the one hand, neuronal activity can cause changes in pH; on the other hand, changes in pH can modulate neuronal activity. Consequently, the pH of the brain is regulated at various levels. Here we show that steady-state pH and acid extrusion were diminished in cultured hippocampal neurons of mice with a targeted disruption of the Na(+)-driven Cl(-)/HCO(3)(-) exchanger Slc4a8. Because Slc4a8 was found to predominantly localize to presynaptic nerve endings, we hypothesize that Slc4a8 is a key regulator of presynaptic pH. Supporting this hypothesis, spontaneous glutamate release in the CA1 pyramidal layer was reduced but could be rescued by increasing the intracellular pH. The reduced excitability in vitro correlated with an increased seizure threshold in vivo. Together with the altered kinetics 

In [19]:
uquery = """
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?protein
WHERE {
  ?protein a up:Protein ;
  up:reviewed true.
}
"""

In [20]:
uresources = forge.sparql(query=uquery, dataset_source='UniProt', limit=10, debug=True)

Submitted query:
   
   PREFIX up: <http://purl.uniprot.org/core/>
   SELECT ?protein
   WHERE {
     ?protein a up:Protein ;
     up:reviewed true.
   }
     LIMIT 10



In [21]:
len(uresources)

10

In [22]:
uresources[0]

Resource(_last_action=None, _validated=False, _synchronized=False, _store_metadata=None, _inner_sync=False, protein='http://purl.uniprot.org/uniprot/A0B137')

## Use Filters to search

In [23]:
from kgforge.core.wrappings.paths import Filter, FilterOperator

In [24]:

proteins = forge.search({'type': 'Protein', 'up:reviewed': True}, dataset_source='UniProt', limit=10, debug=True)

Submitted query:
   PREFIX up: <http://purl.uniprot.org/core/>
   PREFIX owl: <http://www.w3.org/2002/07/owl#>
   PREFIX owl2xml: <http://www.w3.org/2006/12/owl2-xml#>
   PREFIX swrlb: <http://www.w3.org/2003/11/swrlb#>
   PREFIX protege: <http://protege.stanford.edu/plugins/owl/protege#>
   PREFIX swrl: <http://www.w3.org/2003/11/swrl#>
   PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
   PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
   PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
   PREFIX dc11: <http://purl.org/dc/terms/>
   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX foaf: <http://xmlns.com/foaf/0.1/>
   SELECT ?id WHERE {?id rdf:type up:Protein;
    up:reviewed ?v1 . 
    FILTER(?v1 = 'true'^^xsd:boolean)
   }  LIMIT 10



In [25]:
proteins[0]

Resource(_last_action=None, _validated=False, _synchronized=False, _store_metadata=None, id='http://purl.uniprot.org/uniprot/A0B137', _inner_sync=False)

# Map resources

In [26]:
uniprot = sources['UniProt']

In [27]:
complete_query = """
PREFIX up: <http://purl.uniprot.org/core/>
SELECT ?id ?gene ?label ?subject ?gene_label
WHERE {
  ?id a up:Protein ;
  up:reviewed true ;
  up:encodedBy ?gene ;
  up:recommendedName / up:fullName ?label ;
  up:organism / up:scientificName ?subject .
  ?gene skos:prefLabel ?gene_label . 
}
"""

In [28]:
raw_proteins = uniprot.sparql(complete_query)

In [29]:
new_resource = uniprot.map(raw_proteins[0], 'Protein')

In [30]:

print(json.dumps(forge.as_jsonld(new_resource), indent=4))

[
    {
        "@context": "https://bbp.neuroshapes.org",
        "@id": "https://bbp.epfl.ch/neurosciencegraph/data/proteins/P0DJN9",
        "@type": [
            "Entity",
            "Protein"
        ],
        "encodedBy": {
            "@id": "http://purl.uniprot.org/uniprot/P0DJN9#gene-MD5A00DD99270221B359AB0AE338E423668",
            "label": "acsF"
        },
        "identifier": {
            "propertyID": "UniProtKB",
            "value": "P0DJN9"
        },
        "name": "Protein P0DJN9 from UniProtKB",
        "label": "Aerobic magnesium-protoporphyrin IX monomethyl ester [oxidative] cyclase",
        "subject": {
            "label": "Rubrivivax gelatinosus"
        }
    }
]


### same result could be obtain from a dictionary and a DictionaryMapping instance

In [31]:
dict_resource = forge.as_json(raw_proteins[0])
mapping = DictionaryMapping.load("../../database-sources/UniProt/mappings/DictionaryMapping/Protein.hjson")
print(json.dumps(forge.as_jsonld(uniprot.map(dict_resource, mapping)), indent=4))

[
    {
        "@context": "https://bbp.neuroshapes.org",
        "@id": "https://bbp.epfl.ch/neurosciencegraph/data/proteins/P0DJN9",
        "@type": [
            "Entity",
            "Protein"
        ],
        "encodedBy": {
            "@id": "http://purl.uniprot.org/uniprot/P0DJN9#gene-MD5A00DD99270221B359AB0AE338E423668",
            "label": "acsF"
        },
        "identifier": {
            "propertyID": "UniProtKB",
            "value": "P0DJN9"
        },
        "name": "Protein P0DJN9 from UniProtKB",
        "label": "Aerobic magnesium-protoporphyrin IX monomethyl ester [oxidative] cyclase",
        "subject": {
            "label": "Rubrivivax gelatinosus"
        }
    }
]


## Query the NeuroMorpho WebService

In [32]:
neuromorpho = sources['NeuroMorpho']

In [33]:
nmo_filters = {"species": "rat,mouse,human", "response_loc": ["_embedded", "neuronResources"]}

In [56]:
nmo_resources = forge.search(nmo_filters, dataset_source='NeuroMorpho', size=3, searchendpoint='select_query', q="species:mouse")



In [50]:
forge.resolve(nmo_resources[0].brain_region[0], scope='ontology', strategy='BEST_MATCH')

Resource(_last_action=None, _validated=False, _synchronized=False, _store_metadata=None, id='http://purl.obolibrary.org/obo/UBERON_0002301', type='Class', label='Neocortical Layer', _inner_sync=False, prefLabel='Neocortical Layer', subClassOf='bmo:BrainLayer')

In [51]:
nmo_resources[0].brain_region

['neocortex', 'occipital', 'layer 6']

### Format date

In [35]:
from datetime import datetime

In [36]:
for resource in nmo_resources:
    resource.bbpID =  str(uuid.uuid4())
    date_ints = [int(p) for p in resource.deposition_date.split('-')]
    date_str = datetime(*date_ints)
    resource.date_formatted = date_str.strftime("%Y-%m-%dT%H:%M:%S")

In [37]:
new_morphologies = neuromorpho.map(nmo_resources, 'NeuronMorphology')

In [38]:
format_file = neuromorpho.service.files_download['endpoint'] + "/{}/CNG version/{}.CNG.swc"

## Attach file as distribution with morphologies

In [39]:
import morphio

In [40]:
for morphology in new_morphologies:
    url = format_file.format(morphology.archive.lower(), morphology.name)
    base_name = ''.join(url.split('.')[:-2])
    file_path = f"./downloaded/{base_name.split('/')[-1]}"
    swc = file_path + '.swc'
    neuromorpho.download(url, swc)
    neuromorpho.attach_file(morphology, swc, content_type='application/swc')
    # Generate other morphology files
    m = morphio.mut.Morphology(swc)
    for extension in ['h5', 'asc']:
        path = f"{file_path}.{extension}"
        m.write(path)
        neuromorpho.attach_file(morphology, path, content_type=f'application/{extension}')
    m.write(file_path+'.asc')



In [44]:
forge.register(new_morphologies[0], schema_id="datashapes:neuronmorphology")

<action> _register_one
<succeeded> False
<error> RegistrationError: 400 Client Error: Bad Request for url: https://bbp.epfl.ch/nexus/v1/resources/bbp/atlas/datashapes%3Aneuronmorphology


In [42]:
# forge.validate(new_morphologies[0], type_="NeuronMorphology")

In [45]:
print(new_morphologies[0])

{
    id: https://bbp.epfl.ch/neurosciencegraph/data/neuronmorphologies/neuromorpho/0b869f02-85b3-4c49-9cb3-5d6f12bbde28
    type:
    [
        Dataset
        NeuronMorphology
    ]
    archive: Scanziani
    brainLocation:
    {
        type: BrainLocation
    }
    contribution:
    {
        type: Contribution
        agent:
        {
            id: https://ror.org/02jqj7156
            type: Organization
            label: George Mason University
        }
    }
    dateCreated:
    {
        type: xsd:dateTime
        @value: 2012-07-01T00:00:00
    }
    description: Morphology of TypeA-10 obtained from NeuroMorpho API.
    distribution:
    [
        {
            type: DataDownload
            atLocation:
            {
                type: Location
                location: file:///gpfs/bbp.cscs.ch/data/project/proj39/nexus/bbp/atlas/0/4/8/5/1/9/5/c/TypeA-10.swc
                store:
                {
                    id: https://bbp.epfl.ch/neurosciencegraph/data/2f120