## Querying Nexus knowledge graph and download data

The goal of this notebook is to demonstratre how to query data from Nexus and download associated files.


## Prerequisites

This notebook assumes you have targeted an organisation and project from which to query and download data.

## Notes

Many functions used in this note notebook will soon be integrated in our SDK.

## Overview

You'll work through the following steps:

1. Create a Blue Brain Nexus client using a python sdk
2. Create a query wrapper around the project to query
3. Search for data by type and get download urls

## Step 1: Create a Blue Brain Nexus client using a python sdk

In [None]:
#Set a token to authenticate to Nexus
import getpass
token = getpass.getpass()

In [2]:
#Configuration for the Nexus deployment
import nexussdk as nexus

#Configure a nexus client
nexus_environment = "https://bbp.epfl.ch/nexus/v1"
org ="org"
project ="project"

nexus.config.set_environment(nexus_environment)
nexus.config.set_token(token)

vocab = "%s/vocabs/%s/%s/"%(nexus_environment, org, project)


## Step 2: Create a query wrapper around the project to query

Every project in Blue Brain Nexus comes with a set of two views exposing specific query interfaces for searching the data:
    - ElasticSearch view: data can be seached in a document oriented way
    - Sparql view: data can be searched and navigated as a graph
In this notebook we'll target the sparql view.

In [3]:
#Let install sparqlwrapper which a python wrapper around sparql client
!pip install git+https://github.com/RDFLib/sparqlwrapper

Collecting git+https://github.com/RDFLib/sparqlwrapper
  Cloning https://github.com/RDFLib/sparqlwrapper to /private/var/folders/f0/hkfhswz16gj0bsvl1hmtbw6h0000gn/T/pip-req-build-ohqs2m5d
Building wheels for collected packages: SPARQLWrapper
  Running setup.py bdist_wheel for SPARQLWrapper ... [?25ldone
[?25h  Stored in directory: /private/var/folders/f0/hkfhswz16gj0bsvl1hmtbw6h0000gn/T/pip-ephem-wheel-cache-a09rqz19/wheels/94/87/ff/bd3f8bccb74168f42b5ba997f9105e18a9cc2699ad9a5525a6
Successfully built SPARQLWrapper
[31mdistributed 1.21.8 requires msgpack, which is not installed.[0m
[33mYou are using pip version 10.0.1, however version 19.2.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [22]:
# Utility functions to create sparql wrapper around a sparql endpoint

from SPARQLWrapper import SPARQLWrapper, JSON, POST, GET, POSTDIRECTLY, CSV
import requests

def create_sparql_client(sparql_endpoint, http_query_method=POST, result_format= JSON, token=None):
    sparql_client = SPARQLWrapper(sparql_endpoint)
    #sparql_client.addCustomHttpHeader("Content-Type", "application/sparql-query")
    if token:
        sparql_client.addCustomHttpHeader("Authorization","Bearer {}".format(token))
    sparql_client.setMethod(http_query_method)
    sparql_client.setReturnFormat(result_format)
    if http_query_method == POST:
        sparql_client.setRequestMethod(POSTDIRECTLY)
    
    return sparql_client

In [23]:
# Utility functions
import pandas as pd

pd.set_option('display.max_colwidth', -1)

# Convert SPARQL results into a Pandas data frame
def sparql2dataframe(json_sparql_results):
    cols = json_sparql_results['head']['vars']
    out = []
    for row in json_sparql_results['results']['bindings']:
        item = []
        for c in cols:
            item.append(row.get(c, {}).get('value'))
        out.append(item)
    return pd.DataFrame(out, columns=cols)

# Send a query using a sparql wrapper 
def query_sparql(query, sparql_client):
    sparql_client.setQuery(query)
    result_object = sparql_client.query()
    if sparql_client.returnFormat == JSON:
        return result_object._convertJSON()
    return result_object.convert()

In [24]:
# Let create a sparql wrapper around the project sparql view
sparqlview_endpoint = nexus_environment+"/views/"+org+"/"+project+"/graph/sparql"
sparqlview_wrapper = create_sparql_client(sparql_endpoint=sparqlview_endpoint, token=token,http_query_method= POST, result_format=JSON)


## Step 3: Search for data by type and get download urls

Let get a list of all entities of a given type as a pandas dataframe. For each entity, we'll retrieve its id (entity), its name (name) as well as a link to download it (downloadUrl).

In [29]:
#Enter your type here
entity_type = "nsg:NISSLImageDataLayer"

max_total_result = 200000
page_size = 5000
offset = 0

count = 0
nexus_df=None
while ( count <= max_total_result ): 
    
    
    select_query = """
                    PREFIX nsg: <https://neuroshapes.org/>
                    PREFIX schema: <http://schema.org/>
                    PREFIX prov: <http://www.w3.org/ns/prov#>
                    
                    SELECT DISTINCT ?entity ?name ?downloadUrl
                    WHERE
                    {
                      ?entity a %type%.

                      ?entity schema:name ?name.
                      OPTIONAL {
                       ?entity schema:distribution/schema:contentUrl ?downloadUrl
                      }
                    }
                    OFFSET %offset%
                    LIMIT %page_size%
                   """
    select_query = select_query.replace("%type%",entity_type ).replace("%offset%",str(offset)).replace("%page_size%",str(page_size))
    #print(select_query)

    nexus_results = query_sparql(select_query,sparqlview_wrapper)

    result_df =sparql2dataframe(nexus_results)
    #print(len(result_df.index))
    if len(result_df.index) > 0:
        if nexus_df is None:
            #print("None")
            nexus_df = pd.DataFrame(result_df)
        else:
            nexus_df = pd.concat([nexus_df,result_df],ignore_index=True)
            #print(len(nexus_df.index))
        count = count + page_size
        offset = offset+page_size
    else:
        break;


if nexus_df is not None:
    print("Results stats: ")
    display(nexus_df.describe())

    print("(5) first Results : ")
    display(nexus_df.head(5))
    entities = set(nexus_df["entity"])
    print("Number of entities of type %s: %s" % (entity_type,len(entities)))
else:
    print("No result was found")


Results stats: 


Unnamed: 0,entity,name,downloadUrl
count,2,2,2
unique,2,2,2
top,https://bbp.epfl.ch/neurosciencegraph/data/c1f768f8-5bbb-46ea-aeee-bc92909a0b52,ara_nissl_25,https://bbp.epfl.ch/nexus/v1/files/bbp/atlas/782d7f45-fe72-4eea-b28a-e99090188f09
freq,1,1,1


(5) first Results : 


Unnamed: 0,entity,name,downloadUrl
0,https://bbp.epfl.ch/neurosciencegraph/data/9768fb11-6461-4705-995f-7ad40b3aab77,ara_nissl_25,https://bbp.epfl.ch/nexus/v1/files/bbp/atlas/782d7f45-fe72-4eea-b28a-e99090188f09
1,https://bbp.epfl.ch/neurosciencegraph/data/c1f768f8-5bbb-46ea-aeee-bc92909a0b52,ara_nissl_10,https://bbp.epfl.ch/nexus/v1/files/bbp/atlas/28c8c251-cb7c-4248-bbb5-a2d36bd13bd6


Number of entities of type nsg:NISSLImageDataLayer: 2


In [30]:
downloadUrls = set(nexus_df["downloadUrl"])

names = set(nexus_df["name"])
entries = list(zip(downloadUrls,names))
print("Number of download links: %s" % (len(entries)))

Number of download links: 2
