## Step by step buidling of a knowledge graph

The goal of this notebook is to build an example of knowledge graph using a step by step approach.


## Prerequisites

This notebook assumes you've created a project within the sandbox deployment of Nexus. If not follow the Blue Brain Nexus [Quick Start tutorial](https://bluebrain.github.io/nexus/docs/tutorial/getting-started/quick-start/index.html).

## Overview
Explain the Research domain (with figure). Uses schema.org and json-ld to describe entities

You'll work through the following steps:

1. Create and configure a Blue Brain Nexus client
2. Create a Person entity
3. Create an Organization entity and link it to the Person entity as affiliation
4. Create an article entity and link it to the Person entity as author
5. Explore and navigate the created knowledge graph

## Step 1: Create and configure a Nexus client

In [78]:
#install the Blue Brain Nexus python SDK
!pip install -U nexus-sdk

Requirement already up-to-date: nexus-sdk in /Users/mfsy/anaconda3/envs/bbpworkshop/lib/python3.5/site-packages (0.2.1)
Requirement not upgraded as not directly required: requests in /Users/mfsy/anaconda3/envs/bbpworkshop/lib/python3.5/site-packages (from nexus-sdk) (2.22.0)
Requirement not upgraded as not directly required: certifi>=2017.4.17 in /Users/mfsy/anaconda3/envs/bbpworkshop/lib/python3.5/site-packages (from requests->nexus-sdk) (2018.8.24)
Requirement not upgraded as not directly required: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /Users/mfsy/anaconda3/envs/bbpworkshop/lib/python3.5/site-packages (from requests->nexus-sdk) (1.25.3)
Requirement not upgraded as not directly required: idna<2.9,>=2.5 in /Users/mfsy/anaconda3/envs/bbpworkshop/lib/python3.5/site-packages (from requests->nexus-sdk) (2.8)
Requirement not upgraded as not directly required: chardet<3.1.0,>=3.0.2 in /Users/mfsy/anaconda3/envs/bbpworkshop/lib/python3.5/site-packages (from requests->nexus-sdk) (3.0.4)


In [79]:
#Set a token to authenticate to Nexus
import getpass
token = getpass.getpass()


········


In [80]:
#Configure a nexus client
nexus_environment = "https://sandbox.bluebrainnexus.io/v1"
org ="demo"
project ="testdemo"

import nexussdk as nexus
nexus.config.set_environment(nexus_environment)
nexus.config.set_token(token)



In [81]:
from pygments.lexers import JsonLdLexer
from pygments import highlight
from pygments.formatters import TerminalFormatter, TerminalTrueColorFormatter
import json

In [19]:
#install ontoviz
!pip install git+https://github.com/usc-isi-i2/ontology-visualization.git

Collecting git+https://github.com/usc-isi-i2/ontology-visualization.git
  Cloning https://github.com/usc-isi-i2/ontology-visualization.git to /private/var/folders/f0/hkfhswz16gj0bsvl1hmtbw6h0000gn/T/pip-req-build-szvpuec4
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/mfsy/anaconda3/lib/python3.6/tokenize.py", line 452, in open
        buffer = _builtin_open(filename, 'rb')
    FileNotFoundError: [Errno 2] No such file or directory: '/private/var/folders/f0/hkfhswz16gj0bsvl1hmtbw6h0000gn/T/pip-req-build-szvpuec4/setup.py'
    
    ----------------------------------------
[31mCommand "python setup.py egg_info" failed with error code 1 in /private/var/folders/f0/hkfhswz16gj0bsvl1hmtbw6h0000gn/T/pip-req-build-szvpuec4/[0m
[33mYou are using pip version 10.0.1, however version 19.1.1 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.[0m


In [85]:
from sparqlendpointhelper import SparqlViewHelper

%load_ext autoreload
%autoreload 1
%aimport sparqlendpointhelper
%aimport utils

sparqlview_endpoint = nexus_environment+"/views/"+org+"/"+project+"/graph/sparql"
sparqlviewhelper = SparqlViewHelper(sparqlview_endpoint,nexus_environment, org, project, token)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [86]:
%reload_ext autoreload

## Step 2: Create a Person entity

Let define an entity of type Person as follows:

* A person is an entity of type Person (@type value)
* A person has an identifier (@id value)
* A person has a family name (familyName value), a given name (givenName value) and a job title (jobTitle value)
* A person has a job


In [87]:
# Use an orcid identifier if you have one, or your github id
person ={
    "@context":"https://bbp.neuroshapes.org",
    "@id":"https://orcid.org/0000-0002-4603-9838",
    "@type":"Person",
    "familyName":"your familly name here",
    "givenName":"your given name here",
    "jobTitle":"Engineer"
} 

response = create_resource(nexus,person)
pretty_print(response)


TypeError: create_resource() takes 1 positional argument but 2 were given

## Step 3: Create an article entity and link it to the Person entity as author

The knowledge Graph now contains a single entity of type Person. Let add to the knowledge graph one scholarly article (publication) authored by the person entity:

* A scholarly article is  an entity of type ScholarlyArticle (@type value)
* A scholarly article has an identifier (@id value)
* A scholarly article has a name (name value)
* A scholarly article has a publishing date (datePublished value)


###  Create a ScholarlyArticle entity

In [71]:
#Let create an entity describing a publication with a doi: https://doi.org/10.1186/1471-2105-13-s1-s4
scholarly_article ={
    "@context":"https://bbp.neuroshapes.org",
    "@type":"ScholarlyArticle",
    "@id":"https://doi.org/10.1186/1471-2105-13-s1-s4",
    "name":"User centered and ontology based information retrieval system for life sciences",
    "datePublished":"2012-12"
} 

response = create_resource(nexus,scholarly_article)
pretty_print(response)

TypeError: create_resource() takes 1 positional argument but 2 were given

###  Link the Person and the ScholarlyArticle entity with authorship

In [70]:
# A reference to the Person identifier (value of @id) is enough to link with the article
# Note the revision value change (should be "_rev": 2) after an update
scholarly_article["author"]=person["@id"]
response = update_resource(nexus,scholarly_article["@id"],scholarly_article)
pretty_print(response)

NameError: name 'person' is not defined

###  Fetch the ScholarlyArticle by identifier to view the update 

In [69]:
response = fetch_resource(nexus,scholarly_article["@id"])
pretty_print(response)

NameError: name 'scholarly_article' is not defined

##  Step 4: Update the Person entity to add its affiliation

Let update the person entity with an affiliation information. We will use EPFL as an affiliation and link to the person entity via "affiliation" propoerty.

In this step, we'll:

* Search for the EPFL organization entity and retrieve its identifier
* Update the person entity affiliation to point to the EPFL entity

### Search for the EPFL organization entity and retrieve its identifier and name

In [64]:
_type = "Organization"
orgs_df = sparqlviewhelper.get_entity_by_path_value(path="acronym",literal_value="EPFL",_type=_type, 
                                                    retrieve_properties=["grid_id","name"],
                                                    result_format = "DATAFRAME")
orgs_df.head()

iiiiiiii
EPFL
True
isldsdsds
EPFL
iiiiiiii
Organization
False
iiiiiiii
grid_id
False
iiiiiiii
name
False
iiiiiiii
acronym
False
iiiiiiii
EPFL
False
PREFIX schema: <http://schema.org/> 
PREFIX prov: <http://www.w3.org/ns/prov#> 
PREFIX sh: <http://www.w3.org/ns/shacl#> 
PREFIX vann: <http://purl.org/vocab/vann/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX xml: <http://www.w3.org/XML/1998/namespace> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
PREFIX nsg: <https://neuroshapes.org/> 
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX dc: <http://purl.org/dc/elements/1.1/> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX context: <https://incf.github.io/neuroshapes/contexts/> 
PREFIX vocab: <https://sandbox.bluebrainnexus.io/v1/demo/testdemo/_/> 
Select DISTINCT * WHERE {
                   

SparqlQueryException: Failed to execute the query PREFIX schema: <http://schema.org/> 
PREFIX prov: <http://www.w3.org/ns/prov#> 
PREFIX sh: <http://www.w3.org/ns/shacl#> 
PREFIX vann: <http://purl.org/vocab/vann/> 
PREFIX skos: <http://www.w3.org/2004/02/skos/core#> 
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> 
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> 
PREFIX xml: <http://www.w3.org/XML/1998/namespace> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#> 
PREFIX nsg: <https://neuroshapes.org/> 
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/> 
PREFIX dcterms: <http://purl.org/dc/terms/> 
PREFIX dc: <http://purl.org/dc/elements/1.1/> 
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
PREFIX context: <https://incf.github.io/neuroshapes/contexts/> 
PREFIX vocab: <https://sandbox.bluebrainnexus.io/v1/demo/testdemo/_/> 
Select DISTINCT * WHERE {
                     ?s vocab:acronym vocab:EPFL.
                     ?s a <{'@id': 'nsg:Organization'}>.
                     ?s vocab:grid_id ?grid_id_value . 
?s {'@id': 'schema:name'} ?name_value . 

                     
               }
               .

### Update the person entity affiliation to point to the EPFL entity

In [None]:
# Note the revision value change after an update
person["affiliation"]=orgs_df["s"]
response = update_resource(person["@id"],person)
pretty_print(response)

## Step 5: Update the EPFL organization entity with its logo file


Wouldn't it be great to add the EPFL logo as an preview image of the EPFL organization entity ? Turns out the EPFL entity is available in the Wikidata open knowledge graph with the logo we need.
Wikidata is the knowledge graph version of Wikipedia with a public sparql endpoint.

Let query the wikidata sparql endpoint to fetch the logo of the EPFL entity using its grid identifier.
[Try the query](https://query.wikidata.org/#SELECT%20%3Fepfl_logo%20WHERE%20%7B%0A%20%20%3Fepfl%20wdt%3AP2427%20%22grid.5333.6%22.%0A%20%20OPTIONAL%20%7B%20%3Fepfl%20wdt%3AP154%20%3Fepfl_logo.%20%7D%0A%7D) in the Wikidata sparql playground.



### Create a wrapper around the wikidata sparql endpoint

In [10]:
wikidata_sparql_endpoint = "https://query.wikidata.org/sparql"
wikidata_sparql_helper = SparqlViewHelper(wikidata_sparql_endpoint,token)

### Get the EPFL logo url

In the query below:

* wdt:P2427 is the wikidata property for GRID ID
* wdt:P154 is the wiki data property for logo image url

In [None]:
# wdt:P2427 is the wikidata property for GRID ID
epfl_logo_query = """
SELECT *
WHERE
{
    ?epfl wdt:P2427 "%s".
    ?epfl wdt:P154 ?logo.
}
""" % (orgs_df["grid_id"])
wiki_results = query_sparql(movie_logo_query,wikidata_sparql_wrapper)
wiki_df =sparql2dataframe(wiki_results)
wiki_df.head()

## Setp 6 Explore and navigate the created knowledge graph

### Search Entity by type

In [66]:
_type = "Organization"
results_df = sparqlviewhelper.get_entity_by_type(_type, result_format = "DATAFRAME")
results_df.head()

NameError: name 'aURI' is not defined

## Step 2: Explore and navigate data using the SPARQL query language


Let write our first query.

In [None]:
select_all_query = """
SELECT ?s ?p ?o
WHERE
{
  ?s ?p ?o
}
OFFSET 0
LIMIT 5
"""

nexus_results = query_sparql(select_all_query,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
nexus_df.head()

Most SPARQL queries you'll see will have the anotomy above with:
* a **SELECT** clause that let you select the variables you want to retrieve
* a **WHERE** clause defining a set of constraints that the variables should satisfy to be retrieved
* **LIMIT** and **OFFSET** clauses to enable pagination
* the constraints are usually graph patterns in the form of **triple** (?s for subject, ?p for property and ?o for ?object)

Multiple triples can be provided as graph pattern to match but each triple should end with a period. As an example, let retrieve 5 movies (?movie) along with their titles (?title).

In [None]:
movie_with_title = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>
Select ?movie ?title
 WHERE  {
    ?movie a vocab:Movie.
    ?movie vocab:title ?title.
} LIMIT 5
"""%(org,project)

nexus_results = query_sparql(movie_with_title,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
nexus_df.head()

Note PREFIX clauses. It is way to shorten URIS within a SPARQL query. Without them we would have to use full URI for all properties.

The ?movie variable is bound to a URI (the internal Nexus id). Let retrieve the movieId just like in the MovieLens csv files for simplicity.

In [None]:
movie_with_title = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>
Select ?movieId ?title
 WHERE  {
    
    # Select movies
    ?movie a vocab:Movie.

    # Select their movieId value
    ?movie vocab:movieId ?movieId.
    
    #
    ?movie vocab:title ?title.
    
} LIMIT 5
"""%(org,project)

nexus_results = query_sparql(movie_with_title,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
nexus_df.head()

In the above query movies are things (or entities) of type vocab:Movie. 
This is a typical instance query where entities are filtered by their type(s) and then some of their properties are retrieved (here ?title). 

Let retrieve everything that is linked (outgoing) to the movies. 
The * character in the SELECT clause indicates to retreve all variables: ?movie, ?p, ?o

In [None]:
movie_with_properties = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>
Select *
 WHERE  {
    ?movie a vocab:Movie.
    ?movie ?p ?o.
} LIMIT 20
"""%(org,project)

nexus_results = query_sparql(movie_with_properties,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
nexus_df.head(20)

As a little exercise, write a query retrieving incoming entities to movies. You can copy past the query above and modify it.

Hints: ?s ?p ?o can be read as: ?o is linked to ?s with an outgoing link.

Do you have results ?

In [None]:
#Your query here


Let retrieve the movie ratings

In [None]:
movie_with_properties = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>
Select ?userId ?movieId ?rating ?timestamp
 WHERE  {
    ?movie a vocab:Movie.
    ?movie vocab:movieId ?movieId.
    
    
    ?ratingNode vocab:movieId ?ratingmovieId.
    ?ratingNode vocab:rating ?rating.
    ?ratingNode vocab:userId ?userId.
    ?ratingNode vocab:timestamp ?timestamp.
    
    # Somehow pandas is movieId as double for rating 
    FILTER(xsd:integer(?ratingmovieId) = ?movieId)
    
} LIMIT 20
"""%(org,project)

nexus_results = query_sparql(movie_with_properties,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
nexus_df.head(20)

As a little exercise, write a query retrieving the movie tags along with the user id and timestamp. You can copy and past the query above and modify it.


In [None]:
#Your query here



### Aggregate queries

[Aggregates](https://www.w3.org/TR/sparql11-query/#aggregates) apply some operations over a group of solutions.
Available aggregates are: COUNT, SUM, MIN, MAX, AVG, GROUP_CONCAT, and SAMPLE.

We will not see them all but we'll look at some examples.

The next query will compute the average rating score for 'funny' movies.

In [None]:
tag_value = "funny"
movie_avg_ratings = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>

Select ( AVG(?ratingvalue) AS ?score)
 WHERE  {
    # Select movies
    ?movie a vocab:Movie.

    # Select their movieId value
    ?movie vocab:movieId ?movieId.

    ?tag vocab:movieId ?movieId.
    ?tag vocab:tag ?tagvalue.
    FILTER(?tagvalue = "%s").

    # Keep movies with ratings
    ?rating vocab:movieId ?ratingmovidId.
    FILTER(xsd:integer(?ratingmovidId) = xsd:integer(?movieId))
    ?rating vocab:rating ?ratingvalue.

}
""" %(org,project,tag_value)

nexus_results = query_sparql(movie_avg_ratings,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
display(nexus_df.head(20))
nexus_df=nexus_df.astype(float)


Retrieve the number of tags per movie. Can be a little bit slow depending on the size of your data.

In [None]:
nbr_tags_per_movie = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>

Select ?title (COUNT(?tagvalue) as ?tagnumber)
 WHERE  {
    # Select movies
    ?movie a vocab:Movie.
    # Select their movieId value
    ?movie vocab:movieId ?movieId.
    
    ?tag a vocab:Tag.
    ?tag vocab:movieId ?tagmovieId.
    FILTER(?tagmovieId = ?movieId)
    ?movie vocab:title ?title.
    ?tag vocab:tag ?tagvalue.
}

GROUP BY ?title 
ORDER BY DESC(?tagnumber)
LIMIT 10
""" %(org,project)

nexus_results = query_sparql(nbr_tags_per_movie,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
display(nexus_df.head(20))


In [None]:
#Let plot the result
nexus_df.tagnumber = pd.to_numeric(nexus_df.tagnumber)
nexus_df.plot(x="title",y="tagnumber",kind="bar")


The next query will retrieve movies along with users that tagged them separated by a comma

In [None]:
# Group Concat

movie_tag_users = """
PREFIX vocab: <https://nexus-sandbox.io/v1/vocabs/%s/%s/>
PREFIX nxv: <https://bluebrain.github.io/nexus/vocabulary/>

Select ?movieId (group_concat(DISTINCT ?userId;separator=",") as ?users)
 WHERE  {
    # Select movies
    ?movie a vocab:Movie.

    # Select their movieId value
    ?movie vocab:movieId ?movieId.

    ?tag vocab:movieId ?movieId.
    ?tag vocab:userId ?userId.

  
}
GROUP BY ?movieId
LIMIT 10
"""%(org,project)

nexus_results = query_sparql(movie_tag_users,sparqlview_wrapper)

nexus_df =sparql2dataframe(nexus_results)
nexus_df.head(20)