# Knowledge and Data 2020: Practical Assignment 2
## Manipulate local and external RDF Knowledge Graphs 

YOUR NAME: Tom Slik

YOUR VUNetID: tsk910

*(If you do not provide your name and VUNetID we will not accept your submission).*

### Learning objectives

At the end of this exercise you should be able to perform some simple manipulations of RDF Data using the rdflib library. You should be able to: 

1. Add and retrieve information from a local RDF database
2. Represent RDF data in other formats, such as the .dot format for graph visualisation
3. Retrieve information from an RDF database created from Web Data
4. Query information from the Web with SPARQL

### Practicalities

Follow this Notebook step-by-step. 

Of course, you can do the exercises in any Programming Editor of your liking. 
But you do not have to. Feel free to simply write code in the Notebook. When 
everything is filled in and works, save the Notebook and submit it 
as a Jupyter Notebook, i.e. with an .ipynb extension. Please use as name of the 
Notebook your studentID+Assignment2.ipynb.  

Other than in courses dedicated to programming we will not evaluate the style
of the programs. But we will test your programs on other data than we provide, 
and your program should give the correct answers to those test-data as well. 

# A. Tasks related to local RDF Knowledge Graphs

This first cell will open a file 'example-from-slide.ttl' using the rdflib library. The first Practical Assignment should have taught you that manipulating symbols as strings is a major pain. 

Programming libraries, such as **rdflib**, help you with this mess once and for all, by parsing the files, creating appropriate datastructures (Graph()) and providing useful functions (such as serialize(), save() and much more). 
Check the website of rdflib http://rdflib.readthedocs.io/: this library does most of the hard work for you.

Before starting with the tasks of this assignment, do not forget to install **rdflib** so we can start using it. Installing libraries in Python is very simple. Just open your terminal and write the following command:

    $ pip install rdflib

For more details on how to install pip and Python libraries, you can check the [preliminaries document](https://github.com/ucds-vu/knowledge-data-vu/blob/master/Tutorials/Preliminaries/preliminaries.md).  

In [129]:
from rdflib import Graph, RDF, Namespace, Literal, URIRef

g = Graph()

EX = Namespace('http://example.com/kad2020/')
g.bind('ex',EX)

def serialize_graph():
    # g.serialize() returns a string
    print(g.serialize(format='turtle'))

def save_graph(filename):
    with open(filename, 'w') as f:
        g.serialize(f, format='nt')
        
def load_graph(filename):
    with open(filename, 'r') as f:
        g.parse(f, format='turtle')   

The file 'example-from-slides.ttl' formalises the knowledge base from the slides from Module 1, and a bit more. 

Here is how it looks when you load it into your program and serialise it with rdflib in turtle. 

In [130]:
load_graph('example-from-slides.ttl')
serialize_graph()

@prefix ex: <http://example.com/kad2020/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Germany a ex:EuropeanCountry .

ex:Netherlands a ex:Country ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:Amsterdam a ex:Capital .

ex:Belgium a ex:Country .

ex:EuropeanCountry rdfs:subClassOf ex:Country .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City .

ex:Capital rdfs:subClassOf ex:City .




Now, we can manipulate the graph very easily, e.g. like in the following very simple function, which returns the predicate(s) that relate a subject to a literal object: 

In [131]:
for s,p,o in g:
    if type(o) is Literal:
        print(p)

http://example.com/kad2020/has_Name


### - Task 1: (1 Point) Add information to an RDF graph

Add 10 triples to the knowledge graph. Make sure that they have the right namespaces. 

Similarily to the triples already present in the file 'example-from-slides.ttl', add at least:
- a new country with its name and capital 
- one triple with a new predicate

Check: http://rdflib.readthedocs.io/en/stable/intro_to_creating_rdf.html

In [132]:
ex = Namespace("http://example.com/kad2020/")
owl = Namespace('http://www.w3.org/2002/07/owl#')


# add 10 triples here to the graph 'g' (do not forget the namespaces).

g.add((ex.Belgium, ex.has_Name, Literal('Belgium')))
g.add((ex.Brussels, RDF.type, ex.Capital))
g.add((ex.France, RDF.type, ex.EuropeanCountry))
g.add((ex.France, ex.has_Name, Literal('France')))
g.add((ex.France, ex.neighbours, ex.France))
g.add((ex.Belgium, ex.neighbours, ex.Germany))
g.add((ex.Germany, ex.has_Name, Literal('Germany')))
g.add((ex.Germany, ex.has_SurfaceArea, Literal(357386)))
g.add((ex.Iceland, ex.has_Name, Literal('Iceland')))
g.add((ex.Iceland, ex.neighbours, ex.the_Ocean))



print(g.serialize(format='turtle'))

@prefix ex: <http://example.com/kad2020/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ex:Brussels a ex:Capital .

ex:Iceland ex:has_Name "Iceland" ;
    ex:neighbours ex:the_Ocean .

ex:Netherlands a ex:Country ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:Amsterdam a ex:Capital .

ex:Belgium a ex:Country ;
    ex:has_Name "Belgium" ;
    ex:neighbours ex:Germany .

ex:France a ex:EuropeanCountry ;
    ex:has_Name "France" ;
    ex:neighbours ex:France .

ex:Germany a ex:EuropeanCountry ;
    ex:has_Name "Germany" ;
    ex:has_SurfaceArea 357386 .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City .

ex:EuropeanCountry rdfs:subClassOf ex:Country .

ex:Capital rdfs:subClassOf ex:City .




*After you ran the previous code (adding triples) the next cells will be executed on your extended graph. That is ok.*

### - Task 2a: (1 Point) Get structured information from an RDF graph (all Literals)

Use the functions available in the RDFLib library. Write a small function to print all Literals. 

Hint: there is a function in rdflib to test the type of an object (check previous examples in this notebook)

In [133]:
for s,p,o in g:
    if type(o) == Literal:
        print(o)

France
Iceland
The Netherlands
Belgium
Germany
357386


### - Task 2b: (1 Point) Get structured information from an RDF graph (all unique Predicates)

Please provide another function that gives a **unique** list of the predicates, ordered by occurrence (most occurring first). The answer will look like this: 
<br>http://www.w3.org/2000/01/rdf-schema#label
<br>http://www.w3.org/1999/02/22-rdf-syntax-ns#type
<br>http://example.com/sw2016/locatedIn
<br>http://www.w3.org/2000/01/rdf-schema#range

In [134]:
predicates = []
for s, p, o in g:
    predicates.append(p)

predicates.sort()
#print(predicates)
indexed_Pred = []
counter = 1
prev = "none"
for pred in predicates:
    if prev == pred:
        counter += 1
        indexed_Pred[-1] = [pred, counter]
        prev = pred
    else:
        indexed_Pred.append([pred, 1])
        prev = pred
        counter = 1

from operator import itemgetter

indexed_Pred.sort(key=itemgetter(1), reverse=True)

for iPred in indexed_Pred:
    print(iPred[0])

http://www.w3.org/1999/02/22-rdf-syntax-ns#type
http://example.com/kad2020/has_Name
http://example.com/kad2020/neighbours
http://www.w3.org/2000/01/rdf-schema#range
http://www.w3.org/2000/01/rdf-schema#subClassOf
http://example.com/kad2020/has_Capital
http://example.com/kad2020/has_SurfaceArea
http://www.w3.org/2000/01/rdf-schema#domain
http://www.w3.org/2000/01/rdf-schema#subPropertyOf


# B. Tasks related to Graph visualisations 

### - Task 3a: (1 Point) From RDF to .dot 


In the lecture, we have seen two ways of writing a knowledge graph (simple n-triples, and simple turtle). Let us consider a 3rd syntax, this time a syntax that is useful for visualisation. One standard for visualising graphs is the .dot format.

Print the knowledge graph in .dot file format. Check https://graphviz.gitlab.io/documentation/ for the documentation. You will only need very little of this information, and the most relevant information can be found in the examples that are given. 

<br>Basically, an RDF graph in .dot format starts with 
<br>digraph G { 
    and then a list of links of the following form 
<br>s -> o [label="p"]
    for every (s p o ) in KG (separated by ;
<br>Do not forget to end with a closing bracket. }

An example is 
     
     digraph G { s1 -> o1 [label="p1"] ; s2 -> o2 [label="p2"] } 
     
for an RDF graph {(s1 p1 o1),(s2 p2 o2)}

*You can check how your graph looks like. Just copy & paste your output into: http://www.webgraphviz.com/*

In [135]:
dotFormat = "digraph G { "

for s,p,o in g:
    s = s.split('/')[-1]
    p = p.replace('#', '/')
    p = p.split('/')[-1]
    o = o.split('/')[-1]
    dotFormat += s + ' -> ' + o + ' [label="' + p + '"] ; '

dotFormat = dotFormat[:-3] + ' }'

print(dotFormat)

digraph G { hasCapital -> containsCity [label="subPropertyOf"] ; Brussels -> Capital [label="type"] ; EuropeanCountry -> Country [label="subClassOf"] ; hasCapital -> Capital [label="range"] ; France -> France [label="neighbours"] ; France -> France [label="has_Name"] ; Belgium -> Country [label="type"] ; Iceland -> Iceland [label="has_Name"] ; Iceland -> the_Ocean [label="neighbours"] ; Belgium -> Germany [label="neighbours"] ; Netherlands -> Belgium [label="neighbours"] ; Netherlands -> The Netherlands [label="has_Name"] ; Amsterdam -> Capital [label="type"] ; France -> EuropeanCountry [label="type"] ; containsCity -> City [label="range"] ; Belgium -> Belgium [label="has_Name"] ; Netherlands -> Country [label="type"] ; Capital -> City [label="subClassOf"] ; Germany -> Germany [label="has_Name"] ; Germany -> 357386 [label="has_SurfaceArea"] ; containsCity -> Country [label="domain"] ; Netherlands -> Amsterdam [label="has_Capital"] ; Germany -> EuropeanCountry [label="type"] }


### - Task 3b: (1 Point) From RDF to .dot with "semantic information"

There is a conceptual distinction between properties, instances and classes (sets of instances). A simple way of checking is the following

1. in a triple (s a o), with predicate a (which is a special abbreviation for the predicate rdf:type), the s is an Instance, and o is a Class. 
2. in a triple (s rdfs:subClassOf o) both s and o are Classes. 
3. in a triple (p rdfs:domain o) p is a Property and o is a Class. 
4. in a triple (p rdfs:range o)  p is a Property and o is a Class. 

Make a .dot representation for an RDF graph that distinguishes between types of links (RDF vocabulary vs others) and types of nodes (Classes versus Instances) via different colors. 

*You can check how your graph looks like. Just copy & paste your output into: http://www.webgraphviz.com/*

In [136]:
dotFormat = "digraph G { "

for s,p,o in g:
    s = s.split('/')[-1]
    p = p.replace('#', '/')
    p = p.split('/')[-1]
    o = o.split('/')[-1]
    
    
    if p == 'type':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '"]; '
        dotFormat += s + ' [color=blue, style=filled]; '  # Make instance Blue
        dotFormat += o + ' [color=red, style=filled]; ' # Make class red
    elif p == 'subClassOf':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '"]; '
        dotFormat += s + ' [color=red, style=filled]; ' # Make class red
        dotFormat += o + ' [color=red, style=filled]; '# Make class red
    elif p == 'domain':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '", color=green]; ' # Make property green
        dotFormat += o + ' [color=red, style=filled]; ' # Make class red
    elif p == 'range':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '", color=green]; ' # Make property green
        dotFormat += o + ' [color=red, style=filled]; ' # Make class red
    else:
        dotFormat += s + ' -> ' + o + ' [label="' + p + '"]; '


dotFormat = dotFormat[:-2] + ' }'

print(dotFormat)

digraph G { hasCapital -> containsCity [label="subPropertyOf"]; Brussels -> Capital [label="type"]; Brussels [color=blue, style=filled]; Capital [color=red, style=filled]; EuropeanCountry -> Country [label="subClassOf"]; EuropeanCountry [color=red, style=filled]; Country [color=red, style=filled]; hasCapital -> Capital [label="range", color=green]; Capital [color=red, style=filled]; France -> France [label="neighbours"]; France -> France [label="has_Name"]; Belgium -> Country [label="type"]; Belgium [color=blue, style=filled]; Country [color=red, style=filled]; Iceland -> Iceland [label="has_Name"]; Iceland -> the_Ocean [label="neighbours"]; Belgium -> Germany [label="neighbours"]; Netherlands -> Belgium [label="neighbours"]; Netherlands -> The Netherlands [label="has_Name"]; Amsterdam -> Capital [label="type"]; Amsterdam [color=blue, style=filled]; Capital [color=red, style=filled]; France -> EuropeanCountry [label="type"]; France [color=blue, style=filled]; EuropeanCountry [color=red

### - Task 4a: (1 Point) Deriving implicit knowledge (a bit of schema)

We will look into Schema information in the latter modules, but let us try already to find some implicit information in a first bit of inferencing: whenever there are two statements (s a o) and (o rdfs:subClassOf o2) we can derive (and later prove) that (s a o2). 

Write a procedure that adds all implied triples to our knowledge graph. 

In [137]:
for s,p,o in g:
    s = s.split('/')[-1]
    p = p.replace('#', '/')
    p = p.split('/')[-1]
    o = o.split('/')[-1]
    #print(s + p + o)
    if p == 'subClassOf':
        s1 = s
        o1 = o
        print(s1 + ' ' + o1)

        for s,p,o in g:
            s = s.split('/')[-1]
            p = p.replace('#', '/')
            p = p.split('/')[-1]
            o = o.split('/')[-1]

            if o == s1:
                print(s+p+o)
                print(s+p+o1)
                g.add((ex[s], ex.impType, ex[o1]))

EuropeanCountry Country
FrancetypeEuropeanCountry
FrancetypeCountry
GermanytypeEuropeanCountry
GermanytypeCountry
Capital City
BrusselstypeCapital
BrusselstypeCity
hasCapitalrangeCapital
hasCapitalrangeCity
AmsterdamtypeCapital
AmsterdamtypeCity


### - Task 4b: (Optional - 1 Extra Point) Visualising implicit knowledge

Produce a .dot version of the graph with those implies implicit triples, and mark the edges of those triples with a different color or arrow style. 

In [138]:
dotFormat = "digraph G { "

for s,p,o in g:
    s = s.split('/')[-1]
    p = p.replace('#', '/')
    p = p.split('/')[-1]
    o = o.split('/')[-1]
    
    
    if p == 'type':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '"]; '
        dotFormat += s + ' [color=blue, style=filled]; '  # Make instance Blue
        dotFormat += o + ' [color=red, style=filled]; ' # Make class red
    elif p == 'impType':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '",color=purple]; '
    elif p == 'subClassOf':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '"]; '
        dotFormat += s + ' [color=red, style=filled]; ' # Make class red
        dotFormat += o + ' [color=red, style=filled]; '# Make class red
    elif p == 'domain':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '", color=green]; ' # Make property green
        dotFormat += o + ' [color=red, style=filled]; ' # Make class red
    elif p == 'range':
        dotFormat += s + ' -> ' + o + ' [label="' + p + '", color=green]; ' # Make property green
        dotFormat += o + ' [color=red, style=filled]; ' # Make class red
    else:
        dotFormat += s + ' -> ' + o + ' [label="' + p + '"]; '


dotFormat = dotFormat[:-2] + ' }'

print(dotFormat)

digraph G { hasCapital -> containsCity [label="subPropertyOf"]; Brussels -> Capital [label="type"]; Brussels [color=blue, style=filled]; Capital [color=red, style=filled]; EuropeanCountry -> Country [label="subClassOf"]; EuropeanCountry [color=red, style=filled]; Country [color=red, style=filled]; hasCapital -> Capital [label="range", color=green]; Capital [color=red, style=filled]; Amsterdam -> City [label="impType",color=purple]; France -> France [label="neighbours"]; France -> France [label="has_Name"]; Belgium -> Country [label="type"]; Belgium [color=blue, style=filled]; Country [color=red, style=filled]; Iceland -> Iceland [label="has_Name"]; Iceland -> the_Ocean [label="neighbours"]; Belgium -> Germany [label="neighbours"]; Netherlands -> Belgium [label="neighbours"]; Netherlands -> The Netherlands [label="has_Name"]; France -> Country [label="impType",color=purple]; Brussels -> City [label="impType",color=purple]; Amsterdam -> Capital [label="type"]; Amsterdam [color=blue, styl

# C. Tasks related to local copies of external RDF Datasets using SPARQL

Until now, we have manipulated local knowledge graphs, but as we claimed in the lectures, the advantage of knowledge graphs is that they can easily be linked with other datasets on the Web. 

In the remaining 3 tasks, we will manipulate data from the Web, and ask complex queries over this web data. 

In the first task, we will access web data, make a local copy of it, and then query it. In the other two tasks, we will query live data directly from web Knowledge Graphs (in this case, the SPARQL endpoint of DBPedia). 

### - Task 5: (1 Point) Show and manipulate data about RDF resources on the Web 

With rdflib we can easily load a local graph, but we can just as well retrieve a graph from the Web. Here, we will do so using the *requests* library, which allows us to fire a request to any server and/or SPARQL endpoint and to capture the response. The following snippet does so for the resource Amsterdam Dbpedia, by using the 'DESCRIBE' keyword to give us all triples about Amsterdam, and then loads it in a RDFlib Graph object.

In [139]:
# install the library
#%pip install requests

In [140]:
import requests

endpoint = "https://dbpedia.org/sparql"
query = 'DESCRIBE <http://dbpedia.org/resource/Amsterdam>'

payload = {'query':query, 'format':'text/turtle'}
response = requests.post(endpoint, data = payload)

g = Graph()
g.parse(data=response.text, format='ttl')

<Graph identifier=N90d67b929af44d2e98baa67ccf98cf99 (<class 'rdflib.graph.Graph'>)>

Now do the same for Rotterdam

In [141]:
query = 'DESCRIBE <http://dbpedia.org/resource/Rotterdam>'

payload = {'query':query, 'format':'text/turtle'}
response = requests.post(endpoint, data = payload)

g.parse(data=response.text, format='ttl')  # calling parse again merges the graphs

<Graph identifier=N90d67b929af44d2e98baa67ccf98cf99 (<class 'rdflib.graph.Graph'>)>

Let us start by showing diverse bits of information w.r.t  Amsterdam and Rotterdam in DBPedia. It is very similar to task 1, but now with Web Data. 

First, query the graph g (now containing the DBPedia information about Amsterdam and Rotterdam) and check whether you can find someone who was born in Amsterdam (is dbo:birthPlace of) and died in Rotterdam (is dbo:deathPlace of)?

In [142]:
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?s
        WHERE {
            ?s dbo:birthPlace dbr:Amsterdam .
            ?s dbo:deathPlace dbr:Rotterdam .
        }
        LIMIT 10
       """)
for row in qres:
    print("%s" % row)

http://dbpedia.org/resource/Haya_van_Someren
http://dbpedia.org/resource/Gabriël_Vigeveno
http://dbpedia.org/resource/Jan_Stolker
http://dbpedia.org/resource/Nans_van_Leeuwen


Write a query to check whether there is an album that was recorded both in Rotterdam and Amsterdam? You need to look at the data to know which property you should check for. 

To get an intuition of what is in the knowledge graph you might want to look at the human readable rendering on : http://dbpedia.org/resource/Amsterdam

In [143]:
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?s
        WHERE {
            ?s dbo:recordedIn dbr:Amsterdam .
            ?s dbo:recordedIn dbr:Rotterdam .
        }
        LIMIT 10
       """)
for row in qres:
    print("%s" % row)

### - Task 6: (2 Points) Ask SPARQL against live data using Yasgui

Yasgui (http://yasgui.org/) is a nice graphical interface for asking queries.

Run a new query against http://dbpedia.org/sparql that does the following:

- Find all languages spoken in countries that are not official languages of that country
- The query should return two colums: the country, and the number of languages.
- Order the countries by the number of unofficial languages, from high to low.

In [144]:
# I am not sure how this has to be done, here is my take:
'''
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX db: <http://dbpedia.org/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?country ?language
WHERE
{
  
  ?language dbo:spokenIn ?country .
  ?country dbo:wikiPageWikiLink dbr:Sovereign_state .
   
MINUS { ?language dbo:officialLanguage ?country .}

}
ORDER BY (?language)
'''


'\nPREFIX dbr: <http://dbpedia.org/resource/>\nPREFIX db: <http://dbpedia.org/>\nPREFIX dbo: <http://dbpedia.org/ontology/>\nPREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>\nPREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>\n\nSELECT DISTINCT ?country ?language\nWHERE\n{\n  \n  ?language dbo:spokenIn ?country .\n  ?country dbo:wikiPageWikiLink dbr:Sovereign_state .\n   \nMINUS { ?language dbo:officialLanguage ?country .}\n\n}\nORDER BY (?language)\n'

### - Task 7: (1 Point) SPARQL 

Write a SPARQL query which returns all relationship(s) between the series "The Office (UK)" and the actor "Ricky Gervais" (literally).

Be careful to check for relations in both directions (but not necessarily the same relation in both directions).  

Use Yasgui to design the correct SPARQL query, and copy paste it in the cell below. 

In [145]:
# Add here the SPARQL query (not Python) code. (copy & paste from Yasgui)
# When executing against Yasgui you should get an answer. 
# Don't worry that executing this cell will return an error message (a SPARQL query is not a Python code, so it should give an error message here).