# Knowledge and Data 2020: Practical Assignment 2
## Manipulate local and external RDF Knowledge Graphs 

YOUR NAME: Michael Moor

YOUR VUNetID: mmr497

*(If you do not provide your name and VUNetID we will not accept your submission).*

### Learning objectives

At the end of this exercise you should be able to perform some simple manipulations of RDF Data using the rdflib library. You should be able to: 

1. Add and retrieve information from a local RDF database
2. Represent RDF data in other formats, such as the .dot format for graph visualisation
3. Retrieve information from an RDF database created from Web Data
4. Query information from the Web with SPARQL

### Practicalities

Follow this Notebook step-by-step. 

Of course, you can do the exercises in any Programming Editor of your liking. 
But you do not have to. Feel free to simply write code in the Notebook. When 
everything is filled in and works, save the Notebook and submit it 
as a Jupyter Notebook, i.e. with an .ipynb extension. Please use as name of the 
Notebook your studentID+Assignment2.ipynb.  

Other than in courses dedicated to programming we will not evaluate the style
of the programs. But we will test your programs on other data than we provide, 
and your program should give the correct answers to those test-data as well. 

# A. Tasks related to local RDF Knowledge Graphs

This first cell will open a file 'example-from-slide.ttl' using the rdflib library. The first Practical Assignment should have taught you that manipulating symbols as strings is a major pain. 

Programming libraries, such as **rdflib**, help you with this mess once and for all, by parsing the files, creating appropriate datastructures (Graph()) and providing useful functions (such as serialize(), save() and much more). 
Check the website of rdflib http://rdflib.readthedocs.io/: this library does most of the hard work for you.

Before starting with the tasks of this assignment, do not forget to install **rdflib** so we can start using it. Installing libraries in Python is very simple. Just open your terminal and write the following command:

    $ pip install rdflib

For more details on how to install pip and Python libraries, you can check the [preliminaries document](https://github.com/raadjoe/knowledge-data-vu/blob/master/Tutorials/Preliminaries/preliminaries.md).  

In [1]:
from rdflib import Graph, RDF, Namespace, Literal, URIRef

g = Graph()

EX = Namespace('http://example.com/kad2020/')
g.bind('ex',EX)

def serialize():
    # g.serialize() returns a byte string (b'...')
    # .decode() is parsing the byte string into a python3 string
    print(g.serialize(format='turtle').decode("utf-8"))

def save(filename):
    with open(filename, 'w') as f:
        g.serialize(f, format='nt')
        
def load(filename):
    with open(filename, 'r') as f:
        g.load(f, format='turtle')   

The file 'example-from-slides.ttl' formalises the knowledge base from the slides from Module 1, and a bit more. 

Here is how it looks when you load it into your program and serialise it with rdflib in turtle. 

In [2]:
load('example-from-slides.ttl')
serialize()

@prefix ex: <http://example.com/kad2020/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Germany a ex:EuropeanCountry .

ex:Netherlands a ex:Country ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:Amsterdam a ex:Capital .

ex:Belgium a ex:Country .

ex:EuropeanCountry rdfs:subClassOf ex:Country .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City .

ex:Capital rdfs:subClassOf ex:City .




Now, we can manipulate the graph very easily, e.g. like in the following very simple function, which returns the predicate(s) that relate a subject to a literal object: 

In [3]:
for s,p,o in g:
    if type(o) == Literal:
        print(p)

http://example.com/kad2020/has_Name


### - Task 1: (1 Point) Add information to an RDF graph

Add 10 triples to the knowledge graph. Make sure that they have the right namespaces. 

Similarily to the triples already present in the file 'example-from-slides.ttl', add at least:
- a new country with its name and capital 
- one triple with a new predicate

Check: http://rdflib.readthedocs.io/en/stable/intro_to_creating_rdf.html

In [4]:
ex = Namespace("http://example.com/kad2020/")
owl = Namespace('http://www.w3.org/2002/07/owl#')

# add 10 triples here to the graph 'g' (do not forget the namespaces).
g.add((ex.France, ex.has_Name, Literal("France")))
g.add((ex.France, ex.has_Capital, ex.Paris))
g.add((ex.France, ex.has_Language, ex.French))
g.add((ex.France, RDF.type, ex.EuropeanCountry))
g.add((ex.France, ex.neighbours, ex.Spain))
g.add((ex.France, ex.neighbours, ex.Germany))
g.add((ex.France, ex.neighbours, ex.Belgium))
g.add((ex.Netherlands, ex.has_Language, ex.Dutch))
g.add((ex.Germany, ex.has_Language, ex.German))
g.add((ex.Germany, ex.has_Name, Literal("Germany")))



print(g.serialize(format='turtle').decode("utf-8"))

@prefix ex: <http://example.com/kad2020/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:France a ex:EuropeanCountry ;
    ex:has_Capital ex:Paris ;
    ex:has_Language ex:French ;
    ex:has_Name "France" ;
    ex:neighbours ex:Belgium,
        ex:Germany,
        ex:Spain .

ex:Netherlands a ex:Country ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Language ex:Dutch ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:Amsterdam a ex:Capital .

ex:Germany a ex:EuropeanCountry ;
    ex:has_Language ex:German ;
    ex:has_Name "Germany" .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City .

ex:Belgium a ex:Country .

ex:Capital rdfs:subClassOf ex:City .

ex:EuropeanCountry rdfs:subClassOf ex:Country .




*After you ran the previous code (adding triples) the next cells will be executed on your extended graph. That is ok.*

### - Task 2a: (1 Point) Get structured information from an RDF graph (all Literals)

Use the functions available in the RDFLib library. Write a small function to print all Literals. 

Hint: there is a function in rdflib to test the type of an object (check previous examples in this notebook)

In [5]:
def print_literals(g):
    for s,p,o in g:
        if type(o) == Literal:
            print(o) 

print_literals(g)

France
The Netherlands
Germany


### - Task 2b: (1 Point) Get structured information from an RDF graph (all unique Predicates)

Please provide another function that gives a **unique** list of the predicates, ordered by occurrence (most occurring first). The answer will look like this: 
<br>http://www.w3.org/2000/01/rdf-schema#label
<br>http://www.w3.org/1999/02/22-rdf-syntax-ns#type
<br>http://example.com/sw2016/locatedIn
<br>http://www.w3.org/2000/01/rdf-schema#range

In [6]:
def print_predicates(g):

    predicates = []

    for s,p,o in g:
        predicates.append(p)

    sorted_predicates = sorted(predicates, key = predicates.count, reverse = True)
    sorted_predicates = list(dict.fromkeys(sorted_predicates))

    for p in sorted_predicates:
        print(p)

print_predicates(g)


SyntaxError: invalid syntax (<ipython-input-6-269a3b9d545c>, line 1)

# B. Tasks related to Graph visualisations 

### - Task 3a: (1 Point) From RDF to .dot 


In the lecture, we have seen two ways of writing a knowledge graph (simple n-triples, and simple turtle). Let us consider a 3rd syntax, this time a syntax that is useful for visualisation. One standard for visualising graphs is the .dot format.

Print the knowledge graph in .dot file format. Check https://graphviz.gitlab.io/documentation/ for the documentation. You will only need very little of this information, and the most relevant information can be found in the examples that are given. 

<br>Basically, an RDF graph in .dot format starts with 
<br>digraph G { 
    and then a list of links of the following form 
<br>s -> o [label="p"]
    for every (s p o ) in KG (separated by ;
<br>Do not forget to end with a closing bracket. }

An example is 
     
     digraph G { s1 -> o1 [label="p1"] ; s2 -> o2 [label="p2"] } 
     
for an RDF graph {(s1 p1 o1),(s2 p2 o2)}

*You can check how your graph looks like. Just copy & paste your output into: http://www.webgraphviz.com/*

In [7]:
dot_list = []
final_string = ''
for s,p,o in g:
    if type(o) == Literal:
        o = '"'+ o + '"'
    s = s.split('/')
    if '#' in p:
        p = p.split('#')
    else:
        p = p.split('/')
    o = o.split('/')
    dot_string = str(s[-1]) + ' -> ' + str(o[-1]) + ' [label=' + '"' + str(p[-1]) + '"' + ']'
    dot_list.append(dot_string)

for string in dot_list:
    if string != dot_list[-1]:
        final_string += '\t' + string + ' ;' + '\n'
    else:
        final_string += '\t' + string

print("digraph G" + " { " + '\n' + final_string + '\n'  +"}")

digraph G { 
	Netherlands -> Country [label="type"] ;
	France -> Germany [label="neighbours"] ;
	France -> Belgium [label="neighbours"] ;
	France -> "France" [label="has_Name"] ;
	Netherlands -> "The Netherlands" [label="has_Name"] ;
	France -> Spain [label="neighbours"] ;
	containsCity -> Country [label="domain"] ;
	hasCapital -> containsCity [label="subPropertyOf"] ;
	France -> French [label="has_Language"] ;
	Belgium -> Country [label="type"] ;
	containsCity -> City [label="range"] ;
	Netherlands -> Amsterdam [label="has_Capital"] ;
	EuropeanCountry -> Country [label="subClassOf"] ;
	Capital -> City [label="subClassOf"] ;
	Germany -> EuropeanCountry [label="type"] ;
	Netherlands -> Dutch [label="has_Language"] ;
	hasCapital -> Capital [label="range"] ;
	Amsterdam -> Capital [label="type"] ;
	Netherlands -> Belgium [label="neighbours"] ;
	Germany -> German [label="has_Language"] ;
	France -> Paris [label="has_Capital"] ;
	Germany -> "Germany" [label="has_Name"] ;
	France -> EuropeanC

### - Task 3b: (1 Point) From RDF to .dot with "semantic information"

There is a conceptual distinction between properties, instances and classes (sets of instances). A simple way of checking is the following

1. in a triple (s a o), with predicate a (which is a special abbreviation for the predicate rdf:type), the s is an Instance, and o is a Class. 
2. in a triple (s rdfs:subClassOf o) both s and o are Classes. 
3. in a triple (p rdfs:domain o) p is a Property and o is a Class. 
4. in a triple (p rdfs:range o)  p is a Property and o is a Class. 

Make a .dot representation for an RDF graph that distinguishes between types of links (RDF vocabulary vs others) and types of nodes (Classes versus Instances) via different colors. 

*You can check how your graph looks like. Just copy & paste your output into: http://www.webgraphviz.com/*

In [8]:
dot_list = []
final_string = ''
for s,p,o in g:
    if type(o) == Literal:
        o = '"'+ o + '"'
    s = s.split('/')
    if '#' in p:
        p = p.split('#')
    else:
        p = p.split('/')
    o = o.split('/')
    if 'type' in p:
        pcolor = "red"
        scolor = "yellow"
        ocolor = "orange"
        dot_string = str(s[-1]) + ' [color=' + scolor + '] ' + str(o[-1]) + ' [color=' + ocolor + '] ' + str(s[-1]) + ' -> ' + str(o[-1]) + ' [label=' + '"' + str(p[-1]) +\
        '"' + ' color=' + pcolor + ']'
    elif 'domain' in p:
        pcolor = "blue"
        scolor = "cyan"
        ocolor = "orange"
        dot_string = str(s[-1]) + ' [color=' + scolor + '] ' + str(o[-1]) + ' [color=' + ocolor + '] ' + str(s[-1]) + ' -> ' + str(o[-1]) + ' [label=' + '"' + str(p[-1]) +\
        '"' + ' color=' + pcolor + ']'
    elif 'range' in p:
        pcolor = "green"
        scolor = "cyan"
        ocolor = "orange"
        dot_string = str(s[-1]) + ' [color=' + scolor + '] ' + str(o[-1]) + ' [color=' + ocolor + '] ' + str(s[-1]) + ' -> ' + str(o[-1]) + ' [label=' + '"' + str(p[-1]) +\
        '"' + ' color=' + pcolor + ']'
    elif 'subClassOf' in p:
        pcolor = "purple"
        scolor = "orange"
        ocolor = scolor
        dot_string = str(s[-1]) + ' [color=' + scolor + '] ' + str(o[-1]) + ' [color=' + ocolor + '] ' + str(s[-1]) + ' -> ' + str(o[-1]) + ' [label=' + '"' + str(p[-1]) +\
        '"' + ' color=' + pcolor + ']'
    else:
        dot_string = str(s[-1]) + ' -> ' + str(o[-1]) + ' [label=' + '"' + str(p[-1]) + '"' + ']'
    dot_list.append(dot_string)

for string in dot_list:
    if string != dot_list[-1]:
        final_string += '\t' + string + ' ;' + '\n'
    else:
        final_string += '\t' + string

print("digraph G" + " { " + '\n' + final_string + '\n'  +"}")

digraph G { 
	Netherlands [color=yellow] Country [color=orange] Netherlands -> Country [label="type" color=red] ;
	France -> Germany [label="neighbours"] ;
	France -> Belgium [label="neighbours"] ;
	France -> "France" [label="has_Name"] ;
	Netherlands -> "The Netherlands" [label="has_Name"] ;
	France -> Spain [label="neighbours"] ;
	containsCity [color=cyan] Country [color=orange] containsCity -> Country [label="domain" color=blue] ;
	hasCapital -> containsCity [label="subPropertyOf"] ;
	France -> French [label="has_Language"] ;
	Belgium [color=yellow] Country [color=orange] Belgium -> Country [label="type" color=red] ;
	containsCity [color=cyan] City [color=orange] containsCity -> City [label="range" color=green] ;
	Netherlands -> Amsterdam [label="has_Capital"] ;
	EuropeanCountry [color=orange] Country [color=orange] EuropeanCountry -> Country [label="subClassOf" color=purple] ;
	Capital [color=orange] City [color=orange] Capital -> City [label="subClassOf" color=purple] ;
	Germany [

### - Task 4a: (1 Point) Deriving implicit knowledge (a bit of schema)

We will look into Schema information in the latter modules, but let us try already to find some implicit information in a first bit of inferencing: whenever there are two statements (s a o) and (o rdfs:subClassOf o2) we can derive (and later prove) that (s a o2). 

Write a procedure that adds all implied triples to our knowledge graph. 

In [9]:
for s,p,o in g:
    if "subClassOf" in p:
        check_o = s
        right_o = o
        for s,p,o in g:
            if o == check_o and "type" in p:
                o = right_o
                g.add((s, p, o))

print(g.serialize(format='turtle').decode("utf-8"))

@prefix ex: <http://example.com/kad2020/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:France a ex:Country,
        ex:EuropeanCountry ;
    ex:has_Capital ex:Paris ;
    ex:has_Language ex:French ;
    ex:has_Name "France" ;
    ex:neighbours ex:Belgium,
        ex:Germany,
        ex:Spain .

ex:Netherlands a ex:Country ;
    ex:has_Capital ex:Amsterdam ;
    ex:has_Language ex:Dutch ;
    ex:has_Name "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:Amsterdam a ex:Capital,
        ex:City .

ex:Germany a ex:Country,
        ex:EuropeanCountry ;
    ex:has_Language ex:German ;
    ex:has_Name "Germany" .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City .

ex:Belgium a ex:Country .

ex:Capital rdfs:subClassOf ex:City .

ex:EuropeanCountry rdfs:subClassOf ex:Country .




### - Task 4b: (Optional - 1 Extra Point) Visualising implicit knowledge

Produce a .dot version of the graph with those implies implicit triples, and mark the edges of those triples with a different color or arrow style. 

In [10]:
# Your code here

# C. Tasks related to local copies of external RDF Datasets using SPARQL

Until now, we have manipulated local knowledge graphs, but as we claimed in the lectures, the advantage of knowledge graphs is that they can easily be linked with other datasets on the Web. 

In the remaining 3 tasks, we will manipulate data from the Web, and ask complex queries over this web data. 

In the first task, we will access web data, make a local copy of it, and then query it. In the other two tasks, we will query live data directly from web Knowledge Graphs (in this case, the SPARQL endpoint of DBPedia). 

### - Task 5: (1 Point) Show and manipulate data about RDF resources on the Web 

With rdflib we can easily load a graph, even if it comes from a source on the Web. The following snupped loads as graph the information about the resource Amsterdam from Dbpedia.

In [11]:
import rdflib
from rdflib import Literal, URIRef
g=rdflib.Graph()
g.load('http://dbpedia.org/resource/Amsterdam')
g.load('http://dbpedia.org/resource/Rotterdam')

Let us start by showing diverse bits of information w.r.t  Amsterdam and Rotterdam in DBPedia. It is very similar to task 1, but now with Web Data. 

First, query the graph g (now containing the DBPedia information about Amsterdam and Rotterdam) and check whether you can find someone who was born in Amsterdam (is dbo:birthPlace of) and died in Rotterdam (is dbo:deathPlace of)?

In [12]:
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?s
        WHERE {
            ?s dbo:birthPlace dbr:Amsterdam .
            ?s dbo:deathPlace dbr:Rotterdam .
        }
        LIMIT 10

       """)
for row in qres:
    print("%s" % row)

http://dbpedia.org/resource/Jan_van_Beveren
http://dbpedia.org/resource/Anthony_Sweijs
http://dbpedia.org/resource/Jan_Stolker
http://dbpedia.org/resource/Haya_van_Someren


Write a query to check whether there is an album that was recorded both in Rotterdam and Amsterdam? You need to look at the data to know which property you should check for. 

To get an intuition of what is in the knowledge graph you might want to look at the human readable rendering on : http://dbpedia.org/resource/Amsterdam

In [13]:
qres = g.query(
   """
    PREFIX dbr: <http://dbpedia.org/resource/>
    PREFIX dbo: <http://dbpedia.org/ontology/>
    SELECT ?s
        WHERE {
            ?s dbo:recordedIn dbr:Amsterdam .
            ?s dbo:recordedIn dbr:Rotterdam .
        }
        LIMIT 10
       """)
for row in qres:
    print("%s" % row)

http://dbpedia.org/resource/With_or_Without_You_(album)


### - Task 6: (2 Points) Ask SPARQL against live data using Yasgui

Yasgui (http://yasgui.org/) is a nice graphical interface for asking queries.

Run a new query against http://dbpedia.org/sparql that does the following:

- Find all languages spoken in countries that are not official languages of that country
- The query should return two colums: the country, and the number of languages.
- Order the countries by the number of unofficial languages, from high to low.

In [14]:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>

SELECT ?country (COUNT(?language) AS ?frequency) WHERE {
  ?country rdf:type dbo:Country ;
           dbo:language ?language .
  FILTER NOT EXISTS {?country dbo:officialLanguage ?language}
} ORDER BY DESC(?frequency)

SyntaxError: invalid syntax (<ipython-input-14-ad29a4084161>, line 1)

### - Task 7: (1 Point) SPARQL 

Write a SPARQL query which returns all relationship(s) between the series "The Office (UK)" and the actor "Ricky Gervais" (literally).

Be careful to check for relations in both directions (but not necessarily the same relation in both directions).  

Use Yasgui to design the correct SPARQL query, and copy paste it in the cell below. 

In [15]:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbr: <http://dbpedia.org/resource/>
PREFIX dbo: <http://dbpedia.org/ontology/>

Select DISTINCT ?relation WHERE {
  {dbr:The_Office_\(UK_TV_series\) ?relation dbr:Ricky_Gervais .}
  UNION
  {dbr:Ricky_Gervais ?relation dbr:The_Office_\(UK_TV_series\) .}
}

SyntaxError: invalid syntax (<ipython-input-15-854916869f59>, line 1)