# Knowledge and Data: Practical Assignment 3 
## RDF Data, RDFS knowledge and inferencing 

YOUR NAME: Kevin Schönhage

YOUR VUNetID: OMN496

*(If you do not provide your name and VUNetID we will not accept your submission).* 

### Learning objectives

At the end of this exercise you should be able to:

1. Access local an external data via SPARQL both from within a python programming environment and stand-alone with a GUI, such as [YASGUI](https://yasgui.triply.cc/), and this way integrate data from different sources  
2. Model your own first knowledge base, in this case an RDF Schema knowledge graph
3. Implement inference rules 

Follow this Notebook step-by-step. 

Of course, you can do the exercises in any Programming Editor of your liking. 
But you do not have to. Feel free to simply write code in the Notebook. When 
everythink is filled in and works, safe the Notebook and submit it 
as a Jupyter Notebook, i.e. with an ipynb extension. Please use as name of the 
Notebook your studentID+Assignment3.ipynb.  


We will not evaluate the programming style of your solutions. Yet we do look whether your solutions suggests an understanding, and whether they yield the correct output.

Note that all notebooks will automatically be checked for plagiarism: while similar answers can be expected, it is not allowed to directly copy the solutions from fellow students or TAs, or from the examples discussed during the lectures. Similarly, sharing your solutions with your peers is not allowed.

**IMPORTANT: Submit this notebook after finishing the assignment. It is not necessary to submit the created turtle files**

Before you start, you need to:

- **Install the *rdflib* Python package:** *pip install rdflib* (should already be installed from the previous assignment)
- **Install the *SPARQLWrapper* Python package:** *pip install SPARQLWrapper*
- **Install the free edition of the GraphDB Triplestore:** please follow this short [GraphDB tutorial](https://github.com/ucds-vu/knowledge-data-vu/blob/master/Tutorials/Preliminaries/tutorial-GraphDB.md). 

Then, add the file example-from-slides.ttl to a newly created database, say called assignment-3. 

**Note that you should have an active internet connection to run the code in this notebook. If, for some external reason (ie internet and/or system issues), you cannot access the SPARQL endpoint, then report this to a TA as soon as possible!**

In [18]:
# install library
%pip install SPARQLWrapper


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## Task 1: (35 points) Integrate Local and External Data

You can integrate SPARQL queries into your Python code by using the *RDFLib* and *SPARQLWrapper* libraries. 

The following code accesses the DBPedia knowledge graph using its SPARQL endpoint, and returns the result of the SPARQL query requesting all the labels asserted to Amsterdam (test it!)  

In [19]:
# This code only works if you are online.
# If, for some reason, you cannot get this to work, then please contact a TA

from rdflib import Graph, RDF, RDFS, Namespace, Literal, URIRef
from SPARQLWrapper import SPARQLWrapper, JSON

sparql = SPARQLWrapper("http://dbpedia.org/sparql")
sparql.setQuery("""
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    SELECT ?cityName
    WHERE { 
        <http://dbpedia.org/resource/Amsterdam> rdfs:label ?cityName 
    }
""")
sparql.setReturnFormat(JSON)
results = sparql.query().convert()
for result in results["results"]["bindings"]:
    print(result["cityName"]["value"])  

Amsterdam
أمستردام
Amsterdam
Amsterdam
Amsterdam
Άμστερνταμ
Amsterdamo
Ámsterdam
Amsterdam
Amstardam
Amsterdam
Amsterdam
Amsterdam
암스테르담
アムステルダム
Amsterdam
Amsterdam
Амстердам
Amesterdão
Amsterdam
Амстердам
阿姆斯特丹


For your convenience, we already wrote the following functions that might be useful to complete this task. 
In addition, we have loaded and printed the 'example-from-slides.ttl' dataset.

In [20]:
from rdflib import Graph, RDF, Namespace, Literal, URIRef
from SPARQLWrapper import SPARQLWrapper, JSON


# Loads the data from a certain file given as input in Turtle syntax into the Graph g  
# -------------------------
def load_graph(graph, filename):
    with open(filename, 'r') as f:
        graph.parse(f, format='turtle')
        

# Prints a certain graph given as input in Turtle syntax
# if your output shows byte string (ie, b'...') you must add '.decode()' to the print statements:
#    print(myGraph.serialize(format='turtle').decode())
# -------------------------
def serialize_graph(myGraph):
     print(myGraph.serialize(format='turtle'))
        

# Saves the Graph g in Turtle syntax to a certain file given as input
# -------------------------
def save_graph(myGraph, filename):
    with open(filename, 'w') as f:
        myGraph.serialize(filename, format='turtle')
        
    
# Changes the namespace of a certain URI given as input to a DBpedia URI 
# Example: transformToDBR("http://example.com/kad2020/Amsterdam") returns "http://dbpedia.org/resource/Amsterdam"
# -------------------------
def transformToDBR(uri):
    if isinstance(uri, Literal):
        # changes the literal to uppercase so that the object with the same name refers to an object and not the string
        return uri.upper()
    components = g.namespace_manager.compute_qname(uri)
    return "http://dbpedia.org/resource/%s"%(components[2])

# -------------------------

g = Graph()
load_graph(g, 'example-from-slides.ttl')
serialize_graph(g)


# Don't forget to run this cell before continuing the task.


@prefix ex: <http://example.com/kad/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .

ex:Netherlands a ex:Country ;
    ex:contains ex:Ijsselmeer ;
    ex:containsCity ex:Rotterdam ;
    ex:hasCapital ex:Amsterdam ;
    ex:hasName "The Netherlands" ;
    ex:neighbours ex:Belgium .

ex:hasCapital rdfs:range ex:Capital ;
    rdfs:subPropertyOf ex:containsCity .

ex:neighbours rdfs:subPropertyOf ex:closeBy .

ex:Amsterdam a ex:Capital ;
    ex:closeBy ex:Germany .

ex:Belgium a ex:Country .

ex:EuropeanCountry rdfs:subClassOf ex:Country .

ex:Germany a ex:EuropeanCountry ;
    ex:hasCapital ex:Berlin .

ex:closeBy rdfs:domain ex:Location ;
    rdfs:range ex:Location .

ex:containsCity rdfs:domain ex:Country ;
    rdfs:range ex:City ;
    rdfs:subPropertyOf ex:contains .

ex:Capital rdfs:subClassOf ex:City .

ex:City rdfs:subClassOf ex:Location .

ex:Country rdfs:subClassOf ex:Location .




### A: Write a SPARQL query that finds all the cities in the dataset

As you cannot directly use class City, you will have to find those cities in the dataset (example-from-slides.ttl) using implicit information that can be deduced from the domain and ranges of the relations (e.g. things in a hasCapital relation are capitals and a capital is a city, etc.).

Save all the cities returned from the SPARQL query into the empty set "cities". 

In [21]:
cities = set()

qres = g.query("""
PREFIX ex:   <http://example.com/kad/>
PREFIX rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT ?city WHERE {
  { ?c ex:hasCapital ?city }                                  # capitals
  UNION
  { ?city rdf:type ex:Capital }                               # expliciet Capital
  UNION
  { ?city rdf:type ?t . ?t rdfs:subClassOf ex:City }          # type via subclass
}
""")

for row in qres:
    cities.add(row.city)

for city in cities:
    print(city)


http://example.com/kad/Berlin
http://example.com/kad/Amsterdam


### B: For each city, find from DBpedia its longitude & latitude, and its number of inhabitants (if available)

Don't forget to adapt the namespace of the cities in your dataset when querying DBpedia, using the above function *transformToDBR(uri)*. Also note that namespaces should never use the *https* protocol.

The empty graph h should only contain the triples extracted from DBpedia, but added to the URIs with the 'ex' namespace. 
An example of a triple in h is the following triple: 
       
       ex:Amsterdam dbo:populationTotal "872680"^^xsd:nonNegativeInteger .

In [22]:
h   = Graph()
dbo = Namespace("http://dbpedia.org/ontology/")
geo = Namespace("http://www.w3.org/2003/01/geo/wgs84_pos#")
ex  = Namespace("http://example.com/kad/")

def localname(u):
    s = str(u)
    return s.rsplit('#', 1)[-1].rsplit('/', 1)[-1]

sparql = SPARQLWrapper("http://dbpedia.org/sparql")

for city in cities:
    name    = localname(city)
    dbp_uri = f"http://dbpedia.org/resource/{name}"

    sparql.setQuery(f"""
      PREFIX dbo: <http://dbpedia.org/ontology/>
      PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
      SELECT ?pop ?lat ?long WHERE {{
        OPTIONAL {{ <{dbp_uri}> dbo:populationTotal ?pop . }}
        OPTIONAL {{ <{dbp_uri}> geo:lat  ?lat .  }}
        OPTIONAL {{ <{dbp_uri}> geo:long ?long . }}
      }} LIMIT 1
    """)
    sparql.setReturnFormat(JSON)
    res = sparql.query().convert()

    for b in res["results"]["bindings"]:
        if "pop" in b:
            dt = URIRef(b["pop"].get("datatype")) if "datatype" in b["pop"] else None
            h.add((ex[name], dbo.populationTotal, Literal(b["pop"]["value"], datatype=dt)))
        if "lat" in b:
            dt = URIRef(b["lat"].get("datatype")) if "datatype" in b["lat"] else None
            h.add((ex[name], geo.lat, Literal(b["lat"]["value"], datatype=dt)))
        if "long" in b:
            dt = URIRef(b["long"].get("datatype")) if "datatype" in b["long"] else None
            h.add((ex[name], geo.long, Literal(b["long"]["value"], datatype=dt)))

serialize_graph(h)


@prefix ns1: <http://dbpedia.org/ontology/> .
@prefix ns2: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

<http://example.com/kad/Amsterdam> ns1:populationTotal "907976"^^xsd:nonNegativeInteger ;
    ns2:lat "52.36666488647461"^^xsd:float ;
    ns2:long "4.900000095367432"^^xsd:float .

<http://example.com/kad/Berlin> ns1:populationTotal "3677472"^^xsd:nonNegativeInteger ;
    ns2:lat "52.52000045776367"^^xsd:float ;
    ns2:long "13.40499973297119"^^xsd:float .




### C: Save your results

- Merge the triples from example-from-slides.ttl with the information extracted from DBpedia. See the [documentation](https://rdflib.readthedocs.io/en/stable/merging.html) on how to accomplish this.
- Save all these triples into a new file 'extended-example.ttl'. **It is not necessary to submit this file**
- Print all triples in Turtle Syntax.


In [23]:
g_all = Graph()
for t in g: g_all.add(t)
for t in h: g_all.add(t)

g_all.serialize(destination='extended-example.ttl', format='turtle')
print(g_all.serialize(format='turtle').decode() if isinstance(g_all.serialize(format='turtle'), bytes) else g_all.serialize(format='turtle'))


@prefix ns1: <http://dbpedia.org/ontology/> .
@prefix ns2: <http://example.com/kad/> .
@prefix ns3: <http://www.w3.org/2003/01/geo/wgs84_pos#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns2:Netherlands a ns2:Country ;
    ns2:contains ns2:Ijsselmeer ;
    ns2:containsCity ns2:Rotterdam ;
    ns2:hasCapital ns2:Amsterdam ;
    ns2:hasName "The Netherlands" ;
    ns2:neighbours ns2:Belgium .

ns2:hasCapital rdfs:range ns2:Capital ;
    rdfs:subPropertyOf ns2:containsCity .

ns2:neighbours rdfs:subPropertyOf ns2:closeBy .

ns2:Amsterdam a ns2:Capital ;
    ns1:populationTotal "907976"^^xsd:nonNegativeInteger ;
    ns2:closeBy ns2:Germany ;
    ns3:lat "52.36666488647461"^^xsd:float ;
    ns3:long "4.900000095367432"^^xsd:float .

ns2:Belgium a ns2:Country .

ns2:Berlin ns1:populationTotal "3677472"^^xsd:nonNegativeInteger ;
    ns3:lat "52.52000045776367"^^xsd:float ;
    ns3:long "13.40499973297119"^^xsd:float .

ns2:Euro

## Task 2: (25 points)  Implement Basic Inferencing Rules 

In the lecture we showed that the RDFS inference rules can be used to infer new knowledge. For example, infer class membership based on _rdfs:domain_ or infer relationships between subjects and objects based on _rdfs:subPropertyOf_. 

Create rules to inference class membership based on the RDF Schema language features 
*	For example: infer that an instance belongs to a class because of domain and range restrictions
*	For example: infer that an instance belongs to a (super)class because it also belongs to a subclass

We implemented the __rdfs2__ rule. You should implement the 5 following remaining rules:  

*     (rdfs2) If G contains the triples (aaa rdfs:domain xxx.) and (uuu aaa yyy.)  then infer the triple (uuu rdf:type xxx.)
*     (rdfs3) If G contains the triples (aaa rdfs:range xxx.) and (uuu aaa vvv.) then infer the triple (vvv rdf:type xxx .)
*     (rdfs5) If G contains the triples (uuu rdfs:subPropertyOf vvv.) and (vvv rdfs:subPropertyOf xxx.) then infer the triple
(uuu rdfs:subPropertyOf xxx.) 
*     (rdfs7) If G contains the triples (aaa rdfs:subPropertyOf bbb.) and (uuu aaa yyy.) then infer the triple (uuu bbb yyy) 
*     (rdfs9) If G contains the triples (uuu rdfs:subClassOf xxx.) and (vvv rdf:type uuu.) then infer the triple
 (vvv rdf:type xxx.)   -> this one was not mentioned in the lecture, but is a very important one. 
*     (rdfs11) If G contains the triples (uuu rdfs:subClassOf vvv.) and (vvv rdfs:subClassOf xxx.) then infer the triple
(uuu rdfs:subClassOf xxx.)


Run your rule reasoner on your knowledge graph. If you have implemented everything correctly, you should find exactly 17 inferences.

In [51]:
from rdflib import URIRef

def myRDFSreasoner(myGraph):
    inferredTriples = 0
    RDFS = "http://www.w3.org/2000/01/rdf-schema#"
    RDF  = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"

    for sbj, prd, obj in myGraph:

        # --- rdfs2 ---
        if prd.eq(URIRef(RDFS + "domain")):
            generator = myGraph.subject_objects(URIRef(sbj))
            for s, o in generator:
                inferredTriples += 1
                print("(rdfs2)", s, "rdf:type", obj)

        # --- rdfs3 ---
        elif prd.eq(URIRef(RDFS + "range")):
            generator = myGraph.subject_objects(URIRef(sbj))  # uuu sbj vvv
            for s, o in generator:
                inferredTriples += 1
                print("(rdfs3)", o, "rdf:type", obj)

        # --- rdfs5 ---
        elif prd.eq(URIRef(RDFS + "subPropertyOf")):
            # (sbj ⊑ obj) & (obj ⊑ z)  =>  (sbj ⊑ z)
            for _, _, z in myGraph.triples((obj, URIRef(RDFS + "subPropertyOf"), None)):
                inferredTriples += 1
                print("(rdfs5)", sbj, "rdfs:subPropertyOf", z)

        # --- rdfs7 ---
        else:
            # (sbj prd obj) & (prd ⊑ q)  =>  (sbj q obj)
            for _, _, q in myGraph.triples((prd, URIRef(RDFS + "subPropertyOf"), None)):
                inferredTriples += 1
                print("(rdfs7)", sbj, q, obj)

        # --- rdfs9 ---
        if prd.eq(URIRef(RDFS + "subClassOf")):
            for _, _, z in myGraph.triples((obj, URIRef(RDFS + "subClassOf"), None)):
                inferredTriples += 1
                print("(rdfs9)", sbj, "rdfs:subClassOf", z)

        # --- rdfs11 ---
        if prd.eq(URIRef(RDF + "type")):
            for _, _, sup in myGraph.triples((obj, URIRef(RDFS + "subClassOf"), None)):
                inferredTriples += 1
                print("(rdfs11)", sbj, "rdf:type", sup)

    print("--------------------------------")
    print("Number of inferred triples:", inferredTriples)
    print("--------------------------------")

myRDFSreasoner(g)

(rdfs9) http://example.com/kad/EuropeanCountry rdfs:subClassOf http://example.com/kad/Location
(rdfs5) http://example.com/kad/hasCapital rdfs:subPropertyOf http://example.com/kad/contains
(rdfs3) http://example.com/kad/Amsterdam rdf:type http://example.com/kad/Capital
(rdfs3) http://example.com/kad/Berlin rdf:type http://example.com/kad/Capital
(rdfs7) http://example.com/kad/Netherlands http://example.com/kad/contains http://example.com/kad/Rotterdam
(rdfs9) http://example.com/kad/Capital rdfs:subClassOf http://example.com/kad/Location
(rdfs2) http://example.com/kad/Amsterdam rdf:type http://example.com/kad/Location
(rdfs11) http://example.com/kad/Belgium rdf:type http://example.com/kad/Location
(rdfs3) http://example.com/kad/Germany rdf:type http://example.com/kad/Location
(rdfs11) http://example.com/kad/Netherlands rdf:type http://example.com/kad/Location
(rdfs7) http://example.com/kad/Netherlands http://example.com/kad/containsCity http://example.com/kad/Amsterdam
(rdfs2) http://exa

## Task 3: (20 points) Build your very own RDFS knowledge graph. 


Define a small RDF Schema vocabulary in Turtle. You can choose your own domain (e.g. movies, geography, sports), as long as it hasn't been used as an example during the lectures. The following rules must be respected:
*	The schema should define at least 4 classes, 4 properties, and 4 instances.
*   The properties should be used to relate the instances (i.e., object-type relations)
*	The instances should be members of at least one of the 4 defined classes
*	All resources should have an rdfs:label attribute in a suitable language.

You should use (at least) the following language features of RDF and RDFS:
* 	rdf:type (or 'a')
* 	rdfs:subClassOf
* 	rdfs:subPropertyOf
* 	rdfs:domain and rdfs:range
*	rdfs:label

Be sure to define the 'rdf:' and 'rdfs:' namespace prefixes for RDF and RDF Schema in your file (perhaps have a look at http://prefix.cc)

For creating your vocabulary you should add the axioms directly (programatically) to your Knowledge Graph as you did last week. 

Play around with the inference rules you have created in the previous task to make sure that you added some implicit knowledge, that becomes "visible" via inferencing (this will be useful for the next task). 

Finally:
- Add the knowledge you created into the RDFlib graph datastructure *myRDFSgraph*, 
- Print *myRDFSgraph* in Turtle so that we can check your "design"
- Save *myRDFSgraph* into a new file 'myRDFSgraph.ttl' (it is not necessary to submit this file)

In [56]:
ttl = """
@prefix ex:   <http://example.com/kad/> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

# Classes and subclass chain  (rdfs9 + rdfs11)
ex:Animal  a rdfs:Class .
ex:Mammal  a rdfs:Class ; rdfs:subClassOf ex:Animal .
ex:Dog     a rdfs:Class ; rdfs:subClassOf ex:Mammal .

# Properties with domain/range and subProperty chain  (rdfs2, rdfs3, rdfs5, rdfs7)
ex:relatedTo  a rdf:Property .
ex:hasParent  a rdf:Property ;
              rdfs:subPropertyOf ex:relatedTo ;
              rdfs:domain ex:Animal ;
              rdfs:range  ex:Animal .
ex:hasMother  a rdf:Property ;
              rdfs:subPropertyOf ex:hasParent .

# Instances (data)
ex:Buddy  a ex:Dog ;
          ex:hasMother ex:Luna .

ex:Luna   a ex:Mammal .

# extra instance om rdfs2 expliciet te zien
ex:Max    ex:hasParent ex:Luna .
"""

g3 = Graph()
g3.parse(data=ttl, format="turtle")

myRDFSreasoner(g3)


(rdfs7) http://example.com/kad/Buddy http://example.com/kad/hasParent http://example.com/kad/Luna
(rdfs5) http://example.com/kad/hasMother rdfs:subPropertyOf http://example.com/kad/relatedTo
(rdfs2) http://example.com/kad/Max rdf:type http://example.com/kad/Animal
(rdfs7) http://example.com/kad/Max http://example.com/kad/relatedTo http://example.com/kad/Luna
(rdfs11) http://example.com/kad/Buddy rdf:type http://example.com/kad/Mammal
(rdfs11) http://example.com/kad/Luna rdf:type http://example.com/kad/Animal
(rdfs9) http://example.com/kad/Dog rdfs:subClassOf http://example.com/kad/Animal
(rdfs3) http://example.com/kad/Luna rdf:type http://example.com/kad/Animal
--------------------------------
Number of inferred triples: 8
--------------------------------


## Task 4 (20 points) Compare local inferences with GraphDB results

Upload *myRDFSgraph.ttl* to GraphDB (check [the GraphDB tutorial](https://github.com/ucds-vu/knowledge-data-vu/blob/master/Tutorials/Preliminaries/tutorial-GraphDB.md) before starting to work with GraphDB).

Formulate two different SPARQL queries, and write a Python code that executes these queries over your GraphDB SPARQL endpoint (check example of Task 1).

**Each SPARQL query should return a _different type_ of inferred knowledge** (at least one triple that was not explicitly asserted in the graph).

Specify below next to your query (using a comment '# ...') which type of RDFS rule is the GraphDB reasoner using to infer this answer (rdfs2, rdfs3, rdfs5, rdfs7, rdfs9, rdfs11). 

In [None]:
# Get your GraphDB repository URL (setup -> repositories -> repository url) and assign it to the variable 'myEndpoint' below. 
# It should be similar to this: 

myEndpoint = "http://127.0.0.1:7200/repositories/KnowledgeAndData"  # KnowledgeAndData is the name of the repository
sparql = SPARQLWrapper(myEndpoint)

In [None]:
# Query 1 - Specify which RDFS rule are you testing: 

# Check example of Task 1 on how to query remote SPARQL endpoints

sparql.setQuery("""

""")




In [None]:
# Query 2 - Specify which RDFS rule are you testing: 

# Check example of Task 1 on how to query remote SPARQL endpoints

sparql.setQuery("""

""")



## Submitting the assignment

Please submit this notebook (.ipynb) once you're finished with the assignment. It is not necessary to submit the created turtle files.