# Managing RDF data in python


 1. Introduction to RDFlib
 2. Querying RDF data
 3. Creating RDF triples
 4. Saving and storing RDF data
 


In [None]:
# all imports
import pprint
import rdflib
from rdflib import URIRef, Literal, Namespace
from rdflib.namespace import XSD, RDFS, DCTERMS
from rdflib import Literal

## 1. Introduction to RDFlib

RDFlib is a python library for managing RDF data in Python in local servers or server-side web applications. It allows one to:

 * parse RDF data stored in files (either local or online) with several syntaxes
 * manipulate RDF data by creating a in-memory graph
 * query RDF data by means of SPARQL queries and other built-in constructs
 * serialize and store RDF data in files.

#### How does it work? 

Basically, a graph is treated as an unordered list of tuples. Each tuple has three elements, namely two/three URIs corresponding to subject, predicate, and object. 

In [None]:
subj1 = '<http://example.org/robert_capa>'
subj2 = '<http://example.org/gerda_taro>'
pred1 = '<http://example.org/knows>'
pred2 = '<http://example.org/hasSpouse>'
obj1 = '<http://example.org/henri_cartier_bresson>'

# the basic behaviour in RDFlib
example_graph = [
    (subj1, pred1, obj1),
    (subj2, pred2, subj1)
]

NB. An object can be either a URI or a literal (e.g. a string, a date, an integer and so on). Have a look at the [RDFlib supported terms](https://rdflib.readthedocs.io/en/stable/rdf_terms.html)

#### Parse RDF data

If the data are stored in a file or a **string**, this must be parsed and the data transformed in an in-memory graph (i.e. the aforementioned list of tuples). For instance:

In [None]:
example_data = """
    <http://example.org/robert_capa> <http://example.org/knows> <http://example.org/henri_cartier_bresson> .
    <http://example.org/gerda_taro> <http://example.org/hasSpouse> <http://example.org/robert_capa> .
    """

example_g = rdflib.Graph()
example_result = example_g.parse(data=example_data, format='nt')

# get the number of statements (triples)
print(len(example_g))

#### Parse ARTchives data

In our case, we parse data from a **file**.

Our file includes **several graphs**, hence we need to parse our data and move data in several lists of tuples. In RDFlib you do that by using the class `ConjunctiveGraph` like follows.

In [None]:
# create an empty Graph
g = rdflib.ConjunctiveGraph()

# parse a local RDF file by specifying the format into the graph
result = g.parse("../resources/artchives.nq", format='nquads')

Now the graph **g** includes all the triples and the named graphs described in the file, and we can apply both python and RDFlib methods on it. For instance:

In [None]:
## the number of triples (quadruples)
print(len(g))

## 2. Querying RDF data

Now that our graph `g` is ready, we can query the data. Queries can be done via SPARQL queries (we will see it in the next tutorial) or by using python and RDFlib methods for iterating over tuples. 

#### Iterate over all the graphs

In [None]:
for n_graph in g.contexts():
    pprint.pprint(n_graph)

#### iterate over all the triples + context graph (i.e. quads)

In [None]:
for quad in g.quads():
    pprint.pprint(quad)

#### Iterate over all terms of the quads

In [None]:
# iterate over all terms of the quads
for subj, pred, obj, context in g.quads():
    pprint.pprint(obj)

#### Manipulate single elements of the triple/ quad. 

For instance, we can get the **list of objects** as a set and work on the unique values of the list. To this extent, we can mix both RDFlib methods and python built-in methods to work on RDF data as any other type of data.

In [None]:
unique_objs = set()
for subj, pred, obj, context in g.quads():
    unique_objs.add(obj)
    
for obj in unique_objs:
    pprint.pprint(obj)

Look at the results: RDFlib classifies URIs as `rdflib.URIRef` and strings as `rdflib.Literal`, which are two of the aforementioned terms that RDFlib handles. We can reuse such methods to query data in more precisely ways.

#### Replace variables with real values

When iterating over triples, we can replace placeholders with real values. 
For instance, we can get **all the triples that have as subject the URI identifying Federico Zeri**(`http://www.wikidata.org/entity/Q1089074`)

In [None]:
from rdflib import URIRef, Literal
# all the properties and objects having Federico Zeri as subject
for s, p, o, c in g.quads():
    if s == URIRef('http://www.wikidata.org/entity/Q1089074'): # the URI representing Federico Zeri
        print(p, o)

Now, imagine we want only the object of a specific property, for instance the biography of Federico Zeri, which is the object of the property `http://purl.org/dc/terms/description`. To get that, we need to specify the triple pattern as follows:

In [None]:
for s, p, o, c in g.quads():
    if s == URIRef('http://www.wikidata.org/entity/Q1089074') \
    and p == URIRef('http://purl.org/dc/terms/description'): # the URI representing Federico Zeri
        print(o)

RDFlib offers competing and **shorter methods to access triples** in the graphs. 

 * Instead of iterating over all the quads we can iterate over all the **triples** belonging to any graph (basically, we ignore the fourth variable `c`) 
 * We use the *placeholder* **None** in the lookup triple pattern.

The method that allows us to iterate over all triples is called `triples()` and it accepts a tuple `()` including three placeholders, respectively for the subject, the predicate, and the object. When one or more of the terms in the triple pattern are not known (e.g. in our case we don't know the *object* of the property `dcterms:description`) we use as placeholder the term `None`. For instance:

In [None]:
for s, p, o in g.triples((URIRef('http://www.wikidata.org/entity/Q1089074'), URIRef('http://purl.org/dc/terms/description'), None )):
    print(o)

#### Namespaces

As you can see the triple pattern can be very long if we have one or more URIs to specify. 

RDFlib provides means to shorten URIs by using their **namespace** followed by the short ID of the term,. For instance:

```
http://purl.org/dc/terms/ -> DCTERMS
http://purl.org/dc/terms/description -> DCTERMS.description
```

Moreover, RDFlib has built-in namespaces for the most common vocabularies, such as RDF, RDFS, FOAF, and DCTERMS (our case!). Therefore we can shorten the prior code in a more readable way as follows.

In [None]:
from rdflib.namespace import DCTERMS

zeri = URIRef('http://www.wikidata.org/entity/Q1089074')
for s, p, o in g.triples((zeri, DCTERMS.description, None)):
    print(o)

However, in ARTchives there are also non standard vocabularies, such as:

 * Wikidata classes and individuals, having prefix `http://www.wikidata.org/entity/`
 * Wikidata properties, having prefix `http://www.wikidata.org/prop/direct/`
 * ARTchives terms (classes, properties and individuals), having prefix `https://w3id.org/artchives/`
 
For these cases, RDFlib allows you to specify your own namespaces as follows: 

In [None]:
from rdflib import Namespace

# assign prefixes to namespaces
wd = Namespace("http://www.wikidata.org/entity/") # remember that a prefix matches a URI until the last slash (or hashtag #)
wdt = Namespace("http://www.wikidata.org/prop/direct/")
art = Namespace("https://w3id.org/artchives/")

For instance, our prior example on Federico Zeri, now looks like the following:

In [None]:
for s, p, o in g.triples((wd.Q1089074, DCTERMS.description, None)):
    print(o)

### Exercise (15min)

In ARTchives, all the topics (people, organisations, movements, genres) studied by (or somehow related to) historians are linked to the historian by means of the predicate `http://www.wikidata.org/prop/direct/P921`. 

Every topic is identified by a URI, which is in turn associated to a label (a human-readable string) by means of the property `http://www.w3.org/2000/01/rdf-schema#label`. RDFlib allows you to import a built-in namespace called RDFS to recall this predicate faster.

**Exercise: Extract the list of labels of all the topics related to Federico Zeri**

Tips:

 * At the beginning of your file import RDFlib and the `Namespace` module
 * Iterate over the triples having (a) as subject the URI identifying Zeri, and (b) as predicate wdt.P921 (remember to declare and bind the namespaces to a prefix!)
 * Iterate over the objects of the prior triple pattern and lookup for a new triple pattern, having now as subject the variable identifying the topic, and as predicate RDFs.label (remember to import RDFS along with the built-in namespaces such as DCTERMS).
 * print the list

In [None]:
# solution placeholder
from rdflib.namespace import RDFS
unique_topics = set()

for s,p,o in g.triples((wd.Q1089074, wdt.P921, None)):
    for s1,p1,o1 in g.triples((o, RDFS.label, None)):
        unique_topics.add(o1.strip())
        
for topic in unique_topics:
    print(topic)

## 3. Creating RDF triples

At this point you know how to parse a RDF file wherein data are organised in named graphs (although you worked only on triples and didn't need to worry about the graphs) and how to perform simple iterations over data. RDFlib allows also to create graphs and to add triples to those.

For instance, in ARTchives there is no information about historians' birth places, birth dates, or sex. 
In case of Federico Zeri (`http://www.wikidata.org/entity/Q1089074`), we can have a look at his record on [Wikidata](https://www.wikidata.org/wiki/Q1089074), where we will find out that he was born in **Rome**, his birthday is **12 August 1921**, and he is **male**.

 * The city of Rome is identified in Wikidata with the URI `https://www.wikidata.org/entity/Q220`. 
 * The birth date is treated as a xsd:date (YYYY-MM-DD)
 * The sex is identified as a URI as well, but for the sake of the example we treat is a string.

To add this information to the graph, we first need to declare the datatype of objects. 

In [None]:
from rdflib.namespace import XSD
from rdflib import Literal

birthplace = URIRef("https://www.wikidata.org/entity/Q220")
birthdate = Literal('1921-08-12',datatype=XSD.date) # notice the datatype parameter
sex = Literal("male",lang="en") # notice the lang parameter


Then we need to choose the predicates to which these objects must be linked. We decide to reuse Wikidata predicates:

 * birth place: `http://www.wikidata.org/prop/direct/P19` -> wdt.P19
 * birth date: `http://www.wikidata.org/prop/direct/P569` -> wdt.P569
 * sex: `http://www.wikidata.org/prop/direct/P21` -> wdt.P21

**Be aware that URIs of Wikidata web pages differ from the URIs of classes/individuals/predicates there described. Only the last part (after the last slash) applies to both web pages / persistent URIs. **

Lastly, we use the method `add()` to insert a new triple (a tuple) to the graph.

In [None]:
g.add(( wd.Q1089074, wdt.P19, birthplace ))
g.add(( wd.Q1089074, wdt.P569, birthdate ))
g.add(( wd.Q1089074, wdt.P21, sex ))

Now we can double-check if the new data has been correctly added. Let's query our graph for Federico Zeri's birth place, date, and sex.

In [None]:
check_properties = [wdt.P19, wdt.P569, wdt.P21]

for prop in check_properties:
    for s,p,o in g.triples((wd.Q1089074, prop, None )):
        print(s, prop, o)

Apparently there was already some information about Federico Zeri's birth date. Let's analyse it:

In [None]:
for s,p,o in g.triples((wd.Q1089074, wdt.P569, None )):
    print(o, type(o), o.datatype)

Apparently the birth date was added with `XSD.gYear` as datatype, that is, only the year was included (the label 1921-01-01 may be misleading since also the month and the day are shown - that's why it's important to know the datatypes!). 

We can remove the duplicate year by using the method `remove()` which allows to specify all the terms of the pattern that we want to remove. In our case we want to remove one specific triple, hence we specify all the three placeholders in the triple pattern.

In [None]:
g.remove((wd.Q1089074, wdt.P569, Literal("1921-01-01",datatype=XSD.gYear) ))

# check if it has been deleted
for s,p,o in g.triples((wd.Q1089074, wdt.P569, None )):
        print(s, p, o)

## 4. Saving and storing RDF data


The changes that we made to the graphs are saved in an in-memory graph, that is, once the python program stops running the information added/removed get lost. To prevent this, we can save the data in a new file (as a good practise, never write on the original data, you never know!). 

First, you need to choose the serialization. As aforementioned we can write RDF data by using several syntaxes (e.g. n3, turtle, XML, nquads). In our case, we select the orginal serialisation of our data, i.e. `nquads`.

Second, you need to select the file where to save data. In our case we dump data in the folder `resource`, and we call our file `artchives_enhanced.nq`.

In [None]:
g.serialize(destination='../resources/artchives_enhanced.nq', format='nquads')

Useful links:
 
 * [RDFlib official documentation](https://rdflib.readthedocs.io/en/stable/gettingstarted.html)
 * [Youtube series of tutorial on working on RDF with python](https://www.youtube.com/watch?v=sCU214rbRZ0) Pay attention! in this tutorial the speaker is working with **only one graph**, hence he uses the RDFlib class `Graph()`, while we worked on several graphs included in one file, hence we used the class `ConjunctiveGraph()`. It's worth to notice that when you work on a single graph you have more methods then when working on multiple graphs (e.g. `Graph()` allows you to iterate over subjects, objects, subjects and predicates, predicates and objects, while `ConjunctiveGraph()` allows you to iterate only on `triples()` and `quads()`)   
 * **There is also a Javascript library called [rdflib](https://linkeddata.github.io/rdflib.js/doc/) to handle data client-side (in the browser).** We will not use this in our classes, but you are welcome to explore it for future use and eventually for your project!

## Homework

Fill in the [questionnaire](https://forms.gle/uWW3j7sXNZ8WF3B89)