In [1]:
from DataCitationFramework.SPARQLAPI import SPARQLAPI
from DataCitationFramework.QueryUtils import Query
import DataCitationFramework.QueryStore as qs
import DataCitationFramework.Citing as ct
import queries_for_testing as q
from datetime import datetime, timedelta, timezone
from IPython.display import Image

# Preparing the data and query store

## R1 - Data Versioning

Coming from relational databases a triple store can be imagined as a table with three fields - subject, predicate, object. if one were about to annotate a triple with a timestamp or other label the fact that no additional “column” can be used within a triple store but only additional rows would make it hard to reference a specific triple. All that could be done is to insert another triple to somehow reference the target triple. However, where would this new triple point at or what would be the subject? Figure 1 illustrates this problem


In [None]:
Image(filename='figures/figure 1.png')

One can easily see the problem that the placeholder ? can only take one piece of information (subject or object) of the target triple which does not suffice to reference the triple as a whole. However, if one were able to nest a whole triple within the subject of another it would endow us with the right tool to address data versioning within triple stores. In fact, RDF* and SPARQL* as extensions of RDF and SPARQL respectively are capable of doing exactly this. Figure 2 illustrates the solution for the example brought in figure 1 in actual RDF* syntax.


In [None]:
Image(filename='figures/figure 2.png')

This capability of nesting triples is only part of the solution. However, now we can use existing methods that also can be used within relational databases. One way is to use a start and an end date for each triple. As an initial operation all triples should be annotated with the current timestamp as the start date and an end date that is far in the future, e.g. 9999-12-31T00:00.000. From here we have to distinguish between insert, update and delete operations. Finally, to retrieve data as it existed at a certain point in time we would use simple filter operations on the start date and end date attributes. 


### Insert

Whenever a new triple is inserted two additional triples will be inserted - one marking the start date (e.g. current timestamp) of that triple and one setting the end date to 9999-12-31T00:00.000 to mark it as valid until further notice. 


In [None]:
Image(filename='figures/figure 3.png')

### Update

The update operation is a combination of a delete operation followed by an insert. We first must make clear what is actually updated. Here it again helps to think in terms of relational databases. The subject can be seen as a table, the predicate as an attribute of that table and the object as a particular value of that attribute. As we are updating data it goes without saying that we want to perform these operations on the object (the value) of the triple. Therefore, we first perform a delete on the end date of the target triple. Subsequently, we want to give a new end date to the target triple and this could be the current timestamp as this operation is performed. So far, we have only outdated the old value. What is now left to do is to basically perform the same operation as in Insert chapter. 

In [None]:
Image(filename='figures/figure 4.png')

### Delete

The idea for the delete operation has already been conceived in the first part of the update operation. Firstly, the target triple’s end date is deleted and secondly the target triple is provided with a new end date, e.g. as of operation date. 


In [None]:
Image(filename='figures/figure 7.png')

### Proof of concept
Following test case shall demonstrate that with the above discussed solution it is possible to retrieve earlier states of the data. 

In [3]:
# set up endpoints
sparqlapi = SPARQLAPI('http://192.168.0.241:7200/repositories/DataCitation', #GET
                      'http://192.168.0.241:7200/repositories/DataCitation/statements') #POST

<SPARQLWrapper.Wrapper.SPARQLWrapper object at 0x00007FDFBC669370>
{"_defaultGraph" : None,
"_defaultReturnFormat" : 'xml',
"agent" : 'sparqlwrapper 1.8.5 (rdflib.github.io/sparqlwrapper)',
"customHttpHeaders" : {},
"endpoint" : 'http://192.168.0.241:7200/repositories/DataCitation',
"http_auth" : 'DIGEST',
"method" : 'GET',
"onlyConneg" : False,
"parameters" : {},
"passwd" : None,
"queryString" : 'SELECT * WHERE{ ?s ?p ?o }',
"queryType" : 'SELECT',
"requestMethod" : 'urlencoded',
"returnFormat" : 'json',
"timeout" : None,
"updateEndpoint" : 'http://192.168.0.241:7200/repositories/DataCitation',
"user" : None}
<SPARQLWrapper.Wrapper.QueryResult object at 0x00007FDF92A5DF10>
{"requestedFormat" : 'json',
"response (a file-like object, as return by the urllib2.urlopen library call)" : {
	"url" : "http://192.168.0.241:7200/repositories/DataCitation?query=select+%2A+where+%7B+%3Fs+%3Fp+%3Fo+.%7D+limit+100&format=json&output=json&results=json",
	"code" : "200",
	"headers" : Vary: Accept
Cach

Query and prefixes must be separated to make it easier to only nest the query within
the SPARL template for timestamping (See R7 below) queries while the prefixes will go on top.

In [4]:
# Prepare test query and timestamps
prefixes = {'citing': 'http://ontology.ontotext.com/citing/',
            'pub': 'http://ontology.ontotext.com/taxonomy/',
            'xsd': 'http://www.w3.org/2001/XMLSchema#',
            'publishing': 'http://ontology.ontotext.com/publishing#'}

original_query = """
select ?mention ?party ?person ?document ?personLabel ?value {
    ?mention publishing:hasInstance ?person .
    ?document publishing:containsMention ?mention . 
    ?person pub:memberOfPoliticalParty ?party .
    ?person pub:preferredLabel ?personLabel .
    ?party pub:hasValue ?value .
    ?value pub:preferredLabel "Democratic Party"@en .
    filter(?personLabel = "Judy Chu"@en)
}
"""

vieTZObject = timezone(timedelta(hours=2))
timestamp1 = datetime(2020, 9, 8, 12, 11, 21, 941000, vieTZObject)
timestamp2 = datetime(2020, 10, 4, 18, 11, 21, 941000, vieTZObject)
timestamp3 = datetime(2020, 10, 5, 18, 11, 21, 941000, vieTZObject)

In [5]:
query = Query(original_query, prefixes)
timestamped_query_1 = query.extend_query_with_timestamp(timestamp1)
sparqlapi.get_data(timestamped_query_1, prefixes) # dataframe

Unnamed: 0,mention,party,person,document,personLabel,value,TimeOfCiting,valid_from_0,valid_until_0,valid_from_1,valid_until_1,valid_from_2,valid_until_2,valid_from_3,valid_until_3,valid_from_4,valid_until_4,valid_from_5,valid_until_5
0,http://data.ontotext.com/publishing#Mention-dbaa4de4563be5f6b927c87e09f90461c09451296f4b52b1f80dcb6e941a5acd,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-09-08T12:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T12:42:36.401+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]
1,http://data.ontotext.com/publishing#Mention-f5eb5422f2a33ff188b30ab0b983ab18b33f39b383be10f812c153230b73865d,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-09-08T12:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T20:26:07.663+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]
2,http://data.ontotext.com/publishing#Mention-69cd96b08117cf60238c40a3d95c949a778288a68458716fb5f1400a4fdd30dc,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-09-08T12:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T20:26:07.663+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]
3,http://data.ontotext.com/publishing#Mention-1a777ce18aed8d438a8f292f82520d87755062fec533c97ba61cb0f3fbc22e19,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-09-08T12:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T20:26:07.663+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]


In [6]:
# Timestamp 2 > timestamp 1
timestamped_query_2 = query.extend_query_with_timestamp(timestamp2)
sparqlapi.get_data(timestamped_query_2, prefixes)

Unnamed: 0,mention,party,person,document,personLabel,value,TimeOfCiting,valid_from_0,valid_until_0,valid_from_1,valid_until_1,valid_from_2,valid_until_2,valid_from_3,valid_until_3,valid_from_4,valid_until_4,valid_from_5,valid_until_5
0,http://data.ontotext.com/publishing#Mention-f5eb5422f2a33ff188b30ab0b983ab18b33f39b383be10f812c153230b73865d,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-10-04T18:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T20:26:07.663+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]
1,http://data.ontotext.com/publishing#Mention-69cd96b08117cf60238c40a3d95c949a778288a68458716fb5f1400a4fdd30dc,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-10-04T18:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T20:26:07.663+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]
2,http://data.ontotext.com/publishing#Mention-1a777ce18aed8d438a8f292f82520d87755062fec533c97ba61cb0f3fbc22e19,http://ontology.ontotext.com/resource/Q460035S071C8FD6-DA5F-4189-81A7-D589D13B2D09,http://ontology.ontotext.com/resource/tsm835hi3s3k,http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010,Judy Chu@en,http://ontology.ontotext.com/resource/tsk5a9unh5a8,2020-10-04T18:11:21.941000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-04T20:26:07.663+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-10-05T00:31:47.190+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],2020-09-06T10:46:09.033+02:00 [http://www.w3.org/2001/XMLSchema#dateTime],9999-12-31T00:00:00.000+02:00 [http://www.w3.org/2001/XMLSchema#dateTime]


In [None]:
# Timestamp 3 > timestamp 2
timestamped_query_3 = query.extend_query_with_timestamp(timestamp3)
sparqlapi.get_data(timestamped_query_3, prefixes)

Let us now assume that new triples get added to the tripple store which affect the original query. While the original query will return a different result the timestamped query will always yield the same result.

In [None]:
mention = "<hhttp://data.ontotext.com/publishing#Mention-dbaa4de4563be5f6b927c87e09f90461c09451296f4b52b1f80dcb6e941a5acd>"
hasInstance = "publishing:hasInstance"
person = "<http://ontology.ontotext.com/resource/tsm835hi3s3k>"

sparqlapi.insert_triple((mention, hasInstance, person), prefixes)

document = "<http://www.reuters.com/article/2014/10/10/us-usa-california-mountains-idUSKCN0HZ0U720141010>"
containsMention = "publishing:containsMention"
mention = "<hhttp://data.ontotext.com/publishing#Mention-dbaa4de4563be5f6b927c87e09f90461c09451296f4b52b1f80dcb6e941a5acd>"

sparqlapi.insert_triple((document, containsMention, mention), prefixes)

In [None]:
sparqlapi.get_data(timestamped_query_3, prefixes)

We see that the timestamped dateset, as before, yields no rows, as all rows concerning this subset have been deleted between timestamp2 and timestamp3.

In [None]:
sparqlapi.get_data(original_query, prefixes)

However, the actual dataset has been modified with our insert statement above and therefore returns a row. Not all the triples from the original query needed to be inserted but only the ones that made the join find no matches anymore after they have been deleted. Therefore it suffices to only "back insert" the triples that have previously been deleted.

In [None]:
sparqlapi._delete_triples((mention, hasInstance, person), prefixes)
sparqlapi._delete_triples((document, containsMention, mention), prefixes)

The last piece of code just makes sure to delete the previously added triples and their annotations to make this demonstration re-executable.

## R2 - Timestamping 

Operations on data should be timestamped. If we look at the suggested solution for R1, we see that timestamps are already applied with every operation. Now, one could use the operation timestamp for both - versioning data and timestamping the operation itself. The proposed solution for R1 would allow retrieving each operation’s timestamp but also a specific version of the dataset (as it was at a specific point in time). The start date annotation tells us also the timestamp of the insert and update operation. If a semantically valid end date (not 9999-12-31T00:00.000) is used to mark triples as outdated it can also be seen as the timestamp of the delete operation. 
However, if one wants to distinguish between a real delete and a delete as part of an update, this approach would lead to ambiguity. Therefore, another proposed solution is to use a further annotation “operation_date” and stamp the operation itself. This solution has the following benefit: A different timestamp than the operation timestamp could be used for versioning data. Some use cases might require to version data with specific timestamps (E.g. end of the day the operation was performed)


## R3 - Query Store Facilities

The query store is a means to store queries and the associated metadata. For the query store it is not important whether the underlying database is a graph database, document-based or relational database. The query itself is just a string of characters in all cases and a simple database like SQLite in combination with Python’s object relational mapper SQL Alchemy is proposed as the Query Store. An examination of query store implementations by EODC, VMC, CCCA, EHR and VAMDC exposed following more common metadata which are proposed to by covered by default by the framework:

In [None]:
Image(filename='figures/figure 6.png', width=800, height=680)

# Persistently Identify Specific Data Sets

In [None]:
Image(filename='figures/figure 5.png')

Every time a data set should be persisted the following operations are performed. First, the query is normalized (R4) and the version timestamp (R7) is embedded in the query text. The latter operation ensures that two identical queries with different query timestamps will also get different PIDs. It needs to be pointed out that the version timestamp is not the execution timestamp but the timestamp of the last write operation in the returned dataset. This means that two queries with different execution timestamps can result in the same checksum if no write operations were made in between the two queries. It also consequently means that two different version timestamps on an otherwise identical query will result in different result sets.
Then, the normalised query is extended with a sorting operation (R5) which sorts the data by the alphabetical order of the columns in the select section. 
Next, a query checksum is computed (R4) and at the same time the query can be executed to also compute the result set checksum. The query checksum is looked up in the query query store to check whether this query had been stored before.
* If yes, the computed result set checksum is compared with the stored checksum from the query store. If the hash values are equal the citation text (which has been created at a earlier time) is returned from the query store. Otherwise, a real error case occurs. **By design, it is not possible that a timestamped query returns a different data set at a later execution date.** Every row is versioned with a timestamp and if data changes the new or updated rows will also get a new version. Therefore, the formerly valid row will just expire and be marked with an expiration date. Neither the outdated row nor the expiration date will change. However, a change can happen if the row or the expiration date are manually updated with functions not provided by the framework. Such exceptions should result into a notification to the system administrator. The user, however, shoud get the stored citation text. The citation text can vary from community to community, therefore it is a custom function which can be customized within the framework. 
* Otherwise, a new PID is generated for that query (R8). All metadata is stored as described in R3 (R9). The newly generated PID is embedded into the citation text which is defined by a custom function and returned to the user (R10)

A user retrieves data via an user interface by either executing a query or using graphical features to build the data set (which, in fact, executes a query in the background). At any point the user can request the citation text by means of graphical features (e.g. button) or terminal commands. Following example shall demonstrate the citation process as illustrated in figure 5.

In [None]:
query = Query(original_query, prefixes)
print(query.sparql_prefixes)
print(query.query)

## R4a - Query uniqueness: Normalize query algebra

| Number | Statement description | Normalization measure   |
|------|------|------|
|1 | WHERE clause is optional | A where clause will always be inserted |
|2 | "rdf:type" predicate can be replaced by "a" | rdf:type will always be replaced by "a" |
|3 | if all variables are selected one does not need to write a variable in the form ?s but can just use an asterisk | Variables will always be explicitely mentioned and ordered alphabetically |
|4 | If the same subject is used multiple times in subsequent triplets separated by a dot it can be simplified by writing just the first subject variable name, separating the triplets by semicolons and leaving out the other variable names that are equal to the one in the first triplet | Triples will never be simplified and simplified triples will be made explicit |
|5 |   The order of triple statements does not affect the outcome | Triple statements will be ordered alphabetically by subject, predicate, object |
|6 | Aliases via BIND keyword just rename variables but the query result stays the same  | Do not know how to tackle yet |
|7 | Variable names in general can differ between two queries without changing the outcome.  | Variables will be replaced by letters from the alphabet. For each variable a letter from the alphabet will be assigned starting with 'a' and in chronological order. Two letters will be used and chronological combinations will be assigned should there be more than 26 variables.  |
|8 | Finding variables that are not bound can be written in two ways: 1. with optional keyword adding the optional triplet combined with filter condition !bound(?var); 2. with "filter not exists (triplet)" | No solution yet |
|9 | Inverting the order of the triplet (object predicate subject instead of subject predicate object) using "^" gives the same results | Inverted triples will be back-inverted and "^" will thereby be removed |
|10 | sequence paths can reduce the number of triplets in the query statement and are commonly used. | Sequence paths will be made explicit in form triple statements |
|11 | Prefixes can be interchanged in the prefix section before the query and subsequently in the query without changing the outcome. | All prefixes will be substituted with their underlying URIs |

To proof that normalized query objects of semantically identical queries yield the same checksums a bunch of semantically identical queries with respect to aforementioned semantics are prepared. Then the queries are normalized by normalizing their individual "algebra trees" and their checksums are computed. We can see that all checksums are equal, which is the expected result.

In [None]:
prefixes = {'citing': 'http://ontology.ontotext.com/citing/',
            'pub': 'http://ontology.ontotext.com/taxonomy/',
            'xsd': 'http://www.w3.org/2001/XMLSchema#',
            'publishing': 'http://ontology.ontotext.com/publishing#'}

#1: WHERE clause is optional
query_1 = """
select ?document ?mention ?party ?person ?personLabel ?value {
    ?mention publishing:hasInstance ?person .
    ?document publishing:containsMention ?mention . 
    ?person pub:memberOfPoliticalParty ?party .
    ?person pub:preferredLabel ?personLabel .
    ?party pub:hasValue ?value .
    ?value pub:preferredLabel "Democratic Party"@en .
    filter(?personLabel = "Judy Chu"@en)
}
"""

#3: if all variables are selected one does not need to write a 
# variable in the form ?s but can just use an asterisk
query_3 = """
select * {
    ?mention publishing:hasInstance ?person .
    ?document publishing:containsMention ?mention . 
    ?person pub:memberOfPoliticalParty ?party .
    ?person pub:preferredLabel ?personLabel .
    ?party pub:hasValue ?value .
    ?value pub:preferredLabel "Democratic Party"@en .
    filter(?personLabel = "Judy Chu"@en)
}
"""

#5: The order of triple statements does not affect the outcome
query_5 = """
select ?document ?mention ?party ?person ?personLabel ?value {
    ?document publishing:containsMention ?mention . 
    ?person pub:memberOfPoliticalParty ?party .
    ?person pub:preferredLabel ?personLabel .
    ?mention publishing:hasInstance ?person .
    ?party pub:hasValue ?value .
    ?value pub:preferredLabel "Democratic Party"@en .
    filter(?personLabel = "Judy Chu"@en)
}
"""

#7: Variable names in general can differ between two queries without changing the outcome
query_7 = """
select ?mention ?x ?party ?person ?personLabel ?value {
    ?x publishing:containsMention ?mention . 
    ?person pub:memberOfPoliticalParty ?party .
    ?person pub:preferredLabel ?personLabel .
    ?mention publishing:hasInstance ?person .
    ?party pub:hasValue ?value .
    ?value pub:preferredLabel "Democratic Party"@en .
    filter(?personLabel = "Judy Chu"@en)
}
"""

queries = {"query1": query_1, 
           "query3": query_3,
           "query5": query_5, 
           "query7": query_7}

In [None]:
def query_checksum_workflow(query_string, query_name, prefixes):
    query = Query(query_string, prefixes)
    normalized_query_algebra = query.normalize_query_tree()
    query.compute_checksum("query", normalized_query_algebra)
    print("{0}: \t checksum {1} ".format(query_name, query.query_checksum))
    # print(query.normalized_query_algebra)

for query_name, query_string in queries.items():
    query_checksum_workflow(query_string, query_name, prefixes)

In [None]:
# Normalize query - #1, #3, #5 and #7 from table above
query.normalize_query_tree()

## R4b - Query uniqueness: Compute query checksum

The checksum of the normalized query string is computed. The query string contains the version timestamp as bound variable with the label TimeOfCiting. For the checksum computation data versioning query extensions are not needed and the SPARQL* syntax is also not supported by Python's query-to-algebra translator.

In [None]:
query.compute_checksum(query_or_result="query", citation_object=normalized_query_algebra)
print(query.query_checksum)

## R7 - Query Timestamping
Extend query with version timestamp based on the last update to the selection of data affected by the query

In [None]:
timestamp = datetime(2020, 9, 6, 12, 11, 21, 941000, vieTZObject) # version timestamp. Should be extracted somehow based on the last update?
timestamped_query_colored = query.extend_query_with_timestamp(timestamp1, colored=True) # for presentation purpose only
timestamped_query = query.extend_query_with_timestamp(timestamp1, colored=False) # for presentation purpose only

print(timestamped_query_colored)

## R5 - Stable Sorting

The normalized and timestamped query is now extended with the "order by" operation. The columns in this operation are placed in the same order as they are in the select-clause. 

In [None]:
query_for_execution_colored = query.extend_query_with_sort_operation(timestamped_query, colored=True) # color parameter just for presentation purpose
query.query_for_execution = query.extend_query_with_sort_operation(timestamped_query, colored=False) # color parameter just for presentation purpose

print(query_for_execution_colored) 

### Execute Query


In [None]:
# Timestamp 3 > timestamp 2
is_timestamped = True
query_result = versioningUtilities.get_data(query.query_for_execution, is_timestamped=True)
query_result

## R6 - Compute Result set checksum

Now, we compute the checksum of the result set by using the same function as for query checksum computation just with a different configuration. 

In [None]:
result_set_checksum = query.compute_checksum(query_or_result="result", object=query_result)
print(result_set_checksum)

In **R4a - Query uniqueness: Normalize query algebra** we defined some queries to proof that they are semantically identical by computing and comparing their checksums. Now we can further support the proof by showing that the result set checksums also equal with each other. We also see that the query which we assigned a different citation date also yields the same results. This can, of course, happen as a slightly different date does not necessarily yield a different result. The only thing that must be assured is that two semantically identical queries return matching results. 

In [None]:
def workflow_result_set_checksum(query_string, query_name, timestamp):
    query = Query(query_string, prefixes)
    timestamped_query = query.extend_query_with_timestamp(timestamp)
    query.query_for_execution = query.extend_query_with_sort_operation(timestamped_query)
    query_result = versioningUtilities.get_data(query.query_for_execution, is_timestamped=True)
    result_set_checksum = query.compute_checksum(query_or_result="result", object=query_result)
    print("query {0}: \t result_set_checksum: {1}".format(query_name, result_set_checksum))

for query_name, query_string in queries.items():
    workflow_result_set_checksum(query_string, query_name, timestamp1)
    
workflow_result_set_checksum(query_71, "query71", timestamp1)

### Search for query checksum in query store

In [2]:
# stored_query = query_store.lookup(query.query_checksum)
query_store = qs.QueryStore("DataCitationFramework/Query store/query_store.db")
stored_query = query_store.lookup(123)
print(stored_query.citation_timestamp)

'datetime.datetime' object is not subscriptable


AttributeError: 'NoneType' object has no attribute 'citation_timestamp'

#### Case 1: Query not found

#### Case 2: Query found

# Resolving PIDs and Retrieving the Data

In [None]:
# Retrieve the data set using the Query PID