SPARQL Transformer evaluation
=========================

This notebook contains some quantitative measures for the evaluation of SPARQL Transformer.

In [1]:
import json
import os
import time
import numpy as np
import pandas as pd

from ipywidgets import FloatProgress
from IPython.display import display

from SPARQLWrapper import SPARQLWrapper, JSON
from SPARQLTransformer import sparqlTransformer

In [2]:
input_folder = './sparql'
ENDPOINT = 'http://dbpedia.org/sparql'

In [3]:
json_queries_files = list(filter(lambda x: x.endswith('.json'), os.listdir(input_folder)))
json_queries_files.sort()
rq_queries_files = [f.replace('.json', '.rq') for f in json_queries_files]

json_queries = [json.load(open('%s/%s' % (input_folder, f), 'r')) for f in json_queries_files]
rq_queries = [open('%s/%s' % (input_folder, f), 'r').read() for f in rq_queries_files]

json_queries_files

['1.Born_in_Berlin.json',
 '2.German_musicians.json',
 '3.Musicians_born_in_Berlin.json',
 '4.Soccer_players.json',
 '5.Games.json']

The test queries have been taken from the __[DBpedia wiki](https://wiki.dbpedia.org/OnlineAccess)__.

Those SELECT queries have been manually converted in json query, making sure that the transformed query was equal to the original one (variable names apart).

The following table shows, for each query:
- `n vars`, how many variable are selected
- `levels`, how many levels are present in the json prototype, considered that `1` refers to a flat object (all properties attached to the root) and `2` at one level of nested object
- `features` included in the query
        
| name                     | n vars | levels | features             |
|--------------------------|--------|--------|----------------------|
|1.Born_in_Berlin          |   4    |   1    | filter, orderby      |
|2.German_musicians        |   4    |   1    | lang filter, optional|
|3.Musicians_born_in_Berlin|   4    |   1    | lang filter          |
|4.Soccer_players          |   5    |   2    | filter, orderby      |
|5.Games                   |   2    |   1    | orderby              |

Functions for executing the query and returning the bindings.

- For JSON queries, we use **SPARQLTransformer**.
- For SPARQL queries, we use **SPARQLWrapper** (which is also internally used by SPARQLTransformer).

In [4]:
def sparql_exec(query):
    sparql = SPARQLWrapper(ENDPOINT)
    sparql.setQuery(query)
    sparql.setReturnFormat(JSON)
    result = sparql.query().convert()
    return result["results"]["bindings"]

def json_exec(query, debug=False):
    return sparqlTransformer(query, {'endpoint': ENDPOINT, 'debug': debug})

Functions for running the test for a particular query (sparql or json).

The test measure the **execution time** of the query (including any parsing task) and the **number of results**.

In [24]:
def test_atom(query, typ='sparql'):
    start = time.time()
    if typ == 'sparql':
        r = sparql_exec(query)
    else:
        r = json_exec(query)
    
    end = time.time()
    timing = end - start
    
    return len(r), timing

We will execute the test multiple times for each query, to obtain an average result as much as possible not correlated to the network/server workload.

In particular, each test would be executed `num_iteration` times. Each couple of consecutive iteration will be separated by `sleep_time` seconds.

In [37]:
num_iteration = 100
sleep_time = 5

In [38]:
test_results = []

for i, json_query in enumerate(json_queries):
    # queries
    json_query = json_queries[i]
    rq_query = rq_queries[i]
    title = rq_queries_files[i].replace('.rq', '')
    print(title)
    
    # progress bars
    fs = FloatProgress(min=0, max=num_iteration, description='SPARQL test:')
    display(fs)
    fj = FloatProgress(min=0, max=num_iteration, description='JSON test:')
    display(fj)

    
    sparql_time = 0
    sparql_results = 0
    json_time = 0
    json_results = 0
    
    for j in np.arange(num_iteration):
        if (i + j) > 0 :
            time.sleep(sleep_time)
        sparql_results, t = test_atom(rq_query, typ='sparql')
        sparql_time += t
        fs.value += 1

    for j in np.arange(num_iteration):
        time.sleep(sleep_time)
        json_results, t = test_atom(json_query, typ='json')
        json_time += t
        fj.value += 1
    
    time_diff = 100 * (sparql_time - json_time) / ((sparql_time + json_time) / 2)
    
    test_results.append({
        'name': title,
        'time_sparql': sparql_time / num_iteration , 
        'result_sparql': sparql_results,
        'time_json': json_time / num_iteration , 
        'result_json': json_results,
        'time_diff': '{0:.2g}%'.format(time_diff)
    });

1.Born_in_Berlin


FloatProgress(value=0.0, description='SPARQL test:')

FloatProgress(value=0.0, description='JSON test:')

2.German_musicians


FloatProgress(value=0.0, description='SPARQL test:')

FloatProgress(value=0.0, description='JSON test:')

3.Musicians_born_in_Berlin


FloatProgress(value=0.0, description='SPARQL test:')

FloatProgress(value=0.0, description='JSON test:')

4.Soccer_players


FloatProgress(value=0.0, description='SPARQL test:')

FloatProgress(value=0.0, description='JSON test:')

5.Games


FloatProgress(value=0.0, description='SPARQL test:')

FloatProgress(value=0.0, description='JSON test:')

In [40]:
pd.DataFrame.from_dict(test_results)

Unnamed: 0,name,result_json,result_sparql,time_diff,time_json,time_sparql
0,1.Born_in_Berlin,573,1132,18%,0.333181,0.400445
1,2.German_musicians,278,317,-21%,0.276772,0.223354
2,3.Musicians_born_in_Berlin,154,227,16%,0.261305,0.307008
3,4.Soccer_players,77,85,8.7%,0.135661,0.148074
4,5.Games,1043,1084,-27%,0.333539,0.255249


The table give us two different informations.

#### Time difference

The execution time of JSON queries (`time_json`) is quite close to the one of SPARQL ones (`time_sparql`). The difference in percentage (`time_diff`) never overcomes the +10%, which usually corresponds to few hundredths of a second.

> This is not in line with the results (which are weird, I will repeat the test outside EURECOM network).

#### Result difference

The number of results (bindings) returned by SPARQL Transformer (`result_json`) is always lower than the ones returned by the endpoint (`result_json`). This is due to the fact that the latter represents all the combination of values as distinct bindings, while the former aggregates the results with the same id.

### Example of result for `1.Born_in_Berlin`.

An interest case is the 2nd result about [Prince Adalbert of Prussia](http://dbpedia.org/resource/Prince_Adalbert_of_Prussia_(1811–1873)), which has 4 names and 2 differently formatted death date. This is represented with 4 * 2 = 8 bindings, then merged with SPARQL Transformer

In [5]:
# SPARQL query
sparql_exec(rq_queries[0])[1:9]

[{'birth': {'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'type': 'typed-literal',
   'value': '1811-10-29'},
  'death': {'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'type': 'typed-literal',
   'value': '1873-06-06'},
  'name': {'type': 'literal', 'value': '()', 'xml:lang': 'en'},
  'person': {'type': 'uri',
   'value': 'http://dbpedia.org/resource/Prince_Adalbert_of_Prussia_(1811–1873)'}},
 {'birth': {'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'type': 'typed-literal',
   'value': '1811-10-29'},
  'death': {'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'type': 'typed-literal',
   'value': '1873-6-6'},
  'name': {'type': 'literal', 'value': '()', 'xml:lang': 'en'},
  'person': {'type': 'uri',
   'value': 'http://dbpedia.org/resource/Prince_Adalbert_of_Prussia_(1811–1873)'}},
 {'birth': {'datatype': 'http://www.w3.org/2001/XMLSchema#date',
   'type': 'typed-literal',
   'value': '1811-10-29'},
  'death': {'datatype': 'http://www.w3.org/2001/XM

In [6]:
# SPARQL query
json_exec(json_queries[0])[1]

{'birth': '1811-10-29',
 'death': ['1873-06-06', '1873-6-6'],
 'id': 'http://dbpedia.org/resource/Prince_Adalbert_of_Prussia_(1811–1873)',
 'name': [{'language': 'en', 'value': '()'},
  {'language': 'en', 'value': '(Henry William Adalbert)'},
  {'language': 'en', 'value': 'Adalbert of Prussia'},
  {'language': 'en', 'value': 'Prince Adalbert'}]}