# Reproducible Workflow
This notebook is intended to be a walkthrough of the paper results with examples that convey the main idea. To load the LUBM Graph we use a public endpoint on [Dydra](http://dydra.com). However for larger test cases we run our tests on Apache Jena. In case you have a LUBM on your local/public endpoint, you can load it as well. Please note that the LUBM that we use is _materialized_. The inferencing is **RDFS**. 

# Relaxation
Relaxation is the standard baseline for reformulating SPARQL queries. A lot of related work exists on reformulating SPARQL queries. The ideas are inherently based on *flexible* querying. In this notion, the different conditions in the input query are *loosened* or *relaxed* to give more results. This can also be looked at as *exploratory* querying. 

## SPARQL Query Relaxation
Lets have a look at this hierarchy from LUBM on all teaching faculty in a University.
* Employee 
* * Faculty 
* * * Professor 
* * * * VisitingProfessor 
* * * * FullProfessor 
* * * * Dean 
* * * * Chair 
* * * * AssociateProfessor 
* * * * AssistantProfessor 
* * * * PostDoc 
* * * * Lecturer 

Lets look how relaxation helps in getting more answers. In this case consider the following query on LUBM


```Select ?teacher where {
    ?teacher a Lecturer .
} ```

Lets see the results that we get from LUBM

In [15]:
from rdflib import Graph
g = Graph()
g.parse('lubm_saturated.nt',format="nt")
print("Total triple statements in LUBM is " + str(len(g)))

Total triple statements in LUBM is 144299


In [2]:
from SPARQLWrapper import SPARQLWrapper, JSON
from IPython.display import display
import pandas as pd
import json
import numpy as np
from pandas.io.json import json_normalize

pd.options.display.max_colwidth = 100
pd.options.display.max_rows = 999

#Procedure to execute a SPARQL Query and get a pandas object out of it
def execute_query(sparqlQuery):
    sparql = SPARQLWrapper("https://dydra.com/amarviswanathan/lubm/sparql")
    sparql.setQuery(sparqlQuery)
    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()
    results_df = json_normalize(results["results"]["bindings"])
    return results_df



In [3]:
%%time
query  = """

PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?teacher where {
    ?teacher a ub:Lecturer .
}

"""
results_df = execute_query(query)

#Show the top-5 results
results_df.head()

CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 396 ms


In [4]:
#Total number of lecturers
shape = results_df.shape
print("There are " + str(shape[0]) + " lecturers")

There are 93 lecturers


But if the user is not satisfied with these 93 lecturers and wants to find more of them, a simple way is to relax by moving up in the hierarchy. So we move from Lecturer to Professor. This gives us the following query : 

``` 
Select ?teacher where {
    ?teacher a ub:Professor .
}
```

In [5]:
%%time 
relaxed_query  = """

PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?teacher where {
    ?teacher a ub:Professor .
}
"""
results_df = execute_query(relaxed_query)

#Show the top-5 results
results_df.head()

CPU times: user 16 ms, sys: 0 ns, total: 16 ms
Wall time: 588 ms


In [6]:
#Total number of lecturers
shape = results_df.shape
print("There are " + str(shape[0]) + " Professors")

There are 447 Professors


The above result gives us 447 Professors, each of whom may be any type under the hierarchy of **Professor**.

## Instance Query 
Let us look at an instance of a **Professor** i.e. **<http://www.Department14.University0.edu/FullProfessor4>** and see what courses this Professor teaches.

In [7]:
%%time 
entity_query = """
PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?course where {
   <http://www.Department14.University0.edu/FullProfessor4> ub:teacherOf  ?course .
}
"""

results_df = execute_query(entity_query)
results_df = results_df[['course.value']]
# Show all the results
results_df

CPU times: user 4 ms, sys: 4 ms, total: 8 ms
Wall time: 389 ms


So the **FullProfessor4** teaches `2 Graduate Courses` and `1 Course`.  **If** the user decides to want more answers, an automatic way would be to relax the query. Thus the system wouldrelax this entity value. However the entity has **no hierarchy**. Which means this entity ends up being relaxed to a variable. This is known as `simple relaxation`. The relaxed query then becomes 


```Select ?course where {
    ?teacher ub:teacherOf ?course .
} ```


In [8]:
%%time 
entity_relaxed_query = """
PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?teacher ?course where {
   ?teacher ub:teacherOf  ?course .
}
"""

relax_df = execute_query(entity_relaxed_query)

# Show all the results
relax_df
print("The total courses are " + str(relax_df.shape[0]))

relax_df.head()

The total courses are 1627
CPU times: user 68 ms, sys: 0 ns, total: 68 ms
Wall time: 1.88 s


### Motivation
In this case, we end up relaxing the query to find _Anybody who takes any course_. Now this ends up giving `1627 results` and is very _generalized_. While this is logically right, wouldn't it be more beneficial if the system resulted in courses are more similar to what **Professor14** teaches? 


# Goal

To address this issue, we present a technique where we utilize the _entity_ statements present in the graph to suggest reformulations. Let us see how this makes sense. Entities have properties(_predicate_) and values (_object_) in the graph. For example the entity **Professor14** has this triple associated with it

| Subject        | Predicate           | Object  |
| ------------- |:-------------:| -----:|
|**Professor14**|teacherOf|GraduateCourse5|


One could easily utilize these values much more effectively to create _triple patterns_ that can be appended back to the original query. This can then be used to suggest reformulations. We call the **predicate** and **object** value pair as a _feature_. Since we utilize these features to create reformulations, we call our method 
**Feature based reformulation of entities in triple pattern queries**. 

The features provide more _information_ and _context_ about an entity. Let us see how features can be used. To do that we print out the features of **FullProfessor4** from LUBM.

In [9]:
entity_statement_query = """
PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?p ?o where {
   <http://www.Department14.University0.edu/FullProfessor4> ?p ?o .
}
"""

results_df = execute_query(entity_statement_query)

# Show all the results
results_df


Unnamed: 0,o.datatype,o.type,o.value,p.type,p.value
0,,uri,http://www.w3.org/2000/01/rdf-schema#Resource,uri,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
1,,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#FullProfessor,uri,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
2,,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Faculty,uri,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
3,,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Person,uri,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
4,,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Professor,uri,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
5,,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Employee,uri,http://www.w3.org/1999/02/22-rdf-syntax-ns#type
6,http://www.w3.org/2001/XMLSchema#string,literal,FullProfessor4,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#name
7,,uri,http://www.Department14.University0.edu/GraduateCourse5,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf
8,,uri,http://www.Department14.University0.edu/Course4,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf
9,,uri,http://www.Department14.University0.edu/GraduateCourse4,uri,http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf


From the above result we see that the `literal` values don't add more information to the triple except that they are string values for an entity. Morever, they don't have any statements associated with them. So we filter the literal values out first.

In [10]:
results_df = results_df[results_df['o.type'] == 'uri']
results_df = results_df[['p.value','o.value']]
results_df

Unnamed: 0,p.value,o.value
0,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://www.w3.org/2000/01/rdf-schema#Resource
1,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://swat.cse.lehigh.edu/onto/univ-bench.owl#FullProfessor
2,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Faculty
3,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Person
4,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Professor
5,http://www.w3.org/1999/02/22-rdf-syntax-ns#type,http://swat.cse.lehigh.edu/onto/univ-bench.owl#Employee
7,http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf,http://www.Department14.University0.edu/GraduateCourse5
8,http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf,http://www.Department14.University0.edu/Course4
9,http://swat.cse.lehigh.edu/onto/univ-bench.owl#teacherOf,http://www.Department14.University0.edu/GraduateCourse4
10,http://swat.cse.lehigh.edu/onto/univ-bench.owl#undergraduateDegreeFrom,http://www.University214.edu


The element in `row 0`, is too generic because it tags anything as a resource. So instead of that lets eyeball some interesting properties and change our initial query. For the sake of this example I pick 

| Predicate        | Object           |
| ------------- |:-------------:|
|**mastersDegreeFrom**|University912.edu|
|**memberOf**|Department14.University0.edu|

The above two properties say something about the entity **Professor14** i.e. it says that **Professor14** got his `doctoralDegreeFrom University801.edu` and is `memberOf Department14.University0.edu`. To utilize these features in a query, one just has to convert them to a variable so that it becomes a valid _triple pattern_. This is shown in the table below : 

<table>
<tr><th>Entity Statements </th><th> Entity Patterns</th></tr>
<tr><td>

|Entity| Predicate | Value|
|--|--|--|
|**Professor14**| **mastersDegreeFrom**|University912.edu|
|**Professor14**| **memberOf**|Department14.University0.edu|

</td><td>

|Variable|Predicate|Value| 
|--|--|--|
|?x|**mastersDegreeFrom**|University912.edu|
|?x|**memberOf**|Department14.University0.edu|


Now let us pick the first pattern and add it back to the original query. Then lets add the second pattern to the original query independently.  This results in a  reformulated queries that looks like :

```
Select ?course where {
   ?x ub:teacherOf  ?course .
   ?x mastersDegreeFrom University912.edu .
}
```

```
Select ?course where {
   ?x ub:teacherOf  ?course .
   ?x memberOf Department14.University0.edu .
}
```


The inital query was :

* Select courses taught by **FullProfessor4**

Adding the two new features the query becomes :
* Select courses taught by ?x who has a mastersDegreeFrom `University912.edu` 
* Select courses taught by ?x who is a member of `Department14.Univeristy0.edu` .

Both the above queries are more contextual and give precise answers than the initial relaxation which read as 
* Select all courses taught by any teacher

So lets run the reformulation to see the results.

In [11]:
%%time 
entity_reformulated_query = """
PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?course where {
   ?x ub:teacherOf ?course .
   ?x ub:mastersDegreeFrom <http://www.University912.edu> .
   
}
"""

ref_1 = execute_query(entity_reformulated_query)

# Show all the results
print("The number of courses now is " + str(ref_1.shape[0]))
ref_1.head()

The number of courses now is 5
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 566 ms


In [12]:
%%time 
entity_reformulated_query = """
PREFIX  ub: <http://swat.cse.lehigh.edu/onto/univ-bench.owl#>
Select ?course where {
   ?x ub:teacherOf ?course .
   ?x ub:memberOf <http://www.Department14.University0.edu> .
   
}
"""

ref_2 = execute_query(entity_reformulated_query)

# Show all the results
print("The number of courses now is " + str(ref_2.shape[0]))
ref_2.head()

The number of courses now is 97
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 813 ms


Now lets find the common results between the two reformulations

In [13]:
overall_ref = pd.merge(ref_1,ref_2,how='inner',on=['course.type','course.value'])
overall_ref

Unnamed: 0,course.type,course.value
0,uri,http://www.Department14.University0.edu/GraduateCourse5
1,uri,http://www.Department14.University0.edu/Course4
2,uri,http://www.Department14.University0.edu/GraduateCourse4


Lets find the results that are not common between the two reformulations.



In [14]:
merged_df = pd.concat([ref_1,ref_2])
merged_df  = merged_df.drop_duplicates(keep=False)
merged_df.shape[0]

96

So we now have the following comparisons between relaxation and reformulation

### Results Comparison
| Original Query        | **Relaxation**           | Ref-1 | Ref-2 | Combined |
| ------------- |:-------------:|:--------:|:-------:|:---------|
|3|1627|5|97|99|

### Time Comparison
 | **Relaxation**           | Ref-1 | Ref-2 | 
|:-------------:|:--------:|:-------:|
|1880ms|566ms|813ms|

* Clearly the Relaxation results are higher, whereas the reformulation results are lesser. This makes this kind of reformulation more precise.
* In addition the time calculation shows that the resulting reformulations also run in lesser time than the relaxed version of the query.

This can be visualized as 

![GitHub Logo](files/images/Chart.png)
