# 📘 KG course SPARQL notebook

A notebook to run SPARQL queries for the KG course at UM DACS.

1. Update the `g.parse()` calls in the first cell to import your RDF files.
2. In the same folder as the notebook create files with your SPARQL queries (e.g. `q1.rq`), and execute them with `run_query(g, 'q1.rq')`

Use the `.rq` file extension to get SPARQL syntax coloration

In [1]:
import sys
!{sys.executable} -m pip install pandas oxrdflib Pygments

Collecting oxrdflib
  Downloading oxrdflib-0.4.0-py3-none-any.whl.metadata (6.5 kB)
Collecting pyoxigraph~=0.4.2 (from oxrdflib)
  Downloading pyoxigraph-0.4.8-cp38-abi3-win_amd64.whl.metadata (5.6 kB)
Downloading oxrdflib-0.4.0-py3-none-any.whl (10 kB)
Downloading pyoxigraph-0.4.8-cp38-abi3-win_amd64.whl (4.8 MB)
   ---------------------------------------- 0.0/4.8 MB ? eta -:--:--
   --------------- ------------------------ 1.8/4.8 MB 9.8 MB/s eta 0:00:01
   -------------------------------- ------- 3.9/4.8 MB 9.9 MB/s eta 0:00:01
   ---------------------------------------- 4.8/4.8 MB 9.6 MB/s eta 0:00:00
Installing collected packages: pyoxigraph, oxrdflib
Successfully installed oxrdflib-0.4.0 pyoxigraph-0.4.8


In [2]:

import pandas as pd
from IPython.display import display, HTML
from pygments import highlight
from pygments.lexers import SparqlLexer
from pygments.formatters import HtmlFormatter
from rdflib import Graph

def run_query(graph, query_path):
    try:
        with open(query_path, 'r') as file:
            query = file.read()
    except Exception as _e:
        print(f"No file for {query_path}")
        return
    results = graph.query(query)
    # Display the SPARQL query
    formatted_query = highlight(query, SparqlLexer(), HtmlFormatter(style='solarized-dark', full=True, nobackground=True))
    display(HTML(formatted_query))
    # Convert results to a Pandas DataFrame
    res_list = []
    for row in results:
        res_list.append([str(item) for item in row])
    df = pd.DataFrame(res_list, columns=[str(var) for var in results.vars]) if len(res_list) > 0 else pd.DataFrame()
    # Display the DataFrame as a table in Jupyter Notebook
    display(HTML(df.to_html()))

g = Graph(store="Oxigraph")


g.parse("data/food_kg.ttl")

print(f"Working with {len(g)} triples")

Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#duration, Converter=<function parse_xsd_duration at 0x0000022CF53A9EE0>
Traceback (most recent call last):
  File "c:\Users\domin\anaconda3\envs\kgraph\Lib\site-packages\rdflib\term.py", line 2163, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
  File "c:\Users\domin\anaconda3\envs\kgraph\Lib\site-packages\rdflib\xsd_datetime.py", line 433, in parse_xsd_duration
    raise ValueError("Unable to parse duration string " + dur_string)
ValueError: Unable to parse duration string nan
Failed to convert Literal lexical form to value. Datatype=http://www.w3.org/2001/XMLSchema#duration, Converter=<function parse_xsd_duration at 0x0000022CF53A9EE0>
Traceback (most recent call last):
  File "c:\Users\domin\anaconda3\envs\kgraph\Lib\site-packages\rdflib\term.py", line 2163, in _castLexicalToPython
    return conv_func(lexical)  # type: ignore[arg-type]
  File "c:\Users\domin\a

Working with 1801 triples


1. Identify one type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

## Accuracy Quality Assessment

In [3]:
run_query(g, 'queries/q1.rq')

Unnamed: 0,recipe,cookTime,prepTime
0,http://kg-course/food-nutrition/recipe/41,PT20M,P1D
1,http://kg-course/food-nutrition/recipe/148683,PT2H,P0D


### Explanation
This query retrieves recipes whose cookTime or prepTime fall outside a plausible range (less than 1 minute or more than 12 hours). The results, however, show that the preptime for 148683 is 0 minutes. This would be possible for some recipes but looking at the recipe, it is a soup that requires chopping vegetables, so it is at the very least unrealistic. A prep time of 1 day is long, but not faulty, so 12 hours could be a bit too low as a threshold.


2. Identify a second type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

In [4]:
run_query(g, 'queries/q2.rq')

Unnamed: 0,recipe,property,valueStr
0,http://kg-course/food-nutrition/recipe/48,cookTime,
1,http://kg-course/food-nutrition/recipe/46,cookTime,
2,http://kg-course/food-nutrition/recipe/337283,cookTime,
3,http://kg-course/food-nutrition/recipe/280584,cookTime,
4,http://kg-course/food-nutrition/recipe/162371,cookTime,


### Explanation
This query looks for recipes whoes cookTime or prepTime values include the sting "nan". Five recipes were correctly identified this way.

3. Identify a third type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

In [5]:
run_query(g, 'queries/q3.rq')

Unnamed: 0,recipe,ingredient
0,http://kg-course/food-nutrition/recipe/41,1 mushrooms
1,http://kg-course/food-nutrition/recipe/323316,1 confectioners' sugar
2,http://kg-course/food-nutrition/recipe/305119,1 pecans
3,http://kg-course/food-nutrition/recipe/121241,1 eggs
4,http://kg-course/food-nutrition/recipe/100573,1 pecans


### Explanation
This query looks for mistakes in grammar in the ingredients. Specifically, ingredients that only appear one time but are mentioned by their plural, like "1 eggs". This query returned 5 results, that are valid.

4. Identify a forth type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

In [6]:
run_query(g, 'queries/q4.rq')

Unnamed: 0,recipe,property,durStr
0,http://kg-course/food-nutrition/recipe/74837,prepTime,45
1,http://kg-course/food-nutrition/recipe/48,cookTime,
2,http://kg-course/food-nutrition/recipe/46,cookTime,
3,http://kg-course/food-nutrition/recipe/41,prepTime,P1D
4,http://kg-course/food-nutrition/recipe/337283,cookTime,
5,http://kg-course/food-nutrition/recipe/280584,cookTime,
6,http://kg-course/food-nutrition/recipe/162371,cookTime,


### Explanation - consistency issue
This query checks for recipes where the cookTime or prepTime does not conform the ISO 8601 standard. The results show that there are 7 recipes that do not conform to the standard, although 5 of them were found with the query before, making this a more extensive way of checking for the same issue.

5. Identify a fifth type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

In [7]:
run_query(g, 'queries/q5.rq')

Unnamed: 0,recipe,property,numValues
0,http://kg-course/food-nutrition/recipe/42/nutrition,https://schema.org/sugarContent,4
1,http://kg-course/food-nutrition/recipe/42/nutrition,https://schema.org/cholesterolContent,3
2,http://kg-course/food-nutrition/recipe/48/nutrition,https://schema.org/fiberContent,6
3,http://kg-course/food-nutrition/recipe/48/nutrition,https://schema.org/proteinContent,6
4,http://kg-course/food-nutrition/recipe/42/nutrition,https://schema.org/carbohydrateContent,4
5,http://kg-course/food-nutrition/recipe/48/nutrition,https://schema.org/sugarContent,6
6,http://kg-course/food-nutrition/recipe/48/nutrition,https://schema.org/cholesterolContent,6
7,http://kg-course/food-nutrition/recipe/42/nutrition,https://schema.org/calories,4
8,http://kg-course/food-nutrition/recipe/48/nutrition,https://schema.org/carbohydrateContent,6
9,http://kg-course/food-nutrition/recipe/42/nutrition,https://schema.org/sodiumContent,4


### Explanation

Query to check for multiple values for a single property. Typically, "schema:calories" and similar properties should have a single numeric value, not multiple. Results of query show that 2 recipes have multiple values (see above in numValues) for their nutrition information properties.

6. Identify a sixth type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

In [8]:
run_query(g, 'queries/q6.rq')

Unnamed: 0,recipe,nutritionType
0,http://kg-course/food-nutrition/recipe/46,http://dbpedia.org/ontology/Nutrition


### Explanation - consistency issue

We have two ontology classes that represent the same concept but come from different vocabularies (schema and dbo). 
Query check which recipes use schema:NutritionInformation and which dbo:Nutrition.

7. Identify a seventh type of quality check different than above, write and run SPARQL to implement the check and return the violating entities.

In [9]:
run_query(g, 'queries/q7.rq')

Unnamed: 0,recipe,nutritionProperty,value
0,http://kg-course/food-nutrition/recipe/45/nutrition,https://schema.org/proteinContent,-4.2
1,http://kg-course/food-nutrition/recipe/41/nutrition,https://schema.org/fiberContent,-0.2


### Explanation - semantic accuracy

Query to check for negative values for nutrition properties. This query results show 2 situation where nutrition contains negative values.

# Conciseness

8. Orphan nodes

In [83]:
run_query(g, 'queries/q8.rq')

Unnamed: 0,orphan
0,http://kg-course/food-nutrition/recipe/49/nutrition
1,http://kg-course/food-nutrition/recipe/48/nutrition
2,http://kg-course/food-nutrition/recipe/98664
3,http://kg-course/food-nutrition/recipe/88096
4,http://kg-course/food-nutrition/recipe/88095/
5,http://kg-course/food-nutrition/recipe/74837
6,http://kg-course/food-nutrition/recipe/58643
7,http://kg-course/food-nutrition/recipe/57879
8,http://kg-course/food-nutrition/recipe/57856
9,http://kg-course/food-nutrition/recipe/55724


### Explanation
This SPARQL query was used to investigate orphan nodes. Orphan nodes are nodes that incoming or outgoing references. We check for these nodes since they do not contribute to the function of a knowledge graph which is to graph relations. We do not consider datatype properties, classes or object properties as a orphan node as they are used for definitions. Most of the orphan nodes are recipes; these are recipes that are never referenced or have any links to the nutrition class. There are only 2 instances of orphan nutrition nodes. Despite having a recipe number in their iri, those recipes are not correctly linked to the nutrition class.

9. Duplicate Nodes

In [65]:
run_query(g, 'queries/q9.rq')


Unnamed: 0,name,duplicateRecipeCount
0,Best Lemonade,3
1,Butter Pecan Cookies,7
2,Boston Cream Pie,20
3,Cabbage Soup,16
4,Biryani,2


In [49]:
run_query(g, 'queries/q10.rq')

Unnamed: 0,nutrition,duplicateRecipeCount
0,http://kg-course/food-nutrition/recipe/43/nutrition,2


### Explanation
I used 2 different methods to check for duplicates in the graph. The first one checks for duplicate names. Multiple results were returned, the recipes are completely different despite sharing the same name. Thus they will not be consider duplicates. However, the second query checks for duplicate links to a nutrition class. The output is a problem because, 2 different recipes linking to the same nutrition means either the recipes are duplicates or one of the recipes reference the wrong nutrition class. Upon further inspection it is the latter case; specifically recipe 88095 referencing nutrition for recipe 43.

10. Failed transitivity

In [110]:
run_query(g, 'queries/q11.rq')

Unnamed: 0,recipe,nutrition
0,http://kg-course/food-nutrition/recipe/49,http://example.org/nonexistent/Nutrition
1,http://kg-course/food-nutrition/recipe/48,cholesterol


### Explanation
This query checks for transitivity by seeing if each recipe correctly links to a nutrition. As seen in the output, there are 2 cases where the recipe references a nonexistent nutrition. Unsurprisingly, these nutrition classes can also be found in the orphan nodes. This means there was an error when creating the recipe; despite the nutrition for this recipe existing the link was not made.