Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serializing SPARQL Query Results with Aggregates over Variables from Optional Graph Pattern #2229

Closed
prohde opened this issue Feb 14, 2023 · 3 comments · Fixed by #2448
Closed
Labels
bug Something isn't working confirmation needed The issue raises a potential bug that needs to be confirmed.

Comments

@prohde
Copy link

prohde commented Feb 14, 2023

I ran into an issue when serializing the results of SPARQL queries with aggregates from optional graph patterns, i.e., they might potentially be unbound. I am using rdflib==6.2.0.

The query in question is:

SELECT DISTINCT ?x (COUNT(DISTINCT ?inst) AS ?cnt) WHERE {
  ?x a <http://swat.cse.lehigh.edu/onto/univ-bench.owl#GraduateStudent> 
  OPTIONAL {
    VALUES ?inst { <http://www.University0.edu> <http://www.University1.edu> }. 
    ?x <http://swat.cse.lehigh.edu/onto/univ-bench.owl#undergraduateDegreeFrom> ?inst .
  }
} GROUP BY ?x

For each graduate student, I want to know how many undergraduate degrees he/she has from the list of universities provided using the VALUES clause. I am using OPTIONAL here since I am also interested in getting 0 if the student doesn't have a degree from one of the specified universities.

The query runs fine in my SPARQL endpoint but when I try to use rdflib as an in-memory RDF graph, I get the following exception:

Traceback (most recent call last):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evalutils.py", line 68, in _eval
    return ctx[expr]
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/sparql.py", line 175, in __getitem__
    return self.ctx.initBindings[key]  # type: ignore[index]
KeyError: rdflib.term.Variable('inst')

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/my_path/graph_test.py", line 33, in run_query
    res_json = res.serialize(format='json')
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/query.py", line 252, in serialize
    serializer.serialize(stream2, encoding=encoding, **args)  # type: ignore
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/results/jsonresults.py", line 43, in serialize
    self._bindingToJSON(x) for x in self.result.bindings
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/query.py", line 184, in bindings
    self._bindings += list(self._genbindings)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 541, in evalDistinct
    for x in res:
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 550, in <genexpr>
    return (row.project(project.PV) for row in res)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 100, in evalExtend
    for c in evalPart(ctx, extend.p):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 100, in evalExtend
    for c in evalPart(ctx, extend.p):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evaluate.py", line 453, in evalAggregateJoin
    aggregator.update(row)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/aggregates.py", line 256, in update
    if acc.use_row(row):
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/aggregates.py", line 68, in use_row
    return self.eval_row(row) not in self.seen
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/aggregates.py", line 62, in eval_row
    return _eval(self.expr, row)
  File "/home/my_path/venv/lib/python3.9/site-packages/rdflib/plugins/sparql/evalutils.py", line 71, in _eval
    raise NotBoundError("Variable %s is not bound" % expr)
rdflib.plugins.sparql.sparql.NotBoundError: Variable inst is not bound

I tried the following rewriting of my original query to bypass that issue but with no success.

SELECT DISTINCT ?x (IF(bound(?inst), COUNT(DISTINCT ?inst), 0) AS ?cnt) WHERE {
  ?x a <http://swat.cse.lehigh.edu/onto/univ-bench.owl#GraduateStudent> 
  OPTIONAL {
    VALUES ?inst { <http://www.University0.edu> <http://www.University1.edu> }. 
    ?x <http://swat.cse.lehigh.edu/onto/univ-bench.owl#undergraduateDegreeFrom> ?inst .
  }
}

To my understanding, the error should occur in the count which shouldn't be executed due to the IF statement.

Many thanks in advance!

@aucampia aucampia added bug Something isn't working confirmation needed The issue raises a potential bug that needs to be confirmed. labels Mar 25, 2023
@WhiteGobo
Copy link
Contributor

WhiteGobo commented May 20, 2023

I couldnt generate that error(with python 3.11). My script:

#import sys
#sys.path.insert(0,"path/to/rdflib-6.2")
import rdflib

query = """
SELECT DISTINCT ?x (COUNT(DISTINCT ?inst) as ?cnt)
WHERE {
    ?x a ex:a
    OPTIONAL {
        VALUES ?inst {ex:b ex:c}.
        ?x ex:d ?inst.
    }
}  GROUP BY ?x
"""

ex = rdflib.Namespace("http://example.com/")
g = rdflib.Graph()
g.bind("ex", ex)
g.parse(format="ttl", data="""@prefix ex: <http://example.com/>.
        <1> a ex:a;
            ex:d ex:b.
        <2> a ex:a;
            ex:d ex:c;
            ex:d ex:b.
        <3> a ex:a;
            ex:d ex:c.
""")
print(list(g.query(query)))

@prohde
Copy link
Author

prohde commented May 22, 2023

Hi @WhiteGobo,

this is because all of your instances (<1>, <2>, and <3>) have a connection to at least ex:b or ex:c via ex:d. If you remove the last triple, expecting the count for <3> to be 0 (zero), then the same error appears.

import rdflib

query = """
SELECT DISTINCT ?x (COUNT(DISTINCT ?inst) as ?cnt)
WHERE {
    ?x a ex:a
    OPTIONAL {
        VALUES ?inst {ex:b ex:c}.
        ?x ex:d ?inst.
    }
}  GROUP BY ?x
"""

ex = rdflib.Namespace("http://example.com/")
g = rdflib.Graph()
g.bind("ex", ex)
g.parse(format="ttl", data="""@prefix ex: <http://example.com/>.
        <1> a ex:a;
            ex:d ex:b.
        <2> a ex:a;
            ex:d ex:c;
            ex:d ex:b.
        <3> a ex:a.
""")
print(list(g.query(query)))

@WhiteGobo
Copy link
Contributor

Ok i have made a fix in an extra branch. That should resolve that Error.
Link to the branch

I need some time to make a PR out of this, because i would create a test and ive got to look at plugins/sparql/aggregates.py because they tried to catch the NotBoundError and im not sure if that line of code ever gets used.

WhiteGobo pushed a commit to WhiteGobo/rdflib that referenced this issue Jun 14, 2023
testing counting of optional nodes. Zero optional nodes may throw a
NotBoundError
Added fix for NotBoundError, for this test.
WhiteGobo pushed a commit to WhiteGobo/rdflib that referenced this issue Jun 14, 2023
testing counting of optional nodes. Zero optional nodes may throw a
NotBoundError
Added fix for NotBoundError, for this test.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working confirmation needed The issue raises a potential bug that needs to be confirmed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants