Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rdflib_hdt.optimize_sparql() produces incorrect results #14

Open
chiarcos opened this issue Mar 7, 2022 · 5 comments
Open

rdflib_hdt.optimize_sparql() produces incorrect results #14

chiarcos opened this issue Mar 7, 2022 · 5 comments
Labels
bug Something isn't working

Comments

@chiarcos
Copy link

chiarcos commented Mar 7, 2022

Describe the bug

When querying in SPARQL mode, a query statement for which no variable binding can be found will bind to all triples in the graph. It should bind to none. This behavior appears after calling rdflib_hdt.optimize_sparql()

To Reproduce
Steps to reproduce the behavior:

  1. Unzip all.hdt.zip
  2. Run python3 hdt-test.py

Expected behavior
The last query should not return any results

System:

  • OS: Ubuntu 20.4L
  • Python 3.8.10
  • hdt 2.3
  • rdflib 6.0.1
@Callidon Callidon added the bug Something isn't working label Mar 7, 2022
@donpellegrino
Copy link

It looks like the focus for this issue is https://github.com/RDFLib/rdflib-hdt/blob/master/rdflib_hdt/sparql_op.py. At only 43 lines, it seems to be an innocent enough function. Nothing jumps out at me as trigging this issue. A next step might be stepping through the execution of the test case.

@donpellegrino
Copy link

I was able to confirm that within def __evalbgp__(ctx: QueryContext, bgp: BGP) adding the following code to pass what should be the same pattern to store.hdt_document.search() does give the expected empty result set.:

        results = store.hdt_document.search(
            (None,
             URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
             URIRef('http://purl.org/acoli/open-ie/types:no-entity'))
        )

Using URIRef('http://purl.org/acoli/open-ie/types:entity') gives 794 results.

So, it looks like the problem is inside store.hdt_document.search_join() somewhere.

@donpellegrino
Copy link

Execution seems to be picking up a debugging statement from https://github.com/rdfhdt/hdt-cpp/blob/develop/libhdt/src/sparql/QueryProcessor.cpp#L117

@donpellegrino
Copy link

Using the command line hdtSearch all.hdt built from the same hdt-cpp-1.3.3 directory used when building rdflib-hdt, the pattern results seem to work as expected:

>> ? http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/acoli/open-ie/types:entity
794 results in 97 ms 496 us
>> ? http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://purl.org/acoli/open-ie/types:no-entity
0 results in 43 us

Therefore, my current guess is that something weird is happening when Python hands off the pattern to the C++ QueryProcessor::searchJoin() function.

@donpellegrino
Copy link

The hdtSearch CLI tool is calling hdt::search(). QueryProcessor::searchJoin() has odd semantics and a different code path. It is not calling hdt::search(). Some reconciliation is needed to align rdflib's BGP approach and how individual patterns should be handled in the SPARQL optimization function. The hdt-cpp QueryProcessor::searchJoin() is either fundamentally broken, which could be the case given the print lines in there, or hdt::search() is a better fit and multiple patterns should be ANDed together in a different way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants