Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDT support #894

Closed
chrysn opened this issue Jan 20, 2019 · 6 comments
Closed

HDT support #894

chrysn opened this issue Jan 20, 2019 · 6 comments
Labels
enhancement New feature or request store Related to a store.
Milestone

Comments

@chrysn
Copy link
Contributor

chrysn commented Jan 20, 2019

The HDT format appears to be a promising way to access large static databases. There exists a Python wrapper around the standard HDT library. (It has an open issue about result correctness, and I didn't at first glance find how blank nodes or literals are handled, but anyway, it's active so can grow to fit).

I'm not well versed in rdflib internals, but it appears that one could construct a read-only store in a quite straight-forward fashion (mainly implementing .triples()).

Main questions (were I to find the time to implement this myself) would be

  • how to tell the query processor how many triples are expected to be returned from a triples iterator (HDT gives an estimate of that; this might help the query processor to decide which branch to follow first before iterating over thousands of results), and
  • whether the HDT file's internal numeric resource identifiers can be leveraged in any fashion (eg. by ensuring that each term returned from an HDT file is annotated with a weakref to its original file object and its numeric ID, so that when used in the next query against the same file the ID doesn't need to be looked up again).
@FlorianLudwig
Copy link
Contributor

@chrysn I made a quick proof of concept here: https://github.com/FlorianLudwig/rdflib-hdt/blob/master/rdflib_hdt.py without any optimizations - since they would need knowledge of how the query processor works which I don't have (yet :)). But it is already quite fast: I used the wikidata set for a quick benchmark and it looked quite good!

@nicholascar
Copy link
Member

I'm interested in this so I've allocated this to the 6.0.0 release (hopefully July 2020).

@FlorianLudwig if you're interested in progressing this to a full Store implementation within rdflib, please let me (one of the new rdflib maintainers) know!

@nicholascar nicholascar added enhancement New feature or request store Related to a store. labels Mar 16, 2020
@FlorianLudwig
Copy link
Contributor

hi @nicholascar, what would it take to make this a full store implementation? While my implementation is really simple it worked for my purposes well enough so never I looked into any optimizations.

Creating hdt files might be of interest but I don't think that fits well into rdflib's model / api as hdt doesn't allow single writes but only all at once.

@nicholascar
Copy link
Member

@FlorianLudwig perhaps nothing much more than what you have already is needed! We should just document it nicely, especially if it needs to decalre itself read only etc. (as the SPARQL Store does) and ensure that we have a copy of the HTD code that is bundled or managed in some way so we aren't exposed to the repo going away. I might invite @Callidon to bring his HTD library into the rdflib family!

Give me a couple of weeks until 5.0.0 is out and I'll get back to you about this. If you did want to put in a PR for this against master in the meanwhile, that would be great. We will just flag it for 6.0.0 as I've done for this Issue.

@Callidon
Copy link

Hi everyone,

Thank you for all the interest over pyHDT, it's amazing to see a small side-project like that turning into something bigger 😄
I'm on board for the integration and I will gladly contribute!

In my opinion, the first piece of work is to adapt to inputs and outputs of pyHDT to use the data model of RDFlib (Literal, URIRef, BNode), because it currently only uses string to represent RDF terms. I'm unsure if we should to this directly into pyHDT (in the C++ binding code) or as a layer over it, as part of the "rdflib integration".

Concerning the creation of HDT files, I agree with @FlorianLudwig that it might not fit well in the whole rdflib model, as an HDT file cannot be modified after creation.

@nicholascar
Copy link
Member

We have HDT handling within RDFlib with the rdflib-hdt project. Please take up all RDFlib/+HDT issues there!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request store Related to a store.
Projects
None yet
Development

No branches or pull requests

4 participants