HDT support #894

chrysn · 2019-01-20T13:39:26Z

The HDT format appears to be a promising way to access large static databases. There exists a Python wrapper around the standard HDT library. (It has an open issue about result correctness, and I didn't at first glance find how blank nodes or literals are handled, but anyway, it's active so can grow to fit).

I'm not well versed in rdflib internals, but it appears that one could construct a read-only store in a quite straight-forward fashion (mainly implementing .triples()).

Main questions (were I to find the time to implement this myself) would be

how to tell the query processor how many triples are expected to be returned from a triples iterator (HDT gives an estimate of that; this might help the query processor to decide which branch to follow first before iterating over thousands of results), and
whether the HDT file's internal numeric resource identifiers can be leveraged in any fashion (eg. by ensuring that each term returned from an HDT file is annotated with a weakref to its original file object and its numeric ID, so that when used in the next query against the same file the ID doesn't need to be looked up again).

The text was updated successfully, but these errors were encountered:

FlorianLudwig · 2019-11-11T09:55:48Z

@chrysn I made a quick proof of concept here: https://github.com/FlorianLudwig/rdflib-hdt/blob/master/rdflib_hdt.py without any optimizations - since they would need knowledge of how the query processor works which I don't have (yet :)). But it is already quite fast: I used the wikidata set for a quick benchmark and it looked quite good!

nicholascar · 2020-03-16T00:38:59Z

I'm interested in this so I've allocated this to the 6.0.0 release (hopefully July 2020).

@FlorianLudwig if you're interested in progressing this to a full Store implementation within rdflib, please let me (one of the new rdflib maintainers) know!

FlorianLudwig · 2020-03-16T08:28:51Z

hi @nicholascar, what would it take to make this a full store implementation? While my implementation is really simple it worked for my purposes well enough so never I looked into any optimizations.

Creating hdt files might be of interest but I don't think that fits well into rdflib's model / api as hdt doesn't allow single writes but only all at once.

nicholascar · 2020-03-17T05:11:46Z

@FlorianLudwig perhaps nothing much more than what you have already is needed! We should just document it nicely, especially if it needs to decalre itself read only etc. (as the SPARQL Store does) and ensure that we have a copy of the HTD code that is bundled or managed in some way so we aren't exposed to the repo going away. I might invite @Callidon to bring his HTD library into the rdflib family!

Give me a couple of weeks until 5.0.0 is out and I'll get back to you about this. If you did want to put in a PR for this against master in the meanwhile, that would be great. We will just flag it for 6.0.0 as I've done for this Issue.

Callidon · 2020-03-18T08:55:30Z

Hi everyone,

Thank you for all the interest over pyHDT, it's amazing to see a small side-project like that turning into something bigger 😄
I'm on board for the integration and I will gladly contribute!

In my opinion, the first piece of work is to adapt to inputs and outputs of pyHDT to use the data model of RDFlib (Literal, URIRef, BNode), because it currently only uses string to represent RDF terms. I'm unsure if we should to this directly into pyHDT (in the C++ binding code) or as a layer over it, as part of the "rdflib integration".

Concerning the creation of HDT files, I agree with @FlorianLudwig that it might not fit well in the whole rdflib model, as an HDT file cannot be modified after creation.

nicholascar · 2020-08-27T05:47:09Z

We have HDT handling within RDFlib with the rdflib-hdt project. Please take up all RDFlib/+HDT issues there!

nicholascar added this to the rdflib 6.0.0 milestone Mar 16, 2020

nicholascar mentioned this issue Mar 16, 2020

Provide Store backend for HDT files #972

Closed

nicholascar added enhancement New feature or request store Related to a store. labels Mar 16, 2020

nicholascar closed this as completed Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDT support #894

HDT support #894

chrysn commented Jan 20, 2019

FlorianLudwig commented Nov 11, 2019

nicholascar commented Mar 16, 2020

FlorianLudwig commented Mar 16, 2020

nicholascar commented Mar 17, 2020

Callidon commented Mar 18, 2020

nicholascar commented Aug 27, 2020

HDT support #894

HDT support #894

Comments

chrysn commented Jan 20, 2019

FlorianLudwig commented Nov 11, 2019

nicholascar commented Mar 16, 2020

FlorianLudwig commented Mar 16, 2020

nicholascar commented Mar 17, 2020

Callidon commented Mar 18, 2020

nicholascar commented Aug 27, 2020