Find solution to Virtuoso SPARQL troubles #30

amoeba · 2021-05-04T01:44:33Z

Years ago, back when we set up d1lod, we decided to handle inserting RDF data into whichever triplestore we used in an agnostic fashion so the triplestore could be swapped out without too much work. So we settled on inserting data via SPARQL INSERT statements.

While revisiting d1lod and repurposingn it for Slinky, I've run into two related issues with this approach:

Virtuoso has some sort of arbitrary and not-documented size limit on SPARQL statements. Their SPARQL engine just pukes when you get over a certain query string length. I don't think we ran into this during the GeoLink work and I only noticed it because a particular dataset got turned into a too-long SPARQL INSERT query
If I choose to split the query up and insert it in batches, we run into another problem: Blank nodes. If a query references a bnode as an object but the definition of that bnode (where it's a subject) ends up in the next query, Virtuoso complains. AFAICT this is a Virtuoso Open Source bug and may not apply to other triplestores
- This makes sense because I don't think bnodes really work across multiple queries. I considered making each bnode a proper HTTP IRI (skolemizing?) but wanted to avoid that because I want our output to still match science-on-schema.

SPARQL may just not be the right thing for this workload. I considered using alternative RDF data loading methods Virtuoso provides but it looks all they have is a system that loads data from a local filesystem via I-SQL commands.

I had been meaning to look at Blazegraph for a few years and I see that it has a nice HTTP bulk data loading REST API where you can just send serialized RDF to an endpoint. We aren't using any special functionality from Virtuoso so this might be a good point to switch.

Feedback or thoughts welcomed. I'll update here with what I figure out.

amoeba · 2021-05-04T19:50:10Z

@ThomasThelen mentioned two things to consider if we were to swap to Blazegraph:

Authentication: Virtuoso offers some authentication support we might lose if we switch
GeoSPARQL support: It looks like Blazegraph supports some spatial functionality but not quite GeoSPARQL. Not sure if this is really a problem but it's worth considering. See https://github.com/blazegraph/database/wiki/GeoSpatial.

Closes #30 I couldn't find a way to send very large SPARQL queries to Virtuoso but Virtuoso does have an HTTP API that takes Turtle/NTriples/etc. Since this is specific to Virtuoso, I've made a separate model from SparqlTripleStore.

ThomasThelen · 2021-06-29T16:08:23Z

An alternative to SPARQL requests is...

Write the turtle to disk
Upload the file to WebDAV
Let Virtuoso know to ingest the file to the graph
More info here and here

Alternatively, we can possibly share storage between the triplifier and graph; there may be a way to let Virtuoso the path of the file on disk (you can do this with GraphDB)

amoeba added enhancement New feature or request help wanted Extra attention is needed labels May 4, 2021

amoeba self-assigned this May 4, 2021

amoeba added the salmon data label May 4, 2021

amoeba added this to the 0.3.0 milestone Feb 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Find solution to Virtuoso SPARQL troubles #30

Find solution to Virtuoso SPARQL troubles #30

amoeba commented May 4, 2021

amoeba commented May 4, 2021

ThomasThelen commented Jun 29, 2021

Find solution to Virtuoso SPARQL troubles #30

Find solution to Virtuoso SPARQL troubles #30

Comments

amoeba commented May 4, 2021

amoeba commented May 4, 2021

ThomasThelen commented Jun 29, 2021