You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Years ago, back when we set up d1lod, we decided to handle inserting RDF data into whichever triplestore we used in an agnostic fashion so the triplestore could be swapped out without too much work. So we settled on inserting data via SPARQL INSERT statements.
While revisiting d1lod and repurposingn it for Slinky, I've run into two related issues with this approach:
Virtuoso has some sort of arbitrary and not-documented size limit on SPARQL statements. Their SPARQL engine just pukes when you get over a certain query string length. I don't think we ran into this during the GeoLink work and I only noticed it because a particular dataset got turned into a too-long SPARQL INSERT query
If I choose to split the query up and insert it in batches, we run into another problem: Blank nodes. If a query references a bnode as an object but the definition of that bnode (where it's a subject) ends up in the next query, Virtuoso complains. AFAICT this is a Virtuoso Open Source bug and may not apply to other triplestores
This makes sense because I don't think bnodes really work across multiple queries. I considered making each bnode a proper HTTP IRI (skolemizing?) but wanted to avoid that because I want our output to still match science-on-schema.
SPARQL may just not be the right thing for this workload. I considered using alternative RDF data loading methods Virtuoso provides but it looks all they have is a system that loads data from a local filesystem via I-SQL commands.
I had been meaning to look at Blazegraph for a few years and I see that it has a nice HTTP bulk data loading REST API where you can just send serialized RDF to an endpoint. We aren't using any special functionality from Virtuoso so this might be a good point to switch.
Feedback or thoughts welcomed. I'll update here with what I figure out.
The text was updated successfully, but these errors were encountered:
@ThomasThelen mentioned two things to consider if we were to swap to Blazegraph:
Authentication: Virtuoso offers some authentication support we might lose if we switch
GeoSPARQL support: It looks like Blazegraph supports some spatial functionality but not quite GeoSPARQL. Not sure if this is really a problem but it's worth considering. See https://github.com/blazegraph/database/wiki/GeoSpatial.
Closes#30
I couldn't find a way to send very large SPARQL queries to Virtuoso but Virtuoso does have an HTTP API that takes Turtle/NTriples/etc. Since this is specific to Virtuoso, I've made a separate model from SparqlTripleStore.
Let Virtuoso know to ingest the file to the graph
More info here and here
Alternatively, we can possibly share storage between the triplifier and graph; there may be a way to let Virtuoso the path of the file on disk (you can do this with GraphDB)
Years ago, back when we set up d1lod, we decided to handle inserting RDF data into whichever triplestore we used in an agnostic fashion so the triplestore could be swapped out without too much work. So we settled on inserting data via SPARQL INSERT statements.
While revisiting d1lod and repurposingn it for Slinky, I've run into two related issues with this approach:
SPARQL may just not be the right thing for this workload. I considered using alternative RDF data loading methods Virtuoso provides but it looks all they have is a system that loads data from a local filesystem via I-SQL commands.
I had been meaning to look at Blazegraph for a few years and I see that it has a nice HTTP bulk data loading REST API where you can just send serialized RDF to an endpoint. We aren't using any special functionality from Virtuoso so this might be a good point to switch.
Feedback or thoughts welcomed. I'll update here with what I figure out.
The text was updated successfully, but these errors were encountered: