Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Find solution to Virtuoso SPARQL troubles #30

Open
amoeba opened this issue May 4, 2021 · 2 comments
Open

Find solution to Virtuoso SPARQL troubles #30

amoeba opened this issue May 4, 2021 · 2 comments
Assignees
Labels
enhancement New feature or request help wanted Extra attention is needed salmon data
Milestone

Comments

@amoeba
Copy link
Contributor

amoeba commented May 4, 2021

Years ago, back when we set up d1lod, we decided to handle inserting RDF data into whichever triplestore we used in an agnostic fashion so the triplestore could be swapped out without too much work. So we settled on inserting data via SPARQL INSERT statements.

While revisiting d1lod and repurposingn it for Slinky, I've run into two related issues with this approach:

  1. Virtuoso has some sort of arbitrary and not-documented size limit on SPARQL statements. Their SPARQL engine just pukes when you get over a certain query string length. I don't think we ran into this during the GeoLink work and I only noticed it because a particular dataset got turned into a too-long SPARQL INSERT query
  2. If I choose to split the query up and insert it in batches, we run into another problem: Blank nodes. If a query references a bnode as an object but the definition of that bnode (where it's a subject) ends up in the next query, Virtuoso complains. AFAICT this is a Virtuoso Open Source bug and may not apply to other triplestores
    • This makes sense because I don't think bnodes really work across multiple queries. I considered making each bnode a proper HTTP IRI (skolemizing?) but wanted to avoid that because I want our output to still match science-on-schema.

SPARQL may just not be the right thing for this workload. I considered using alternative RDF data loading methods Virtuoso provides but it looks all they have is a system that loads data from a local filesystem via I-SQL commands.

I had been meaning to look at Blazegraph for a few years and I see that it has a nice HTTP bulk data loading REST API where you can just send serialized RDF to an endpoint. We aren't using any special functionality from Virtuoso so this might be a good point to switch.

Feedback or thoughts welcomed. I'll update here with what I figure out.

@amoeba amoeba added enhancement New feature or request help wanted Extra attention is needed labels May 4, 2021
@amoeba amoeba self-assigned this May 4, 2021
@amoeba
Copy link
Contributor Author

amoeba commented May 4, 2021

@ThomasThelen mentioned two things to consider if we were to swap to Blazegraph:

  1. Authentication: Virtuoso offers some authentication support we might lose if we switch
  2. GeoSPARQL support: It looks like Blazegraph supports some spatial functionality but not quite GeoSPARQL. Not sure if this is really a problem but it's worth considering. See https://github.com/blazegraph/database/wiki/GeoSpatial.

amoeba added a commit that referenced this issue May 18, 2021
Closes #30

I couldn't find a way to send very large SPARQL queries to Virtuoso but Virtuoso does have an HTTP API that takes Turtle/NTriples/etc. Since this is specific to Virtuoso, I've made a separate model from SparqlTripleStore.
@ThomasThelen
Copy link
Member

An alternative to SPARQL requests is...

  1. Write the turtle to disk
  2. Upload the file to WebDAV
  3. Let Virtuoso know to ingest the file to the graph
    More info here and here

Alternatively, we can possibly share storage between the triplifier and graph; there may be a way to let Virtuoso the path of the file on disk (you can do this with GraphDB)

@amoeba amoeba added this to the 0.3.0 milestone Feb 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed salmon data
Projects
None yet
Development

No branches or pull requests

2 participants