Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shelve basicindex #830

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Shelve basicindex #830

wants to merge 6 commits into from

Conversation

jpmccu
Copy link
Contributor

@jpmccu jpmccu commented May 22, 2018

Shelve is a local keystore database in python that takes native python objects. I have made a simple store with it that optimizes for read and write, while keeping indexing simple and local. The data structure is a tree that looks like (including the native store:

{ "context" : { "subject": { "predicate" : set(["object"]) } } }

The context-level is the unit of storage in Shelve, and so operations are performed by reading and writing whole contexts in memory, then storing them as a unit to disk on each mutation. This gets shaky with really big contexts, but is performant when contexts aren't huge. Additionally, an LRU cache is enabled so that sequential and near-sequential mutations to the context don't require lots of disk reads. Searches that require grabbing all of a subtree from the data structure should be pretty fast. Finding all the subjects with a matching object will be an order N operation over the graph, which is worst case performance.

Order-N transformations like data generation, simple filtering, format conversion, etc. are therefore optimal, but don't go doing complex graph queries with it. It might work for a linked data server if you don't supply a SPARQL endpoint on top of it.

@jpmccu
Copy link
Contributor Author

jpmccu commented May 23, 2018

Additional issues: changes needed for dbm on 2.7 mean breaking 3, since dbm doesn't like unicode keys. Also, whatever backend is being used in my local test doesn't scale well, so I need to find a better one. Watch this space (or not) for updates.

@gromgull gromgull added enhancement New feature or request fix-in-progress store Related to a store. labels Oct 27, 2018
@RDFLib RDFLib deleted a comment from coveralls Oct 29, 2018
@white-gecko white-gecko added this to the rdflib 5.1.0 milestone Mar 16, 2020
@nicholascar
Copy link
Member

@jimmccusker would you be interested in updating this PR to work with 5.0.0+? I see that it previously passed only the Python 2.7 tests and none of the 3.x tests. In 5.0.0+, you might get this to pass 3.5+ only.

@white-gecko white-gecko modified the milestones: rdflib 5.1.0, rdflib 6.0.0 May 1, 2020
@nicholascar
Copy link
Member

Hi @jimmccusker, are you interested in getting this to work in Python 3.6+ / RDFlib 5.0.0? We are keen to see a couple more store implementations are we are planning on killing off the old in-memory store that doesn't support a lot of expected features (like Turtle parsing!).

@jpmccu
Copy link
Contributor Author

jpmccu commented Jul 30, 2020

I'm actually thinking of trying again with sqlite to try to use its full text search, actually. It should match the syntax I'm working on for a fuseki implementation too. Do you mean the in memory store that's the default store, or is that a different one?

@nicholascar
Copy link
Member

Of the two stores in memory.py, Memory & IOMemory, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory. The performance advantages it has over Memory might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory if it was faster but supported fewer features.

Can't quite remember all this though so will have to test out the stores' features and speeds first.

@jpmccu
Copy link
Contributor Author

jpmccu commented Jul 30, 2020 via email

@nicholascar
Copy link
Member

@jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added.

@jpmccu
Copy link
Contributor Author

jpmccu commented Jul 2, 2021 via email

@nicholascar nicholascar mentioned this pull request Aug 28, 2021
@westurner
Copy link
Contributor

Would there be advantages to this approach instead of just https://github.com/RDFLib/rdflib-sqlalchemy with SQLite?

@nicholascar
Copy link
Member

@jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner or perhaps something else?

Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.

It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.

@jpmccu
Copy link
Contributor Author

jpmccu commented Aug 29, 2021 via email

@jpmccu
Copy link
Contributor Author

jpmccu commented Aug 29, 2021 via email

@nicholascar
Copy link
Member

going through the sqlalchemy layer is probably not as performant as it could be

Well that’s the thing: I assumed a Shelve implementation using the native Shelve API would be best, but then you’d have to invent (or borrow, if you could copy from BerkeleyDB) all the CRUD equivalent functions in RDFlib-speak as well as all the SPO, PSO etc indexing. That’s what I thought you wanted to do!

Perhaps we really do need a Store features and performance comparison table. Then we will know what, If anything’s missing.

is this something an RPI student might be able to do @jimmccusker?

@westurner
Copy link
Contributor

westurner commented Aug 29, 2021

Here are the sqla Tables and Indexes: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/rdflib_sqlalchemy/tables.py

Tests for SQLite:
https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_sqlalchemy_sqlite.py

500-25K triples:
https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_store_performance.py

Is shelve similar to LevelDB in key/value interface?

From https://github.com/jsonpickle/jsonpickle#security :

Security
jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Warning

The jsonpickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Consider signing data with an HMAC if you need to ensure that it has not been tampered with.

HMACs and Merkle hashes help with data integrity, but not [cryptographic] identity (which we now have W3C ld-proofs for part of)

Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

https://pypi.org/project/ijson/#performance-tips

https://github.com/simdjson/simdjson#performance-results

@westurner
Copy link
Contributor

westurner commented Aug 29, 2021

... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite.

https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes

datetime64 indexes, too?
https://arrow.apache.org/docs/python/pandas.html#date-types


Edit: Shelve solves for persisting a dict of python objects; for when there's not enough RAM. But:

  • SEC: shelve executes unsigned code due to pickle,
  • PERF,SCAL,BUG: shelve doesn't do ordered transactions, so shelve is not safe for parallel use: if there are e.g. writes during reads, the behavior is nondeterministic due to lack of (database transaction) Isolation. From https://en.wikipedia.org/wiki/ACID re Isolation:

depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions

Rdflib-sqlachemy should already be wrapping with transaction BEGIN and COMMIT SQL statements that mutate the database?

@jpmccu
Copy link
Contributor Author

jpmccu commented Aug 29, 2021 via email

@nicholascar
Copy link
Member

I can see if there's an undergraduate who'd like a project like this that I can mentor

Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to.

@jpmccu
Copy link
Contributor Author

jpmccu commented Aug 30, 2021 via email

@nicholascar
Copy link
Member

Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now.

The current work we are doing is building a Python UCUM converter, based on the JavaScript one. Once that's done, we will work on a QUDT converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib.

@westurner
Copy link
Contributor

westurner commented Aug 30, 2021 via email

@ghost ghost mentioned this pull request May 19, 2023
8 tasks
@nicholascar
Copy link
Member

@jpmccu @westurner RDLib's gone up a couple of versions now, any continued interest here?

@jpmccu
Copy link
Contributor Author

jpmccu commented Mar 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fix-in-progress store Related to a store.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants