Shelve basicindex #830

jpmccu · 2018-05-22T22:59:58Z

Shelve is a local keystore database in python that takes native python objects. I have made a simple store with it that optimizes for read and write, while keeping indexing simple and local. The data structure is a tree that looks like (including the native store:

{ "context" : { "subject": { "predicate" : set(["object"]) } } }

The context-level is the unit of storage in Shelve, and so operations are performed by reading and writing whole contexts in memory, then storing them as a unit to disk on each mutation. This gets shaky with really big contexts, but is performant when contexts aren't huge. Additionally, an LRU cache is enabled so that sequential and near-sequential mutations to the context don't require lots of disk reads. Searches that require grabbing all of a subtree from the data structure should be pretty fast. Finding all the subjects with a matching object will be an order N operation over the graph, which is worst case performance.

Order-N transformations like data generation, simple filtering, format conversion, etc. are therefore optimal, but don't go doing complex graph queries with it. It might work for a linked data server if you don't supply a SPARQL endpoint on top of it.

…ut not arbitrary search.

jpmccu · 2018-05-23T00:24:24Z

Additional issues: changes needed for dbm on 2.7 mean breaking 3, since dbm doesn't like unicode keys. Also, whatever backend is being used in my local test doesn't scale well, so I need to find a better one. Watch this space (or not) for updates.

nicholascar · 2020-05-01T11:23:59Z

@jimmccusker would you be interested in updating this PR to work with 5.0.0+? I see that it previously passed only the Python 2.7 tests and none of the 3.x tests. In 5.0.0+, you might get this to pass 3.5+ only.

nicholascar · 2020-07-30T04:33:49Z

Hi @jimmccusker, are you interested in getting this to work in Python 3.6+ / RDFlib 5.0.0? We are keen to see a couple more store implementations are we are planning on killing off the old in-memory store that doesn't support a lot of expected features (like Turtle parsing!).

jpmccu · 2020-07-30T13:44:30Z

I'm actually thinking of trying again with sqlite to try to use its full text search, actually. It should match the syntax I'm working on for a fuseki implementation too. Do you mean the in memory store that's the default store, or is that a different one?

nicholascar · 2020-07-30T16:30:33Z

Of the two stores in memory.py, Memory & IOMemory, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory. The performance advantages it has over Memory might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory if it was faster but supported fewer features.

Can't quite remember all this though so will have to test out the stores' features and speeds first.

jpmccu · 2020-07-30T17:14:23Z

If you're worried about performance, for what it's worth I've brought the default memory store in py3 up over 1 billion triples.

On Thu, Jul 30, 2020 at 12:30 PM Nicholas Car ***@***.***> wrote: Of the two stores in memory.py, Memory & IOMemory, I think, from memory (!) that one of them doesn't support all features, perhaps IOMemory. The performance advantages it has over Memory might be negated with a change to Python 3.6 dicts, so then no need to maintain IOMemory if it was faster but supported fewer features. Can't quite remember all this though so will have to test out the stores' features and speeds first. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEI7OUKRGRILMKKK7IDR6GN3TANCNFSM4FBGCTCA> .

-- Jim McCusker Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute mccusj2@rpi.edu <mccusj@cs.rpi.edu> http://tw.rpi.edu

nicholascar · 2021-07-02T11:33:20Z

@jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added.

jpmccu · 2021-07-02T13:47:45Z

I haven't had the chance to work on it. It would probably be SQLite based, as I've found some unexpectedly expensive operations with shelve. Jamie

On Fri, Jul 2, 2021 at 7:33 AM Nicholas Car ***@***.***> wrote: @jimmccusker <https://github.com/jimmccusker> are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEO2RKV4D2QRNA3C2UDTVWPYVANCNFSM4FBGCTCA> .

-- Jamie McCusker (she/they) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

westurner · 2021-08-28T12:46:03Z

Would there be advantages to this approach instead of just https://github.com/RDFLib/rdflib-sqlalchemy with SQLite?

nicholascar · 2021-08-29T01:37:58Z

@jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner or perhaps something else?

Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco.

It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco.

jpmccu · 2021-08-29T02:35:58Z

If there's a SQLite store, it's probably already doing better what I'd try to do. If benchmarks say it's comparable to sleepycat, we should just go with that.

On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car ***@***.***> wrote: @jimmccusker <https://github.com/jimmccusker> I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner <https://github.com/westurner> or perhaps something else? Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco. It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Jamie McCusker (she/they) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

jpmccu · 2021-08-29T02:37:10Z

But I'm guessing that going through the sqlalchemy layer is probably not as performant as it could be. I haven't investigated, though.

On Sat, Aug 28, 2021 at 10:35 PM Jamie McCusker ***@***.***> wrote: If there's a SQLite store, it's probably already doing better what I'd try to do. If benchmarks say it's comparable to sleepycat, we should just go with that. On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car ***@***.***> wrote: > @jimmccusker <https://github.com/jimmccusker> I don't know much about > Shelve but when you say "It would probably be SQLite based" what do you > mean? Is it that you would reuse the SQLite Store, as indicated by > @westurner <https://github.com/westurner> or perhaps something else? > > Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for > options here and need to brush up on my use of SQLite. The SQLite doesn't > appear much or perhaps at all in the main RDFlib docco. > > It would be good to do a stock take of all the Stores and to perhaps list > their similarities & differences in mainline RDFlib docco. > > — > You are receiving this because you were mentioned. > Reply to this email directly, view it on GitHub > <#830 (comment)>, or > unsubscribe > <https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA> > . > Triage notifications on the go with GitHub Mobile for iOS > <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> > or Android > <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. > > -- Jamie McCusker (she/they) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

-- Jamie McCusker (she/they) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

nicholascar · 2021-08-29T03:56:12Z

going through the sqlalchemy layer is probably not as performant as it could be

Well that’s the thing: I assumed a Shelve implementation using the native Shelve API would be best, but then you’d have to invent (or borrow, if you could copy from BerkeleyDB) all the CRUD equivalent functions in RDFlib-speak as well as all the SPO, PSO etc indexing. That’s what I thought you wanted to do!

Perhaps we really do need a Store features and performance comparison table. Then we will know what, If anything’s missing.

is this something an RPI student might be able to do @jimmccusker?

westurner · 2021-08-29T05:08:01Z

Here are the sqla Tables and Indexes: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/rdflib_sqlalchemy/tables.py

Tests for SQLite:
https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_sqlalchemy_sqlite.py

500-25K triples:
https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/test/test_store_performance.py

Is shelve similar to LevelDB in key/value interface?

https://github.com/python/cpython/blob/main/Lib/shelve.py
- Someday pickle could be extended to load data but not code: https://github.com/python/cpython/blob/main/Lib/pickle.py#L1497-L1539
- jsonpickle
- ijson +& simdjson for performance
https://github.com/RDFLib/rdflib-leveldb/blob/master/rdflib_leveldb/leveldbstore.py
https://github.com/cosmos/iavl

From https://github.com/jsonpickle/jsonpickle#security :

Security
jsonpickle should be treated the same as the Python stdlib pickle module from a security perspective.

Warning

The jsonpickle module is not secure. Only unpickle data you trust.

It is possible to construct malicious pickle data which will execute arbitrary code during unpickling. Never unpickle data that could have come from an untrusted source, or that could have been tampered with.

Consider signing data with an HMAC if you need to ensure that it has not been tampered with.

HMACs and Merkle hashes help with data integrity, but not [cryptographic] identity (which we now have W3C ld-proofs for part of)

Safer deserialization approaches, such as reading JSON directly, may be more appropriate if you are processing untrusted data.

https://pypi.org/project/ijson/#performance-tips

https://github.com/simdjson/simdjson#performance-results

westurner · 2021-08-29T05:20:17Z

... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite.

https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes

datetime64 indexes, too?
https://arrow.apache.org/docs/python/pandas.html#date-types

Edit: Shelve solves for persisting a dict of python objects; for when there's not enough RAM. But:

SEC: shelve executes unsigned code due to pickle,
PERF,SCAL,BUG: shelve doesn't do ordered transactions, so shelve is not safe for parallel use: if there are e.g. writes during reads, the behavior is nondeterministic due to lack of (database transaction) Isolation. From https://en.wikipedia.org/wiki/ACID re Isolation:

depending on the method used, the effects of an incomplete transaction might not even be visible to other transactions

Rdflib-sqlachemy should already be wrapping with transaction BEGIN and COMMIT SQL statements that mutate the database?

jpmccu · 2021-08-29T15:17:48Z

The major issue with shelve is that it is expensive to iterate keys. The documentation doesn't explain how bad, but I've definitely seen performance be far worse than iterating through a key list file. I can see if there's an undergraduate who'd like a project like this that I can mentor. Thanks, Jamie

On Sun, Aug 29, 2021 at 1:20 AM Wes Turner ***@***.***> wrote: ... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite. https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes datetime64 indexes, too? https://arrow.apache.org/docs/python/pandas.html#date-types — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEPX4JQF7NUS3A54IHTT7G7RXANCNFSM4FBGCTCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Jamie McCusker (she/they) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

nicholascar · 2021-08-30T00:14:09Z

I can see if there's an undergraduate who'd like a project like this that I can mentor

Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to.

jpmccu · 2021-08-30T02:22:55Z

We have something we did for units of measure conversion, but it's Ontology specific. https://pypi.org/project/whyis-unit-converter/

On Sun, Aug 29, 2021 at 8:14 PM Nicholas Car ***@***.***> wrote: I can see if there's an undergraduate who'd like a project like this that I can mentor Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAETCEM3XDU66JVL2UYWEVLT7LEN3ANCNFSM4FBGCTCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

-- Jamie McCusker (she/they) Director, Data Operations Tetherless World Constellation Rensselaer Polytechnic Institute ***@***.*** ***@***.***> http://tw.rpi.edu

nicholascar · 2021-08-30T03:13:05Z

Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now.

The current work we are doing is building a Python UCUM converter, based on the JavaScript one. Once that's done, we will work on a QUDT converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib.

westurner · 2021-08-30T07:17:51Z

- [ ] how to publish a Dataset as CSVW Linked Data with units of measure - [ ] how to specify the units of measure of a CSV/CSVW *column* as a URI - http://wrdrd.github.io/docs/consulting/units#csvw-and-units http://wrdrd.github.io/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows

## CSV, CSVW, and metadata rows A data table with 7 metadata header rows (column label, property URI

path, DataType, unit, accuracy, precision, significant figures) - [ ] how to publish a ScholarlyArticle of StructuredPremises like Datasets {from a Jupyter-Book in a repo2docker REES container image} - "#LinkedReproducibility" - #LinkedResearch

…

On Sun, Aug 29, 2021, 23:13 Nicholas Car ***@***.***> wrote: Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now. The current work we are doing is building a Python UCUM converter, based on the JavaScript one <https://github.com/lhncbc/ucum-lhc>. Once that's done, we will work on a QUDT <http://qudt.org/> converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#830 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAMNS45NFLXSA5ZYYFFJVDT7LZMZANCNFSM4FBGCTCA> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

nicholascar · 2024-03-20T02:51:08Z

@jpmccu @westurner RDLib's gone up a couple of versions now, any continued interest here?

jpmccu · 2024-03-20T03:19:46Z

Not in the near future. I've been using the OxiGraph store for similar use cases. Get Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Nicholas Car ***@***.***> Sent: Tuesday, March 19, 2024 10:51:31 PM To: RDFLib/rdflib ***@***.***> Cc: Jamie McCusker ***@***.***>; Mention ***@***.***> Subject: Re: [RDFLib/rdflib] Shelve basicindex (#830) @jpmccu<https://github.com/jpmccu> @westurner<https://github.com/westurner> RDLib's gone up a couple of versions now, any continued interest here? — Reply to this email directly, view it on GitHub<#830 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAETCEIPXEACXTDZXG5GSYTYZD2THAVCNFSM4FBGCTCKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQHA2TQOBQHA3A>. You are receiving this because you were mentioned.Message ID: ***@***.***>

jpmccu added 6 commits May 22, 2018 17:07

Added tests for sleepycat and a new store optimized for read/write, b…

d2744bc

…ut not arbitrary search.

Fixes for all tests passing on the Shelf store.

2ae62f2

Including the actual module.

3d1c28e

Added unicode conversion for some backends, directory-based storage.

984f693

Added cleaner unicode conversion.

3745857

Looks like bsddb isn't available on travis, missed a unicode conversion.

9005622

gromgull added enhancement New feature or request fix-in-progress store Related to a store. labels Oct 27, 2018

RDFLib deleted a comment from coveralls Oct 29, 2018

white-gecko added this to the rdflib 5.1.0 milestone Mar 16, 2020

white-gecko modified the milestones: rdflib 5.1.0, rdflib 6.0.0 May 1, 2020

nicholascar mentioned this pull request Aug 28, 2021

Configure pytest #1268

Closed

white-gecko modified the milestones: rdflib 6.x.x, 2022 June release Jun 20, 2022

ghost mentioned this pull request May 19, 2023

Sqlitedbstore #2380

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shelve basicindex #830

Shelve basicindex #830

jpmccu commented May 22, 2018

jpmccu commented May 23, 2018

nicholascar commented May 1, 2020

nicholascar commented Jul 30, 2020

jpmccu commented Jul 30, 2020

nicholascar commented Jul 30, 2020

jpmccu commented Jul 30, 2020 via email

nicholascar commented Jul 2, 2021

jpmccu commented Jul 2, 2021 via email

westurner commented Aug 28, 2021

nicholascar commented Aug 29, 2021

jpmccu commented Aug 29, 2021 via email

jpmccu commented Aug 29, 2021 via email

nicholascar commented Aug 29, 2021

westurner commented Aug 29, 2021 •

edited

Loading

westurner commented Aug 29, 2021 •

edited

Loading

jpmccu commented Aug 29, 2021 via email

nicholascar commented Aug 30, 2021

jpmccu commented Aug 30, 2021 via email

nicholascar commented Aug 30, 2021

westurner commented Aug 30, 2021 via email

nicholascar commented Mar 20, 2024

jpmccu commented Mar 20, 2024 via email

Shelve basicindex #830

Are you sure you want to change the base?

Shelve basicindex #830

Conversation

jpmccu commented May 22, 2018

jpmccu commented May 23, 2018

nicholascar commented May 1, 2020

nicholascar commented Jul 30, 2020

jpmccu commented Jul 30, 2020

nicholascar commented Jul 30, 2020

jpmccu commented Jul 30, 2020 via email

nicholascar commented Jul 2, 2021

jpmccu commented Jul 2, 2021 via email

westurner commented Aug 28, 2021

nicholascar commented Aug 29, 2021

jpmccu commented Aug 29, 2021 via email

jpmccu commented Aug 29, 2021 via email

nicholascar commented Aug 29, 2021

westurner commented Aug 29, 2021 • edited Loading

westurner commented Aug 29, 2021 • edited Loading

jpmccu commented Aug 29, 2021 via email

nicholascar commented Aug 30, 2021

jpmccu commented Aug 30, 2021 via email

nicholascar commented Aug 30, 2021

westurner commented Aug 30, 2021 via email

nicholascar commented Mar 20, 2024

jpmccu commented Mar 20, 2024 via email

westurner commented Aug 29, 2021 •

edited

Loading

westurner commented Aug 29, 2021 •

edited

Loading