-
Notifications
You must be signed in to change notification settings - Fork 558
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shelve basicindex #830
base: main
Are you sure you want to change the base?
Shelve basicindex #830
Conversation
…ut not arbitrary search.
Additional issues: changes needed for dbm on 2.7 mean breaking 3, since dbm doesn't like unicode keys. Also, whatever backend is being used in my local test doesn't scale well, so I need to find a better one. Watch this space (or not) for updates. |
@jimmccusker would you be interested in updating this PR to work with 5.0.0+? I see that it previously passed only the Python 2.7 tests and none of the 3.x tests. In 5.0.0+, you might get this to pass 3.5+ only. |
Hi @jimmccusker, are you interested in getting this to work in Python 3.6+ / RDFlib 5.0.0? We are keen to see a couple more store implementations are we are planning on killing off the old in-memory store that doesn't support a lot of expected features (like Turtle parsing!). |
I'm actually thinking of trying again with sqlite to try to use its full text search, actually. It should match the syntax I'm working on for a fuseki implementation too. Do you mean the in memory store that's the default store, or is that a different one? |
Of the two stores in memory.py, Can't quite remember all this though so will have to test out the stores' features and speeds first. |
If you're worried about performance, for what it's worth I've brought the
default memory store in py3 up over 1 billion triples.
On Thu, Jul 30, 2020 at 12:30 PM Nicholas Car ***@***.***> wrote:
Of the two stores in memory.py, Memory & IOMemory, I think, from memory
(!) that one of them doesn't support all features, perhaps IOMemory. The
performance advantages it has over Memory might be negated with a change
to Python 3.6 dicts, so then no need to maintain IOMemory if it was
faster but supported fewer features.
Can't quite remember all this though so will have to test out the stores'
features and speeds first.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEI7OUKRGRILMKKK7IDR6GN3TANCNFSM4FBGCTCA>
.
--
Jim McCusker
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
mccusj2@rpi.edu <mccusj@cs.rpi.edu>
http://tw.rpi.edu
|
@jimmccusker are you still keen on this store? I've got a PR in to update the Sleepycat store to the newer version of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see more stores added. |
I haven't had the chance to work on it. It would probably be SQLite based,
as I've found some unexpectedly expensive operations with shelve.
Jamie
On Fri, Jul 2, 2021 at 7:33 AM Nicholas Car ***@***.***> wrote:
@jimmccusker <https://github.com/jimmccusker> are you still keen on this
store? I've got a PR in to update the Sleepycat store to the newer version
of (Pythons wrapper for) BerkeleyDB and that works well. I'm keen to see
more stores added.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEO2RKV4D2QRNA3C2UDTVWPYVANCNFSM4FBGCTCA>
.
--
Jamie McCusker (she/they)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
Would there be advantages to this approach instead of just https://github.com/RDFLib/rdflib-sqlalchemy with SQLite? |
@jimmccusker I don't know much about Shelve but when you say "It would probably be SQLite based" what do you mean? Is it that you would reuse the SQLite Store, as indicated by @westurner or perhaps something else? Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for options here and need to brush up on my use of SQLite. The SQLite doesn't appear much or perhaps at all in the main RDFlib docco. It would be good to do a stock take of all the Stores and to perhaps list their similarities & differences in mainline RDFlib docco. |
If there's a SQLite store, it's probably already doing better what I'd try
to do. If benchmarks say it's comparable to sleepycat, we should just go
with that.
On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car ***@***.***> wrote:
@jimmccusker <https://github.com/jimmccusker> I don't know much about
Shelve but when you say "It would probably be SQLite based" what do you
mean? Is it that you would reuse the SQLite Store, as indicated by
@westurner <https://github.com/westurner> or perhaps something else?
Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for
options here and need to brush up on my use of SQLite. The SQLite doesn't
appear much or perhaps at all in the main RDFlib docco.
It would be good to do a stock take of all the Stores and to perhaps list
their similarities & differences in mainline RDFlib docco.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Jamie McCusker (she/they)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
But I'm guessing that going through the sqlalchemy layer is probably not as
performant as it could be. I haven't investigated, though.
On Sat, Aug 28, 2021 at 10:35 PM Jamie McCusker ***@***.***> wrote:
If there's a SQLite store, it's probably already doing better what I'd try
to do. If benchmarks say it's comparable to sleepycat, we should just go
with that.
On Sat, Aug 28, 2021 at 9:38 PM Nicholas Car ***@***.***>
wrote:
> @jimmccusker <https://github.com/jimmccusker> I don't know much about
> Shelve but when you say "It would probably be SQLite based" what do you
> mean? Is it that you would reuse the SQLite Store, as indicated by
> @westurner <https://github.com/westurner> or perhaps something else?
>
> Wouldn't Shelve be quite a different thing to SQLite? I'm really keen for
> options here and need to brush up on my use of SQLite. The SQLite doesn't
> appear much or perhaps at all in the main RDFlib docco.
>
> It would be good to do a stock take of all the Stores and to perhaps list
> their similarities & differences in mainline RDFlib docco.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#830 (comment)>, or
> unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AAETCEKLTTPP6ATUGTIOXODT7GFQBANCNFSM4FBGCTCA>
> .
> Triage notifications on the go with GitHub Mobile for iOS
> <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
> or Android
> <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
>
>
--
Jamie McCusker (she/they)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
--
Jamie McCusker (she/they)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
Well that’s the thing: I assumed a Shelve implementation using the native Shelve API would be best, but then you’d have to invent (or borrow, if you could copy from BerkeleyDB) all the CRUD equivalent functions in RDFlib-speak as well as all the SPO, PSO etc indexing. That’s what I thought you wanted to do! Perhaps we really do need a Store features and performance comparison table. Then we will know what, If anything’s missing. is this something an RPI student might be able to do @jimmccusker? |
Here are the sqla Tables and Indexes: https://github.com/RDFLib/rdflib-sqlalchemy/blob/develop/rdflib_sqlalchemy/tables.py Tests for SQLite: 500-25K triples: Is shelve similar to LevelDB in key/value interface?
From https://github.com/jsonpickle/jsonpickle#security :
HMACs and Merkle hashes help with data integrity, but not [cryptographic] identity (which we now have W3C ld-proofs for part of)
|
... FWIW, Arrow + Parquet is definitely faster than pickle or JSON (because Parquet is already shaped like it needs to be in RAM); but serialization of arbitrary types does not make a proper DBMS with ACID guarantees, like SQLite. https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes datetime64 indexes, too? Edit: Shelve solves for persisting a dict of python objects; for when there's not enough RAM. But:
Rdflib-sqlachemy should already be wrapping with transaction |
The major issue with shelve is that it is expensive to iterate keys. The
documentation doesn't explain how bad, but I've definitely seen performance
be far worse than iterating through a key list file. I can see if there's
an undergraduate who'd like a project like this that I can mentor.
Thanks,
Jamie
On Sun, Aug 29, 2021 at 1:20 AM Wes Turner ***@***.***> wrote:
... FWIW, Arrow + Parquet is definitely faster than pickle or JSON
(because Parquet is already shaped like it needs to be in RAM); but
serialization of arbitrary types does not make a proper DBMS with ACID
guarantees, like SQLite.
https://arrow.apache.org/docs/python/pandas.html#handling-pandas-indexes
datetime64 indexes, too?
https://arrow.apache.org/docs/python/pandas.html#date-types
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEPX4JQF7NUS3A54IHTT7G7RXANCNFSM4FBGCTCA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Jamie McCusker (she/they)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
Great, well I'm happy to assist too with students and anyone else! I already have an undergrad working on RDFlib-related things now too but he is consumed with Units of Measure. Going forward, I think I want to do more at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle serialization improvements etc. as there really are some critical issues that lots of people would benefit from solutions to. |
We have something we did for units of measure conversion, but it's Ontology
specific.
https://pypi.org/project/whyis-unit-converter/
On Sun, Aug 29, 2021 at 8:14 PM Nicholas Car ***@***.***> wrote:
I can see if there's an undergraduate who'd like a project like this that
I can mentor
Great, well I'm happy to assist too with students and anyone else! I
already have an undergrad working on RDFlib-related things now too but he
is consumed with Units of Measure. Going forward, I think I want to do more
at the fundamental RDFlib end - all this Store stuff, JSON-LD & Turtle
serialization improvements etc. as there really are some critical issues
that lots of people would benefit from solutions to.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAETCEM3XDU66JVL2UYWEVLT7LEN3ANCNFSM4FBGCTCA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
--
Jamie McCusker (she/they)
Director, Data Operations
Tetherless World Constellation
Rensselaer Polytechnic Institute
***@***.*** ***@***.***>
http://tw.rpi.edu
|
Thanks for the link to that work Jamie: yes I know about OM and have communicated a bit with Hajo Rijgersberg about it and I'm glad to have your Python converter in mind now. The current work we are doing is building a Python UCUM converter, based on the JavaScript one. Once that's done, we will work on a QUDT converter which shoul dbe pretty easy compared to the UCUM one, given the conversion vectors present in QUDT. The goal is to have multi-system unit conversions available in RDFlib. |
- [ ] how to publish a Dataset as CSVW Linked Data with units of measure
- [ ] how to specify the units of measure of a CSV/CSVW *column* as a URI
- http://wrdrd.github.io/docs/consulting/units#csvw-and-units
http://wrdrd.github.io/docs/consulting/linkedreproducibility#csv-csvw-and-metadata-rows
## CSV, CSVW, and metadata rows
A data table with 7 metadata header rows (column label, property URI
path, DataType, unit, accuracy, precision, significant figures)
- [ ] how to publish a ScholarlyArticle of StructuredPremises like Datasets
{from a Jupyter-Book in a repo2docker REES container image}
- "#LinkedReproducibility"
- #LinkedResearch
…On Sun, Aug 29, 2021, 23:13 Nicholas Car ***@***.***> wrote:
Thanks for the link to that work Jamie: yes I know about OM and have
communicated a bit with Hajo Rijgersberg about it and I'm glad to have your
Python converter in mind now.
The current work we are doing is building a Python UCUM converter, based
on the JavaScript one <https://github.com/lhncbc/ucum-lhc>. Once that's
done, we will work on a QUDT <http://qudt.org/> converter which shoul dbe
pretty easy compared to the UCUM one, given the conversion vectors present
in QUDT. The goal is to have multi-system unit conversions available in
RDFlib.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#830 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAAMNS45NFLXSA5ZYYFFJVDT7LZMZANCNFSM4FBGCTCA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
@jpmccu @westurner RDLib's gone up a couple of versions now, any continued interest here? |
Not in the near future. I've been using the OxiGraph store for similar use cases.
Get Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Nicholas Car ***@***.***>
Sent: Tuesday, March 19, 2024 10:51:31 PM
To: RDFLib/rdflib ***@***.***>
Cc: Jamie McCusker ***@***.***>; Mention ***@***.***>
Subject: Re: [RDFLib/rdflib] Shelve basicindex (#830)
@jpmccu<https://github.com/jpmccu> @westurner<https://github.com/westurner> RDLib's gone up a couple of versions now, any continued interest here?
—
Reply to this email directly, view it on GitHub<#830 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AAETCEIPXEACXTDZXG5GSYTYZD2THAVCNFSM4FBGCTCKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TEMBQHA2TQOBQHA3A>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Shelve is a local keystore database in python that takes native python objects. I have made a simple store with it that optimizes for read and write, while keeping indexing simple and local. The data structure is a tree that looks like (including the native store:
The context-level is the unit of storage in Shelve, and so operations are performed by reading and writing whole contexts in memory, then storing them as a unit to disk on each mutation. This gets shaky with really big contexts, but is performant when contexts aren't huge. Additionally, an LRU cache is enabled so that sequential and near-sequential mutations to the context don't require lots of disk reads. Searches that require grabbing all of a subtree from the data structure should be pretty fast. Finding all the subjects with a matching object will be an order N operation over the graph, which is worst case performance.
Order-N transformations like data generation, simple filtering, format conversion, etc. are therefore optimal, but don't go doing complex graph queries with it. It might work for a linked data server if you don't supply a SPARQL endpoint on top of it.