-
-
Notifications
You must be signed in to change notification settings - Fork 385
Discovery: data consistency and resource usage #259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Even though it's early in the process of this discovery I'd appreciate any (and early) feedback. Thank you :) |
You still seem to be missing the most obvious optimisation: gitoxide currently
maps both the index and the data file eagerly. There is no good reason to do
this, except for very specialised use cases (eg. implementing `git repack`). I
think it is fair to say that the majority of applications will **not** know
upfront which objects (and therefore which packs) they will access. Those
applications will most likely benefit from mapping data files on-demand, even if
you don't provide a threadsafe implementation and don't account for concurrent
modifications of the odb.
Also your definition of scalability does not consider that resource accounting
is independent of the number of available cores, so I would find it appropriate
to not make bold claims to that end.
|
…which definitely is the usecase for the current implementation. It's clear it would benefit from lazy-loading too though, even though it's not the most important thing right now.
Thanks for your help, it's appreciated :).
To me, avoiding unconditional interior mutability to support lazy loading of packs allows for thread-safe sharing of memory maps to scale object access with cores. Maybe this is a special case or an optimization many will skip. I highlight the word 'unconditional' because also I believe lazy loading should be supported, probably even as the default, but not be the only option. The use-cases I mention in the discovery will try to make a case for different options to handle this, and I hope to have a technical sketch for a solution that serves all of them soon.
I'd be glad if you could point me to this statement so it can be adjusted as I agree that it shouldn't even be implied that resource accounting is somehow dependent on the number of cores. |
Even though I definitely don't like all that complexity :/
…259) Because it really is a storage location for shared data that doesn't do anything on its own.
But the best of it is that auto-refreshing Policies should be the new default.
Already showing some issues but I think it can be done smartly nonetheless.
However, dyn-traits can't be combined with non-auto traits which prevents this. Maybe a macro would do though.
The goal is to make thread-safety toggalble with features to get type complexity down to just a single type in git-repository without needing more feature toggles in parent crates. This should work actually, this is an intermediate commit. Something that will change is the settings of Easy, as there probably will only be an Easy and EasyShared, don't know how the types can be filled in based on a feature toggle though. Maybe it can work with more typedefs.
…but it shows that ideally we just export the correct Policy type (i.e. one with Send and Sync requirements as needed) to avoid having to do some conversion acrobatics when we want to use it later.
After testing performance, we can either decide to allow multiple impls thanks to type parameters, or just use one and keep them out of it.
…hrough a reference (#259) This is quite an amazing result and maybe it's just an M1 thing, but it's clearly showing that modern CPUs can handle that pretty well. We still keep the idea alive to make thread-safety switchable to avoid forcing overhead on other architectures or those who don't want to use threads.
… no-arc (#259) However, that can only be caused when doing contains checks, which probably are not usually what one does, and in standard object access operation going through an arc (and even an arc-lock) isn't a problem. Going through a Mutex is slow though, so it's probably better to use read-locks when possible and upgrade them when needed.
#259) For this we skip the Eager-portion and go straight to lazy loading, while introducing a flag that should help to keep packs available (as well as indices) indefinitely. I imagine the coordinator of readers (upload-pack) to check for the amount of open handles from time to time and the amount of reuse, and replace it with a new one to clear handles. This also means this must not be implemented as a flag per call, but as a setting on the policy.
…entual pack lookup (#259)
…pack loading states need some more detail to be able to decide what to do.
Probably rare, but maybe the reconcilation can handle that without overhead, and if so, it's just a benefit.
…led (#259) Now even in the Store packs belong to indices. Indices are sparse in the Policy but are dense in the store. Packs for multi-pack indices are always dense at first but get less dense over time. Note quite sure how changes are communicated in case of multi-pack indices.
…and run into a hopefully fixable lifetime issue
Since packs can only be loaded after loading indices, one will have a marker that has to be used to check for changes that can't be reconciled. Note that this is not to protect from querying packs that aren't available on disk anymore as we will never unload them anyway when in a mode that needs pack-ids to remain stable.
Concurrent writes to packs observable by applications pose a challenge to the current implementation and we need to find ways around that.
Motivation
Use this document to shed light on the entire problem space surrounding data consistency and resource usage of packed objects to aid in finding solutions
that are best for various use cases without committing to high costs in one case or another.