Skip to content

Conversation

Byron
Copy link
Member

@Byron Byron commented Nov 19, 2021

Concurrent writes to packs observable by applications pose a challenge to the current implementation and we need to find ways around that.

Motivation

Use this document to shed light on the entire problem space surrounding data consistency and resource usage of packed objects to aid in finding solutions
that are best for various use cases without committing to high costs in one case or another.

@Byron
Copy link
Member Author

Byron commented Nov 19, 2021

Even though it's early in the process of this discovery I'd appreciate any (and early) feedback. Thank you :)

CC @kim @joshtriplett

@kim
Copy link
Contributor

kim commented Nov 19, 2021 via email

@Byron Byron changed the title Discovery: data consistency and resource usage of packed objects Discovery: data consistency and resource usage Nov 20, 2021
@Byron
Copy link
Member Author

Byron commented Nov 21, 2021

Thanks for your help, it's appreciated :).

You still seem to be missing the most obvious optimisation: gitoxide currently
maps both the index and the data file eagerly. There is no good reason to do
this, except for very specialised use cases. […]

To me, avoiding unconditional interior mutability to support lazy loading of packs allows for thread-safe sharing of memory maps to scale object access with cores. Maybe this is a special case or an optimization many will skip. I highlight the word 'unconditional' because also I believe lazy loading should be supported, probably even as the default, but not be the only option. The use-cases I mention in the discovery will try to make a case for different options to handle this, and I hope to have a technical sketch for a solution that serves all of them soon.

Also your definition of scalability does not consider that resource accounting
is independent of the number of available cores, so I would find it appropriate
to not make bold claims to that end.

I'd be glad if you could point me to this statement so it can be adjusted as I agree that it shouldn't even be implied that resource accounting is somehow dependent on the number of cores.

…259)

Because it really is a storage location for shared data that doesn't do
anything on its own.
But the best of it is that auto-refreshing Policies should
be the new default.
Already showing some issues but I think it can be done smartly
nonetheless.
However, dyn-traits can't be combined with non-auto traits which
prevents this. Maybe a macro would do though.
The goal is to make thread-safety toggalble with features to get type
complexity down to just a single type in git-repository without needing
more feature toggles in parent crates.

This should work actually, this is an intermediate commit.

Something that will change is the settings of Easy, as there probably
will only be an Easy and EasyShared, don't know how the types
can be filled in based on a feature toggle though.
Maybe it can work with more typedefs.
…but it shows that ideally we just export the correct Policy type (i.e.
one with Send and Sync requirements as needed) to avoid having
to do some conversion acrobatics when we want to use it later.
Byron added 26 commits November 23, 2021 16:04
After testing performance, we can either decide to allow multiple impls
thanks to type parameters, or just use one and keep them out of it.
…hrough a reference (#259)

This is quite an amazing result and maybe it's just an M1 thing, but
it's clearly showing that modern CPUs can handle that pretty well.

We still keep the idea alive to make thread-safety switchable to avoid
forcing overhead on other architectures or those who don't want to use
threads.
… no-arc (#259)

However, that can only be caused when doing contains checks, which
probably are not usually what one does, and in standard object access
operation going through an arc (and even an arc-lock) isn't a problem.

Going through a Mutex is slow though, so it's probably better to use
read-locks when possible and upgrade them when needed.
#259)

For this we skip the Eager-portion and go straight to lazy loading,
while introducing a flag that should help to keep packs available
(as well as indices) indefinitely.

I imagine the coordinator of readers (upload-pack) to check for the
amount of open handles from time to time and the amount of reuse,
and replace it with a new one to clear handles.

This also means this must not be implemented as a flag per call,
but as a setting on the policy.
)

We keep them like a normal index, which may mean we have HAD multiple
of them. Thus we can't make it special, I think, but keep all in the
same index-providing array.
…pack loading states need some more detail to be able to decide what to
do.
)

This enables a single repository instance to be used for pack receivers
and pack writers.
Probably rare, but maybe the reconcilation can handle that without
overhead, and if so, it's just a benefit.
…led (#259)

Now even in the Store packs belong to indices. Indices are sparse in the
Policy but are dense in the store.

Packs for multi-pack indices are always dense at first but get less
dense over time.

Note quite sure how changes are communicated in case of multi-pack
indices.
…and run into a hopefully fixable lifetime issue
@Byron Byron merged commit a14e4d0 into main Nov 25, 2021
Byron added a commit that referenced this pull request Nov 25, 2021
Since packs can only be loaded after loading indices, one will have a
marker that has to be used to check for changes that can't be
reconciled.

Note that this is not to protect from querying packs that aren't
available on disk anymore as we will never unload them anyway when
in a mode that needs pack-ids to remain stable.
@Byron Byron deleted the pack-consistency branch January 10, 2022 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants