Discovery: data consistency and resource usage #259

Byron · 2021-11-19T09:04:19Z

Concurrent writes to packs observable by applications pose a challenge to the current implementation and we need to find ways around that.

Motivation

Use this document to shed light on the entire problem space surrounding data consistency and resource usage of packed objects to aid in finding solutions
that are best for various use cases without committing to high costs in one case or another.

Byron · 2021-11-19T09:05:59Z

Even though it's early in the process of this discovery I'd appreciate any (and early) feedback. Thank you :)

CC @kim @joshtriplett

kim · 2021-11-19T14:45:50Z

You still seem to be missing the most obvious optimisation: gitoxide currently maps both the index and the data file eagerly. There is no good reason to do this, except for very specialised use cases (eg. implementing `git repack`). I think it is fair to say that the majority of applications will **not** know upfront which objects (and therefore which packs) they will access. Those applications will most likely benefit from mapping data files on-demand, even if you don't provide a threadsafe implementation and don't account for concurrent modifications of the odb. Also your definition of scalability does not consider that resource accounting is independent of the number of available cores, so I would find it appropriate to not make bold claims to that end.

…ry (#259)

…#259)

…icity (#259)

…which definitely is the usecase for the current implementation. It's clear it would benefit from lazy-loading too though, even though it's not the most important thing right now.

Byron · 2021-11-21T01:57:30Z

Thanks for your help, it's appreciated :).

You still seem to be missing the most obvious optimisation: gitoxide currently
maps both the index and the data file eagerly. There is no good reason to do
this, except for very specialised use cases. […]

To me, avoiding unconditional interior mutability to support lazy loading of packs allows for thread-safe sharing of memory maps to scale object access with cores. Maybe this is a special case or an optimization many will skip. I highlight the word 'unconditional' because also I believe lazy loading should be supported, probably even as the default, but not be the only option. The use-cases I mention in the discovery will try to make a case for different options to handle this, and I hope to have a technical sketch for a solution that serves all of them soon.

Also your definition of scalability does not consider that resource accounting
is independent of the number of available cores, so I would find it appropriate
to not make bold claims to that end.

I'd be glad if you could point me to this statement so it can be adjusted as I agree that it shouldn't even be implied that resource accounting is somehow dependent on the number of cores.

…solutions) (#259)

Even though I definitely don't like all that complexity :/

…usable (#259)

…259) Because it really is a storage location for shared data that doesn't do anything on its own.

But the best of it is that auto-refreshing Policies should be the new default.

Already showing some issues but I think it can be done smartly nonetheless.

However, dyn-traits can't be combined with non-auto traits which prevents this. Maybe a macro would do though.

The goal is to make thread-safety toggalble with features to get type complexity down to just a single type in git-repository without needing more feature toggles in parent crates. This should work actually, this is an intermediate commit. Something that will change is the settings of Easy, as there probably will only be an Easy and EasyShared, don't know how the types can be filled in based on a feature toggle though. Maybe it can work with more typedefs.

…but it shows that ideally we just export the correct Policy type (i.e. one with Send and Sync requirements as needed) to avoid having to do some conversion acrobatics when we want to use it later.

…ss (#259)

…ble (#259)

After testing performance, we can either decide to allow multiple impls thanks to type parameters, or just use one and keep them out of it.

…hrough a reference (#259) This is quite an amazing result and maybe it's just an M1 thing, but it's clearly showing that modern CPUs can handle that pretty well. We still keep the idea alive to make thread-safety switchable to avoid forcing overhead on other architectures or those who don't want to use threads.

… no-arc (#259) However, that can only be caused when doing contains checks, which probably are not usually what one does, and in standard object access operation going through an arc (and even an arc-lock) isn't a problem. Going through a Mutex is slow though, so it's probably better to use read-locks when possible and upgrade them when needed.

#259) For this we skip the Eager-portion and go straight to lazy loading, while introducing a flag that should help to keep packs available (as well as indices) indefinitely. I imagine the coordinator of readers (upload-pack) to check for the amount of open handles from time to time and the amount of reuse, and replace it with a new one to clear handles. This also means this must not be implemented as a flag per call, but as a setting on the policy.

…more soon (#259)

…entual pack lookup (#259)

…ll unclear (#259)

) We keep them like a normal index, which may mean we have HAD multiple of them. Thus we can't make it special, I think, but keep all in the same index-providing array.

…pack loading states need some more detail to be able to decide what to do.

) This enables a single repository instance to be used for pack receivers and pack writers.

… our state (#259)

…sk locations (#259)

Probably rare, but maybe the reconcilation can handle that without overhead, and if so, it's just a benefit.

…led (#259) Now even in the Store packs belong to indices. Indices are sparse in the Policy but are dense in the store. Packs for multi-pack indices are always dense at first but get less dense over time. Note quite sure how changes are communicated in case of multi-pack indices.

…and run into a hopefully fixable lifetime issue

…urned (#259)

Since packs can only be loaded after loading indices, one will have a marker that has to be used to check for changes that can't be reconciled. Note that this is not to protect from querying packs that aren't available on disk anymore as we will never unload them anyway when in a mode that needs pack-ids to remain stable.

First stab at finding a format to help analyse the problem

e96d289

Byron added 4 commits November 20, 2021 10:00

Finish technical problems and solutions (#259)

f14cd61

Analyis about loose refs db (#259)

0c02c13

Read up on ref-table a little and fill in some details (#259)

3ada0c9

Write about git configuration to assert it doesn't affect the discove…

3bdf6c3

…ry (#259)

Byron changed the title ~~Discovery: data consistency and resource usage of packed objects~~ Discovery: data consistency and resource usage Nov 20, 2021

Byron added 5 commits November 20, 2021 15:13

beef up problem/solution list and add git index info for completeness (…

a6d73e4

…#259)

get started with analysing changes (soon changes in parallel) (#259)

a0bcae0

Analysis of repository changes in presence of caches and lack of atom…

c4f1db0

…icity (#259)

Add CLI example… (#259)

6494776

…which definitely is the usecase for the current implementation. It's clear it would benefit from lazy-loading too though, even though it's not the most important thing right now.

Add professional git hosting server usecase (#259)

ac8f594

Byron added 7 commits November 21, 2021 10:31

Describe and propose fix for ref namespace-sharing issue (#259)

55773b8

A use-case for an intranet scale repository server (without problems/…

ca109c1

…solutions) (#259)

Finish zero-conf easy-maintenance server problem statement (#259)

2028f58

First thoughts on how to tackle the pack problem (#259)

c042072

Even though I definitely don't like all that complexity :/

Some more thoughts, it _could_ even work (#259)

63dd57b

Clarification of thoughts, onto something here (#259)

22caa6c

More clarification around Policies, which makes Repository alone un…

2e62eb8

…usable (#259)

Byron mentioned this pull request Nov 21, 2021

pack-generation wouldn't work with alternate DBs #260

Closed

Byron added 8 commits November 21, 2021 20:32

Some more ideas that will modify what Repository is fundamentally (#…

b7eaf74

…259) Because it really is a storage location for shared data that doesn't do anything on its own.

even more notes about implementation details (#259)

dad4bf9

But the best of it is that auto-refreshing Policies should be the new default.

add experiment for type system (#259)

761c63a

A first stab at sketching types (#259)

cf17816

Already showing some issues but I think it can be done smartly nonetheless.

FAIL: try to make thread-safety togglable in just git-features (#259)

882dde9

However, dyn-traits can't be combined with non-auto traits which prevents this. Maybe a macro would do though.

A few more steps towards implementation… (#259)

1e86148

…but it shows that ideally we just export the correct Policy type (i.e. one with Send and Sync requirements as needed) to avoid having to do some conversion acrobatics when we want to use it later.

use generic Repository type with typedef default to avoid macro-madne…

e6ae34f

…ss (#259)

Byron added 26 commits November 23, 2021 16:04

unify trait bounds for parallel code: prefer Clone over Sync (#259)

c805d0b

Sketch the future of the ref-db with namespace fix and upcoming refta…

4c00f78

…ble (#259)

a path forward to deciding which ODB implementation to pick (#259)

8cafc8d

After testing performance, we can either decide to allow multiple impls thanks to type parameters, or just use one and keep them out of it.

Adjust unload mode to be a policy wide settings, probably along some …

4160b4a

…more soon (#259)

Fix single-threaded version of design-sketch (#259)

3d533a1

More work towards understanding indices and multi-pack indices for ev…

f90207d

…entual pack lookup (#259)

sketch a little more how packs could be accessed (#259)

3fce8f2

figure out how pack-ids can be stable even though multi-packs are sti…

babfb7b

…ll unclear (#259)

A better idea on how multi-pack indices (and changes to them) work (#259

fae3e7c

) We keep them like a normal index, which may mean we have HAD multiple of them. Thus we can't make it special, I think, but keep all in the same index-providing array.

A big step towards getting IDs straight, but… (#259)

5124abf

…pack loading states need some more detail to be able to decide what to do.

Flesh out Sync requirement for Repository (#259)

5180f11

Make handles tracked to allow automatic handling of pack-unloading (#259

70dc445

) This enables a single repository instance to be used for pack receivers and pack writers.

Learn about store's resource consumption (#259)

ef9b08c

Get a better understanding on how on-disk state can be represented in…

58b0bcf

… our state (#259)

sketch a more concise way of keeping indices and packs relative to di…

6187b5c

…sk locations (#259)

Simplify tracking of handles, using tokens to make that safer (#259)

407be6c

refactor

d8d5a60

Allow on-disk files to be reused as will in case they re-appear (#259)

6f5eaf1

Probably rare, but maybe the reconcilation can handle that without overhead, and if so, it's just a benefit.

try loading actual packs with the current data structure… (#259)

87544c3

…and run into a hopefully fixable lifetime issue

workaround borrow check, which in this case fits nicely (#259)

2977518

realize that pack lookup needs the marker or wrong packs might be ret…

a5a6a78

…urned (#259)

A good way to avoid using a potentially wrong pack (#259)

a14e4d0

Byron merged commit a14e4d0 into main Nov 25, 2021

Byron mentioned this pull request Nov 26, 2021

Fix loose db namespace and move packed-buffer handling into loose ref store #263

Closed

6 tasks

Byron deleted the pack-consistency branch January 10, 2022 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Discovery: data consistency and resource usage #259

Discovery: data consistency and resource usage #259

Uh oh!

Byron commented Nov 19, 2021

Uh oh!

Byron commented Nov 19, 2021

Uh oh!

kim commented Nov 19, 2021 via email

Uh oh!

Byron commented Nov 21, 2021

Uh oh!

Uh oh!

Uh oh!

Discovery: data consistency and resource usage #259

Discovery: data consistency and resource usage #259

Uh oh!

Conversation

Byron commented Nov 19, 2021

Motivation

Uh oh!

Byron commented Nov 19, 2021

Uh oh!

kim commented Nov 19, 2021 via email

Uh oh!

Byron commented Nov 21, 2021

Uh oh!

Uh oh!