Investigate git clone --reference[-if-able] when using --path-cache/--name-cache #625

mbolivar-nordic · 2023-02-13T21:04:20Z

In general, investigate and benchmark using .git/objects/info/alternates to point to the cache repository instead of cloning the cache repository. This may be faster when the clone has to cross a file system boundary, which is frequently the case when --path-cache and --name-cache are used in CI environments.

Reference to Discord discussion:

https://discordapp.com/channels/720317445772017664/906521547672522752/1074769891850211398

marc-hb · 2023-02-13T21:16:46Z

I hope the description is complete because the discord link is behind a "registration wall". Just FYI.

mbolivar-nordic · 2023-02-13T21:20:38Z

I think there's enough in there, thanks.

mbolivar-ampere · 2023-10-26T03:02:23Z

@hzarnani since you mentioned this is a dupe of now-closed #695, I sure would like to see some performance benchmarks if you have any showing that this is a win.

hzarnani · 2023-10-26T04:12:52Z

@hzarnani since you mentioned this is a dupe of now-closed #695, I sure would like to see some performance benchmarks if you have any showing that this is a win.

@mbolivar-ampere, I'd be happy to do that. I was looking at the discord discussion to see what metrics you were interested in and got a good idea. And I can probably add more.

The thing, though, is that unless one is dealing with large repositories containing GBs of objects and have many active workspaces, it'd be hard to see the the value, and in some cases the necessity, of object sharing. That's what I was trying to convey (unsuccessfully) in #695. So, in short, scale is important, both in the size of the individual repositories and in the number of workspaces that clone them. This is particularly important in CI but even when used by interactive users.

Now, I'm sure that an immediately obvious question to some readers after seeing "GBs of objects" is "why do you even have that large of repositories? Do you revision large binaries in Git, which is really not intended for that purpose? Understand that Git isn't really intended to store binaries. Consider using LFS instead or truncate and rewrite old history." All those are valid questions and options, but to make any of those options happen in production is no easy task.

Bottom line is this -- a tool built on top of Git should offer both of the primary mechanisms provided by Git for dealing with many clones of very large repositories -- shallow depth and object sharing.

marc-hb · 2023-10-26T14:25:10Z

Engineering is always about trade-offs and numbers. Performance even more.

You can almost always find a specific use case that will benefit from pretty much any optimization. The question is always "is it worth it?". In other words, is the extra code and corresponding extra maintenance cost worth the benefits? Of course this is not an exact science: we'll never know exactly how many people use big git repos with west and how big they are. But that should not stop us from looking at some examples and incomplete data. It's still better than no data at all because performance is always full of surprises.

For the same reason we need some estimation of the complexity of the code changes and corresponding maintenance burden. Considering the extremely limited manpower, significantly affecting the maintenance of "mainstream" features for very few people using git "the wrong way" could be a blocker.

karhama · 2023-11-24T12:12:22Z

I performed some tests with my draft PR.

Here is explanation for two legends in the graph below:
reference repo = PR #697 as such (commit 1bf94b0)
local clone = PR #697 but using local clone for projects as in main branch of west (only submodule update uses --reference)

West update times were captured from our dev CI system. Each west update execution ran on clean Azure F16s_v2 VM (with ephemeral disk and with some host cache prepping included). Even with such a limited number of samples I think that current local clone approach is significantly faster than reference repo approach I used. State of cache was the same for all the executions and it was not up-to-date for all projects but quite recent anyway - I believe this is quite close what we would see in actual CI.

On my laptop and with fully up-to-date cache - local clone approach is significantly faster than using reference. In case anyone got ideas how to improve reference case please let me know.

mbolivar-nordic added enhancement New feature or request help wanted Implementation help is requested labels Feb 13, 2023

marc-hb mentioned this issue Apr 26, 2023

west update: Default behavior should fetch only --depth 1 zephyrproject-rtos/zephyr#34757

Closed

marc-hb added the performance How long things take label May 5, 2023

hzarnani mentioned this issue Oct 26, 2023

Allow using reference repositories to share objects #695

Closed

karhama mentioned this issue Nov 15, 2023

Extend cache support for submodules #697

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate git clone --reference[-if-able] when using --path-cache/--name-cache #625

Investigate git clone --reference[-if-able] when using --path-cache/--name-cache #625

mbolivar-nordic commented Feb 13, 2023

marc-hb commented Feb 13, 2023

mbolivar-nordic commented Feb 13, 2023

mbolivar-ampere commented Oct 26, 2023

hzarnani commented Oct 26, 2023

marc-hb commented Oct 26, 2023 •

edited

karhama commented Nov 24, 2023 •

edited

Investigate git clone --reference[-if-able] when using --path-cache/--name-cache #625

Investigate git clone --reference[-if-able] when using --path-cache/--name-cache #625

Comments

mbolivar-nordic commented Feb 13, 2023

marc-hb commented Feb 13, 2023

mbolivar-nordic commented Feb 13, 2023

mbolivar-ampere commented Oct 26, 2023

hzarnani commented Oct 26, 2023

marc-hb commented Oct 26, 2023 • edited

karhama commented Nov 24, 2023 • edited

marc-hb commented Oct 26, 2023 •

edited

karhama commented Nov 24, 2023 •

edited