Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate git clone --reference[-if-able] when using --path-cache/--name-cache #625

Open
mbolivar-nordic opened this issue Feb 13, 2023 · 6 comments
Labels
enhancement New feature or request help wanted Implementation help is requested performance How long things take

Comments

@mbolivar-nordic
Copy link
Contributor

In general, investigate and benchmark using .git/objects/info/alternates to point to the cache repository instead of cloning the cache repository. This may be faster when the clone has to cross a file system boundary, which is frequently the case when --path-cache and --name-cache are used in CI environments.

Reference to Discord discussion:

https://discordapp.com/channels/720317445772017664/906521547672522752/1074769891850211398

@mbolivar-nordic mbolivar-nordic added enhancement New feature or request help wanted Implementation help is requested labels Feb 13, 2023
@marc-hb
Copy link
Collaborator

marc-hb commented Feb 13, 2023

I hope the description is complete because the discord link is behind a "registration wall". Just FYI.

@mbolivar-nordic
Copy link
Contributor Author

I think there's enough in there, thanks.

@mbolivar-ampere
Copy link
Collaborator

@hzarnani since you mentioned this is a dupe of now-closed #695, I sure would like to see some performance benchmarks if you have any showing that this is a win.

@hzarnani
Copy link

@hzarnani since you mentioned this is a dupe of now-closed #695, I sure would like to see some performance benchmarks if you have any showing that this is a win.

@mbolivar-ampere, I'd be happy to do that. I was looking at the discord discussion to see what metrics you were interested in and got a good idea. And I can probably add more.

The thing, though, is that unless one is dealing with large repositories containing GBs of objects and have many active workspaces, it'd be hard to see the the value, and in some cases the necessity, of object sharing. That's what I was trying to convey (unsuccessfully) in #695. So, in short, scale is important, both in the size of the individual repositories and in the number of workspaces that clone them. This is particularly important in CI but even when used by interactive users.

Now, I'm sure that an immediately obvious question to some readers after seeing "GBs of objects" is "why do you even have that large of repositories? Do you revision large binaries in Git, which is really not intended for that purpose? Understand that Git isn't really intended to store binaries. Consider using LFS instead or truncate and rewrite old history." All those are valid questions and options, but to make any of those options happen in production is no easy task.

Bottom line is this -- a tool built on top of Git should offer both of the primary mechanisms provided by Git for dealing with many clones of very large repositories -- shallow depth and object sharing.

@marc-hb
Copy link
Collaborator

marc-hb commented Oct 26, 2023

Engineering is always about trade-offs and numbers. Performance even more.

You can almost always find a specific use case that will benefit from pretty much any optimization. The question is always "is it worth it?". In other words, is the extra code and corresponding extra maintenance cost worth the benefits? Of course this is not an exact science: we'll never know exactly how many people use big git repos with west and how big they are. But that should not stop us from looking at some examples and incomplete data. It's still better than no data at all because performance is always full of surprises.

For the same reason we need some estimation of the complexity of the code changes and corresponding maintenance burden. Considering the extremely limited manpower, significantly affecting the maintenance of "mainstream" features for very few people using git "the wrong way" could be a blocker.

@karhama
Copy link
Contributor

karhama commented Nov 24, 2023

I performed some tests with my draft PR.

Here is explanation for two legends in the graph below:
reference repo = PR #697 as such (commit 1bf94b0)
local clone = PR #697 but using local clone for projects as in main branch of west (only submodule update uses --reference)

West update times were captured from our dev CI system. Each west update execution ran on clean Azure F16s_v2 VM (with ephemeral disk and with some host cache prepping included). Even with such a limited number of samples I think that current local clone approach is significantly faster than reference repo approach I used. State of cache was the same for all the executions and it was not up-to-date for all projects but quite recent anyway - I believe this is quite close what we would see in actual CI.

image

On my laptop and with fully up-to-date cache - local clone approach is significantly faster than using reference. In case anyone got ideas how to improve reference case please let me know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Implementation help is requested performance How long things take
Projects
None yet
Development

No branches or pull requests

5 participants