Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbounded disk space used by package's metadata #19597

Closed
wavexx opened this issue Dec 14, 2016 · 7 comments
Closed

Unbounded disk space used by package's metadata #19597

wavexx opened this issue Dec 14, 2016 · 7 comments
Labels
domain:packages Package management and loading

Comments

@wavexx
Copy link
Contributor

wavexx commented Dec 14, 2016

Since we use git internally to sync package metadata and packages themselves, I'd like to point out that the blobs present in the repositories themselves will keep growing, even though we don't need history for those. We're just using git to pull updates efficiently. In doing so though, we're wasting space over time.

This is a common problem with most modern package manages that use git as the underlying sync mechanism, so please consider this not as a critique against julia or Pkg itself. I'm pointing this out since I noticed my ~/.julia directory grew in several GBs in size due to the long history of some projects in there, and combining with other package managers and projects that do the same thing, it's a tremendous waste of space.

I've wrote to the git ML a long time ago when this pattern became popular, in a message I found again here:

http://git.661346.n2.nabble.com/Thinning-a-repository-td7622038.html

As a dirty solution, I wrote a simple shell script that will take a bare repository, kill the reflog, repack and shrink it again to a shallow repository of exactly 1 commit. The result is identical to a clone --depth 1, with the advantage that connection is not required to perform the operation, while still allowing efficient fast-forward.

Should we consider adopting something like this? The package manager doesn't need the history.
Unfortunately libgit2 doesn't support shallow repositories as of yet. This technique was working before when git was used as an external command, but cannot work now.

@KristofferC
Copy link
Sponsor Member

#17963

@tkelman
Copy link
Contributor

tkelman commented Dec 14, 2016

Pkg3 should support downloading a package just as a tarball of a particular tag - https://github.com/JuliaLang/Juleps/blob/master/Pkg3.md. It may or may not be able to do that in the initial implementation, but it's a design goal to keep that possible.

@wavexx
Copy link
Contributor Author

wavexx commented Dec 14, 2016

Thanks, #17963 didn't pop up in my searches. The issue itself seem to have evolved in a lot of speculation halfway, and it's too early for me to comment on Pkg3 after reading through it quickly.

I understand the need of switching to a content hash instead of a commit hash, but let's it be clear that currently, using commit hashes, we can clone and re-shallow repositories as outlined here with git very efficiently. No additional changes are needed, aside from support in libgit2.

Fast-forwarding from a shallow repository is perfectly efficient on both sides. Initial cloning is heavier on the remote side (as packs are recomputed), but incurs in massive network savings as well that cannot be ignored. Considering the ratio of update vs initial clone, I wouldn't even consider this being an issue on the server side. I've stopped deep-cloning as soon as commit to shallow repositories was implemented and never looked back.

@tkelman
Copy link
Contributor

tkelman commented Dec 14, 2016

There was the case of cocoapods' repo shallow clones overloading github servers (don't have a link handy, it was a long discussion but worth reading just the comments by github infra engineers). We're not at that scale yet, but for METADATA or very widely-used packages the server-side load of shallow clones might not be negligible.

Pkg2 does actually need the history of packages in order to checkout, pin, or downgrade to older versions of a package.

@wavexx
Copy link
Contributor Author

wavexx commented Dec 14, 2016

I'm running a server-side git instance for a local, large (multi-gb), binary repo (similarly, we use git to sync as it efficiently supports renames, something rsync doesn't). You can compute checkpoint packs on the server to make initial clones faster. We were doing it weekly. I'm surprised github doesn't do something like this already.

Shouldn't checkout/pin still be perfectly fine with shallow clones as long as the commit is in the allowed range? The main issue I see is downgrade to a revision you don't have, in which case you would need to re-clone the repository, analogous to what you'd do with a static version archive.

@JeffBezanson JeffBezanson added the domain:packages Package management and loading label Dec 14, 2016
@wavexx
Copy link
Contributor Author

wavexx commented Apr 5, 2017

By the way, I was able to shrink my ~/.julia directory from 4gb down to 350mb by running git gc --aggressive in my ~/.julia/{.cache,v*} directories :/

To see about extra gains using shallow repositories, I tried to clone --bare --no-single-branch --mirror --depth 1 and repack each repository, which would shrink the entire hierarchy to a mere 80mb.

@simonbyrne
Copy link
Contributor

Largely fixed by new Pkg. If becomes an issue again, please open at https://github.com/JuliaLang/Pkg.jl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:packages Package management and loading
Projects
None yet
Development

No branches or pull requests

5 participants