Unbounded disk space used by package's metadata #19597

wavexx · 2016-12-14T14:55:14Z

Since we use git internally to sync package metadata and packages themselves, I'd like to point out that the blobs present in the repositories themselves will keep growing, even though we don't need history for those. We're just using git to pull updates efficiently. In doing so though, we're wasting space over time.

This is a common problem with most modern package manages that use git as the underlying sync mechanism, so please consider this not as a critique against julia or Pkg itself. I'm pointing this out since I noticed my ~/.julia directory grew in several GBs in size due to the long history of some projects in there, and combining with other package managers and projects that do the same thing, it's a tremendous waste of space.

I've wrote to the git ML a long time ago when this pattern became popular, in a message I found again here:

http://git.661346.n2.nabble.com/Thinning-a-repository-td7622038.html

As a dirty solution, I wrote a simple shell script that will take a bare repository, kill the reflog, repack and shrink it again to a shallow repository of exactly 1 commit. The result is identical to a clone --depth 1, with the advantage that connection is not required to perform the operation, while still allowing efficient fast-forward.

Should we consider adopting something like this? The package manager doesn't need the history.
Unfortunately libgit2 doesn't support shallow repositories as of yet. This technique was working before when git was used as an external command, but cannot work now.

KristofferC · 2016-12-14T15:36:24Z

#17963

tkelman · 2016-12-14T15:39:38Z

Pkg3 should support downloading a package just as a tarball of a particular tag - https://github.com/JuliaLang/Juleps/blob/master/Pkg3.md. It may or may not be able to do that in the initial implementation, but it's a design goal to keep that possible.

wavexx · 2016-12-14T16:15:45Z

Thanks, #17963 didn't pop up in my searches. The issue itself seem to have evolved in a lot of speculation halfway, and it's too early for me to comment on Pkg3 after reading through it quickly.

I understand the need of switching to a content hash instead of a commit hash, but let's it be clear that currently, using commit hashes, we can clone and re-shallow repositories as outlined here with git very efficiently. No additional changes are needed, aside from support in libgit2.

Fast-forwarding from a shallow repository is perfectly efficient on both sides. Initial cloning is heavier on the remote side (as packs are recomputed), but incurs in massive network savings as well that cannot be ignored. Considering the ratio of update vs initial clone, I wouldn't even consider this being an issue on the server side. I've stopped deep-cloning as soon as commit to shallow repositories was implemented and never looked back.

tkelman · 2016-12-14T16:29:58Z

There was the case of cocoapods' repo shallow clones overloading github servers (don't have a link handy, it was a long discussion but worth reading just the comments by github infra engineers). We're not at that scale yet, but for METADATA or very widely-used packages the server-side load of shallow clones might not be negligible.

Pkg2 does actually need the history of packages in order to checkout, pin, or downgrade to older versions of a package.

wavexx · 2016-12-14T16:37:52Z

I'm running a server-side git instance for a local, large (multi-gb), binary repo (similarly, we use git to sync as it efficiently supports renames, something rsync doesn't). You can compute checkpoint packs on the server to make initial clones faster. We were doing it weekly. I'm surprised github doesn't do something like this already.

Shouldn't checkout/pin still be perfectly fine with shallow clones as long as the commit is in the allowed range? The main issue I see is downgrade to a revision you don't have, in which case you would need to re-clone the repository, analogous to what you'd do with a static version archive.

wavexx · 2017-04-05T13:41:25Z

By the way, I was able to shrink my ~/.julia directory from 4gb down to 350mb by running git gc --aggressive in my ~/.julia/{.cache,v*} directories :/

To see about extra gains using shallow repositories, I tried to clone --bare --no-single-branch --mirror --depth 1 and repack each repository, which would shrink the entire hierarchy to a mere 80mb.

simonbyrne · 2018-08-10T22:27:55Z

Largely fixed by new Pkg. If becomes an issue again, please open at https://github.com/JuliaLang/Pkg.jl

JeffBezanson added the domain:packages Package management and loading label Dec 14, 2016

simonbyrne closed this as completed Aug 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unbounded disk space used by package's metadata #19597

Unbounded disk space used by package's metadata #19597

wavexx commented Dec 14, 2016

KristofferC commented Dec 14, 2016

tkelman commented Dec 14, 2016

wavexx commented Dec 14, 2016

tkelman commented Dec 14, 2016

wavexx commented Dec 14, 2016

wavexx commented Apr 5, 2017 •

edited

Loading

simonbyrne commented Aug 10, 2018

Unbounded disk space used by package's metadata #19597

Unbounded disk space used by package's metadata #19597

Comments

wavexx commented Dec 14, 2016

KristofferC commented Dec 14, 2016

tkelman commented Dec 14, 2016

wavexx commented Dec 14, 2016

tkelman commented Dec 14, 2016

wavexx commented Dec 14, 2016

wavexx commented Apr 5, 2017 • edited Loading

simonbyrne commented Aug 10, 2018

wavexx commented Apr 5, 2017 •

edited

Loading