Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963

KristofferC · 2016-08-11T08:27:59Z

Currently, Pkg.add and Pkg.clone just gets the whole git repo from the server. Having the full git repo is nice if you want to develop the package since you can just start gitting away. It does however come with drawbacks. My v0.5 folder is 2.0 GB and contains 100 000+ files. Some packages also have quite large git repos (Plots.jl is over 300MB) so cloning these take a considerable time.

As Julia matures the number of users / package should go up compared to the number of developers. This means that the reason of having the full git repo locally becomes on average less and less important. A user who only wants to have the latest release of a package would be just as happy getting the latest tar ball of the package. This should also be significantly faster.

There things I propose is the following:

First step, investigate into using shallow-clone for gettingnew packages. This should reduce the time and disk size to add a new package but still having the possibility of going back to the full repo with a simple git fetch --unshallow. According to @yuyichao there can sometimes be problems with the server using shallow repos but it should be workable. This issue is worth looking at: Issues Cloning Spec repo - GitHub taking a very long time to download changes to the Specs Repo CocoaPods/CocoaPods#4989 (comment).
Look into the possibility of just getting the tar ball for the packages as default and maybe move the git add and clone that gets the git repos to PkgDev.

One question is what happens with dependency resolution if we don't have the full git repo. I am not sure how the resolution is done but if we at least have the shallow git repo and we find that we need to checkout a tag of the repo that does not exist locally maybe we can just fetch back far enough to get that tag. If we just have the tar ball I guess we could just get the tar ball for the tag we need and then set that one as "active" somehow.

I am not very involved with how the whole package system works and maybe these ideas have been discussed and dismissed previously but I think doing something like above could improve the package experience for normal users while not significantly make it worse for developers.

cc @wildart @carlobaldassi as the Pkg experts :)

The text was updated successfully, but these errors were encountered:

tkelman · 2016-08-11T08:28:50Z

libgit2 doesn't support shallow clones.

oxinabox · 2016-08-11T08:33:40Z

Cross Ref: libgit2/libgit2#3058

KristofferC · 2016-08-11T08:35:49Z

From the issue I linked:

Our advice would be for CocoaPods to stop using any kind of shallow feature from Git altogether. Users should perform a full clone of the repository, and then fetch into it as usual. Simply performing that change should significantly soften the load on our fileservers.

So using Github as CDN and using shallow clones seems to not make them so happy, at least if you are big which we aim to be!

simonster · 2016-08-11T15:32:12Z

I think that METADATA.jl is also unsustainable in the long run, since it carries information about every version of every package ever produced. While the folder structure is useful for version control, I suspect it's hell for the file system. Right now, there are >15,000 files in there and nearly 10,000 directories. Given that the allocation block size on HFS+ is 4K, every time anyone tags anything, it costs me >8K of disk space.

tkelman · 2016-08-11T15:37:14Z

I've said elsewhere, for Pkg3 I think we should seriously restructure the way METADATA works. One toml (or json or something) file per package with appended information per tag would probably be worth it in terms of being easier on the filesystem. Would need a little bit of parsing, but probably better overall. And we'll need a real migration story so we can come up with a systematic way of archiving old history of package versions and metadata versions, probably with periodic new-branch resets?

lobingera · 2016-08-11T15:42:43Z

Is Pkg3 only a working title, or a repository?

tkelman · 2016-08-11T15:50:50Z

https://www.youtube.com/watch?v=4Fk1WOO0Lqk

StefanKarpinski · 2016-08-11T18:46:46Z

@wildart and I are working on the basic design of Pkg3. When it's somewhat more complete, we'll make a Julep and people can comment and debate it.

KristofferC · 2016-08-15T11:11:07Z

Closing this because it feels a bit too speculative. Will open a new issue if have time to play around with the package system and have something concrete to try out.

KristofferC · 2016-08-26T12:55:39Z

So I looked a little bit at this today.

I first tried to address @simonster's comment about two files per version. I wrote this script: https://gist.github.com/KristofferC/df418a78e3485658c1a533b66191de89 which takes the existing METADATA repository and condenses everything into one file per package. The resulting repo can be seen at: https://github.com/KristofferC/METADATA_compressed. This is still a very naive format made to be easy to read by a human.

To test the performance of this I wrote a new available() function: https://gist.github.com/KristofferC/63720b50b7a93cfca82e6a98bcf1c6c9. The new one is about 3 times as fast as the old one on my Linux machine. I have not benchmarked on Windows but since I have heard reading files is even more expensive there it should be have an even larger impact there.

tkelman · 2016-08-26T13:17:31Z

We should also look into sharding the repository with a bit more structure in terms of where each package gets placed. That may help git out a bit, and make things aesthetically nicer to navigate on github.

KristofferC · 2016-08-26T13:25:26Z

"A", "B", "C" folders etc for start of package name?

tkelman · 2016-08-26T13:31:21Z

That would be the simplest thing. It leads to a bit of imbalance since packages aren't evenly distributed across the alphabet, but it's better than nothing.

KristofferC · 2016-08-26T14:02:03Z

https://github.com/KristofferC/METADATA_compressed updated to see how it feels with folders

KristofferC · 2016-08-28T21:14:26Z

So I implemented the thing some people have talked about which is reading directly from the git blobs instead of the actual files. I added some convenient functions to LibGit2 in KristofferC@2155145 and the new available() is:

import Base.LibGit2: GitRepo, GitTree, GitBlob, filename, peel, object, content

function available(repo::GitRepo)
    pkgs = Dict{String,Dict{VersionNumber,Available}}()
    head = LibGit2.head(repo)
    ht = LibGit2.peel(LibGit2.GitTree, head)
    for pkg in ht # Package folders
        !isdir(pkg) && continue
        pkg_name = filename(pkg)
        startswith(pkg_name, '.') && continue
        for package_dir_entry in peel(GitTree, object(repo, pkg))
            entry_name = filename(package_dir_entry)
            !isdir(package_dir_entry) && continue # probably the url file so skip
            entry_name != "versions" && continue  # skip non "versions" folders
            # Loop over the folders in "version" now
            for ver in peel(GitTree, object(repo, package_dir_entry))
                ver_name = filename(ver)
                !ismatch(Base.VERSION_REGEX, ver_name) && continue
                sha_str = ""
                requires_str = ""
                for ver_file in peel(GitTree, object(repo, ver))
                    !isfile(ver_file) && continue
                    ver_file_name = filename(ver_file)
                    blob = peel(GitBlob, object(repo, ver_file))
                    if ver_file_name == "requires"
                         requires_str = unsafe_string(convert(Cstring, content(blob)))
                    elseif ver_file_name == "sha1"
                        sha_str = unsafe_string(convert(Cstring, content(blob)))
                    end
                end
                haskey(pkgs, pkg_name) || (pkgs[pkg_name] = Dict{VersionNumber,Available}())
                pkgs[pkg_name][convert(VersionNumber, ver_name)] =
                    Available(strip(sha_str), Reqs.parse(split(requires_str, '\n')))
            end
        end
    end
    return pkgs
end

Benchmarking shows that this is about 2x faster than the previous ones. Note that today was the first thing I even looked at libgit2 and I have basically no concept of what is expensive so the above code might do something really bad so there are probably improvements that can be made. What is good is that no changes are required to metadata and it should be possible to use a base clone to save on disk size.

KristofferC · 2016-08-28T21:29:01Z

The timings is on a Linux computer with SSD so maybe better performance gain on Windows / worse harddrives?

wildart · 2016-08-28T22:07:17Z

I posted sometime ago benchmark of comparing bare vs checked out METADATA parsing - reading bare metadata repo always beats checked out, see #9944.
But that is not an issue. First optimization that is required - only read dependency metadata when required, which means totally avoid calling Pkg.available.

KristofferC · 2016-08-29T06:24:18Z

Cache the result and use it if the METADATA repo SHA is the same + repo not dirty?

KristofferC · 2016-08-29T06:26:29Z

Serialize the result to disk together with the SHA and then only reread the package folders that changed in METADATA from that commit?

KristofferC · 2016-08-29T12:20:12Z

So I combined the compressed METADATA with the readblob strategy and took away all the parsing so that only the part that is benchmarked is going through the files and putting them in the correct way to hand over to the parse method. The new way is about 6x faster than the current and takes 0.04 seconds to go through all of METADATA. If anyone is interested, here is the blob reader for the compressed METADATA: https://gist.github.com/KristofferC/d4e3acbda9a5845dfc0738171c2f039d.

The no overhead libgit2 version of the current metadata is around 0.16 seconds so 2x of current in Base.

KristofferC · 2016-08-29T20:32:02Z

Seems that git stuff in general are quite slow on METADATA. isdirty takes 0.11 seconds. On my compressed branch it takes 0.002.. Is that because of difference in length of history? I can do a git status and yell out the answer in less than 0.11 seconds..

tkelman · 2016-08-29T20:43:46Z

number of files probably makes a difference

KristofferC · 2016-08-29T20:45:06Z

Yes, completely new repo in a copy of metadata is still 0.11 seconds for a isdirty. Crazy.. There's gotta be a faster way?

KristofferC · 2016-09-01T12:12:46Z

For fun I created a branch at https://github.com/KristofferC/julia/tree/kc/metadata_v3 which uses the new compressed METADATA format I posted about above that can be seen at https://github.com/KristofferC/METADATA.jl for the Pkg operations. I have a cron job that syncs the current METADATA with that one. Things in general feel a bit snappier but I haven't really benchmarked properly so maybe it is just in my head :P It is nice that the METADATA repo website doesn't lag so much though.

It is probably not worth swapping to if the plan for pkg3 is to land in 0.6 but maybe some inspiration can be taken from it.

KristofferC · 2016-09-01T12:58:51Z

Updating the METADATA format shouldn't be that bad though. Just write a script that transforms the new format to the old that is run on a cron job pretty frequently. Update PkgDev to generate the new format. Tell everyone to use the new PkgDev version when tagging registering packages. Shows the advantage off having PkgDev separated from Base.

tkelman · 2016-09-01T13:19:06Z

Also update all the verification code. Unless we do bidirectional mirroring, we should only take PR's to one of the branches.

KristofferC · 2016-09-01T13:25:12Z

Yeah, that's what I meant with

Tell everyone to use the new PkgDev version

My point was mostly that because PkgDev is decoupled from Base, a potential swap of METADATA format is not actually that intrusive to neither users or developers.

tkelman · 2016-09-01T13:27:40Z

Some people will likely still need to use Julia 0.4 for a while. Making it impossible to tag packages coming from there would be a bit unfortunate, but I guess we could actually implement the new format within PkgDev on a different branch to make the package 0.4-compatible.

KristofferC · 2016-09-01T13:31:23Z

It is of course easy to have bidirectional mirroring.. Just more work for the METADATA reviewers (aka you).

tkelman · 2016-09-01T13:37:56Z

That's what I meant by verification code - we'd need to check against the new format being submitted to the old branch or vice versa.

KristofferC added the domain:packages Package management and loading label Aug 11, 2016

KristofferC closed this as completed Aug 15, 2016

oxinabox mentioned this issue Nov 6, 2016

Repository Cleaning SciML/DifferentialEquations.jl#107

Closed

KristofferC mentioned this issue Dec 14, 2016

Unbounded disk space used by package's metadata #19597

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963

Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963

KristofferC commented Aug 11, 2016

tkelman commented Aug 11, 2016

oxinabox commented Aug 11, 2016

KristofferC commented Aug 11, 2016

simonster commented Aug 11, 2016 •

edited

tkelman commented Aug 11, 2016 •

edited

lobingera commented Aug 11, 2016

tkelman commented Aug 11, 2016

StefanKarpinski commented Aug 11, 2016

KristofferC commented Aug 15, 2016

KristofferC commented Aug 26, 2016 •

edited

tkelman commented Aug 26, 2016

KristofferC commented Aug 26, 2016 •

edited

tkelman commented Aug 26, 2016

KristofferC commented Aug 26, 2016

KristofferC commented Aug 28, 2016

KristofferC commented Aug 28, 2016

wildart commented Aug 28, 2016

KristofferC commented Aug 29, 2016

KristofferC commented Aug 29, 2016 •

edited

KristofferC commented Aug 29, 2016 •

edited

KristofferC commented Aug 29, 2016

tkelman commented Aug 29, 2016

KristofferC commented Aug 29, 2016

KristofferC commented Sep 1, 2016

KristofferC commented Sep 1, 2016 •

edited

tkelman commented Sep 1, 2016

KristofferC commented Sep 1, 2016

tkelman commented Sep 1, 2016

KristofferC commented Sep 1, 2016 •

edited

tkelman commented Sep 1, 2016

Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963

Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963

Comments

KristofferC commented Aug 11, 2016

tkelman commented Aug 11, 2016

oxinabox commented Aug 11, 2016

KristofferC commented Aug 11, 2016

simonster commented Aug 11, 2016 • edited

tkelman commented Aug 11, 2016 • edited

lobingera commented Aug 11, 2016

tkelman commented Aug 11, 2016

StefanKarpinski commented Aug 11, 2016

KristofferC commented Aug 15, 2016

KristofferC commented Aug 26, 2016 • edited

tkelman commented Aug 26, 2016

KristofferC commented Aug 26, 2016 • edited

tkelman commented Aug 26, 2016

KristofferC commented Aug 26, 2016

KristofferC commented Aug 28, 2016

KristofferC commented Aug 28, 2016

wildart commented Aug 28, 2016

KristofferC commented Aug 29, 2016

KristofferC commented Aug 29, 2016 • edited

KristofferC commented Aug 29, 2016 • edited

KristofferC commented Aug 29, 2016

tkelman commented Aug 29, 2016

KristofferC commented Aug 29, 2016

KristofferC commented Sep 1, 2016

KristofferC commented Sep 1, 2016 • edited

tkelman commented Sep 1, 2016

KristofferC commented Sep 1, 2016

tkelman commented Sep 1, 2016

KristofferC commented Sep 1, 2016 • edited

tkelman commented Sep 1, 2016

simonster commented Aug 11, 2016 •

edited

tkelman commented Aug 11, 2016 •

edited

KristofferC commented Aug 26, 2016 •

edited

KristofferC commented Aug 26, 2016 •

edited

KristofferC commented Aug 29, 2016 •

edited

KristofferC commented Aug 29, 2016 •

edited

KristofferC commented Sep 1, 2016 •

edited

KristofferC commented Sep 1, 2016 •

edited