Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963

Closed
KristofferC opened this issue Aug 11, 2016 · 30 comments
Closed
Labels
domain:packages Package management and loading

Comments

@KristofferC
Copy link
Sponsor Member

Currently, Pkg.add and Pkg.clone just gets the whole git repo from the server. Having the full git repo is nice if you want to develop the package since you can just start gitting away. It does however come with drawbacks. My v0.5 folder is 2.0 GB and contains 100 000+ files. Some packages also have quite large git repos (Plots.jl is over 300MB) so cloning these take a considerable time.

As Julia matures the number of users / package should go up compared to the number of developers. This means that the reason of having the full git repo locally becomes on average less and less important. A user who only wants to have the latest release of a package would be just as happy getting the latest tar ball of the package. This should also be significantly faster.

There things I propose is the following:

One question is what happens with dependency resolution if we don't have the full git repo. I am not sure how the resolution is done but if we at least have the shallow git repo and we find that we need to checkout a tag of the repo that does not exist locally maybe we can just fetch back far enough to get that tag. If we just have the tar ball I guess we could just get the tar ball for the tag we need and then set that one as "active" somehow.

I am not very involved with how the whole package system works and maybe these ideas have been discussed and dismissed previously but I think doing something like above could improve the package experience for normal users while not significantly make it worse for developers.

cc @wildart @carlobaldassi as the Pkg experts :)

@KristofferC KristofferC added the domain:packages Package management and loading label Aug 11, 2016
@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2016

libgit2 doesn't support shallow clones.

@oxinabox
Copy link
Contributor

Cross Ref: libgit2/libgit2#3058

@KristofferC
Copy link
Sponsor Member Author

From the issue I linked:

Our advice would be for CocoaPods to stop using any kind of shallow feature from Git altogether. Users should perform a full clone of the repository, and then fetch into it as usual. Simply performing that change should significantly soften the load on our fileservers.

So using Github as CDN and using shallow clones seems to not make them so happy, at least if you are big which we aim to be!

@simonster
Copy link
Member

simonster commented Aug 11, 2016

I think that METADATA.jl is also unsustainable in the long run, since it carries information about every version of every package ever produced. While the folder structure is useful for version control, I suspect it's hell for the file system. Right now, there are >15,000 files in there and nearly 10,000 directories. Given that the allocation block size on HFS+ is 4K, every time anyone tags anything, it costs me >8K of disk space.

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2016

I've said elsewhere, for Pkg3 I think we should seriously restructure the way METADATA works. One toml (or json or something) file per package with appended information per tag would probably be worth it in terms of being easier on the filesystem. Would need a little bit of parsing, but probably better overall. And we'll need a real migration story so we can come up with a systematic way of archiving old history of package versions and metadata versions, probably with periodic new-branch resets?

@lobingera
Copy link

Is Pkg3 only a working title, or a repository?

@tkelman
Copy link
Contributor

tkelman commented Aug 11, 2016

@StefanKarpinski
Copy link
Sponsor Member

@wildart and I are working on the basic design of Pkg3. When it's somewhat more complete, we'll make a Julep and people can comment and debate it.

@KristofferC
Copy link
Sponsor Member Author

Closing this because it feels a bit too speculative. Will open a new issue if have time to play around with the package system and have something concrete to try out.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Aug 26, 2016

So I looked a little bit at this today.

I first tried to address @simonster's comment about two files per version. I wrote this script: https://gist.github.com/KristofferC/df418a78e3485658c1a533b66191de89 which takes the existing METADATA repository and condenses everything into one file per package. The resulting repo can be seen at: https://github.com/KristofferC/METADATA_compressed. This is still a very naive format made to be easy to read by a human.

To test the performance of this I wrote a new available() function: https://gist.github.com/KristofferC/63720b50b7a93cfca82e6a98bcf1c6c9. The new one is about 3 times as fast as the old one on my Linux machine. I have not benchmarked on Windows but since I have heard reading files is even more expensive there it should be have an even larger impact there.

@tkelman
Copy link
Contributor

tkelman commented Aug 26, 2016

We should also look into sharding the repository with a bit more structure in terms of where each package gets placed. That may help git out a bit, and make things aesthetically nicer to navigate on github.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Aug 26, 2016

"A", "B", "C" folders etc for start of package name?

@tkelman
Copy link
Contributor

tkelman commented Aug 26, 2016

That would be the simplest thing. It leads to a bit of imbalance since packages aren't evenly distributed across the alphabet, but it's better than nothing.

@KristofferC
Copy link
Sponsor Member Author

https://github.com/KristofferC/METADATA_compressed updated to see how it feels with folders

@KristofferC
Copy link
Sponsor Member Author

So I implemented the thing some people have talked about which is reading directly from the git blobs instead of the actual files. I added some convenient functions to LibGit2 in KristofferC@2155145 and the new available() is:

import Base.LibGit2: GitRepo, GitTree, GitBlob, filename, peel, object, content

function available(repo::GitRepo)
    pkgs = Dict{String,Dict{VersionNumber,Available}}()
    head = LibGit2.head(repo)
    ht = LibGit2.peel(LibGit2.GitTree, head)
    for pkg in ht # Package folders
        !isdir(pkg) && continue
        pkg_name = filename(pkg)
        startswith(pkg_name, '.') && continue
        for package_dir_entry in peel(GitTree, object(repo, pkg))
            entry_name = filename(package_dir_entry)
            !isdir(package_dir_entry) && continue # probably the url file so skip
            entry_name != "versions" && continue  # skip non "versions" folders
            # Loop over the folders in "version" now
            for ver in peel(GitTree, object(repo, package_dir_entry))
                ver_name = filename(ver)
                !ismatch(Base.VERSION_REGEX, ver_name) && continue
                sha_str = ""
                requires_str = ""
                for ver_file in peel(GitTree, object(repo, ver))
                    !isfile(ver_file) && continue
                    ver_file_name = filename(ver_file)
                    blob = peel(GitBlob, object(repo, ver_file))
                    if ver_file_name == "requires"
                         requires_str = unsafe_string(convert(Cstring, content(blob)))
                    elseif ver_file_name == "sha1"
                        sha_str = unsafe_string(convert(Cstring, content(blob)))
                    end
                end
                haskey(pkgs, pkg_name) || (pkgs[pkg_name] = Dict{VersionNumber,Available}())
                pkgs[pkg_name][convert(VersionNumber, ver_name)] =
                    Available(strip(sha_str), Reqs.parse(split(requires_str, '\n')))
            end
        end
    end
    return pkgs
end

Benchmarking shows that this is about 2x faster than the previous ones. Note that today was the first thing I even looked at libgit2 and I have basically no concept of what is expensive so the above code might do something really bad so there are probably improvements that can be made. What is good is that no changes are required to metadata and it should be possible to use a base clone to save on disk size.

@KristofferC
Copy link
Sponsor Member Author

The timings is on a Linux computer with SSD so maybe better performance gain on Windows / worse harddrives?

@wildart
Copy link
Member

wildart commented Aug 28, 2016

I posted sometime ago benchmark of comparing bare vs checked out METADATA parsing - reading bare metadata repo always beats checked out, see #9944.
But that is not an issue. First optimization that is required - only read dependency metadata when required, which means totally avoid calling Pkg.available.

@KristofferC
Copy link
Sponsor Member Author

Cache the result and use it if the METADATA repo SHA is the same + repo not dirty?

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Aug 29, 2016

Serialize the result to disk together with the SHA and then only reread the package folders that changed in METADATA from that commit?

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Aug 29, 2016

So I combined the compressed METADATA with the readblob strategy and took away all the parsing so that only the part that is benchmarked is going through the files and putting them in the correct way to hand over to the parse method. The new way is about 6x faster than the current and takes 0.04 seconds to go through all of METADATA. If anyone is interested, here is the blob reader for the compressed METADATA: https://gist.github.com/KristofferC/d4e3acbda9a5845dfc0738171c2f039d.

The no overhead libgit2 version of the current metadata is around 0.16 seconds so 2x of current in Base.

@KristofferC
Copy link
Sponsor Member Author

Seems that git stuff in general are quite slow on METADATA. isdirty takes 0.11 seconds. On my compressed branch it takes 0.002.. Is that because of difference in length of history? I can do a git status and yell out the answer in less than 0.11 seconds..

@tkelman
Copy link
Contributor

tkelman commented Aug 29, 2016

number of files probably makes a difference

@KristofferC
Copy link
Sponsor Member Author

Yes, completely new repo in a copy of metadata is still 0.11 seconds for a isdirty. Crazy.. There's gotta be a faster way?

@KristofferC
Copy link
Sponsor Member Author

For fun I created a branch at https://github.com/KristofferC/julia/tree/kc/metadata_v3 which uses the new compressed METADATA format I posted about above that can be seen at https://github.com/KristofferC/METADATA.jl for the Pkg operations. I have a cron job that syncs the current METADATA with that one. Things in general feel a bit snappier but I haven't really benchmarked properly so maybe it is just in my head :P It is nice that the METADATA repo website doesn't lag so much though.

It is probably not worth swapping to if the plan for pkg3 is to land in 0.6 but maybe some inspiration can be taken from it.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Sep 1, 2016

Updating the METADATA format shouldn't be that bad though. Just write a script that transforms the new format to the old that is run on a cron job pretty frequently. Update PkgDev to generate the new format. Tell everyone to use the new PkgDev version when tagging registering packages. Shows the advantage off having PkgDev separated from Base.

@tkelman
Copy link
Contributor

tkelman commented Sep 1, 2016

Also update all the verification code. Unless we do bidirectional mirroring, we should only take PR's to one of the branches.

@KristofferC
Copy link
Sponsor Member Author

Yeah, that's what I meant with

Tell everyone to use the new PkgDev version

My point was mostly that because PkgDev is decoupled from Base, a potential swap of METADATA format is not actually that intrusive to neither users or developers.

@tkelman
Copy link
Contributor

tkelman commented Sep 1, 2016

Some people will likely still need to use Julia 0.4 for a while. Making it impossible to tag packages coming from there would be a bit unfortunate, but I guess we could actually implement the new format within PkgDev on a different branch to make the package 0.4-compatible.

@KristofferC
Copy link
Sponsor Member Author

KristofferC commented Sep 1, 2016

It is of course easy to have bidirectional mirroring.. Just more work for the METADATA reviewers (aka you).

@tkelman
Copy link
Contributor

tkelman commented Sep 1, 2016

That's what I meant by verification code - we'd need to check against the new format being submitted to the old branch or vice versa.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:packages Package management and loading
Projects
None yet
Development

No branches or pull requests

7 participants