-
-
Notifications
You must be signed in to change notification settings - Fork 5.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate in using shallow-clone / tar balls for Pkg.add and Pkg.clone #17963
Comments
libgit2 doesn't support shallow clones. |
Cross Ref: libgit2/libgit2#3058 |
From the issue I linked:
So using Github as CDN and using shallow clones seems to not make them so happy, at least if you are big which we aim to be! |
I think that METADATA.jl is also unsustainable in the long run, since it carries information about every version of every package ever produced. While the folder structure is useful for version control, I suspect it's hell for the file system. Right now, there are >15,000 files in there and nearly 10,000 directories. Given that the allocation block size on HFS+ is 4K, every time anyone tags anything, it costs me >8K of disk space. |
I've said elsewhere, for Pkg3 I think we should seriously restructure the way METADATA works. One toml (or json or something) file per package with appended information per tag would probably be worth it in terms of being easier on the filesystem. Would need a little bit of parsing, but probably better overall. And we'll need a real migration story so we can come up with a systematic way of archiving old history of package versions and metadata versions, probably with periodic new-branch resets? |
Is Pkg3 only a working title, or a repository? |
@wildart and I are working on the basic design of Pkg3. When it's somewhat more complete, we'll make a Julep and people can comment and debate it. |
Closing this because it feels a bit too speculative. Will open a new issue if have time to play around with the package system and have something concrete to try out. |
So I looked a little bit at this today. I first tried to address @simonster's comment about two files per version. I wrote this script: https://gist.github.com/KristofferC/df418a78e3485658c1a533b66191de89 which takes the existing METADATA repository and condenses everything into one file per package. The resulting repo can be seen at: https://github.com/KristofferC/METADATA_compressed. This is still a very naive format made to be easy to read by a human. To test the performance of this I wrote a new |
We should also look into sharding the repository with a bit more structure in terms of where each package gets placed. That may help git out a bit, and make things aesthetically nicer to navigate on github. |
"A", "B", "C" folders etc for start of package name? |
That would be the simplest thing. It leads to a bit of imbalance since packages aren't evenly distributed across the alphabet, but it's better than nothing. |
https://github.com/KristofferC/METADATA_compressed updated to see how it feels with folders |
So I implemented the thing some people have talked about which is reading directly from the git blobs instead of the actual files. I added some convenient functions to import Base.LibGit2: GitRepo, GitTree, GitBlob, filename, peel, object, content
function available(repo::GitRepo)
pkgs = Dict{String,Dict{VersionNumber,Available}}()
head = LibGit2.head(repo)
ht = LibGit2.peel(LibGit2.GitTree, head)
for pkg in ht # Package folders
!isdir(pkg) && continue
pkg_name = filename(pkg)
startswith(pkg_name, '.') && continue
for package_dir_entry in peel(GitTree, object(repo, pkg))
entry_name = filename(package_dir_entry)
!isdir(package_dir_entry) && continue # probably the url file so skip
entry_name != "versions" && continue # skip non "versions" folders
# Loop over the folders in "version" now
for ver in peel(GitTree, object(repo, package_dir_entry))
ver_name = filename(ver)
!ismatch(Base.VERSION_REGEX, ver_name) && continue
sha_str = ""
requires_str = ""
for ver_file in peel(GitTree, object(repo, ver))
!isfile(ver_file) && continue
ver_file_name = filename(ver_file)
blob = peel(GitBlob, object(repo, ver_file))
if ver_file_name == "requires"
requires_str = unsafe_string(convert(Cstring, content(blob)))
elseif ver_file_name == "sha1"
sha_str = unsafe_string(convert(Cstring, content(blob)))
end
end
haskey(pkgs, pkg_name) || (pkgs[pkg_name] = Dict{VersionNumber,Available}())
pkgs[pkg_name][convert(VersionNumber, ver_name)] =
Available(strip(sha_str), Reqs.parse(split(requires_str, '\n')))
end
end
end
return pkgs
end Benchmarking shows that this is about 2x faster than the previous ones. Note that today was the first thing I even looked at libgit2 and I have basically no concept of what is expensive so the above code might do something really bad so there are probably improvements that can be made. What is good is that no changes are required to metadata and it should be possible to use a base clone to save on disk size. |
The timings is on a Linux computer with SSD so maybe better performance gain on Windows / worse harddrives? |
I posted sometime ago benchmark of comparing bare vs checked out METADATA parsing - reading bare metadata repo always beats checked out, see #9944. |
Cache the result and use it if the METADATA repo SHA is the same + repo not dirty? |
Serialize the result to disk together with the SHA and then only reread the package folders that changed in METADATA from that commit? |
So I combined the compressed METADATA with the readblob strategy and took away all the parsing so that only the part that is benchmarked is going through the files and putting them in the correct way to hand over to the parse method. The new way is about 6x faster than the current and takes 0.04 seconds to go through all of METADATA. If anyone is interested, here is the blob reader for the compressed METADATA: https://gist.github.com/KristofferC/d4e3acbda9a5845dfc0738171c2f039d. The no overhead libgit2 version of the current metadata is around 0.16 seconds so 2x of current in Base. |
Seems that git stuff in general are quite slow on METADATA. |
number of files probably makes a difference |
Yes, completely new repo in a copy of metadata is still 0.11 seconds for a |
For fun I created a branch at https://github.com/KristofferC/julia/tree/kc/metadata_v3 which uses the new compressed METADATA format I posted about above that can be seen at https://github.com/KristofferC/METADATA.jl for the Pkg operations. I have a cron job that syncs the current METADATA with that one. Things in general feel a bit snappier but I haven't really benchmarked properly so maybe it is just in my head :P It is nice that the METADATA repo website doesn't lag so much though. It is probably not worth swapping to if the plan for pkg3 is to land in 0.6 but maybe some inspiration can be taken from it. |
Updating the METADATA format shouldn't be that bad though. Just write a script that transforms the new format to the old that is run on a cron job pretty frequently. Update PkgDev to generate the new format. Tell everyone to use the new PkgDev version when tagging registering packages. Shows the advantage off having PkgDev separated from Base. |
Also update all the verification code. Unless we do bidirectional mirroring, we should only take PR's to one of the branches. |
Yeah, that's what I meant with
My point was mostly that because PkgDev is decoupled from Base, a potential swap of METADATA format is not actually that intrusive to neither users or developers. |
Some people will likely still need to use Julia 0.4 for a while. Making it impossible to tag packages coming from there would be a bit unfortunate, but I guess we could actually implement the new format within PkgDev on a different branch to make the package 0.4-compatible. |
It is of course easy to have bidirectional mirroring.. Just more work for the METADATA reviewers (aka you). |
That's what I meant by verification code - we'd need to check against the new format being submitted to the old branch or vice versa. |
Currently,
Pkg.add
andPkg.clone
just gets the whole git repo from the server. Having the full git repo is nice if you want to develop the package since you can just start gitting away. It does however come with drawbacks. My v0.5 folder is 2.0 GB and contains 100 000+ files. Some packages also have quite large git repos (Plots.jl is over 300MB) so cloning these take a considerable time.As Julia matures the number of users / package should go up compared to the number of developers. This means that the reason of having the full git repo locally becomes on average less and less important. A user who only wants to have the latest release of a package would be just as happy getting the latest tar ball of the package. This should also be significantly faster.
There things I propose is the following:
shallow-clone
for gettingnew packages. This should reduce the time and disk size to add a new package but still having the possibility of going back to the full repo with a simplegit fetch --unshallow
. According to @yuyichao there can sometimes be problems with the server using shallow repos but it should be workable. This issue is worth looking at: Issues Cloning Spec repo - GitHub taking a very long time to download changes to the Specs Repo CocoaPods/CocoaPods#4989 (comment).add
andclone
that gets the git repos toPkgDev
.One question is what happens with dependency resolution if we don't have the full git repo. I am not sure how the resolution is done but if we at least have the shallow git repo and we find that we need to checkout a tag of the repo that does not exist locally maybe we can just fetch back far enough to get that tag. If we just have the tar ball I guess we could just get the tar ball for the tag we need and then set that one as "active" somehow.
I am not very involved with how the whole package system works and maybe these ideas have been discussed and dismissed previously but I think doing something like above could improve the package experience for normal users while not significantly make it worse for developers.
cc @wildart @carlobaldassi as the
Pkg
experts :)The text was updated successfully, but these errors were encountered: