Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large JLD2 file has invaded our git history #509

Closed
ali-ramadhan opened this issue Oct 28, 2019 · 17 comments · Fixed by #524 or #558
Closed

Large JLD2 file has invaded our git history #509

ali-ramadhan opened this issue Oct 28, 2019 · 17 comments · Fixed by #524 or #558
Labels
package 📦 Quite meta

Comments

@ali-ramadhan
Copy link
Member

I noticed that our git repo has ballooned in size some time in the past week.

Someone, possibly me, committed a 52 MiB ocean_wind_mixing_and_convection.jld2 file, possibly generated by running the example? But docs/src/generated is in .gitignore so not sure how it made it in.

Either way, I think we should scrub it because the repo size has increased by an order of magnitude...


Here's a list of files over 500 KiB:

d277a4e5393b  650KiB test/regression_tests/data/data_rayleigh_benard_regression.jld2
b125bc6f8e9d  709KiB test/regression_tests/data/ocean_large_eddy_simulation_VerstappenAnisotropicMinimumDissipation_10000.jld2
f5c1a7736324  709KiB test/regression_tests/data/ocean_large_eddy_simulation_VerstappenAnisotropicMinimumDissipation_10010.jld2
0b493fa7dd14  709KiB test/regression_tests/data/ocean_large_eddy_simulation_SmagorinskyLilly_10000.jld2
ad020f12370b  709KiB test/regression_tests/data/ocean_large_eddy_simulation_SmagorinskyLilly_10010.jld2
0ee7298c84ad  731KiB test/thermal_bubble_golden_master_model_checkpoint_10.jld
eeeca1f2b394  2.4MiB test/deep_convection_golden_master_model_checkpoint_10.jld
4eb0499aa289   52MiB dev/generated/ocean_wind_mixing_and_convection.jld2
5b613ce426d5   52MiB v0.14.1/generated/ocean_wind_mixing_and_convection.jld2
7fddefca8cc0   52MiB dev/generated/ocean_wind_mixing_and_convection.jld2
b5c2ca7312e5   52MiB dev/generated/ocean_wind_mixing_and_convection.jld2
d1ee57ba2365   52MiB dev/generated/ocean_wind_mixing_and_convection.jld2
@ali-ramadhan ali-ramadhan added the package 📦 Quite meta label Oct 28, 2019
@glwagner
Copy link
Member

Bah, that's annoying.

@glwagner
Copy link
Member

The file is in dev/generated/. Is that different from docs/generated?

@glwagner
Copy link
Member

But also we ignore jld2, right?

@ali-ramadhan
Copy link
Member Author

The file is in dev/generated/. Is that different from docs/generated?

Yeah, although I can't find a dev directory on any branch, and trying to use git show, git blame, etc. aren't turning up anything.

But also we ignore jld2, right?

Yeah so I'm not sure how the file made it in...

@navidcy
Copy link
Collaborator

navidcy commented Oct 31, 2019

oh, probably that's why it took so long to clone the repo today..

@ali-ramadhan
Copy link
Member Author

Sorry about that!

We could probably just use git bfg repo cleaner to get rid of it and be more careful in the future...

@navidcy
Copy link
Collaborator

navidcy commented Oct 31, 2019

Something similar happened in FourierFlows.jl and if I recall correctly bfg-repo-cleaner is how we dealt with it.

@ali-ramadhan
Copy link
Member Author

This has gotten really bad as now our repo size is 585 MB indicating that the file keeps changing so I looked into this again and it wasn't any of us but was actually the documentation building including the output of the ocean_wind_mixing_and_convection.jl example on the gh-pages branch: https://github.com/climate-machine/Oceananigans.jl/tree/gh-pages/dev

I'll open a PR that deletes JLD2 files generated by examples in docs/make.jl.

@ali-ramadhan
Copy link
Member Author

PS: I think this issue is why Julia TagBot is failing to tag v0.15.0: JuliaRegistries/General#4989

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Nov 3, 2019

IMPORTANT: @glwagner @suyashbire1 we should delete all old repos and clone fresh.

From the BFG website:

"At this point, you're ready for everyone to ditch their old copies of the repo and do fresh clones of the nice, new pristine data. It's best to delete all old clones, as they'll have dirty history that you don't want to risk pushing back into your newly cleaned repo. "


I used BFG Repo Cleaner to delete all files larger than 1 MB in git history. This deleted two files: deep_convection_golden_master_model_checkpoint_10.jld and ocean_wind_mixing_and_convection.jld2.

I have a backup of the old "dirty" repository in case we need it for any reason.

Before:

$ du -hs Oceananigans.jl.git
620M    Oceananigans.jl.git

After:

$ du -hs Oceananigans.jl.git
14M     Oceananigans.jl.git

BFG log:

(base) [aramadhan@login-1 ~]$ java -jar bfg-1.13.0.jar --strip-blobs-bigger-than 1M Oceananigans.jl.git    

Using repo : /home/gridsan/aramadhan/Oceananigans.jl.git

Scanning packfile for large blobs: 17491
Scanning packfile for large blobs completed in 209 ms.
Found 17 blob ids for large blobs - biggest=54015055 smallest=2524792
Total size (unpacked)=56539847
Found 155 objects to protect
Found 19 tag-pointing refs : refs/tags/v0.10.0, refs/tags/v0.10.1, refs/tags/v0.11.0, ...
Found 273 commit-pointing refs : HEAD, refs/heads/ar/lid-driven-cavity, refs/heads/ar/more-solvers, ...

Protected commits
-----------------

These are your protected commits, and so their contents will NOT be altered:

 * commit 91e5626e (protected by 'HEAD')

Cleaning
--------

Found 3270 commits
Cleaning commits:       100% (3270/3270)
Cleaning commits completed in 35,468 ms.

Updating 255 Refs
-----------------

        Ref                                                        Before     After   
        ------------------------------------------------------------------------------
        refs/heads/ar/lid-driven-cavity                          | 8eae1762 | 0401753a
        refs/heads/ar/more-solvers                               | 5446ae47 | 4cf1d809
        refs/heads/ar/vertically-stretched-grid                  | 695eb278 | 50fbc9d0
        refs/heads/arbitrary-tracers-inner-loops                 | 2440af95 | c3c4ce7b
        refs/heads/forced-flow-test                              | 5355044d | 4771446c
        refs/heads/gh-pages                                      | 3ecbff38 | da919955
        refs/heads/glw/beaufort-gyre-example                     | 0e54d846 | 6fa7725a
        refs/heads/glw/circulation-experiment                    | de5764b9 | a53c1648
        refs/heads/glw/craik-leibovich-terms                     | 972fbd17 | 956d4739
        refs/heads/glw/eady-example                              | c53227c8 | d8d29f47
        refs/heads/glw/fix-eady-typos                            | 84e7d974 | f43c5dde
        refs/heads/glw/mesoscale-closures                        | 1d0090ec | 62b6fc9c
        refs/heads/glw/mesoscale-closures-biharmonic-diffusivity | 7a765f2a | fe555ef7
        refs/heads/glw/mesoscale-closures-kernel-refactor        | 43886b98 | 187742f0
        refs/heads/glw/mesoscale-closures-leith                  | 54bcd904 | f1f57ae2
        ...

Updating references:    100% (255/255)
...Ref update completed in 3,524 ms.

Commit Tree-Dirt History
------------------------

        Earliest                                              Latest
        |                                                          |
        ................DmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmDDDD

        D = dirty commits (file tree fixed)
        m = modified commits (commit message or parents changed)
        . = clean commits (no changes to file tree)

                                Before     After   
        -------------------------------------------
        First modified commit | a8b6b6cf | 69bcf932
        Last dirty commit     | 3ecbff38 | da919955

Deleted files
-------------

        Filename                                                Git id                                     
        ---------------------------------------------------------------------------------------------------
        deep_convection_golden_master_model_checkpoint_10.jld | eeeca1f2 (2.4 MB)                          
        ocean_wind_mixing_and_convection.jld2                 | b5c2ca73 (51.5 MB), 5b613ce4 (51.5 MB), ...


In total, 2276 object ids were changed. Full details are logged here:

        /home/gridsan/aramadhan/Oceananigans.jl.git.bfg-report/2019-11-03/11-36-30

BFG run is complete! When ready, run: git reflog expire --expire=now --all && git gc --prune=now --aggressive

Testing:

$ git clone git@github.com:climate-machine/Oceananigans.jl.git
Cloning into 'Oceananigans.jl'...
remote: Enumerating objects: 2107, done.
remote: Counting objects: 100% (2107/2107), done.
remote: Compressing objects: 100% (2007/2007), done.
remote: Total 16677 (delta 93), reused 2093 (delta 83), pack-reused 14570
Receiving objects: 100% (16677/16677), 11.63 MiB | 10.26 MiB/s, done.
Resolving deltas: 100% (10969/10969), done.

@ali-ramadhan ali-ramadhan mentioned this issue Nov 3, 2019
@c42f
Copy link

c42f commented Nov 3, 2019

If you have a large file which exists only on a branch (and has never made it into master) it suffices to delete the branch. After that, fresh clones will not include the large file. So in this case I think you could have deleted or rewritten gh-pages and that would have fixed things with less disruption. (IIRC the way gh-pages usually works is a special branch which is detached from the rest of the history, which makes rewriting gh-pages in isolation even easier.)

Note that testing branch deletion with a local git repo can be misleading for repo size - you need some extra steps to remove references to the old branch HEAD from the reflog and gc before testing the size of the .git directory (something like git reflog expire --all; git gc).

@navidcy
Copy link
Collaborator

navidcy commented Nov 14, 2019

navid: $ git clone https://github.com/climate-machine/Oceananigans.jl.git
Cloning into 'Oceananigans.jl'...
remote: Enumerating objects: 191, done.
remote: Counting objects: 100% (191/191), done.
remote: Compressing objects: 100% (106/106), done.
remote: Total 17116 (delta 113), reused 136 (delta 85), pack-reused 16925
Receiving objects: 100% (17116/17116), 127.75 MiB | 9.56 MiB/s, done.
Resolving deltas: 100% (11195/11195), done.
navid: $ du -sh Oceananigans.jl                             [21:19:00]
134M	Oceananigans.jl

@ali-ramadhan, are the 134MB due to .jld2 files again?

@c42f
Copy link

c42f commented Nov 15, 2019

If you clone only the master branch, you will see that it is quite clean:

$ git clone https://github.com/climate-machine/Oceananigans.jl.git --single-branch
Cloning into 'Oceananigans.jl'...
remote: Enumerating objects: 19, done.
remote: Counting objects: 100% (19/19), done.
remote: Compressing objects: 100% (19/19), done.
remote: Total 12346 (delta 0), reused 19 (delta 0), pack-reused 12327
Receiving objects: 100% (12346/12346), 10.25 MiB | 1.02 MiB/s, done.
Resolving deltas: 100% (8872/8872), done.
$ du -sh Oceananigans.jl/.git
12M	Oceananigans.jl/.git

So only 12M of (compressed) files are downloaded.

On the other hand, cloning the documentation branch gh-pages downloads 117 M:

$ git clone https://github.com/climate-machine/Oceananigans.jl.git --single-branch --branch=gh-pages
Cloning into 'Oceananigans.jl'...
remote: Enumerating objects: 4040, done.
remote: Total 4040 (delta 0), reused 0 (delta 0), pack-reused 4040
Receiving objects: 100% (4040/4040), 117.40 MiB | 6.62 MiB/s, done.
Resolving deltas: 100% (1819/1819), done.

So this is nothing to worry too much about :-) A simple non-disruptive solution is to change your hosting solution for the docs to use something other than the main git repository, and to delete the gh-pages branch.

@ali-ramadhan
Copy link
Member Author

Thanks for finding this @navidcy!

Ah, this is annoying but this wasn't actually fixed... Apparently the commit to gh-pages is made during makedocs while the example file is deleted afterwards.

See: https://travis-ci.com/climate-machine/Oceananigans.jl/jobs/262276959#L755-L756

I wonder if it's just better to delete the output file as part of the example, i.e. do it in ocean_wind_mixing_and_convection.jl.

Thankfully the issue is isolated in the gh-pages branch. Thanks for the advice @c42f! Will see if I can rewrite the gh-pages branch which sounds like it should be minimally disruptive.

@ali-ramadhan ali-ramadhan reopened this Dec 2, 2019
@c42f
Copy link

c42f commented Dec 3, 2019

Yeah, rewriting the gh-pages branch should be good enough 👍 Generally I'd only rewrite a master branch as a tool of last resort because it can be quite disruptive :-)

@navidcy
Copy link
Collaborator

navidcy commented Dec 31, 2019

This is still an issue and it's been escalating...

navid:/ $ git clone https://github.com/climate-machine/Oceananigans.jl.git
Cloning into 'Oceananigans.jl'...
remote: Enumerating objects: 453, done.
remote: Counting objects: 100% (453/453), done.
remote: Compressing objects: 100% (227/227), done.
remote: Total 20837 (delta 204), reused 318 (delta 120), pack-reused 20384
Receiving objects: 100% (20837/20837), 331.98 MiB | 2.05 MiB/s, done.
Resolving deltas: 100% (13268/13268), done.
navid:Research/ $ du -sh Oceananigans.jl 
343M	Oceananigans.jl

@ali-ramadhan
Copy link
Member Author

Thanks for the heads up again @navidcy!

Indeed #558 didn't exactly solve the problem as it only deleted the JLD2 file after it was already pushed to gh-pages by makedocs.

I added a .gitignore to the gh-pages branch a while back so JLD2 and NetCDF files aren't pushed, but didn't purge the JLD2 files.

I just did it again and repo size is down to 53 MB now. As all the files were on gh-pages, no pull requests should be affected.

Just to be safe @glwagner @suyashbire1 @sandreza we should probably git clone a fresh copy of repository.

Cloning just the master branch is ~20 MB uncompressed mostly because of regression test files (although they're all <1 MB each).

Cloning just the gh-pages branch is ~80 MB because each tutorial/example in the example has an embedded mp4. The branch will grow in size with time, but we can revisit the issue of what to do about documentation size in the future.

PS: Thanks again for the tips @c42f and for the branch size measuring commands!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
package 📦 Quite meta
Projects
None yet
4 participants