Skip to content
This repository has been archived by the owner on May 18, 2023. It is now read-only.

Large files #88

Closed
naharrison opened this issue Mar 22, 2018 · 20 comments
Closed

Large files #88

naharrison opened this issue Mar 22, 2018 · 20 comments

Comments

@naharrison
Copy link
Member

Below is a list of all files larger than 5 MB. Most of them we probably don't want to keep, and getting rid of them should significantly reduce the size of the repository (currently about 1 GB). I've done some experimenting with the BFG Repo Cleaner (https://rtyley.github.io/bfg-repo-cleaner/) and it should be able to do exactly what we need. @heddle @zieglerv @baltzell @raffaelladevita Can you comment on how you'd like to proceed?

12M etc/data/magfield/clas12-fieldmap-torus.dat
8.3M etc/data/magfield/clas12-fieldmap-solenoid.dat
30M common-tools/cnuphys/coatjava/lib/clas/coat-libs-5.1-SNAPSHOT.jar
12M common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-solenoid.dat
88M common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/magfieldC/data/solenoid-srr.dat
27M common-tools/cnuphys/swimmer/data/smallAsciiMap.txt
88M common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/swimmer/data/solenoid-srr.dat
13M common-tools/cnuphys/swimmer/data/clas12_small_torus.dat
88M common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat

@baltzell
Copy link
Collaborator

For the first two, I'd think your build-coatjava.sh script could/should download them for us (if they don't already exist locally in the proper directory). Looks like clara gets them from coatjava-XXX.tar.gz, so that will take care of itself, sort of.

I've no idea about the rest of them. Looks like lots of duplicates, and a jar!

@raffaelladevita
Copy link
Collaborator

I also don't know about the other files.

Related to cleanup, in the external dependencies folder there are two jar we could remove:
external-dependencies/KPP-Monitoring-1.0.jar
external-dependencies/KPP-Plots-1.0.jar
The first is the original mon12 and the second was used to make reconstruction plots. They are probably both compiled with some old version of coatjava and would not work now.

@naharrison
Copy link
Member Author

Sorry to keep picking on @heddle , but most of these are in cnuphys. Dave, can we try to find a solution to these large files? At least the duplicate ones?

@heddle
Copy link
Collaborator

heddle commented Mar 23, 2018 via email

@baltzell
Copy link
Collaborator

Nathan, can you test Dave's accepted deletions on a fork with BFG and see what the resulting repo size is?

@naharrison
Copy link
Member Author

Have a look at https://github.com/naharrison/clas12-offline-software. I made sure to start even with JeffersonLab:master and now it says the branch is 1175 commits ahead and 1174 commits behind. It's a little scary to change the repo history so drastically, but everything seems to be working normally. Here's a summary of the size improvement:
initial size: 850M
after git rm-ing some big files: 619M
after bfg --strip-blobs-bigger-than 5M: 424M
Thoughts?

@drewkenjo
Copy link
Collaborator

It's not as scary as it sounds: 1175 ahead and 1174 behind - seems it's just one commit (BFG commit) ahead... After you modify git history with official git filter-branch tool it has similar message, consequences of meddling with git history apparently.

First you manually remove 8 files using "git rm" - that's your last "manual" commit.
Then you run BFG, it modifies your last commit and shows 22 changed files:

  • 2 KPP files are below 5M limit, so they are simply transferred to modified commit as "git rm'd":
    • external-dependencies/KPP-Monitoring-1.0.jar
    • external-dependencies/KPP-Plots-1.0.jar
  • 6 other files from your manual "git rm" command are removed from the history by BFG. The removal of their history ids is reflected by 6 changed files where each has the same name as (original filename+"REMOVED.git-id" suffix). They correspond to the files Dave agreed to remove from cnuphys package:
    • common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat.REMOVED.git-id
    • common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat.REMOVED.git-id
    • common-tools/cnuphys/swimmer/data/clas12_small_torus.dat.REMOVED.git-id
    • common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat.REMOVED.git-id
    • common-tools/cnuphys/swimmer/data/smallAsciiMap.txt.REMOVED.git-id
    • common-tools/cnuphys/swimmer/data/solenoid-srr.dat.REMOVED.git-id
  • 14 remaining changed files are the 7 files above 5Mb that we haven't removed and their 7 history ids. BFG understands that we don't want them removed but BFG cleans history anyway, and to "save" them it add them to the latest commit, as if we added this files during latest commit. It's internal BFG mechanics feature discussed here: Files from protected commits loose their history, show up as if in last commit only rtyley/bfg-repo-cleaner#53. Basically BFG removes all files above 5Mb from history and files, and then commits them back at the end:
    • common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-solenoid.dat
    • common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-torus.dat
    • common-tools/cnuphys/coatjava/lib/clas/coat-libs-5.1-SNAPSHOT.jar
    • common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat
    • common-tools/cnuphys/magfieldC/data/solenoid-srr.dat
    • etc/data/magfield/clas12-fieldmap-solenoid.dat
    • etc/data/magfield/clas12-fieldmap-torus.dat

Again these last 7 files are not deleted, they are saved, but their history is removed, and they are newly added in the last commit. So we have 22 changed files.

@naharrison
Copy link
Member Author

Thanks Andrey! I'll submit a pull request now.

@tylern4
Copy link
Contributor

tylern4 commented Apr 2, 2018

It may be helpful to start using git lfs for storing these files so that this does not become an issue in the future.

@naharrison
Copy link
Member Author

As pointed out by Sebastian, a good solution to this is to just use git rm (without BFG) and then use shallow clones when cloning. This can also speed up Travis. Looking at some recent Travis logs, it seems the default depth used is 50. Does anyone see any reason not to change the depth to 1?

@smancill
Copy link
Contributor

Queued jobs may fail when the depth is 1 (actually, when depth is not enough, the documentation is confusing.)

You may experiment with a depth number that works, or just leave the default of 50 and keep pushing new changes. Eventually Travis should stop fetching commits that contain large files.

@heddle
Copy link
Collaborator

heddle commented Apr 12, 2018 via email

@baltzell
Copy link
Collaborator

Dave, how about just put it somewhere publicly accessible for now (e.g. www.jlab.org/~heddle), with a different filename than the existing fieldmap.

There's been talk about changing to automatically retrieving the maps from a directory next to (or inside?) our maven repo (and/or a separate github repo), instead of storing them in clas12-offline-software. Could be done in build-coatjava.sh, or probably a more proper maven way.

Regarding multiple map versions,

  • I see environment variables TORUSMAP and SOLENOIDMAP are setup to allow determining the filenames at run time (although duplicated in many scripts and should be centralized).

  • Dave, does cnuphys honor those env vars, or not? I see evidence of both.

  • But I don't see if/how those variables get set when we run inside CLARA? Doesn't appear to be the yaml files.

@heddle
Copy link
Collaborator

heddle commented Apr 13, 2018 via email

@naharrison
Copy link
Member Author

From what I've been reading, managing large binary files is a common problem without any particularly good solutions. (In addition to field maps we also have some big data files that are used for validation, currently kept here). Here are a few ideas:

  1. Keep these files in the repository, but don't save their histories. Unfortunately, I don't think this is possible (see here and here).

  2. Git LFS is theoretically exactly what we want, but it has very limited storage space and bandwidth, and also might complicate forks and pull requests (see here and here).

  3. Keep the binary files in a separate github repo; developers will then need to download both repositories. This is not the most elegant solution, but it seems pretty simple and foolproof.

  4. Keep the binary files in a maven repository and use the maven resources plugin to download them. In my opinion this is essentially the same as 3 but with fancier tools, plus the annoyance of maintaining the maven repo.

  5. I've come across various other solutions such as using git-submodules, git-annex, and something called orphan branches, but I think all of these would create unnecessary complications.

I like 3 the best do to its simplicity.

@baltzell
Copy link
Collaborator

I've no preference between 3 and 4 (although changing fieldmaps should be rare and a worthwhile maintainence headache), and agree that, in theory, fieldmaps probably don't belong in this repo. Whatever option is chosen needs to be automated, via the build script or maven.

But I also wonder, if we just judiciously prune all the unused fieldmaps, data files, and jars from this repo, whether that might not be good enough in the short term.

Note that we do need to use a new ~100 MB torus fieldmap from Dave.

Probably more import than repository size is software support for different fieldmaps at runtime, determined from environment variables and yaml. Currently it's determined at compile-time, or overwriting a fieldmap file with a particular but very generic name, and the env vars in coatjava/bin may or may not be honored. I think it's more important to sort this out than slim the repo.

@naharrison
Copy link
Member Author

Agreed on the priority, but aren't these issues pretty independent? Anyway, I think we can resolve the repo size today if we can agree on a strategy. I like 3 better than 4 because in my experience github is more reliable than keeping things on a lab filesystem where things sometimes get accidentally deleted, the disks are sometimes down for maintenance, you can have disk quota issues, you have to enter your password twice if you're offsite... Here's my proposal:

  • Create a new github repo - JeffersonLab/clas12-offline-resources which can have different branches, e.g. field-maps, validation-files, etc.
  • In build-coatjava.sh add the line wget http://github.com/JeffersonLab/clas12-offline-resources/field-maps.zip plus 2-3 more lines to unpack and move the files to where they need to go. Maybe also add a new flag to the build script to avoid re-downloading the maps once they're downloaded.

My one concern was that wget-ing files from github to a jlab machine can sometimes be extremely slow for some reason; but I just tested this idea and it's working fine.

@zieglerv
Copy link
Collaborator

zieglerv commented Apr 16, 2018 via email

@naharrison
Copy link
Member Author

Great! I'll set it up and submit a pull request by the end of the day.

@baltzell
Copy link
Collaborator

Some torus map details I accumulated from Dave (he may want to confirm):

The 88 MB map (currently in this repo) is the 12-fold symmetry, high-res version, that corresponds with what gemc currently uses.

There are at least two as-surveyed 136 MB maps around (neither are in this repo):

All three of these torus maps should go in the new repo with appropriate filenames.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants