Large files #88

naharrison · 2018-03-22T22:22:46Z

Below is a list of all files larger than 5 MB. Most of them we probably don't want to keep, and getting rid of them should significantly reduce the size of the repository (currently about 1 GB). I've done some experimenting with the BFG Repo Cleaner (https://rtyley.github.io/bfg-repo-cleaner/) and it should be able to do exactly what we need. @heddle @zieglerv @baltzell @raffaelladevita Can you comment on how you'd like to proceed?

12M etc/data/magfield/clas12-fieldmap-torus.dat
8.3M etc/data/magfield/clas12-fieldmap-solenoid.dat
30M common-tools/cnuphys/coatjava/lib/clas/coat-libs-5.1-SNAPSHOT.jar
12M common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-solenoid.dat
88M common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/magfieldC/data/solenoid-srr.dat
27M common-tools/cnuphys/swimmer/data/smallAsciiMap.txt
88M common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/swimmer/data/solenoid-srr.dat
13M common-tools/cnuphys/swimmer/data/clas12_small_torus.dat
88M common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat

baltzell · 2018-03-22T22:56:39Z

For the first two, I'd think your build-coatjava.sh script could/should download them for us (if they don't already exist locally in the proper directory). Looks like clara gets them from coatjava-XXX.tar.gz, so that will take care of itself, sort of.

I've no idea about the rest of them. Looks like lots of duplicates, and a jar!

raffaelladevita · 2018-03-22T23:19:35Z

I also don't know about the other files.

Related to cleanup, in the external dependencies folder there are two jar we could remove:
external-dependencies/KPP-Monitoring-1.0.jar
external-dependencies/KPP-Plots-1.0.jar
The first is the original mon12 and the second was used to make reconstruction plots. They are probably both compiled with some old version of coatjava and would not work now.

naharrison · 2018-03-23T14:18:59Z

Sorry to keep picking on @heddle , but most of these are in cnuphys. Dave, can we try to find a solution to these large files? At least the duplicate ones?

heddle · 2018-03-23T14:30:34Z

Yes, as far as cnuphys is concerned, delete 27M common-tools/cnuphys/swimmer/data/smallAsciiMap.txt 88M common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat 8.3M common-tools/cnuphys/swimmer/data/solenoid-srr.dat 13M common-tools/cnuphys/swimmer/data/clas12_small_torus.dat 88M common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat 8.3M common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat but preserve 88M common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat 8.3M common-tools/cnuphys/magfieldC/data/solenoid-srr.dat if that is acceptable. I have seen other solutions to this sort of problem, where the large data files are in a separate repo. But if just cleaning up solves the problem I would prefer that rout. And stop picking on me! I'm quite sensitive!

…

On Fri, Mar 23, 2018 at 10:19 AM, Nathan Harrison ***@***.***> wrote: Sorry to keep picking on @heddle <https://github.com/heddle> , but most of these are in cnuphys. Dave, can we try to find a solution to these large files? At least the duplicate ones? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHEBYsdJ0mBp0v0ux3imdtbrwgaDnHhiks5thQRWgaJpZM4S34wD> .

-- *David P. Heddle, Ph.D.Professor of Physics,* *Graduate Program Coordinator Christopher Newport University Newport News, VA 23606* *757.594.8434 (CNU)*

baltzell · 2018-03-23T23:48:23Z

Nathan, can you test Dave's accepted deletions on a fork with BFG and see what the resulting repo size is?

naharrison · 2018-03-24T01:43:05Z

Have a look at https://github.com/naharrison/clas12-offline-software. I made sure to start even with JeffersonLab:master and now it says the branch is 1175 commits ahead and 1174 commits behind. It's a little scary to change the repo history so drastically, but everything seems to be working normally. Here's a summary of the size improvement:
initial size: 850M
after git rm-ing some big files: 619M
after bfg --strip-blobs-bigger-than 5M: 424M
Thoughts?

drewkenjo · 2018-03-24T04:11:45Z

It's not as scary as it sounds: 1175 ahead and 1174 behind - seems it's just one commit (BFG commit) ahead... After you modify git history with official git filter-branch tool it has similar message, consequences of meddling with git history apparently.

First you manually remove 8 files using "git rm" - that's your last "manual" commit.
Then you run BFG, it modifies your last commit and shows 22 changed files:

2 KPP files are below 5M limit, so they are simply transferred to modified commit as "git rm'd":
- external-dependencies/KPP-Monitoring-1.0.jar
- external-dependencies/KPP-Plots-1.0.jar
6 other files from your manual "git rm" command are removed from the history by BFG. The removal of their history ids is reflected by 6 changed files where each has the same name as (original filename+"REMOVED.git-id" suffix). They correspond to the files Dave agreed to remove from cnuphys package:
- common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat.REMOVED.git-id
- common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat.REMOVED.git-id
- common-tools/cnuphys/swimmer/data/clas12_small_torus.dat.REMOVED.git-id
- common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat.REMOVED.git-id
- common-tools/cnuphys/swimmer/data/smallAsciiMap.txt.REMOVED.git-id
- common-tools/cnuphys/swimmer/data/solenoid-srr.dat.REMOVED.git-id
14 remaining changed files are the 7 files above 5Mb that we haven't removed and their 7 history ids. BFG understands that we don't want them removed but BFG cleans history anyway, and to "save" them it add them to the latest commit, as if we added this files during latest commit. It's internal BFG mechanics feature discussed here: Files from protected commits loose their history, show up as if in last commit only rtyley/bfg-repo-cleaner#53. Basically BFG removes all files above 5Mb from history and files, and then commits them back at the end:
- common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-solenoid.dat
- common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-torus.dat
- common-tools/cnuphys/coatjava/lib/clas/coat-libs-5.1-SNAPSHOT.jar
- common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat
- common-tools/cnuphys/magfieldC/data/solenoid-srr.dat
- etc/data/magfield/clas12-fieldmap-solenoid.dat
- etc/data/magfield/clas12-fieldmap-torus.dat

Again these last 7 files are not deleted, they are saved, but their history is removed, and they are newly added in the last commit. So we have 22 changed files.

naharrison · 2018-03-27T01:20:28Z

Thanks Andrey! I'll submit a pull request now.

tylern4 · 2018-04-02T14:06:27Z

It may be helpful to start using git lfs for storing these files so that this does not become an issue in the future.

naharrison · 2018-04-12T19:54:20Z

As pointed out by Sebastian, a good solution to this is to just use git rm (without BFG) and then use shallow clones when cloning. This can also speed up Travis. Looking at some recent Travis logs, it seems the default depth used is 50. Does anyone see any reason not to change the depth to 1?

smancill · 2018-04-12T20:40:54Z

Queued jobs may fail when the depth is 1 (actually, when depth is not enough, the documentation is confusing.)

You may experiment with a depth number that works, or just leave the default of 50 and keep pushing new changes. Eventually Travis should stop fetching commits that contain large files.

heddle · 2018-04-12T20:59:58Z

So I have been asked to create a new binary field map. Am I supposed to check that in?

…

On Thu, Apr 12, 2018 at 4:40 PM, Sebastián Mancilla < ***@***.***> wrote: Queued jobs may fail when the depth is 1 (actually, when depth is not enough, the documentation is confusing <travis-ci/travis-ci#8321>.) You may experiment with a depth number that works, or just leave the default of 50 and keep pushing new changes. Eventually Travis should stop fetching commits that contain large files. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHEBYrqCnfeeUCImQMkXtGsvDznBsWkvks5tn7vXgaJpZM4S34wD> .

-- *David P. Heddle, Ph.D.Professor of Physics,* *Graduate Program Coordinator Christopher Newport University Newport News, VA 23606* *757.594.8434 (CNU)*

baltzell · 2018-04-13T10:34:18Z

Dave, how about just put it somewhere publicly accessible for now (e.g. www.jlab.org/~heddle), with a different filename than the existing fieldmap.

There's been talk about changing to automatically retrieving the maps from a directory next to (or inside?) our maven repo (and/or a separate github repo), instead of storing them in clas12-offline-software. Could be done in build-coatjava.sh, or probably a more proper maven way.

Regarding multiple map versions,

I see environment variables TORUSMAP and SOLENOIDMAP are setup to allow determining the filenames at run time (although duplicated in many scripts and should be centralized).
Dave, does cnuphys honor those env vars, or not? I see evidence of both.
But I don't see if/how those variables get set when we run inside CLARA? Doesn't appear to be the yaml files.

heddle · 2018-04-13T12:43:07Z

It should honor those env vars, I'll double check. There are programatic ways to specify the paths too. I'll write something up.

…

On Fri, Apr 13, 2018 at 6:34 AM, Nathan Baltzell ***@***.***> wrote: Dave, how about just put it somewhere publicly accessible for now (e.g. www.jlab.org/~heddle <http://www.jlab.org/%7Eheddle>), with a different filename than the existing fieldmap. There's been talk about changing to automatically retrieving the maps from a directory next to (or inside?) our maven repo (and/or a separate github repo), instead of storing them in clas12-offline-software. Could be done in build-coatjava.sh, or probably a more proper maven way. Regarding multiple map versions, - I see environment variables TORUSMAP and SOLENOIDMAP are setup to allow determining the filenames at run time (although duplicated in many scripts and should be centralized). - Dave, does cnuphys honor those env vars, or not? I see evidence of both. - But I don't see if/how those variables get set when we run inside CLARA? Doesn't appear to be the yaml files. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#88 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AHEBYh7DDv65QVErPMT7eAVW5aBksycBks5toH8ugaJpZM4S34wD> .

-- *David P. Heddle, Ph.D.Professor of Physics,* *Graduate Program Coordinator Christopher Newport University Newport News, VA 23606* *757.594.8434 (CNU)*

naharrison · 2018-04-15T19:00:24Z

From what I've been reading, managing large binary files is a common problem without any particularly good solutions. (In addition to field maps we also have some big data files that are used for validation, currently kept here). Here are a few ideas:

~~Keep these files in the repository, but don't save their histories.~~ Unfortunately, I don't think this is possible (see here and here).
Git LFS is theoretically exactly what we want, but it has very limited storage space and bandwidth, and also might complicate forks and pull requests (see here and here).
Keep the binary files in a separate github repo; developers will then need to download both repositories. This is not the most elegant solution, but it seems pretty simple and foolproof.
Keep the binary files in a maven repository and use the maven resources plugin to download them. In my opinion this is essentially the same as 3 but with fancier tools, plus the annoyance of maintaining the maven repo.
I've come across various other solutions such as using git-submodules, git-annex, and something called orphan branches, but I think all of these would create unnecessary complications.

I like 3 the best do to its simplicity.

baltzell · 2018-04-15T22:47:09Z

I've no preference between 3 and 4 (although changing fieldmaps should be rare and a worthwhile maintainence headache), and agree that, in theory, fieldmaps probably don't belong in this repo. Whatever option is chosen needs to be automated, via the build script or maven.

But I also wonder, if we just judiciously prune all the unused fieldmaps, data files, and jars from this repo, whether that might not be good enough in the short term.

Note that we do need to use a new ~100 MB torus fieldmap from Dave.

Probably more import than repository size is software support for different fieldmaps at runtime, determined from environment variables and yaml. Currently it's determined at compile-time, or overwriting a fieldmap file with a particular but very generic name, and the env vars in coatjava/bin may or may not be honored. I think it's more important to sort this out than slim the repo.

naharrison · 2018-04-16T16:45:51Z

Agreed on the priority, but aren't these issues pretty independent? Anyway, I think we can resolve the repo size today if we can agree on a strategy. I like 3 better than 4 because in my experience github is more reliable than keeping things on a lab filesystem where things sometimes get accidentally deleted, the disks are sometimes down for maintenance, you can have disk quota issues, you have to enter your password twice if you're offsite... Here's my proposal:

Create a new github repo - JeffersonLab/clas12-offline-resources which can have different branches, e.g. field-maps, validation-files, etc.
In build-coatjava.sh add the line wget http://github.com/JeffersonLab/clas12-offline-resources/field-maps.zip plus 2-3 more lines to unpack and move the files to where they need to go. Maybe also add a new flag to the build script to avoid re-downloading the maps once they're downloaded.

My one concern was that wget-ing files from github to a jlab machine can sometimes be extremely slow for some reason; but I just tested this idea and it's working fine.

zieglerv · 2018-04-16T19:00:07Z

I agree that keeping files on github may be the safest. I'd say since this strategy seems to be working as Nathan pointed out, let's go with it.

…

----- Original Message ----- From: "Nathan Harrison" <notifications@github.com> To: "JeffersonLab/clas12-offline-software" <clas12-offline-software@noreply.github.com> Cc: "Veronique Ziegler" <ziegler@jlab.org>, "Mention" <mention@noreply.github.com> Sent: Monday, April 16, 2018 12:47:15 PM Subject: Re: [JeffersonLab/clas12-offline-software] Large files (#88) Agreed on the priority, but aren't these issues pretty independent? Anyway, I think we can resolve the repo size today if we can agree on a strategy. I like 3 better than 4 because in my experience github is more reliable than keeping things on a lab filesystem where things sometimes get accidentally deleted, the disks are sometimes down for maintenance, you can have disk quota issues, you have to enter your password twice if you're offsite... Here's my proposal: * Create a new github repo - JeffersonLab/clas12-offline-resources which can have different branches, e.g. field-maps, validation-files, etc. * In build-coatjava.sh add the line `wget https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_JeffersonLab_clas12-2Doffline-2Dresources_field-2Dmaps.zip&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=W3-NOktEHmERMrPOVplwMQ&m=jxfUUA-JZFW4WXiigS_4Fsv9DFwBzMCc5rniCTy0jR8&s=5OUsQ9wW1rS_KLVSEuarxPJ_RdiLbDEXTnu692eWwSQ&e= ` plus 2-3 more lines to unpack and move the files to where they need to go. Maybe also add a new flag to the build script to avoid re-downloading the maps once they're downloaded. My one concern was that wget-ing files from github to a jlab machine can sometimes be extremely slow for some reason; but I just tested this idea and it's working fine.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_clas12-2Doffline-2Dsoftware_issues_88-23issuecomment-2D381671995&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=W3-NOktEHmERMrPOVplwMQ&m=jxfUUA-JZFW4WXiigS_4Fsv9DFwBzMCc5rniCTy0jR8&s=dE5GrGXY_l8_je_4FPGkFUXZSy4Uk1L4tzyhy_8_tCU&e=

naharrison · 2018-04-16T19:12:50Z

Great! I'll set it up and submit a pull request by the end of the day.

baltzell · 2018-04-16T20:08:50Z

Some torus map details I accumulated from Dave (he may want to confirm):

The 88 MB map (currently in this repo) is the 12-fold symmetry, high-res version, that corresponds with what gemc currently uses.

There are at least two as-surveyed 136 MB maps around (neither are in this repo):

one from January based on standard survey
- can someone point to an official version of this map?
one from this weekend based on more thorough "survey" and available here:
- https://www.jlab.org/heddle/fieldmaps.tar.gz.

All three of these torus maps should go in the new repo with appropriate filenames.

baltzell closed this as completed Apr 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large files #88

Large files #88

naharrison commented Mar 22, 2018

baltzell commented Mar 22, 2018

raffaelladevita commented Mar 22, 2018

naharrison commented Mar 23, 2018

heddle commented Mar 23, 2018 via email

baltzell commented Mar 23, 2018

naharrison commented Mar 24, 2018

drewkenjo commented Mar 24, 2018

naharrison commented Mar 27, 2018

tylern4 commented Apr 2, 2018

naharrison commented Apr 12, 2018

smancill commented Apr 12, 2018

heddle commented Apr 12, 2018 via email

baltzell commented Apr 13, 2018

heddle commented Apr 13, 2018 via email

naharrison commented Apr 15, 2018

baltzell commented Apr 15, 2018

naharrison commented Apr 16, 2018

zieglerv commented Apr 16, 2018 via email

naharrison commented Apr 16, 2018

baltzell commented Apr 16, 2018

Large files #88

Large files #88

Comments

naharrison commented Mar 22, 2018

baltzell commented Mar 22, 2018

raffaelladevita commented Mar 22, 2018

naharrison commented Mar 23, 2018

heddle commented Mar 23, 2018 via email

baltzell commented Mar 23, 2018

naharrison commented Mar 24, 2018

drewkenjo commented Mar 24, 2018

naharrison commented Mar 27, 2018

tylern4 commented Apr 2, 2018

naharrison commented Apr 12, 2018

smancill commented Apr 12, 2018

heddle commented Apr 12, 2018 via email

baltzell commented Apr 13, 2018

heddle commented Apr 13, 2018 via email

naharrison commented Apr 15, 2018

baltzell commented Apr 15, 2018

naharrison commented Apr 16, 2018

zieglerv commented Apr 16, 2018 via email

naharrison commented Apr 16, 2018

baltzell commented Apr 16, 2018