Large files #88
Comments
For the first two, I'd think your build-coatjava.sh script could/should download them for us (if they don't already exist locally in the proper directory). Looks like clara gets them from coatjava-XXX.tar.gz, so that will take care of itself, sort of. I've no idea about the rest of them. Looks like lots of duplicates, and a jar! |
I also don't know about the other files. Related to cleanup, in the external dependencies folder there are two jar we could remove: |
Sorry to keep picking on @heddle , but most of these are in cnuphys. Dave, can we try to find a solution to these large files? At least the duplicate ones? |
Yes, as far as cnuphys is concerned, delete
27M common-tools/cnuphys/swimmer/data/smallAsciiMap.txt
88M common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/swimmer/data/solenoid-srr.dat
13M common-tools/cnuphys/swimmer/data/clas12_small_torus.dat
88M common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat
but preserve
88M common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/magfieldC/data/solenoid-srr.dat
if that is acceptable.
I have seen other solutions to this sort of problem, where the large data
files are in a separate repo. But if just cleaning up solves the problem I
would prefer that rout.
And stop picking on me! I'm quite sensitive!
…On Fri, Mar 23, 2018 at 10:19 AM, Nathan Harrison ***@***.***> wrote:
Sorry to keep picking on @heddle <https://github.com/heddle> , but most
of these are in cnuphys. Dave, can we try to find a solution to these large
files? At least the duplicate ones?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHEBYsdJ0mBp0v0ux3imdtbrwgaDnHhiks5thQRWgaJpZM4S34wD>
.
--
*David P. Heddle, Ph.D.Professor of Physics,*
*Graduate Program Coordinator Christopher Newport University Newport News,
VA 23606*
*757.594.8434 (CNU)*
|
Nathan, can you test Dave's accepted deletions on a fork with BFG and see what the resulting repo size is? |
Have a look at https://github.com/naharrison/clas12-offline-software. I made sure to start even with JeffersonLab:master and now it says the branch is 1175 commits ahead and 1174 commits behind. It's a little scary to change the repo history so drastically, but everything seems to be working normally. Here's a summary of the size improvement: |
It's not as scary as it sounds: 1175 ahead and 1174 behind - seems it's just one commit (BFG commit) ahead... After you modify git history with official git filter-branch tool it has similar message, consequences of meddling with git history apparently. First you manually remove 8 files using "git rm" - that's your last "manual" commit.
Again these last 7 files are not deleted, they are saved, but their history is removed, and they are newly added in the last commit. So we have 22 changed files. |
Thanks Andrey! I'll submit a pull request now. |
It may be helpful to start using git lfs for storing these files so that this does not become an issue in the future. |
As pointed out by Sebastian, a good solution to this is to just use |
Queued jobs may fail when the depth is 1 (actually, when depth is not enough, the documentation is confusing.) You may experiment with a depth number that works, or just leave the default of 50 and keep pushing new changes. Eventually Travis should stop fetching commits that contain large files. |
So I have been asked to create a new binary field map. Am I supposed to
check that in?
…On Thu, Apr 12, 2018 at 4:40 PM, Sebastián Mancilla < ***@***.***> wrote:
Queued jobs may fail when the depth is 1 (actually, when depth is not
enough, the documentation is confusing
<travis-ci/travis-ci#8321>.)
You may experiment with a depth number that works, or just leave the
default of 50 and keep pushing new changes. Eventually Travis should stop
fetching commits that contain large files.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHEBYrqCnfeeUCImQMkXtGsvDznBsWkvks5tn7vXgaJpZM4S34wD>
.
--
*David P. Heddle, Ph.D.Professor of Physics,*
*Graduate Program Coordinator Christopher Newport University Newport News,
VA 23606*
*757.594.8434 (CNU)*
|
Dave, how about just put it somewhere publicly accessible for now (e.g. www.jlab.org/~heddle), with a different filename than the existing fieldmap. There's been talk about changing to automatically retrieving the maps from a directory next to (or inside?) our maven repo (and/or a separate github repo), instead of storing them in clas12-offline-software. Could be done in build-coatjava.sh, or probably a more proper maven way. Regarding multiple map versions,
|
It should honor those env vars, I'll double check. There are programatic
ways to specify the paths too. I'll write something up.
…On Fri, Apr 13, 2018 at 6:34 AM, Nathan Baltzell ***@***.***> wrote:
Dave, how about just put it somewhere publicly accessible for now (e.g.
www.jlab.org/~heddle <http://www.jlab.org/%7Eheddle>), with a different
filename than the existing fieldmap.
There's been talk about changing to automatically retrieving the maps from
a directory next to (or inside?) our maven repo (and/or a separate github
repo), instead of storing them in clas12-offline-software. Could be done in
build-coatjava.sh, or probably a more proper maven way.
Regarding multiple map versions,
-
I see environment variables TORUSMAP and SOLENOIDMAP are setup to
allow determining the filenames at run time (although duplicated in many
scripts and should be centralized).
-
Dave, does cnuphys honor those env vars, or not? I see evidence of
both.
-
But I don't see if/how those variables get set when we run inside
CLARA? Doesn't appear to be the yaml files.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#88 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHEBYh7DDv65QVErPMT7eAVW5aBksycBks5toH8ugaJpZM4S34wD>
.
--
*David P. Heddle, Ph.D.Professor of Physics,*
*Graduate Program Coordinator Christopher Newport University Newport News,
VA 23606*
*757.594.8434 (CNU)*
|
From what I've been reading, managing large binary files is a common problem without any particularly good solutions. (In addition to field maps we also have some big data files that are used for validation, currently kept here). Here are a few ideas:
I like 3 the best do to its simplicity. |
I've no preference between 3 and 4 (although changing fieldmaps should be rare and a worthwhile maintainence headache), and agree that, in theory, fieldmaps probably don't belong in this repo. Whatever option is chosen needs to be automated, via the build script or maven. But I also wonder, if we just judiciously prune all the unused fieldmaps, data files, and jars from this repo, whether that might not be good enough in the short term. Note that we do need to use a new ~100 MB torus fieldmap from Dave. Probably more import than repository size is software support for different fieldmaps at runtime, determined from environment variables and yaml. Currently it's determined at compile-time, or overwriting a fieldmap file with a particular but very generic name, and the env vars in coatjava/bin may or may not be honored. I think it's more important to sort this out than slim the repo. |
Agreed on the priority, but aren't these issues pretty independent? Anyway, I think we can resolve the repo size today if we can agree on a strategy. I like 3 better than 4 because in my experience github is more reliable than keeping things on a lab filesystem where things sometimes get accidentally deleted, the disks are sometimes down for maintenance, you can have disk quota issues, you have to enter your password twice if you're offsite... Here's my proposal:
My one concern was that wget-ing files from github to a jlab machine can sometimes be extremely slow for some reason; but I just tested this idea and it's working fine. |
I agree that keeping files on github may be the safest.
I'd say since this strategy seems to be working as Nathan pointed out, let's go with it.
…----- Original Message -----
From: "Nathan Harrison" <notifications@github.com>
To: "JeffersonLab/clas12-offline-software" <clas12-offline-software@noreply.github.com>
Cc: "Veronique Ziegler" <ziegler@jlab.org>, "Mention" <mention@noreply.github.com>
Sent: Monday, April 16, 2018 12:47:15 PM
Subject: Re: [JeffersonLab/clas12-offline-software] Large files (#88)
Agreed on the priority, but aren't these issues pretty independent? Anyway, I think we can resolve the repo size today if we can agree on a strategy. I like 3 better than 4 because in my experience github is more reliable than keeping things on a lab filesystem where things sometimes get accidentally deleted, the disks are sometimes down for maintenance, you can have disk quota issues, you have to enter your password twice if you're offsite... Here's my proposal:
* Create a new github repo - JeffersonLab/clas12-offline-resources which can have different branches, e.g. field-maps, validation-files, etc.
* In build-coatjava.sh add the line `wget https://urldefense.proofpoint.com/v2/url?u=http-3A__github.com_JeffersonLab_clas12-2Doffline-2Dresources_field-2Dmaps.zip&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=W3-NOktEHmERMrPOVplwMQ&m=jxfUUA-JZFW4WXiigS_4Fsv9DFwBzMCc5rniCTy0jR8&s=5OUsQ9wW1rS_KLVSEuarxPJ_RdiLbDEXTnu692eWwSQ&e= ` plus 2-3 more lines to unpack and move the files to where they need to go. Maybe also add a new flag to the build script to avoid re-downloading the maps once they're downloaded.
My one concern was that wget-ing files from github to a jlab machine can sometimes be extremely slow for some reason; but I just tested this idea and it's working fine.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JeffersonLab_clas12-2Doffline-2Dsoftware_issues_88-23issuecomment-2D381671995&d=DwICaQ&c=lz9TcOasaINaaC3U7FbMev2lsutwpI4--09aP8Lu18s&r=W3-NOktEHmERMrPOVplwMQ&m=jxfUUA-JZFW4WXiigS_4Fsv9DFwBzMCc5rniCTy0jR8&s=dE5GrGXY_l8_je_4FPGkFUXZSy4Uk1L4tzyhy_8_tCU&e=
|
Great! I'll set it up and submit a pull request by the end of the day. |
Some torus map details I accumulated from Dave (he may want to confirm): The 88 MB map (currently in this repo) is the 12-fold symmetry, high-res version, that corresponds with what gemc currently uses. There are at least two as-surveyed 136 MB maps around (neither are in this repo):
All three of these torus maps should go in the new repo with appropriate filenames. |
Below is a list of all files larger than 5 MB. Most of them we probably don't want to keep, and getting rid of them should significantly reduce the size of the repository (currently about 1 GB). I've done some experimenting with the BFG Repo Cleaner (https://rtyley.github.io/bfg-repo-cleaner/) and it should be able to do exactly what we need. @heddle @zieglerv @baltzell @raffaelladevita Can you comment on how you'd like to proceed?
12M etc/data/magfield/clas12-fieldmap-torus.dat
8.3M etc/data/magfield/clas12-fieldmap-solenoid.dat
30M common-tools/cnuphys/coatjava/lib/clas/coat-libs-5.1-SNAPSHOT.jar
12M common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/coatjava/etc/data/magfield/clas12-fieldmap-solenoid.dat
88M common-tools/cnuphys/magfieldC/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/magfieldC/data/solenoid-srr.dat
27M common-tools/cnuphys/swimmer/data/smallAsciiMap.txt
88M common-tools/cnuphys/swimmer/data/clas12_torus_fieldmap_binary.dat
8.3M common-tools/cnuphys/swimmer/data/solenoid-srr.dat
13M common-tools/cnuphys/swimmer/data/clas12_small_torus.dat
88M common-tools/cnuphys/ced/data/clas12-fieldmap-torus.dat
8.3M common-tools/cnuphys/ced/data/clas12-fieldmap-solenoid.dat
The text was updated successfully, but these errors were encountered: