Refactor r.abin and r.terabin#69
Conversation
Codecov Report
@@ Coverage Diff @@
## master #69 +/- ##
==========================================
+ Coverage 68.97% 68.98% +0.01%
==========================================
Files 38 38
Lines 5689 5688 -1
==========================================
Hits 3924 3924
+ Misses 1765 1764 -1
|
c21820c to
125f88d
Compare
|
@suchanj Can you take a look? There's not many functional changes, but I moved a lot of code around into separate functions for better readability so it's probably easier to look at the scripts line by line instead of looking at the diff. Regarding |
| # Example SGE params for PHOTOX clusters | ||
| #$ -V -cwd | ||
| ##$ -q aq -pe shm 3 | ||
| #$ -V -cwd -notify |
There was a problem hiding this comment.
-notify is needed so that qdel command sends the SIGUSR2 signal that we can trap, see below.
|
|
||
| # TeraChem SETUP | ||
| TC_INPUT=tera.inp | ||
| TC_IN=tc.inp |
There was a problem hiding this comment.
Hmm, I can change it back. I changed it because I accidentally deleted the r.terabin script that I 've been working in a separate folder. It went something like this:
# Need to test TeraChem separately
launchTERA tera.inp nq-gpu
# Okay, that worked, let's clean up the auxiliary files
rm r.tera*
# Dammit!There was a problem hiding this comment.
I see, lets keep the change. It might distinguish ABIN+TC/TC inputs. And I don't expect that many calls for help since a lot of people will still use 'locate r.terabin' instead of repository folder ...
| copy_to_scrdir | ||
|
|
||
| cd $SCRDIR | ||
| trap copy_from_scrdir EXIT SIGUSR2 |
There was a problem hiding this comment.
By trapping the SIGUSR2, we can ensure that the data are copied from scratch back to the launching directory when the user issues the qdel command (and also ensures that we remove the scratch dir). But it is up to a discussion whether we actually want this behavior, or whether it is potentially confusing? What is the typical user expectation when they qdel the job?
CC @suchanj
There was a problem hiding this comment.
@danielhollas I think this is an interesting idea. 1) Removing the scratch dir will alleviate our problems with years of undeleted data on our scratch systems. However this could (and should) be solved by our cluster administrator. Based on discussion with others, we agreed on this point. 2) Usually when one issues the qdel command, he deems the job and generated data useless, but we disagreed whether to copy the data back. If we do so, no information is lost, but there is a concern of huge unwated files copying back to home and overflowing the quota. We might copy back only the output, movie and restart files, but this might not prove useful when combined with TC (I think there are some extra files needed for proper restart?). (This reminds me another problem of SGE CUDA counters breaking down when "qdeling" ABIN+TC jobs.) I cannot conclude the solution. qdel is not Exitabin.sh, but in instances of very long ab initio steps we might want it to behave like that.
There was a problem hiding this comment.
Thanks @suchanj. I would be strongly against any complicated solutions, (e.g.copying only some files), either we copy everything or nothing. Also, if we don't copy, we probably can't delete the scratch dir to prevent accidental data loss. So I am leaning towards always copying the data and cleaning the scratch (note that e.g. in LAUNCH/G09 we're already doing that). You can then always remove any unwanted data in your home dir, right? I would also guess that the case where you completely throw out some big simulation without any inspection should be fairly rare, hopefully?
I agree that Exitabin.sh is much cleaner and is the preferred method if people are aware of it, regardless of what we decide here.
One thing should be noted (and this is not a new behavior), only newer files are copied back (i.e. those with timestamps newer than what is on the master node). This is the -u option to cp or rsync. This is needed so that if you e.g. modify input.in locally, and then issue qdel or Exitabin.sh, it won't get overwritten.
There was a problem hiding this comment.
Ok, I agree this change is reasonable, it is simple, lets go with it.
- move code into BASH functions for better readability - automatically exit on error for better debugging/robustnesss - use rsync instead of cp for safer and faster sync from scrdir - r.terabing: By default do not use hydra_nameserver
move code into BASH functions for better readability
automatically exit on error for better debugging/robustnesss
use rsync instead of cp for safer and faster sync from scrdir
r.terabin: By default do not usehydra_nameserver