Skip to content

Refactor r.abin and r.terabin#69

Merged
danielhollas merged 1 commit intomasterfrom
terabin
Jun 13, 2021
Merged

Refactor r.abin and r.terabin#69
danielhollas merged 1 commit intomasterfrom
terabin

Conversation

@danielhollas
Copy link
Copy Markdown
Contributor

@danielhollas danielhollas commented May 30, 2021

  • move code into BASH functions for better readability

  • automatically exit on error for better debugging/robustnesss

  • use rsync instead of cp for safer and faster sync from scrdir

  • r.terabin: By default do not use hydra_nameserver

@danielhollas danielhollas self-assigned this May 30, 2021
@codecov
Copy link
Copy Markdown

codecov Bot commented May 30, 2021

Codecov Report

Merging #69 (2226181) into master (b5f4933) will increase coverage by 0.01%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #69      +/-   ##
==========================================
+ Coverage   68.97%   68.98%   +0.01%     
==========================================
  Files          38       38              
  Lines        5689     5688       -1     
==========================================
  Hits         3924     3924              
+ Misses       1765     1764       -1     
Impacted Files Coverage Δ
src/tera_mpi_api.F90 98.30% <100.00%> (ø)
src/potentials.F90 18.34% <0.00%> (+0.08%) ⬆️

@danielhollas danielhollas force-pushed the terabin branch 3 times, most recently from c21820c to 125f88d Compare June 1, 2021 21:08
@danielhollas danielhollas changed the title WIP: Refactor r.abin and r.terabin Refactor r.abin and r.terabin Jun 2, 2021
@danielhollas danielhollas marked this pull request as ready for review June 2, 2021 18:52
@danielhollas danielhollas requested a review from suchanj June 2, 2021 18:52
@danielhollas
Copy link
Copy Markdown
Contributor Author

@suchanj Can you take a look? There's not many functional changes, but I moved a lot of code around into separate functions for better readability so it's probably easier to look at the scripts line by line instead of looking at the diff.

Regarding r.terabin on PHOTOX clusters, I figured out that it works if I recompile mpich-3.1.3 with GCC-7. Sadly, if I recompile higher versions of mpich, I'm getting that weird SGE error that we discussed in the previous PR. I have no idea why, but at least now we know what works. I will install the new version into /usr/local/programs/custom/ before we release the new ABIN version on the clusters.

Comment thread utils/r.abin Outdated
# Example SGE params for PHOTOX clusters
#$ -V -cwd
##$ -q aq -pe shm 3
#$ -V -cwd -notify
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-notify is needed so that qdel command sends the SIGUSR2 signal that we can trap, see below.

Comment thread utils/r.terabin

# TeraChem SETUP
TC_INPUT=tera.inp
TC_IN=tc.inp
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ಠ_ಠ

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I can change it back. I changed it because I accidentally deleted the r.terabin script that I 've been working in a separate folder. It went something like this:

# Need to test TeraChem separately
launchTERA tera.inp nq-gpu
# Okay, that worked, let's clean up the auxiliary files
rm r.tera*
# Dammit!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, lets keep the change. It might distinguish ABIN+TC/TC inputs. And I don't expect that many calls for help since a lot of people will still use 'locate r.terabin' instead of repository folder ...

Comment thread utils/r.abin
copy_to_scrdir

cd $SCRDIR
trap copy_from_scrdir EXIT SIGUSR2
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By trapping the SIGUSR2, we can ensure that the data are copied from scratch back to the launching directory when the user issues the qdel command (and also ensures that we remove the scratch dir). But it is up to a discussion whether we actually want this behavior, or whether it is potentially confusing? What is the typical user expectation when they qdel the job?

CC @suchanj

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielhollas I think this is an interesting idea. 1) Removing the scratch dir will alleviate our problems with years of undeleted data on our scratch systems. However this could (and should) be solved by our cluster administrator. Based on discussion with others, we agreed on this point. 2) Usually when one issues the qdel command, he deems the job and generated data useless, but we disagreed whether to copy the data back. If we do so, no information is lost, but there is a concern of huge unwated files copying back to home and overflowing the quota. We might copy back only the output, movie and restart files, but this might not prove useful when combined with TC (I think there are some extra files needed for proper restart?). (This reminds me another problem of SGE CUDA counters breaking down when "qdeling" ABIN+TC jobs.) I cannot conclude the solution. qdel is not Exitabin.sh, but in instances of very long ab initio steps we might want it to behave like that.

Copy link
Copy Markdown
Contributor Author

@danielhollas danielhollas Jun 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @suchanj. I would be strongly against any complicated solutions, (e.g.copying only some files), either we copy everything or nothing. Also, if we don't copy, we probably can't delete the scratch dir to prevent accidental data loss. So I am leaning towards always copying the data and cleaning the scratch (note that e.g. in LAUNCH/G09 we're already doing that). You can then always remove any unwanted data in your home dir, right? I would also guess that the case where you completely throw out some big simulation without any inspection should be fairly rare, hopefully?

I agree that Exitabin.sh is much cleaner and is the preferred method if people are aware of it, regardless of what we decide here.

One thing should be noted (and this is not a new behavior), only newer files are copied back (i.e. those with timestamps newer than what is on the master node). This is the -u option to cp or rsync. This is needed so that if you e.g. modify input.in locally, and then issue qdel or Exitabin.sh, it won't get overwritten.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree this change is reasonable, it is simple, lets go with it.

- move code into BASH functions for better readability

- automatically exit on error for better debugging/robustnesss

- use rsync instead of cp for safer and faster sync from scrdir

- r.terabing: By default do not use hydra_nameserver
@danielhollas danielhollas merged commit 89ef40d into master Jun 13, 2021
@danielhollas danielhollas deleted the terabin branch June 13, 2021 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants