Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve sync_output_to_gdata.sh and sync_restarts_to_gdata.sh safety #59

Open
aekiss opened this issue Nov 29, 2017 · 6 comments
Open

Comments

@aekiss
Copy link
Contributor

aekiss commented Nov 29, 2017

At present sync_output_to_gdata.sh and sync_restarts_to_gdata.sh rely on the user to remember to set GDATADIR in sync_output_to_gdata.sh to a path that does not clash with anything already existing. This is dangerous: I typically start a new experiment from a previous one, and if I forget to update GDATADIR the previous experiment's output will (I think!) be overwritten.

I can't immediately think of a safer / more foolproof way to do it.

  • Is there a way to have rsync abort before writing anything if any of the source files exist in the destination?
  • or make cunning use of rsync options like --dry-run, --itemize-changes, --ignore-existing, etc
  • or should we use the current branch name to automatically set GDATADIR and assume that the user knows to checkout a new branch for each experiment (and choose a branch name that is useable as a dir name)? - sounds even worse...
@aekiss
Copy link
Contributor Author

aekiss commented Nov 29, 2017

also sync_restarts_to_gdata.sh currently only copies restarts ending in 0 or 5 to avoid clutter (like payu sweep). This needs to be tweaked to also always copy the most recent restart.

@russfiedler
Copy link

I presume you're moving things to a similarly named directory to the current one so something like
GDATADIR=/g/data3/hh5/tmp/cosima/`basename $PWD`

Though it's probably better to send some arguments to the script in order to do some simple sanity checking.

@aidanheerdegen
Copy link
Contributor

You're right that this isn't a great solution, more of a hack that has propagated. I would say --ignore-existing is the very minimum, but if you have a transfer that stuffed up (timed out) I think this option would prevent a half-transferred file from being completed. Happy to be corrected on that.

I don't know that there is a a great way around this. I am on record (with Marshall at least if he remembers) as wanting to uniquely name outputs with git runlog hashes, e.g.
output052.dfc037e
so it still lists ok, but gives some uniqueness to the names. He wasn't a fan of this idea IIRC

@aidanheerdegen
Copy link
Contributor

As for the restart issue, some of this could be fixed by redoing some of the payu archiving routines which have sort of stagnated but could be revived.

aekiss added a commit to COSIMA/01deg_jra55_ryf that referenced this issue Dec 5, 2017
aekiss added a commit to COSIMA/1deg_jra55_ryf that referenced this issue Dec 5, 2017
aekiss added a commit to COSIMA/1deg_core_nyf that referenced this issue Dec 5, 2017
aekiss added a commit to COSIMA/025deg_jra55_ryf that referenced this issue Dec 5, 2017
@aekiss
Copy link
Contributor Author

aekiss commented Dec 5, 2017

the commits above have added --ignore-existing to sync_output_to_gdata.sh and sync_restarts_to_gdata.sh, and some extra-shouty warnings to users as a stopgap until we think of something better

@aekiss
Copy link
Contributor Author

aekiss commented Dec 5, 2017

The trouble with --ignore-existing is new outputs and restarts will not actually be synched if a user reuses an existing dir for GDATADIR, which is a problem if the user thinks they are backing up restarts. But that's probably better than overwriting somebody else's output by mistake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants