-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run an HTCondor cluster on AWS #490
Conversation
8fdc513 (ENH: Added new aws-condor resource to manage HTCondor cluster resources in the AWS Cloud, 2019-11-19) moved this method to session.put_text(), but a new call to _put_text() was added to master in the meantime and the semantic conflict wasn't caught when those changes were brought in with ea19888 (Fixing merge conflict).
Codecov Report
@@ Coverage Diff @@
## master #490 +/- ##
=======================================
Coverage 89.35% 89.35%
=======================================
Files 149 149
Lines 12267 12267
=======================================
Hits 10961 10961
Misses 1306 1306 Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I haven't tried to run anything, but I've left a few comments from an initial read through.
reproman/support/jobs/job_templates/cluster/condor_config.local.j2
Outdated
Show resolved
Hide resolved
…nto enh-cluster
… across the cluster to/from the master node
…s the job on all cluster nodes.
* origin/master: RF: external_versions: Move repeated run() call pattern to helper ENH: external_versions: Call run expecting failure/stderr RF: external_versions: Prefer plain list to str.split() ENH: cmd: Log at debug on FileNotFoundError if expecting failure TST: skip: Don't assume program is available in Runner-based checks TST: skip: Quiet Runner-based dependency checks NF: run: Add slurm submitter RF: orchestrators: Move bulk of datalad checks to fixtures ENH: skip: Add condition for Slurm testing container TST: travis: Set up container for testing Slurm submitter installing the alternate env command as an instance variable MNT: conda: Unpin conda URL RF: conda: Avoid calling 'conda install' on root environment TST: conda: Update some versions used in install test OPT: conda: Avoid a call to pip RF: conda: Fill in editable packages RF: conda: Assume environments are in $root/envs BF: conda: Add extension to temporary environment file ENH: Fall back to perl if env -0 is not supported. Conflicts: reproman/resource/session.py - minor in imports, will RF them a bit anyways next
This looks fine to me and works with datalad-pair-run and plain. The Travis test time out, but tests pass for me. What else does this need? Anything else to test? |
I'll take a look tomorrow to see if I can figure out what's stalling. |
Looks like it's diff --git a/reproman/support/jobs/job_templates/submission/condor.template b/reproman/support/jobs/job_templates/submission/condor.template
index bdaf040ad..c125fe507 100644
--- a/reproman/support/jobs/job_templates/submission/condor.template
+++ b/reproman/support/jobs/job_templates/submission/condor.template
@@ -9,6 +9,8 @@ environment = ""
Output = {{ _meta_directory }}/stdout.$(Process)
Error = {{ _meta_directory }}/stderr.$(Process)
Log = {{ _meta_directory }}/log.$(Process)
+should_transfer_files = Yes
+when_to_transfer_output = ON_EXIT
{#
TODO: Need to check spec form compatibility between different batch The changes in this PR do not make any of the tests stall for me locally, so perhaps we're looking at git-annex-related ssh stalling that we've been dealing with on DataLad's end. That was specific to Xenial, so I've triggered a job with Bionic to see if the stall still happens there. In my local runs, the above change leads to a failure in |
ad85f13 (ENH: Added options to HTCondor config template to handle moving files across the cluster to/from the master node, 2020-02-27) instructed condor to transfer files from and back to the submitting machine. This configuration results in two issues: * test_orc_datalad_run[sub:condor-orc:pair] fails on my local machine. It completes its first job and then fails when preparing the second because the remote repository is unexpectedly dirty. The untracked files are the previous stdout and stderr that condor transfers back at exit, at which point the runscript has already switched away from the HEAD checked out to execute the job. * test_orc_datalad_concurrent[sub:condor-orc:pair-run] stalls for an unknown reason. https://travis-ci.org/github/ReproNim/reproman/jobs/714911219 As far as I understand, adding this configuration was a false start for an issue that was then solved by 3d93c7e (ENH: Added NFS to AWS HTCondor cluster management, 2020-04-06), so let's remove it.
Works on macOS. |
Great, thanks for checking. The Travis job no longer stalls, and the job with condor enabled passes. Another job fails, but it's in the |
Fixes #392
Basic functionality. Still a work in progress