run an HTCondor cluster on AWS #490

mjtravers · 2019-12-02T13:55:38Z

Fixes #392

Basic functionality. Still a work in progress

…es in the AWS Cloud

8fdc513 (ENH: Added new aws-condor resource to manage HTCondor cluster resources in the AWS Cloud, 2019-11-19) moved this method to session.put_text(), but a new call to _put_text() was added to master in the meantime and the semantic conflict wasn't caught when those changes were brought in with ea19888 (Fixing merge conflict).

codecov · 2019-12-02T14:57:35Z

Codecov Report

Merging #490 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master     #490   +/-   ##
=======================================
  Coverage   89.35%   89.35%           
=======================================
  Files         149      149           
  Lines       12267    12267           
=======================================
  Hits        10961    10961           
  Misses       1306     1306

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b40b96c...ad85f13. Read the comment docs.

kyleam

Thanks. I haven't tried to run anything, but I've left a few comments from an initial read through.

reproman/resource/aws_condor.py

reproman/resource/session.py

reproman/support/jobs/job_templates/cluster/condor_config.local.j2

…nto enh-cluster

… across the cluster to/from the master node

…s the job on all cluster nodes.

* origin/master: RF: external_versions: Move repeated run() call pattern to helper ENH: external_versions: Call run expecting failure/stderr RF: external_versions: Prefer plain list to str.split() ENH: cmd: Log at debug on FileNotFoundError if expecting failure TST: skip: Don't assume program is available in Runner-based checks TST: skip: Quiet Runner-based dependency checks NF: run: Add slurm submitter RF: orchestrators: Move bulk of datalad checks to fixtures ENH: skip: Add condition for Slurm testing container TST: travis: Set up container for testing Slurm submitter installing the alternate env command as an instance variable MNT: conda: Unpin conda URL RF: conda: Avoid calling 'conda install' on root environment TST: conda: Update some versions used in install test OPT: conda: Avoid a call to pip RF: conda: Fill in editable packages RF: conda: Assume environments are in $root/envs BF: conda: Add extension to temporary environment file ENH: Fall back to perl if env -0 is not supported. Conflicts: reproman/resource/session.py - minor in imports, will RF them a bit anyways next

chaselgrove · 2020-08-04T21:27:07Z

This looks fine to me and works with datalad-pair-run and plain. The Travis test time out, but tests pass for me.

What else does this need? Anything else to test?

kyleam · 2020-08-04T21:37:03Z

The Travis test time out

I'll take a look tomorrow to see if I can figure out what's stalling.

kyleam · 2020-08-05T21:08:14Z

The Travis test time out

I'll take a look tomorrow to see if I can figure out what's stalling.

Looks like it's test_orchestrators.py::test_orc_datalad_concurrent[sub:condor-orc:pair-run]. Without the changes in this PR, the Travis job doesn't stall. The likely culprit from this PR is

diff --git a/reproman/support/jobs/job_templates/submission/condor.template b/reproman/support/jobs/job_templates/submission/condor.template
index bdaf040ad..c125fe507 100644
--- a/reproman/support/jobs/job_templates/submission/condor.template
+++ b/reproman/support/jobs/job_templates/submission/condor.template
@@ -9,6 +9,8 @@ environment  = ""
 Output  = {{ _meta_directory }}/stdout.$(Process)
 Error   = {{ _meta_directory }}/stderr.$(Process)
 Log     = {{ _meta_directory }}/log.$(Process)
+should_transfer_files   = Yes
+when_to_transfer_output = ON_EXIT
 
 {#
   TODO: Need to check spec form compatibility between different batch

The changes in this PR do not make any of the tests stall for me locally, so perhaps we're looking at git-annex-related ssh stalling that we've been dealing with on DataLad's end. That was specific to Xenial, so I've triggered a job with Bionic to see if the stall still happens there.

In my local runs, the above change leads to a failure in test_orc_datalad_run[sub:condor-orc:pair]. Interestingly, that seems to have passed in the stalled job on Travis. I need to look into it more.

ad85f13 (ENH: Added options to HTCondor config template to handle moving files across the cluster to/from the master node, 2020-02-27) instructed condor to transfer files from and back to the submitting machine. This configuration results in two issues: * test_orc_datalad_run[sub:condor-orc:pair] fails on my local machine. It completes its first job and then fails when preparing the second because the remote repository is unexpectedly dirty. The untracked files are the previous stdout and stderr that condor transfers back at exit, at which point the runscript has already switched away from the HEAD checked out to execute the job. * test_orc_datalad_concurrent[sub:condor-orc:pair-run] stalls for an unknown reason. https://travis-ci.org/github/ReproNim/reproman/jobs/714911219 As far as I understand, adding this configuration was a false start for an issue that was then solved by 3d93c7e (ENH: Added NFS to AWS HTCondor cluster management, 2020-04-06), so let's remove it.

kyleam · 2020-08-12T18:34:38Z

I just pushed a commit (ec48e84) that removes those condor settings. As I mentioned in that commit, those settings are from trying to get things working without NFS, and I don't think they make sense as of 3d93c7e.

chaselgrove · 2020-08-12T18:48:22Z

Works on macOS.

kyleam · 2020-08-12T19:03:28Z

Works on macOS.

Great, thanks for checking.

The Travis job no longer stalls, and the job with condor enabled passes. Another job fails, but it's in the make -C ... phase after the tests. I haven't seen that in recent Travis runs, I don't see it locally, and it seems unlikely to be related to the PR, so I'd say we should wait to worry about it until it pops up again.

chaselgrove · 2020-08-12T19:31:34Z

@kyleam I think we overlapped earlier; I started the macOS test on 929119e, before your message. Do your changes need another test?

kyleam · 2020-08-12T19:36:35Z

@kyleam I think we overlapped earlier; I started the macOS test on 929119e, before your message.

Ah, thought that was fast :)

Do your changes need another test?

If you don't mind, it'd be good to confirm. Thanks.

chaselgrove · 2020-08-12T20:42:18Z

@kyleam ec48e84 works on macOS.

mjtravers added 3 commits November 19, 2019 17:00

ENH: Added new aws-condor resource to manage HTCondor cluster resourc…

8fdc513

…es in the AWS Cloud

Fixing merge conflict

ea19888

Merge branch 'master' into enh-cluster

f6c17f8

mjtravers added the WiP label Dec 2, 2019

kyleam reviewed Dec 2, 2019

View reviewed changes

mjtravers added 8 commits December 12, 2019 12:03

Merge branch 'master' into enh-cluster

f7ccb89

ENH: Made recommended code review changes for HTCondor cluster resource

af052d1

Merge branch 'enh-cluster' of https://github.com/mjtravers/reproman i…

b40b96c

…nto enh-cluster

Merge branch 'master' into enh-cluster

59e2c92

ENH: Added options to HTCondor config template to handle moving files…

ad85f13

… across the cluster to/from the master node

ENH: Added NFS to AWS HTCondor cluster management

3d93c7e

Merge branch 'master' into enh-cluster

9b61c70

BF: Updates to HTCondor config so job submit user is the user who run…

cf5391d

…s the job on all cluster nodes.

mjtravers removed the WiP label Apr 16, 2020

yarikoptic and others added 4 commits May 7, 2020 13:30

RF: minor tune up of imports

c9e9116

Merge branch 'master' into enh-cluster

877b890

Merge branch 'master' into enh-cluster

929119e

yarikoptic merged commit 0402131 into ReproNim:master Aug 13, 2020

This was referenced Aug 13, 2020

BF: execute: Avoid 'gunzip --keep' for portability #541

Merged

Travis: 'make -C docs' failure #542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

run an HTCondor cluster on AWS #490

run an HTCondor cluster on AWS #490

mjtravers commented Dec 2, 2019

codecov bot commented Dec 2, 2019 •

edited

Loading

kyleam left a comment

chaselgrove commented Aug 4, 2020

kyleam commented Aug 4, 2020

kyleam commented Aug 5, 2020

kyleam commented Aug 12, 2020

chaselgrove commented Aug 12, 2020

kyleam commented Aug 12, 2020

chaselgrove commented Aug 12, 2020

kyleam commented Aug 12, 2020

chaselgrove commented Aug 12, 2020

run an HTCondor cluster on AWS #490

run an HTCondor cluster on AWS #490

Conversation

mjtravers commented Dec 2, 2019

codecov bot commented Dec 2, 2019 • edited Loading

Codecov Report

kyleam left a comment

Choose a reason for hiding this comment

chaselgrove commented Aug 4, 2020

kyleam commented Aug 4, 2020

kyleam commented Aug 5, 2020

kyleam commented Aug 12, 2020

chaselgrove commented Aug 12, 2020

kyleam commented Aug 12, 2020

chaselgrove commented Aug 12, 2020

kyleam commented Aug 12, 2020

chaselgrove commented Aug 12, 2020

codecov bot commented Dec 2, 2019 •

edited

Loading