Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run an HTCondor cluster on AWS #490

Merged
merged 17 commits into from
Aug 13, 2020
Merged

Conversation

mjtravers
Copy link
Contributor

Fixes #392

Basic functionality. Still a work in progress

@mjtravers mjtravers added the WiP label Dec 2, 2019
8fdc513 (ENH: Added new aws-condor resource to manage HTCondor
cluster resources in the AWS Cloud, 2019-11-19) moved this method to
session.put_text(), but a new call to _put_text() was added to master
in the meantime and the semantic conflict wasn't caught when those
changes were brought in with ea19888 (Fixing merge conflict).
@codecov
Copy link

codecov bot commented Dec 2, 2019

Codecov Report

Merging #490 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #490   +/-   ##
=======================================
  Coverage   89.35%   89.35%           
=======================================
  Files         149      149           
  Lines       12267    12267           
=======================================
  Hits        10961    10961           
  Misses       1306     1306

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b40b96c...ad85f13. Read the comment docs.

Copy link
Contributor

@kyleam kyleam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I haven't tried to run anything, but I've left a few comments from an initial read through.

reproman/resource/aws_condor.py Show resolved Hide resolved
reproman/resource/aws_condor.py Outdated Show resolved Hide resolved
reproman/resource/aws_condor.py Outdated Show resolved Hide resolved
reproman/resource/aws_condor.py Outdated Show resolved Hide resolved
reproman/resource/aws_condor.py Outdated Show resolved Hide resolved
reproman/resource/session.py Show resolved Hide resolved
@mjtravers mjtravers removed the WiP label Apr 16, 2020
yarikoptic and others added 4 commits May 7, 2020 13:30
* origin/master:
  RF: external_versions: Move repeated run() call pattern to helper
  ENH: external_versions: Call run expecting failure/stderr
  RF: external_versions: Prefer plain list to str.split()
  ENH: cmd: Log at debug on FileNotFoundError if expecting failure
  TST: skip: Don't assume program is available in Runner-based checks
  TST: skip: Quiet Runner-based dependency checks
  NF: run: Add slurm submitter
  RF: orchestrators: Move bulk of datalad checks to fixtures
  ENH: skip: Add condition for Slurm testing container
  TST: travis: Set up container for testing Slurm submitter
  installing the alternate env command as an instance variable
  MNT: conda: Unpin conda URL
  RF: conda: Avoid calling 'conda install' on root environment
  TST: conda: Update some versions used in install test
  OPT: conda: Avoid a call to pip
  RF: conda: Fill in editable packages
  RF: conda: Assume environments are in $root/envs
  BF: conda: Add extension to temporary environment file
  ENH: Fall back to perl if env -0 is not supported.

 Conflicts:
	reproman/resource/session.py - minor in imports, will RF them a bit anyways next
@chaselgrove
Copy link
Contributor

This looks fine to me and works with datalad-pair-run and plain. The Travis test time out, but tests pass for me.

What else does this need? Anything else to test?

@kyleam
Copy link
Contributor

kyleam commented Aug 4, 2020

The Travis test time out

I'll take a look tomorrow to see if I can figure out what's stalling.

@kyleam
Copy link
Contributor

kyleam commented Aug 5, 2020

The Travis test time out

I'll take a look tomorrow to see if I can figure out what's stalling.

Looks like it's test_orchestrators.py::test_orc_datalad_concurrent[sub:condor-orc:pair-run]. Without the changes in this PR, the Travis job doesn't stall. The likely culprit from this PR is

diff --git a/reproman/support/jobs/job_templates/submission/condor.template b/reproman/support/jobs/job_templates/submission/condor.template
index bdaf040ad..c125fe507 100644
--- a/reproman/support/jobs/job_templates/submission/condor.template
+++ b/reproman/support/jobs/job_templates/submission/condor.template
@@ -9,6 +9,8 @@ environment  = ""
 Output  = {{ _meta_directory }}/stdout.$(Process)
 Error   = {{ _meta_directory }}/stderr.$(Process)
 Log     = {{ _meta_directory }}/log.$(Process)
+should_transfer_files   = Yes
+when_to_transfer_output = ON_EXIT
 
 {#
   TODO: Need to check spec form compatibility between different batch

The changes in this PR do not make any of the tests stall for me locally, so perhaps we're looking at git-annex-related ssh stalling that we've been dealing with on DataLad's end. That was specific to Xenial, so I've triggered a job with Bionic to see if the stall still happens there.

In my local runs, the above change leads to a failure in test_orc_datalad_run[sub:condor-orc:pair]. Interestingly, that seems to have passed in the stalled job on Travis. I need to look into it more.

ad85f13 (ENH: Added options to HTCondor config template to handle
moving files across the cluster to/from the master node, 2020-02-27)
instructed condor to transfer files from and back to the submitting
machine.  This configuration results in two issues:

  * test_orc_datalad_run[sub:condor-orc:pair] fails on my local
    machine.  It completes its first job and then fails when preparing
    the second because the remote repository is unexpectedly dirty.
    The untracked files are the previous stdout and stderr that condor
    transfers back at exit, at which point the runscript has already
    switched away from the HEAD checked out to execute the job.

  * test_orc_datalad_concurrent[sub:condor-orc:pair-run] stalls for an
    unknown reason.

    https://travis-ci.org/github/ReproNim/reproman/jobs/714911219

As far as I understand, adding this configuration was a false start
for an issue that was then solved by 3d93c7e (ENH: Added NFS to AWS
HTCondor cluster management, 2020-04-06), so let's remove it.
@kyleam
Copy link
Contributor

kyleam commented Aug 12, 2020

I just pushed a commit (ec48e84) that removes those condor settings. As I mentioned in that commit, those settings are from trying to get things working without NFS, and I don't think they make sense as of 3d93c7e.

@chaselgrove
Copy link
Contributor

Works on macOS.

@kyleam
Copy link
Contributor

kyleam commented Aug 12, 2020

Works on macOS.

Great, thanks for checking.

The Travis job no longer stalls, and the job with condor enabled passes. Another job fails, but it's in the make -C ... phase after the tests. I haven't seen that in recent Travis runs, I don't see it locally, and it seems unlikely to be related to the PR, so I'd say we should wait to worry about it until it pops up again.

@chaselgrove
Copy link
Contributor

@kyleam I think we overlapped earlier; I started the macOS test on 929119e, before your message. Do your changes need another test?

@kyleam
Copy link
Contributor

kyleam commented Aug 12, 2020

@kyleam I think we overlapped earlier; I started the macOS test on 929119e, before your message.

Ah, thought that was fast :)

Do your changes need another test?

If you don't mind, it'd be good to confirm. Thanks.

@chaselgrove
Copy link
Contributor

@kyleam ec48e84 works on macOS.

@yarikoptic yarikoptic merged commit 0402131 into ReproNim:master Aug 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Prepare cluster AMI
4 participants