Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

run: Parallel subjobs #408

Merged
merged 36 commits into from Aug 9, 2019
Merged

run: Parallel subjobs #408

merged 36 commits into from Aug 9, 2019

Conversation

@kyleam
Copy link
Contributor

kyleam commented Apr 25, 2019

Rework the orchestrators and submitters to support concurrent subjobs.

In terms of the run interface, this introduces a --batch-spec flag and a --batch-parameter flag for simple cases. These control the number of subjobs.

An example invocation would be:

reproman run --follow -r ss --sub condor --orc datalad-pair-run --bp 'thing=thing-*' \
  --input '{p[thing]}' sh -c 'cat {p[thing]} {p[thing]} >doubled-{p[thing]}'

Assuming [thing-a, thing-b, ...] are globbed, that's equivalent to a --batch-spec with

thing:
- thing-a
- thing-b
[...]
@kyleam kyleam force-pushed the kyleam:run-subjobs branch 3 times, most recently from 588b933 to 0cb146c May 13, 2019
@kyleam kyleam force-pushed the kyleam:run-subjobs branch 3 times, most recently from b11f8b7 to 0a1875f May 23, 2019
@kyleam kyleam force-pushed the kyleam:run-subjobs branch 3 times, most recently from 3067a4c to ef71f7e Jun 4, 2019
@kyleam kyleam force-pushed the kyleam:run-subjobs branch 3 times, most recently from fc9149b to d24c301 Jun 12, 2019
@codecov

This comment has been minimized.

Copy link

codecov bot commented Jun 19, 2019

Codecov Report

Merging #408 into master will decrease coverage by 11.12%.
The diff coverage is 61.81%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master     #408       +/-   ##
===========================================
- Coverage   90.04%   78.91%   -11.13%     
===========================================
  Files         148      148               
  Lines       11820    11910       +90     
===========================================
- Hits        10643     9399     -1244     
- Misses       1177     2511     +1334
Impacted Files Coverage Δ
reproman/support/jobs/submitters.py 55.47% <0%> (-18.28%) ⬇️
reproman/interface/run.py 100% <100%> (ø) ⬆️
reproman/support/jobs/tests/test_orchestrators.py 33.22% <50%> (-65.31%) ⬇️
reproman/support/jobs/orchestrators.py 46.68% <51.96%> (-36.37%) ⬇️
reproman/interface/tests/test_run.py 99.56% <98%> (-0.44%) ⬇️
reproman/distributions/tests/test_docker.py 22.22% <0%> (-77.78%) ⬇️
reproman/resource/tests/test_singularity.py 25% <0%> (-75%) ⬇️
reproman/resource/tests/test_ssh.py 27.53% <0%> (-72.47%) ⬇️
reproman/distributions/tests/test_singularity.py 31.03% <0%> (-68.97%) ⬇️
reproman/distributions/tests/test_venv.py 31.73% <0%> (-66.35%) ⬇️
... and 35 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c291a61...d24c301. Read the comment docs.

@codecov

This comment has been minimized.

Copy link

codecov bot commented Jun 19, 2019

Codecov Report

Merging #408 into master will increase coverage by 3.27%.
The diff coverage is 87.04%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #408      +/-   ##
==========================================
+ Coverage    85.8%   89.08%   +3.27%     
==========================================
  Files         148      148              
  Lines       11860    12082     +222     
==========================================
+ Hits        10177    10763     +586     
+ Misses       1683     1319     -364
Impacted Files Coverage Δ
reproman/support/jobs/tests/test_orchestrators.py 92.18% <100%> (+61.05%) ⬆️
reproman/interface/run.py 100% <100%> (ø) ⬆️
reproman/support/jobs/orchestrators.py 79.09% <77.7%> (+35.3%) ⬆️
reproman/support/jobs/submitters.py 76.35% <78.26%> (+22.45%) ⬆️
reproman/interface/tests/test_run.py 99.56% <98%> (-0.44%) ⬇️
reproman/resource/tests/test_singularity.py 26.31% <0%> (-73.69%) ⬇️
reproman/distributions/tests/test_singularity.py 33.33% <0%> (-66.67%) ⬇️
reproman/distributions/singularity.py 74.66% <0%> (-20%) ⬇️
reproman/resource/singularity.py 79.04% <0%> (-12.39%) ⬇️
reproman/distributions/conda.py 94.16% <0%> (ø) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7f2a506...a823a4e. Read the comment docs.

@kyleam kyleam force-pushed the kyleam:run-subjobs branch 3 times, most recently from 330c74e to fae8d9a Jun 19, 2019
kyleam added 13 commits May 10, 2019
An upcoming commit will need this functionality, but for a file that
should not be executable.
Doing so makes it easier to debug and read through the run script
because these variables are used in place of the expanded values.
Using the expanded values directly makes it less obvious what the
values are and is hard to process because the values can be quite
long.
With concurrent jobs, the status method should be capable of getting
the status for any subjob.  Keep the .status property because there
are still places that use it, though it of course isn't very helpful
in the case of concurrent jobs.  These spots should be reworked.
We'll need this for the local parallel submitter in order to get a
path name we know we don't have to quote.
At this point, there is still only one job, but plumbing through the
handling for this suffix starts to prepare the setup for concurrent
jobs.
There's no reason not do this anyway, but this is especially a good
idea given this script is going to get more complicated when it
supports parallel jobs.
We override status() for the DataLad orchestrator because the status
file by be out-of-tree (e.g., because a later run used a HEAD before
the result commit).  But this applies only to the -pair variants, so
move the status() overrided and the _execute_in_wdir() helper to
PrepareRemoteDataladMixin.
Once we support concurrent jobs, we need a way for each subjob to
indicate that it failed.  Teach failing jobs to touch a file in the
"METADIR/failed/".  We could instead check the content of status.N,
but that could mean checking the contents of a large number of files
just to obtain the status of (hopefully) a small number of failures.

Update log_failed() to look at this directory rather than .status,
which won't be meaningful for concurrent jobs.
We can't have each file touch the same sentinel because then we would
only catch the files that arrived after the last job started.
Instead, make each process touch its own file and then use the
earliest timestamp from those files for the comparison.
Once we support concurrent jobs, we'll need to either copy the status
and output files for all of the jobs or for the failures.  Do the
latter since there may be many jobs.
The formatting will get more complicated when we support concurrent
jobs, so the static method will make testing the different cases less
expensive.
kyleam added 12 commits May 23, 2019
Only the datalad-based orchestrators (1) expand globs in inputs and
outputs and (2) format the command with them.  Ideally the plain
orchestrator would do both these things, although it's not a priority
given we recommend using datalad-based orchestrators.  We can do no.
1 without introducing/duplicating any new code (we already took in
DataLad's GlobbedPaths via the 3rd branch).  Avoid no. 2 for now
because that would involve more work (e.g., bringing in DataLad's
SequenceFormatter).
Once we add concurrent jobs, this will be useful once for getting
inputs and outputs for all subjobs and particular subjobs.
There latest datalad-container release passes the image as an
extra_input.
The message won't be pretty in cases where datalad-run results use a
tuple of (format, *args), but the reason for the failure should still
be clear.
The immediate use for this is that I need to use a command other than
qsub for the PBS system I have access to.  More thought should be
given to making the submit command customizable, but for now just look
in the job spec for the (undocumented) "submit_command" key.  Note
this will working only from --job-spec, not --job-parameter, because
it needs to come in as a list.
Wire up the DataLad orchestrators and the local and condor submitters
to support concurrent jobs.  Using a list of records, we're expanding
the command as well as inputs and outputs into a list of commands,
with the submitter-specific mechanism for register each subjob.

Comments mention a bunch of rough edges.  Two particularly important
ones are:

  * Spotty PBS support

    I adjusted this for the system I have access to.  In the event of
    an error, the separator used for the stderr file is wrong, but
    aside from that it seems to work.

    But based on playing around with other Torques available through
    Docker images, at least some variants of Torque don't seem to
    provide easy access, so it might fail for many PBS systems.

    One option might be to work around this by relying on DRMAA
    support, but unfortunately the main PBS system I have access
    to (and one we'd want to support), doesn't seem to have DRMAA
    support.

  * Inappropriate run record

    For datlad-{local,pair}-run, the run record is no longer accurate
    because the record doesn't represent a command that could be
    executed from that record alone (datalad-run, for example,
    wouldn't know how to do the batch expansion).  One possibility
    would be to update the record to store the reproman run
    invocation, but we'd have to quote the placeholders so that
    datalad-run wouldn't complain about unknown ones.

In addition to code FIXMEs, support for concurrent jobs should also
have more dedicated tests.
The orchestrator has been wired up to generate an array of commands
based on records.  For example, with

  [{"name": "foo"},
   {"name": "bar"}]

and a command "touch {p[name]}" would expand to two concurrent calls

  touch foo
  touch bar

Add an option to run() that allows the caller to point to a file that
these records.  This could be a yaml/json file or it could be a python
file that generates the record, but let's go with a yaml file for now
because it's a simple step from a python file to dumping a yaml file
and then python isn't forced on the user.

The next commit will add an alternative option that allows specifying
simple combinations via the command line without the need for a yaml
file.
As an alternative to --batch-spec, add an option for constructing the
batch parameters from combining lists of values.  Right now, we only
support taking the product of these values, but we could consider
adding a flag that specifying that each ith element should be
paired.  (See code comment for what would need to be adjusted.)

This syntax is pretty limited, but for more complicated things the
caller can use the --batch-spec option.
We're going to add a --batch-parameter/--bp option, so use --jp rather
than -p for consistency.
This is useful for generating batch parameters based on a set of file
names.
@kyleam kyleam force-pushed the kyleam:run-subjobs branch from fae8d9a to ae58628 Jul 11, 2019
@kyleam kyleam marked this pull request as ready for review Jul 11, 2019
kyleam added 2 commits Jul 22, 2019
datalad-container added an "img_dspath" placeholder in 0.4.0.  As
noted in the code comment, ideally containers-run would expose this
logic so we don't need to duplicate it and worry about it becoming
stale.
tar v1.30 and later will fail with the following error:

  tar: The following options were used after any non-optional
  arguments in archive create or update mode.  These options are
  positional and affect only arguments that follow them.  Please,
  rearrange them properly.
kyleam added 7 commits Aug 5, 2019
As of fd9bf94 (Added session handling of 'cwd' when executing a
command, 2019-03-21), we can use the `cwd` parameter.  Note that, if
for some reason we need to revert this change, shlex.quote() should be
used when constructing the prefix.

Keep _execute_in_wdir() around rather than calling execute_command()
directly because the helper is still useful for recasting the
CommandError's as OrchestratorError's.
We'll use this to get a list of subdatasets so that we can trigger
git-annex to autoenable remotes, and it should also be useful for
providing better reactions and errors to our remote datalad calls.
If the repository uses an autoenabled special remote (e.g.,
///repronim/containers), the only source of the input may be the
special remote, in which case getting files will fail because the
create-sibling/publish sequence we use doesn't trigger the
autoenabling.

Add a (probably temporary) workaround.
Outputs may be directed to a subdataset.
We'll need to extend this with another workaround that executes for
each subdataset, so let's separate out this logic for readability.
We call `datalad {create-sibling,publish} --recursive` and, aside from
checking out the expected commit in the top-level dataset, behave as
though the hierarchy is in the desired state, but that's not
necessarily the case.  It isn't even always the case for a fresh setup
because publishing a subdataset with a branch other than master
checked out leaves the subdataset on an unborn master branch with an
empty working tree [*].

Go through each of the installed subdatasets and check out the
recorded commit.  Note that this is wasteful because it executes a
command on the resource even if the subdataset is already in the
desired state, but detecting whether the checkout is necessary would
only add to the problem as we'd have to execute an additional command.
But this workaround will hopefully be short-lived because we should be
able to drop this code once DataLad implements the ability to reset an
entire hierarchy to a desired state (issue 1424).

Closes #444.

[*] To add the confusion, Git doesn't mark this subdataset as
    modified (likely a bug in Git), which means our dirty check
    doesn't catch the unintended state.
Copy link
Member

yarikoptic left a comment

Just a minor comment, lots of good code so let's merge ;-)

def test_resolve_batch_params_eq(tmpdir, params, spec):
fname = op.join(str(tmpdir), "spec.yml")
with open(fname, "w") as fh:
fh.write(spec)

This comment has been minimized.

Copy link
@yarikoptic

yarikoptic Aug 9, 2019

Member

Minor FYi: there is create_tree which is used by with_tree decorator

This comment has been minimized.

Copy link
@kyleam

kyleam Aug 9, 2019

Author Contributor

Yes, I even use create_tree in this PR (ae58628) :)

I do like {create,with}_tree, but I also don't think there's much value in replacing every simple write call with it.

@yarikoptic yarikoptic merged commit 0b5bc61 into ReproNim:master Aug 9, 2019
3 checks passed
3 checks passed
codecov/patch 87.04% of diff hit (target 85.8%)
Details
codecov/project 89.08% (+3.27%) compared to 7f2a506
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@kyleam kyleam deleted the kyleam:run-subjobs branch Aug 9, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.