Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TaskTrees, Workflows, Parallelization, Multiprocessing #54

Closed
matthiaskoenig opened this issue Oct 1, 2017 · 10 comments
Closed

TaskTrees, Workflows, Parallelization, Multiprocessing #54

matthiaskoenig opened this issue Oct 1, 2017 · 10 comments

Comments

@matthiaskoenig
Copy link
Collaborator

One important thing to support is to provide support for very large simulation experiments and experiments which can be parallelized. It must be possible to execute a SED-ML simulation in a parallel manner based on information which tasks can be parallelized and depend on each other.

This requires something like TaskTrees and indication of dependent & independent tasks but also on the level of the data generators. I.e. which tasks have to be executed for a DataGenerator to be calculatable.

  • parallelization on task level, parallelization on DataGenerator level.

The idea is to have an SED-ML file with execution automatically parallelized based on the information within it.
The data generators already indicate on which tasks they depend via the variables they use (which are associated to tasks). RepeatedTasks indicate on which Task/RepeatedTask they depend, the reset flag indicates dependency between iterations within a repeatedTask.

Necessary to check which additional information is necessary as help for the task execution phase.

@fbergmann
Copy link
Member

I'm not sure this is part of the specification ... tools should be free to schedule simulations to run as they see fit.

@matthiaskoenig
Copy link
Collaborator Author

Yes, I feel the same. Is not really a specification issue.
I just wrote it down for now to create awareness about this and that we keep in mind not to make decisions which block the parallelization of things.

@jonrkarr
Copy link
Contributor

The dependencies are already encoded into the relationships between SED objects. Tools can use the existing relationships to determine how to schedule the execution, not only of SED tasks, but also the calculation of data generators and the generation of outputs. For example, we're using this to begin producing outputs as soon as their dependent tasks are complete, even before all tasks have completed.

I would vote against adding additional attributes beyond this to promote this information to the level of SED tasks. This would loose the current granularity of dependencies.

Regarding the relationship with workflow systems such as CWL or GitHub Actions, SED tasks, SED data generator calculations, and the generation of SED outputs could each be mapped to workflow tasks. This would allow workflow engines to manage the execution dependencies between all elements of a SED document.

@luciansmith
Copy link
Contributor

I think this can be resolved with new text in the spec that is more explicit about what can be parallel and what cannot.

@jonrkarr
Copy link
Contributor

In my opinion, SED-ML does not quite capture enough with respect to subtasks because of the notion that subtasks can start from the end state of previous subtasks. Addressing this requires another attribute.

@luciansmith
Copy link
Contributor

I'm happy to be more explicit in L2, but here's what I think L1v4 specifies:

  • All tasks are independent, and can be run in parallel or not. If one task changes the state of a model, it must be reset before another task using the same model is run.
  • All subtasks are assumed to be independent if they have no 'order' attribute defined, or if they define an 'order' attribute with the same value as another subtask. If this is not true, this is a bug in the design of the SED-ML file itself. Unlike tasks, model states are not reset unless explicitly told to reset with the 'resetModel=true' attribute.

We could be more explicit about this in the text, and even add a 'validation rule' (if anyone ever implements validation rules): "Any two subtasks that reference the same model must define the 'order' attribute, and must have unique values. This is because tasks change the state of the model, and the model state must be unambiguous for every subtask."

@matthiaskoenig
Copy link
Collaborator Author

@luciansmith This reads great. This should be added to the spec. This would solve my issue here by explicitly stating what could be parallelized.

@jonrkarr
Copy link
Contributor

@luciansmith's last statement is unnecessarily restrictive. The key thing is the order of simulations with respect to model changes. If multiple subtasks execute from the same model state (i.e. same model with no additional changes), they could be run in parallel.

@luciansmith
Copy link
Contributor

If there are tasks that don't change the model state, then it's true that the last statement is too restrictive. I couldn't think of any, which is why I put it in. Are there some? Or, I suppose, might there be some in the future? I'm fine with not having it; it's just a validation rule, after all.

luciansmith added a commit that referenced this issue Jun 11, 2021
Follows the philosophy of 'give hints in the spec about parallelizing things, but don't prescribe anything.
@luciansmith
Copy link
Contributor

The latest updates to the spec call out the possibility of parallelization in a few places. In listOfTasks:

"Each top-level task is defined such that its execution is independent of the others: if one task is executed
after another, the states of the models must be completely reset so there’s no cross-contamination of one
task to the next. This means that the top-level tasks are particularly well suited to being executed in
parallel, should that be desired."

in 'resetModel':

"When the resetModel attribute is set to "true", the individual repeats of the task may be paralellizable,
assuming any child Range of the RepeatedTask does not depend on the results of any individual repeat
(as is theoretically possible for the FunctionalRange, but not for the other Range types.)"

in 'listOfSubmodels':

"In some cases, it may be possible to run some subtasks in parallel. Interpreters may use the order
attribute as a hint in making the decision about what to parallelize, but in general, should be prepared
to examine which Model each SubTask is modifying and which model or models are being used as input."

(I changed to this from my previous list of requirements, which was too restrictive. Instead, I'm basically just punting to interpreters to determine this on their own.)

in 'order':

"Leaving the order undefined for a SubTask implies that the SubTask may be executed before or after
any other SubTask. Giving the same order to multiple SubTask elements is an explicit statement that
each SubTask in the group may be executed before or after any other SubTask in the group. It is
recommended that users always explicitly set the order attribute for this reason.

Any order value does not imply whether the SubTask may be executed in parallel with other SubTask
elements. Interpreters who wish to parallelize subtasks should operate from the assumption that in the
default case, each SubTask would be executed in some order, and adjust accordingly."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants