-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TaskTrees, Workflows, Parallelization, Multiprocessing #54
Comments
I'm not sure this is part of the specification ... tools should be free to schedule simulations to run as they see fit. |
Yes, I feel the same. Is not really a specification issue. |
The dependencies are already encoded into the relationships between SED objects. Tools can use the existing relationships to determine how to schedule the execution, not only of SED tasks, but also the calculation of data generators and the generation of outputs. For example, we're using this to begin producing outputs as soon as their dependent tasks are complete, even before all tasks have completed. I would vote against adding additional attributes beyond this to promote this information to the level of SED tasks. This would loose the current granularity of dependencies. Regarding the relationship with workflow systems such as CWL or GitHub Actions, SED tasks, SED data generator calculations, and the generation of SED outputs could each be mapped to workflow tasks. This would allow workflow engines to manage the execution dependencies between all elements of a SED document. |
I think this can be resolved with new text in the spec that is more explicit about what can be parallel and what cannot. |
In my opinion, SED-ML does not quite capture enough with respect to subtasks because of the notion that subtasks can start from the end state of previous subtasks. Addressing this requires another attribute. |
I'm happy to be more explicit in L2, but here's what I think L1v4 specifies:
We could be more explicit about this in the text, and even add a 'validation rule' (if anyone ever implements validation rules): "Any two subtasks that reference the same model must define the 'order' attribute, and must have unique values. This is because tasks change the state of the model, and the model state must be unambiguous for every subtask." |
@luciansmith This reads great. This should be added to the spec. This would solve my issue here by explicitly stating what could be parallelized. |
@luciansmith's last statement is unnecessarily restrictive. The key thing is the order of simulations with respect to model changes. If multiple subtasks execute from the same model state (i.e. same model with no additional changes), they could be run in parallel. |
If there are tasks that don't change the model state, then it's true that the last statement is too restrictive. I couldn't think of any, which is why I put it in. Are there some? Or, I suppose, might there be some in the future? I'm fine with not having it; it's just a validation rule, after all. |
The latest updates to the spec call out the possibility of parallelization in a few places. In listOfTasks: "Each top-level task is defined such that its execution is independent of the others: if one task is executed in 'resetModel': "When the resetModel attribute is set to "true", the individual repeats of the task may be paralellizable, in 'listOfSubmodels': "In some cases, it may be possible to run some subtasks in parallel. Interpreters may use the order (I changed to this from my previous list of requirements, which was too restrictive. Instead, I'm basically just punting to interpreters to determine this on their own.) in 'order': "Leaving the order undefined for a SubTask implies that the SubTask may be executed before or after Any order value does not imply whether the SubTask may be executed in parallel with other SubTask |
One important thing to support is to provide support for very large simulation experiments and experiments which can be parallelized. It must be possible to execute a SED-ML simulation in a parallel manner based on information which tasks can be parallelized and depend on each other.
This requires something like TaskTrees and indication of dependent & independent tasks but also on the level of the data generators. I.e. which tasks have to be executed for a DataGenerator to be calculatable.
The idea is to have an SED-ML file with execution automatically parallelized based on the information within it.
The data generators already indicate on which tasks they depend via the variables they use (which are associated to tasks). RepeatedTasks indicate on which Task/RepeatedTask they depend, the
reset
flag indicates dependency between iterations within a repeatedTask.Necessary to check which additional information is necessary as help for the task execution phase.
The text was updated successfully, but these errors were encountered: