-
Notifications
You must be signed in to change notification settings - Fork 27
Description
Auto-resources plan.
Auto-resources in libEnsemble should be reviewed. I propose that it is updated in the following ways:
- To be able to work independently of jobcontroller.
- To work under all communicators.
- To be more flexible for different configurations and dynamic updates.
The third point can be improved with some changes under the current scheme. However, a larger restructuring to employ manager-side resources is required to fully satisfy all three.
Following are the current considerations and some possible solutions. Please offer opinions.
Motivation for Auto-resources:
-
- Some systems will launch jobs on top of each other within an active node allocation (unless specific nodes are specified through node-lists or machine files. This includes Bebop and any system lacking a proper application level scheduler. It is also possible that a system may launch applications onto different nodes (from different workers) when in fact we want them to share nodes.
-
- We wish to maintain a capability for a given worker to be associated with a given resource (nodes/set of nodes), with possible access to persistent node level data storage (e.g. a mesh).
-
- Different node types could exist in an allocation. It is possible that a given worker wants to be associated with a given type of node.
Note: The role of auto-resources can be taken on by the user themselves to get around such issues. However, this places a significant burden on them.
Limitations of the current auto-resources implementation.
- Tied to jobcontroller
- Potential waste in default setup
- Does not work with TCP.
- Requires update to work properly with worker-blocking.
The current implementation of auto-resources uses the jobcontroller and interrogates global resource information from the worker. This does not work in, for example, the TCP case, when the worker may not have this information. This may be resolved by switching to a manager-side partitioning, which would send a resource partition to the worker for each sim.
While working with TCP requires manager-side resourcing, I think the other issues above could be made to work either approach. That is, to work with the current worker-blocking feature should require only an addition to the worker code to sample the 'blocked' worker list and re-compute the current workers resource (if the list has changed). An option for re-partitioning resource to workers while running a zero-resource generator could also be implemented with the necessary gen options and an addition to the allocation functions. The latter would be a synchronisation point, as it could only be safely re-partitioned when all those workers are available. There would also need to be a way of updating when the generator is finished.
Supporting info:
Requires update to work properly with worker-blocking:
A current scheme exists to enable generators to specify a resource requirement for a particular sim (e.g. a 'num_nodes' variable can be specified in sim_specs['in'] and gen_specs['out'] and hence the values provided to sim_f via H.
The allocation function (eg. give_sim_work_first) can use these to 'block' a sufficient number of available workers to provide the resources, and assign the work to one of the workers (the workerIDs for 'stolen' resources are supplied to that worker through libe_info). Note that the resources per worker gives a mininum granularity at which resources can be re-assigned under this scheme, although a worker does not have to run on all of its assigned resources.
Different ways of specifying the resource requirement are possible with co-operation of generator and alloc_func (eg. number of workers to use, proportion of resources - and potentially type of node). Note that there is only currently one example that is doing this in the regression tests - and it does not make use of jobcontroller or auto-resources. Instead, it takes the libEnsemble machinefile as an argument. I intend to create a forces example variant which will make use of this feature.
For this feature to work with the current auto-resources implementation will require an addition to the worker code to sample the blocked worker list and re-compute the current worker resource if the list has changed.
To do this with a non-jobcontroller based manager-side resource setup, the required resources for each calc will be determined in the allocation function (by accessing the resource module) and provided to workers via libe_info.
Potential waste in default setup
One simple frustration with the default behaviour of dividing up resources among workers is that this will apply a share of resources to workers running a generator. Often these don't require resources (outside of libensemble code itself) and are wasted while the generator is running. In some examples, the generator is very quick and so this does not impact resource usage very much. In the case of longer running generators or persistent generators this could be a significant waste.
If a worker running a generator (possibly persistent) does not require resources, the default division of resources to workers would want to subtract one from the worker count when calculating the other workers share. Note that even with persistent workers, this is not necessarily a permanent state. To safely do this, in the context of statically controlled application scheduling (ie. explicitly managing resources and not over-subscribing) implies a synchronisation point for all those remaining workers. That is, instead of stealing worker resources and blocking, a re-partition command would be sent (via libe_info), but this would have to cycle until all those workers were available to prevent potential clashes of resources.
A repartition from using zero-resource persistent worker to not - would likely cause an uneven (and possibly) unachievable partition. The user would need to be aware of this.
Limitations of manager-side auto-resources.
Issue of probing on-node resources
To determine on-node resources such as core count, currently workers examine available information and if necessary launch a probe onto the nodes to perform a count. Under some circumstances this may be either costly or impossible to do from the manager directly. So on-node resource determination may still require a worker-side component.
Pools of resources
For scenarios with different node types or other heterogeneous resources (including the possibility of different machines over TCP), separation of available resources into pools may be required. Again, generator options and co-operating allocation functions could be used to assign workers to pools.