Is your feature request related to a problem? Please describe.
The core ado ray actors started for an explore operation can be scheduled to any ray worker. However if an experiment is scheduled to the same worker and e.g. goes OOM, the full ado ray job will crash or hang, as the core actor will also be killed, with no change of recovery.
Describe the solution you'd like.
core ado ray actor are scheduled to a node isolated from experiments - this could be the head node.
This requires tagging a ray cluster worked with a resource which can be used when starting the actors e.g. "operation-actors" or "cluster-head-node" AND tagging a kuberay worker with same resource label.
The easiest would be the head node
Additional context.
- Need the operation to still run even if resource is not available e.g. probe for resource "cluster-head-node" and if not available fall back to current behaviour
- The core actors are: discovery-space-manager, actuators, operator
- There is a case where the actuators do not spawn experiment remotely where the isolated node could potentially go OOM killing all core actors, if this node is also the head-node the cluster will go down.
Is your feature request related to a problem? Please describe.
The core ado ray actors started for an explore operation can be scheduled to any ray worker. However if an experiment is scheduled to the same worker and e.g. goes OOM, the full ado ray job will crash or hang, as the core actor will also be killed, with no change of recovery.
Describe the solution you'd like.
core ado ray actor are scheduled to a node isolated from experiments - this could be the head node.
This requires tagging a ray cluster worked with a resource which can be used when starting the actors e.g. "operation-actors" or "cluster-head-node" AND tagging a kuberay worker with same resource label.
The easiest would be the head node
Additional context.