feat: isolate core ado ray actors during explore operation

**Is your feature request related to a problem? Please describe.** 

The core ado ray actors started for an explore operation can be scheduled to any ray worker. However if an experiment is scheduled to the same worker and e.g. goes OOM, the full ado ray job will crash or hang, as the core actor will also be killed, with no change of recovery.

**Describe the solution you'd like**. 

core ado ray actor are scheduled to a node isolated from experiments - this could be the head node. 

This requires tagging a ray cluster worked  with a resource which can be used when starting the actors e.g. "operation-actors" or "cluster-head-node" AND tagging a kuberay worker with same resource label. 

The easiest would be the head node

**Additional context**. 

- Need the operation to still run even if resource is not available e.g. probe for resource "cluster-head-node" and if not available fall back to current behaviour
- The core actors are: discovery-space-manager, actuators, operator
- There is a case where the actuators do not spawn experiment remotely where the isolated node could potentially go OOM killing all core actors, if this node is also the head-node the cluster will go down. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: isolate core ado ray actors during explore operation #1001

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat: isolate core ado ray actors during explore operation #1001

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions