Remove node list definition from slurm partition #385

amaslenn · 2025-02-28T13:48:03Z

Summary

Remove node list definition from slurm partition.

Added get_nodes_by_spec() with logic that is used by many CmdGen strategies.
Changed updated() logic to add found nodes into a list instead of updating a pre-defined one.

Test Plan

CI with extended tests.
Manual, see below.

Scenario based on sleep test:

[[Tests]]
id = "Tests.2"
test_name = "sleep_5"
num_nodes = 1
time_limit = "00:05:00"

[[Tests]]
id = "Tests.3"
test_name = "sleep_5"
nodes = ["node1,node2"]
time_limit = "00:05:00"

[[Tests]]
id = "Tests.4"
nodes = ["node-[012,026]"]
test_name = "sleep_5"
time_limit = "00:05:00"

[[Tests]]
id = "Tests.5"
nodes = ["batch:grp1:max_avail"]
test_name = "sleep_5"
time_limit = "00:05:00"

[[Tests]]
id = "Tests.6"
nodes = ["batch:grp1:1"]
test_name = "sleep_5"
time_limit = "00:05:00"

System config has this group:

[[partitions.groups]]
name = "grp1"
nodes = ["node1,node2"]

Tests.5 and Tests.6 didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:

[ERROR] Error occurred while allocating nodes from group 'grp1' in partition 'batch': CloudAI is requesting 1 nodes from the group 'grp1', but only 0 nodes are available. Please review the available nodes in the system and ensure there are enough resources to meet the requested node count. Additionally, verify that the system can accommodate the number of nodes required by the test scenario.
Traceback (most recent call last):
  File ".../cloudai/src/cloudai/systems/slurm/slurm_system.py", line 370, in get_available_nodes_from_group
    allocated_nodes = self.allocate_nodes(grouped_nodes, number_of_nodes, group_name)
  File ".../cloudai/src/cloudai/systems/slurm/slurm_system.py", line 467, in allocate_nodes
    raise ValueError(
ValueError: CloudAI is requesting 1 nodes from the group 'grp1', but only 0 nodes are available. Please review the available nodes in the system and ensure there are enough resources to meet the requested node count. Additionally, verify that the system can accommodate the number of nodes required by the test scenario.

Additional Notes

—

Added get_nodes_by_spec() with logic that is used by many CmdGen strategies.

The merge-base changed after approval.

TaekyungHeo · 2025-03-03T12:26:42Z

Tests.5 and Tests.6 didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:

Could you please clarify this? Does this PR fail to retrieve the node list when Slurm is busy? If so, we need to consider retrying multiple times. I recall that squeue does not work when Slurm is busy, and I had to retry multiple times.

amaslenn · 2025-03-03T12:50:30Z

Tests.5 and Tests.6 didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:

Could you please clarify this? Does this PR fail to retrieve the node list when Slurm is busy? If so, we need to consider retrying multiple times. I recall that squeue does not work when Slurm is busy, and I had to retry multiple times.

This logic is unchanged by this PR, this is what we currently have: it user asks for particular number of nodes in a group, we check if (1) there are enough nodes in the group and (2) there are enough nodes for allocation right now, otherwise exception is raised. I think we decided it is what we want some time ago.

srivatsankrishnan

Looks like this just removes the node list but mostly retains the existing functionality. The existing CW/EoS Toml will need to updated to remove the static node list. I will do it for CW/EoS on cloudAIx repo.

TaekyungHeo · 2025-03-10T12:15:14Z

@amaslenn, could you please update CloudAIX accordingly for all system configuration files? I am running some commands on EOS with the current system schema and have found that the existing system schema files do not work—specifically, the install command (I have not tested others). This PR will impact our users.

amaslenn added 2 commits February 28, 2025 04:53

Refactor SlurmSystem to simplify it and clarify boundaries.

cc46de5

Added get_nodes_by_spec() with logic that is used by many CmdGen strategies.

Remove node list from slurm partition

258501f

amaslenn requested review from TaekyungHeo, srinivas212 and srivatsankrishnan as code owners February 28, 2025 13:48

TaekyungHeo added the feature label Feb 28, 2025

TaekyungHeo previously approved these changes Feb 28, 2025

View reviewed changes

amaslenn added 2 commits February 28, 2025 15:48

Merge branch 'main' into am/slurm-system

8ac3c7b

Merge branch 'main' into am/slurm-system

f9276bf

TaekyungHeo approved these changes Mar 3, 2025

View reviewed changes

srivatsankrishnan approved these changes Mar 3, 2025

View reviewed changes

amaslenn merged commit 266e3b2 into main Mar 10, 2025
2 checks passed

amaslenn deleted the am/slurm-system branch March 10, 2025 10:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove node list definition from slurm partition #385

Remove node list definition from slurm partition #385

Uh oh!

amaslenn commented Feb 28, 2025

Uh oh!

TaekyungHeo commented Mar 3, 2025

Uh oh!

amaslenn commented Mar 3, 2025

Uh oh!

srivatsankrishnan left a comment •

edited

Loading

Uh oh!

Uh oh!

TaekyungHeo commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Remove node list definition from slurm partition #385

Remove node list definition from slurm partition #385

Uh oh!

Conversation

amaslenn commented Feb 28, 2025

Summary

Test Plan

Additional Notes

Uh oh!

TaekyungHeo commented Mar 3, 2025

Uh oh!

amaslenn commented Mar 3, 2025

Uh oh!

srivatsankrishnan left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TaekyungHeo commented Mar 10, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

srivatsankrishnan left a comment •

edited

Loading