Skip to content

Conversation

@amaslenn
Copy link
Contributor

Summary

Remove node list definition from slurm partition.

  1. Added get_nodes_by_spec() with logic that is used by many CmdGen strategies.
  2. Changed updated() logic to add found nodes into a list instead of updating a pre-defined one.

Test Plan

  1. CI with extended tests.
  2. Manual, see below.

Scenario based on sleep test:

[[Tests]]
id = "Tests.2"
test_name = "sleep_5"
num_nodes = 1
time_limit = "00:05:00"

[[Tests]]
id = "Tests.3"
test_name = "sleep_5"
nodes = ["node1,node2"]
time_limit = "00:05:00"

[[Tests]]
id = "Tests.4"
nodes = ["node-[012,026]"]
test_name = "sleep_5"
time_limit = "00:05:00"

[[Tests]]
id = "Tests.5"
nodes = ["batch:grp1:max_avail"]
test_name = "sleep_5"
time_limit = "00:05:00"

[[Tests]]
id = "Tests.6"
nodes = ["batch:grp1:1"]
test_name = "sleep_5"
time_limit = "00:05:00"

System config has this group:

[[partitions.groups]]
name = "grp1"
nodes = ["node1,node2"]

Tests.5 and Tests.6 didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:

[ERROR] Error occurred while allocating nodes from group 'grp1' in partition 'batch': CloudAI is requesting 1 nodes from the group 'grp1', but only 0 nodes are available. Please review the available nodes in the system and ensure there are enough resources to meet the requested node count. Additionally, verify that the system can accommodate the number of nodes required by the test scenario.
Traceback (most recent call last):
  File ".../cloudai/src/cloudai/systems/slurm/slurm_system.py", line 370, in get_available_nodes_from_group
    allocated_nodes = self.allocate_nodes(grouped_nodes, number_of_nodes, group_name)
  File ".../cloudai/src/cloudai/systems/slurm/slurm_system.py", line 467, in allocate_nodes
    raise ValueError(
ValueError: CloudAI is requesting 1 nodes from the group 'grp1', but only 0 nodes are available. Please review the available nodes in the system and ensure there are enough resources to meet the requested node count. Additionally, verify that the system can accommodate the number of nodes required by the test scenario.

Additional Notes

Added get_nodes_by_spec() with logic that is used by many CmdGen strategies.
TaekyungHeo
TaekyungHeo previously approved these changes Feb 28, 2025
@amaslenn amaslenn dismissed TaekyungHeo’s stale review February 28, 2025 14:31

The merge-base changed after approval.

@TaekyungHeo
Copy link
Member

Tests.5 and Tests.6 didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:

Could you please clarify this? Does this PR fail to retrieve the node list when Slurm is busy? If so, we need to consider retrying multiple times. I recall that squeue does not work when Slurm is busy, and I had to retry multiple times.

@amaslenn
Copy link
Contributor Author

amaslenn commented Mar 3, 2025

Tests.5 and Tests.6 didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:

Could you please clarify this? Does this PR fail to retrieve the node list when Slurm is busy? If so, we need to consider retrying multiple times. I recall that squeue does not work when Slurm is busy, and I had to retry multiple times.

This logic is unchanged by this PR, this is what we currently have: it user asks for particular number of nodes in a group, we check if (1) there are enough nodes in the group and (2) there are enough nodes for allocation right now, otherwise exception is raised. I think we decided it is what we want some time ago.

Copy link
Contributor

@srivatsankrishnan srivatsankrishnan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this just removes the node list but mostly retains the existing functionality. The existing CW/EoS Toml will need to updated to remove the static node list. I will do it for CW/EoS on cloudAIx repo.

@amaslenn amaslenn merged commit 266e3b2 into main Mar 10, 2025
2 checks passed
@amaslenn amaslenn deleted the am/slurm-system branch March 10, 2025 10:20
@TaekyungHeo
Copy link
Member

@amaslenn, could you please update CloudAIX accordingly for all system configuration files? I am running some commands on EOS with the current system schema and have found that the existing system schema files do not work—specifically, the install command (I have not tested others). This PR will impact our users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants