-
Notifications
You must be signed in to change notification settings - Fork 42
Remove node list definition from slurm partition #385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Added get_nodes_by_spec() with logic that is used by many CmdGen strategies.
The merge-base changed after approval.
Could you please clarify this? Does this PR fail to retrieve the node list when Slurm is busy? If so, we need to consider retrying multiple times. I recall that squeue does not work when Slurm is busy, and I had to retry multiple times. |
This logic is unchanged by this PR, this is what we currently have: it user asks for particular number of nodes in a group, we check if (1) there are enough nodes in the group and (2) there are enough nodes for allocation right now, otherwise exception is raised. I think we decided it is what we want some time ago. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this just removes the node list but mostly retains the existing functionality. The existing CW/EoS Toml will need to updated to remove the static node list. I will do it for CW/EoS on cloudAIx repo.
|
@amaslenn, could you please update CloudAIX accordingly for all system configuration files? I am running some commands on EOS with the current system schema and have found that the existing system schema files do not work—specifically, the install command (I have not tested others). This PR will impact our users. |
Summary
Remove node list definition from slurm partition.
get_nodes_by_spec()with logic that is used by many CmdGen strategies.updated()logic to add found nodes into a list instead of updating a pre-defined one.Test Plan
Scenario based on sleep test:
System config has this group:
Tests.5andTests.6didn't run as cluster is quite busy, but errors are fine and other cases were not affected, here is an error example:Additional Notes
—