Added cluster_info #752

lukew3 · 2022-03-24T17:48:16Z

Can't test current version because it uses the call method which requires loading the slurm config at /etc/slurm/slurm.conf, but was tested and works with regular backtick calls. I also would like to get feedback about my comments.

┆Issue is synchronized with this Asana task by Unito

lib/ood_core/job/adapters/slurm.rb

lukew3 · 2022-03-25T19:33:30Z

Tested and updated call methods which are now working as intended

lib/ood_core/job/cluster_info.rb

Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>

lib/ood_core/job/adapters/slurm.rb

johrstrom · 2022-03-31T18:41:29Z

lib/ood_core/job/adapters/slurm.rb

+            sinfo_out2 = call("sinfo", "-Nhao %n/%G/%T").lines.uniq
+            gpu_total = sinfo_out2.grep(/gpu:/).count
+            gpu_free = sinfo_out2.grep(/gpu:/).grep(/idle/).count


sinfo -Nhao %n/%G/%T I think this is just node status. I don't know if allocated means the GRES was also allocated. Meaning, I have a job on a node that has a GPU, but I didn't request or allocate it.

I think we actually want something more like sinfo -Nha -O 'GresUsed' that indicates whether the GRES is in use or not (as a pose to the the Node being in use or not).

@treydock please advise - is that the right way to get the total GPUs in use vs. GPUs total.

Oh i see, we're returning gpu_nodes_active. No I think the better information is GPUs active. There are multiple GPUs on a given node.

I think reporting # of GPU Nodes is secondary to reporting on # of GPUs themselves.

Collecting GRES used with sinfo is actually a complicated process. I ran into this with Slurm exporter. The information about available/used GRES is only available via --Format flag so the printf %G isn't possible to get used, that's just available.

First you must get longest GRES line (they can get very long). Also you can likely cache this as GRES definitions only change when slurm.conf and gres.conf are changed and slurmctld is restarted.

sinfo -o '%G'

Then get GRES data:

sinfo -a -h --Node --Format=nodehost,gres:%d,gresused:%d

You replace %d with value of longest GRES plus a few if you want a buffer.

In case not clear, the sinfo -o '%G' you have to iterate over all lines to find longest and use that for the length with --Format

johrstrom · 2022-04-07T18:08:10Z

lib/ood_core/job/adapters/slurm.rb

+            def gpus_from_gres(gres)
+              gres.to_s.scan(/gpu:[^,]*(\d+)/).flatten.map(&:to_i).sum
+            end


I don't think you need this redefinition here anymore.

If that is kept, I think that regex is wrong. Where it might cause issues:

gres/gpu:v100-32g=2

That's a GRES definition on Pitzer, numbers are in the subtype. Depending on where you query the GRES, ie which command, you could also split on = and get last element.

Also less complicated clusters will have like gres/gpu:a100=4 (for Ascend). This is why I think a regex is going to cause more problems then simply splitting on the known delimiters.

Regex still works on these cases #754. The regex gets the last set of digits before a comma or the end of the string if preceded by gpu:, so this holds.

I don't think you need this redefinition here anymore.

gpus_from_gres isn't defined here since it's defined in Slurm while get_cluster_info is defined in Batch, which is nested in Slurm. Should Batch extend Slurm?

Lol that's hilarious. This'll do just fine in a pinch.

def gpus_from_gres @slurm.gpus_from_gres end

This'll do just fine in a pinch.

def gpus_from_gres @slurm.gpus_from_gres end

That doesn't work either:

undefined method `gpus_from_gres' for nil:NilClass

Right because I've got it backwards. Yea I'd say let's move that function into Batch then just reference it through @slurm.gpus_from_gres. Other adapters have a helper class which is where it'd go, but alas, we did not use that here.

Right because I've got it backwards. Yea I'd say let's move that function into Batch then just reference it through @slurm.gpus_from_gres. Other adapters have a helper class which is where it'd go, but alas, we did not use that here.

That doesn't work either. Slurm.parse_job_info can't load @batch.gpus_from_gres because it can't access batch. To fix this, I defined gpus_from_gres just above the Batch class definition, so that it is accessible everywhere in both classes. 7eca11b

lib/ood_core/job/adapters/slurm.rb

Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>

…cluster-stats

lukew3 added 2 commits March 24, 2022 13:39

Added cluster_stats method

d6d9d0f

Add cluster_stats method to adapter.rb

f5dffee

johrstrom reviewed Mar 24, 2022

View reviewed changes

lib/ood_core/job/adapters/slurm.rb Outdated Show resolved Hide resolved

Created ClusterInfo class

fc374d4

lukew3 requested a review from johrstrom March 24, 2022 19:52

lukew3 added 2 commits March 25, 2022 15:32

fixed broken call method, reformat sinfo_out2

5c360cd

Remove unused gres_length variable

bfa8572

johrstrom reviewed Mar 31, 2022

View reviewed changes

lib/ood_core/job/cluster_info.rb Outdated Show resolved Hide resolved

johrstrom reviewed Mar 31, 2022

View reviewed changes

lib/ood_core/job/cluster_info.rb Outdated Show resolved Hide resolved

lukew3 and others added 2 commits March 31, 2022 14:18

Update lib/ood_core/job/cluster_info.rb

5a5456f

Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>

removed cluster property from cluster_info

4e69b7b

johrstrom reviewed Mar 31, 2022

View reviewed changes

lib/ood_core/job/adapters/slurm.rb Outdated Show resolved Hide resolved

renamed sinfo output strings

a29e283

johrstrom reviewed Mar 31, 2022

View reviewed changes

lukew3 added 8 commits April 1, 2022 16:12

debugging clusterInfo initialize

d2ea95c

fixed ClusterInfo initialize, renamed properties

1388cad

Count gpus instead of gpu_nodes

e3e98c8

count gpus with array sum

9e31da5

Remove extra variable

9163ec1

Removed active/total gpus single-use variables

86afbca

remove comment

dfde05b

updated docs to say gpus instead of gpu nodes

898395d

lukew3 requested a review from johrstrom April 5, 2022 18:19

lukew3 changed the title ~~Added cluster_stats method~~ Added cluster_info Apr 5, 2022

johrstrom reviewed Apr 7, 2022

View reviewed changes

Merge branch 'master' into get-cluster-stats

9fcbd52

johrstrom mentioned this pull request Apr 7, 2022

Slurm gpus_from_gres not quite right #755

Closed

johrstrom reviewed Apr 7, 2022

View reviewed changes

lib/ood_core/job/adapters/slurm.rb Outdated Show resolved Hide resolved

Remove unnecessary parentheses

5034cb4

Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>

lukew3 added 3 commits April 8, 2022 14:25

define gpus_from_gres in Batch

625edde

Merge branch 'get-cluster-stats' of github.com:OSC/ood_core into get-…

8b43fb2

…cluster-stats

move gpus_from_gres to slurm before batch

7eca11b

johrstrom merged commit 07aff91 into master Apr 22, 2022

johrstrom deleted the get-cluster-stats branch April 22, 2022 19:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added cluster_info #752

Added cluster_info #752

lukew3 commented Mar 24, 2022 •

edited by sync-by-unito bot

Loading

lukew3 commented Mar 25, 2022 •

edited

Loading

johrstrom Mar 31, 2022 •

edited

Loading

johrstrom Mar 31, 2022

treydock Mar 31, 2022 •

edited by lukew3

Loading

treydock Mar 31, 2022

johrstrom Apr 7, 2022

treydock Apr 7, 2022

treydock Apr 7, 2022

lukew3 Apr 7, 2022 •

edited

Loading

lukew3 Apr 7, 2022

johrstrom Apr 7, 2022

This comment was marked as resolved.

lukew3 Apr 7, 2022

johrstrom Apr 7, 2022

lukew3 Apr 8, 2022 •

edited

Loading

Added cluster_info #752

Added cluster_info #752

Conversation

lukew3 commented Mar 24, 2022 • edited by sync-by-unito bot Loading

lukew3 commented Mar 25, 2022 • edited Loading

johrstrom Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

treydock Mar 31, 2022 • edited by lukew3 Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukew3 Apr 7, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lukew3 Apr 8, 2022 • edited Loading

Choose a reason for hiding this comment

lukew3 commented Mar 24, 2022 •

edited by sync-by-unito bot

Loading

lukew3 commented Mar 25, 2022 •

edited

Loading

johrstrom Mar 31, 2022 •

edited

Loading

treydock Mar 31, 2022 •

edited by lukew3

Loading

lukew3 Apr 7, 2022 •

edited

Loading

lukew3 Apr 8, 2022 •

edited

Loading