Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added cluster_info #752

Merged
merged 21 commits into from
Apr 22, 2022
Merged

Added cluster_info #752

merged 21 commits into from
Apr 22, 2022

Conversation

lukew3
Copy link
Contributor

@lukew3 lukew3 commented Mar 24, 2022

Can't test current version because it uses the call method which requires loading the slurm config at /etc/slurm/slurm.conf, but was tested and works with regular backtick calls. I also would like to get feedback about my comments.

┆Issue is synchronized with this Asana task by Unito

@lukew3 lukew3 requested a review from johrstrom March 24, 2022 19:52
@lukew3
Copy link
Contributor Author

lukew3 commented Mar 25, 2022

Tested and updated call methods which are now working as intended

Comment on lines 105 to 107
sinfo_out2 = call("sinfo", "-Nhao %n/%G/%T").lines.uniq
gpu_total = sinfo_out2.grep(/gpu:/).count
gpu_free = sinfo_out2.grep(/gpu:/).grep(/idle/).count
Copy link
Contributor

@johrstrom johrstrom Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sinfo -Nhao %n/%G/%T I think this is just node status. I don't know if allocated means the GRES was also allocated. Meaning, I have a job on a node that has a GPU, but I didn't request or allocate it.

I think we actually want something more like sinfo -Nha -O 'GresUsed' that indicates whether the GRES is in use or not (as a pose to the the Node being in use or not).

@treydock please advise - is that the right way to get the total GPUs in use vs. GPUs total.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh i see, we're returning gpu_nodes_active. No I think the better information is GPUs active. There are multiple GPUs on a given node.

I think reporting # of GPU Nodes is secondary to reporting on # of GPUs themselves.

Copy link
Contributor

@treydock treydock Mar 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Collecting GRES used with sinfo is actually a complicated process. I ran into this with Slurm exporter. The information about available/used GRES is only available via --Format flag so the printf %G isn't possible to get used, that's just available.

First you must get longest GRES line (they can get very long). Also you can likely cache this as GRES definitions only change when slurm.conf and gres.conf are changed and slurmctld is restarted.

sinfo -o '%G'

Then get GRES data:

sinfo -a -h --Node --Format=nodehost,gres:%d,gresused:%d

You replace %d with value of longest GRES plus a few if you want a buffer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case not clear, the sinfo -o '%G' you have to iterate over all lines to find longest and use that for the length with --Format

@lukew3 lukew3 requested a review from johrstrom April 5, 2022 18:19
@lukew3 lukew3 changed the title Added cluster_stats method Added cluster_info Apr 5, 2022
Comment on lines 104 to 106
def gpus_from_gres(gres)
gres.to_s.scan(/gpu:[^,]*(\d+)/).flatten.map(&:to_i).sum
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this redefinition here anymore.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If that is kept, I think that regex is wrong. Where it might cause issues:

gres/gpu:v100-32g=2

That's a GRES definition on Pitzer, numbers are in the subtype. Depending on where you query the GRES, ie which command, you could also split on = and get last element.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also less complicated clusters will have like gres/gpu:a100=4 (for Ascend). This is why I think a regex is going to cause more problems then simply splitting on the known delimiters.

Copy link
Contributor Author

@lukew3 lukew3 Apr 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex still works on these cases #754. The regex gets the last set of digits before a comma or the end of the string if preceded by gpu:, so this holds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need this redefinition here anymore.

gpus_from_gres isn't defined here since it's defined in Slurm while get_cluster_info is defined in Batch, which is nested in Slurm. Should Batch extend Slurm?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lol that's hilarious. This'll do just fine in a pinch.

def gpus_from_gres 
   @slurm.gpus_from_gres 
end

This comment was marked as resolved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This'll do just fine in a pinch.

def gpus_from_gres 
   @slurm.gpus_from_gres 
end

That doesn't work either:

undefined method `gpus_from_gres' for nil:NilClass

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right because I've got it backwards. Yea I'd say let's move that function into Batch then just reference it through @slurm.gpus_from_gres. Other adapters have a helper class which is where it'd go, but alas, we did not use that here.

Copy link
Contributor Author

@lukew3 lukew3 Apr 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right because I've got it backwards. Yea I'd say let's move that function into Batch then just reference it through @slurm.gpus_from_gres. Other adapters have a helper class which is where it'd go, but alas, we did not use that here.

That doesn't work either. Slurm.parse_job_info can't load @batch.gpus_from_gres because it can't access batch. To fix this, I defined gpus_from_gres just above the Batch class definition, so that it is accessible everywhere in both classes. 7eca11b

Co-authored-by: Jeff Ohrstrom <johrstrom@osc.edu>
@johrstrom johrstrom merged commit 07aff91 into master Apr 22, 2022
@johrstrom johrstrom deleted the get-cluster-stats branch April 22, 2022 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants