Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return gpu stats as json #61

Closed
elukey opened this issue Apr 11, 2019 · 13 comments

Comments

@elukey
Copy link

commented Apr 11, 2019

Hi!

I am wondering if rocm_smi.py could be extended with an option to return data formatted as json string. Something similar to what --save returns, but rather than emitting metrics to a file I'd like to get them directly to stdout. I am thinking about creating a Prometheus exporter for AMD gpus to export metrics, this could help a lot. Would you consider a pull request if I made one?

Thanks!

@jlgreathouse

This comment has been minimized.

Copy link
Contributor

commented Apr 11, 2019

If you want to get the stats to stdout instead of a file, have you tried rocm-smi --save /dev/stdout? This will mix the normal rocm-smi printouts with what you want, so you could instead do, e.g.:

filename=`mktemp -u`
rocm-smi --save ${filename} > /dev/null
cat ${filename}
rm -f ${filename}
@elukey

This comment has been minimized.

Copy link
Author

commented Apr 12, 2019

@jlgreathouse thanks for the answer! Tested the stdout solution that you proposed but I'd prefer to get json only, the mixed content is a bit cumbersome to parse. The solution about the filename is ok (and it was more or less what I thought in the first place as solution) but having something like --print-json or similar would surely bypass the need to create temporary files. Not a big deal but it would allow a lighter logic in the Prometheus exporter in my opinion, and probably not a big change in the code? (the logic to dump the json could be easily reused to print to stdout).

Moreover, little nit that I realized now - the json content of --save doesn't include the gpu's temperature, is it on purpose?

elukey added a commit to elukey/ROC-smi that referenced this issue Apr 12, 2019
Add functionality to print json-encode GPU metrics to stdout
This change factors out the json encoding bits of save() in order
to allow print_json() to reuse the same logic.
Drawback: as consequence of this refactor a log related to what
setting is saved to a file (when using --save) has been removed.

GH: RadeonOpenCompute#61
elukey added a commit to elukey/ROC-smi that referenced this issue Apr 12, 2019
Add functionality to print json-encode GPU metrics to stdout
This change factors out the json encoding bits of save() in order
to allow print_json() to reuse the same logic.
Drawback: as consequence of this refactor a log related to what
setting is saved to a file (when using --save) has been removed.

GH: RadeonOpenCompute#61
@elukey

This comment has been minimized.

Copy link
Author

commented Apr 12, 2019

Basically my idea is the following:

elukey@86232e5

Not sure if it makes sense or not for you, let me know :)

@kentrussell

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

I actually was doing this task to support JSON output, and just saw your patch. I'll see if I can merge the two to allow for something that we both are happy with :)

My thought was to JSON-ize the output, so it only gives you what you want to have (so you could do something like rocm-smi --json -i -t -P" to get a JSON values for GPU ID/temperature/power. It will definitely take a bit more work, but means that we don't have unwanted values if we can help it, which is also along the lines of the original ask. If you're good with that too, I'll hopefully be able to get it into 2.5

@elukey

This comment has been minimized.

Copy link
Author

commented Apr 18, 2019

@kentrussell it is definitely good for me, I am not in a real hurry. We are currently building our GPU infrastructure and wondering what it is needed (monitoring is a good one for example!).

Let me know if I can help!

@kentrussell

This comment has been minimized.

Copy link
Contributor

commented Apr 18, 2019

Thanks @elukey . I've actually got a patch up internally for review, so I should be able to get this into 2.5 . I think that they've started testing 2.4 so I may have missed the cutoff, but I can try push it in

@kentrussell

This comment has been minimized.

Copy link
Contributor

commented May 10, 2019

This will be in 2.5

@kentrussell

This comment has been minimized.

Copy link
Contributor

commented Jun 7, 2019

In 2.5

@kentrussell kentrussell closed this Jun 7, 2019

@elukey

This comment has been minimized.

Copy link
Author

commented Jul 9, 2019

@kentrussell Hi! I just tested 2.5 and I found something strange:

~$ /opt/rocm/bin/rocm-smi  --json


========================ROCm System Management Interface========================
ERROR: Cannot print JSON output for concise output (no flags)

But if I add other flags like:

~$ /opt/rocm/bin/rocm-smi --showtemp --showuse --json --showperflevel --showpower


========================ROCm System Management Interface========================
{"card1": {"Temperature (Sensor #1)": "28.0 c", "Current GPU use": "0%", "Current Performance Level": "auto", "Average Graphics Package Power": "7.0W"}}
==============================End of ROCm SMI Log ==============================

Everything works fine. I would suggest two improvements if possible:

  1. the === ROCm System Management etc.. head/tail should not be present if --json is used (otherwise one has to discard part of the output instead of directly interpreting it as json).
  2. allow --json to work alone (without other flags).

Not sure if I am missing some context, in case apologies, let me know your thoughts :)

Thanks for your work!

@elukey

This comment has been minimized.

Copy link
Author

commented Jul 9, 2019

Another comment after trying to parse this with Python. It would be more convenient if the JSON keys/values could follow a format, and avoid extra info. For example:

  • Temperature, "28.0 c" could become only "28.0" (assuming a format of Celsius)
  • "Current GPU use" could be "Current GPU use percent" and leave the value only numeric
  • Same thing for the Power and other values..

Moreover, keys like "Temperature (Sensor #1)" are a bit misleading.. Should we expect more than one sensor? If so, could we have something like temperature_sensor_1: 28.0, etc.. ?
Same for the other keys:

  • Current GPU use -> gpu_usage_percent
  • Current Performance Level -> current_perf_level
  • etc..

I don't mean to nitpick, just to be more concise when emitting data in JSON format, since it will be parsed by some tool and probably not by humans (they will use the regular CLI tool view).

Let me know your thoughts :) I can help and send pull requests if needed!

@kentrussell

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2019

@elukey Nitpicking is encouraged, we need this to be useful and accurate and usable. I'll get on this for 2.7, as well as ensuring that I add it to my test list before submitting patches that add or change output,

Certain GPUs have multiple temperature sensors (Vega20+), so that will print out something different, but I can try to change the formatting a bit to reflect that. For example on a VG20 card:

GPU[1] : Temperature (edge): 31.0 c
GPU[1] : Temperature (junction): 32.0 c
GPU[1] : Temperature (mem): 31.0 c

If the GPU doesn't have the sensor label (2.6-and-newer), it just says the sensor number, so there is a bit of backwards compatibility required as well.

As for --json working without any other flags, it SHOULD work that way, I obviously missed something :)

@elukey

This comment has been minimized.

Copy link
Author

commented Jul 15, 2019

@kentrussell thanks a lot! In the meantime, this is a first dashboard created with Prometheus metrics from rocm-smi:

https://grafana.wikimedia.org/d/ZAX3zaIWz/amd-rocm-gpu

@kentrussell

This comment has been minimized.

Copy link
Contributor

commented Jul 19, 2019

So I have done some preliminary work for this. I just wanted to clarify one thing. The --json with no other flags isn't expected to work, as rocm-smi with no flags prints out a specially-formatted string with critical info. --json -a will get everything, and --json $flags will get it for those specific flags, but the concise format is a special case. I will probably try to revisit it later on though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.