Make output fields self-identifying? #41

lars-t-hansen · 2023-06-15T08:28:13Z

As we experiment with sonar and add more fields and move it to new systems there will come a time when some log data are older and some are newer and data of different ages will have different columns. For a while it will be good enough to always add new columns at the end of the line, but in the long term this will likely become brittle. It may be good to add an identifier to each field to reduce the impact of this problem. The identifier can be a short name and can prefix the field, for example,

date=...,host=...,cores=...,user=...,job=...,cmd=...,cpu%=...,kib=...,gpus=...,gpu%=...,gpumem%=...,gpukib=...

would do it for now. Names (columns) could be removed if they turn out to be worthless but names would never be reused with a different meaning.

The text was updated successfully, but these errors were encountered:

bast · 2023-06-15T12:26:29Z

That's a great idea to make backwards compatibility easier. It could also add sonar version to the line so that we can make "database" migrations.

bast · 2023-06-16T08:24:45Z

I am working on this ...

bast · 2023-06-22T17:20:29Z

What is the difference between gpumem%=... and gpukib=...? And what is an example for gpus=? (I can also try to figure out from the source code)

lars-t-hansen · 2023-06-23T05:45:18Z

At the moment, gpus= is a binary bitmask of the cards used by the process for this sample. For example, the value for a process that uses cards 0, 1 and 3 would be 1011. This is not the only possible representation and not even necessarily the best one, see comments in the code, but it was pretty convenient to emit and to parse (for up to 64 cards per node...). For zombie processes that are known to use the GPUs, but where it can't be determined which GPUs, it's set to !0, but frankly I don't like this representation much, even if it works in practice. Happy to change/discuss, this is a good time to do so. Some of this discussion may touch on matters of what it is sonar is supposed to detect and report.

The difference between gpumem% and gpukib is that on some cards some of the time it is possible to determine one of these but not the other, and vice versa. For example, on the NVIDIA cards we have we can read both quantities for running processes but only gpukib for some zombies. Since we can detect the total amount of memory here we could translate gpukib into gpumem%, though. On the other hand, on our AMD cards there is no support for detecting the absolute amount of memory used, nor the total amount of memory on the cards, only the percentage of gpu memory used. Rather than encoding the logic for dealing with this mess in sonar, it seemed better - certainly for the time being - to report what we can report and let the analyzer sort it out.

lars-t-hansen · 2023-06-23T05:48:22Z

Which also reminds me, there's probably no need to report subsecond precision for the sonar record timestamp, it just takes up space in the log.

bast · 2023-06-23T10:05:16Z

Thanks for explanations!

bast · 2023-07-27T17:30:45Z

Which also reminds me, there's probably no need to report subsecond precision for the sonar record timestamp, it just takes up space in the log.

This is now fixed in #65

bast · 2023-07-27T17:45:51Z

At the moment, gpus= is a binary bitmask of the cards used by the process for this sample. For example, the value for a process that uses cards 0, 1 and 3 would be 1011. This is not the only possible representation and not even necessarily the best one, see comments in the code, but it was pretty convenient to emit and to parse (for up to 64 cards per node...). For zombie processes that are known to use the GPUs, but where it can't be determined which GPUs, it's set to !0, but frankly I don't like this representation much, even if it works in practice. Happy to change/discuss, this is a good time to do so. Some of this discussion may touch on matters of what it is sonar is supposed to detect and report.

We could do it also like this:

...,gpus="0,1,3",...
...,gpus="unknown",...

lars-t-hansen · 2023-07-28T06:26:39Z

Pedantically, for CSV the correct format is

...,"gpus=0,1,3",...
...,gpus=unknown,...

That said, I'm fine with this format for the data.

bast · 2023-07-28T08:45:14Z

Thanks for clarification! I am also fine with bitmask but the "csv inside csv" seems more general and less explaining needed but maybe also uglier? But we don't need to parse it by eye, only with scripts.

lars-t-hansen · 2023-07-28T09:25:23Z

I think the generality of list-of-numbers-or-special-token wins here over the bitmask, which is going to bite us at some point. And it's a good time to make the change, when we're introducing field names.

make output fields self-identifying; closes #41

bast self-assigned this Jun 16, 2023

bast added a commit that referenced this issue Jul 29, 2023

make output fields self-identifying; closes #41

e308b58

lars-t-hansen mentioned this issue Aug 7, 2023

Print raw cpu time #68

Closed

bast closed this as completed in 1594b7c Aug 10, 2023

bast added a commit that referenced this issue Aug 10, 2023

Merge pull request #66 from NordicHPC/radovan/output-format

a824fd6

make output fields self-identifying; closes #41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make output fields self-identifying? #41

Make output fields self-identifying? #41

lars-t-hansen commented Jun 15, 2023

bast commented Jun 15, 2023

bast commented Jun 16, 2023

bast commented Jun 22, 2023

lars-t-hansen commented Jun 23, 2023 •

edited

Loading

lars-t-hansen commented Jun 23, 2023

bast commented Jun 23, 2023

bast commented Jul 27, 2023

bast commented Jul 27, 2023

lars-t-hansen commented Jul 28, 2023

bast commented Jul 28, 2023

lars-t-hansen commented Jul 28, 2023

Make output fields self-identifying? #41

Make output fields self-identifying? #41

Comments

lars-t-hansen commented Jun 15, 2023

bast commented Jun 15, 2023

bast commented Jun 16, 2023

bast commented Jun 22, 2023

lars-t-hansen commented Jun 23, 2023 • edited Loading

lars-t-hansen commented Jun 23, 2023

bast commented Jun 23, 2023

bast commented Jul 27, 2023

bast commented Jul 27, 2023

lars-t-hansen commented Jul 28, 2023

bast commented Jul 28, 2023

lars-t-hansen commented Jul 28, 2023

lars-t-hansen commented Jun 23, 2023 •

edited

Loading