Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make output fields self-identifying? #41

Closed
lars-t-hansen opened this issue Jun 15, 2023 · 11 comments
Closed

Make output fields self-identifying? #41

lars-t-hansen opened this issue Jun 15, 2023 · 11 comments
Assignees

Comments

@lars-t-hansen
Copy link
Collaborator

As we experiment with sonar and add more fields and move it to new systems there will come a time when some log data are older and some are newer and data of different ages will have different columns. For a while it will be good enough to always add new columns at the end of the line, but in the long term this will likely become brittle. It may be good to add an identifier to each field to reduce the impact of this problem. The identifier can be a short name and can prefix the field, for example,

date=...,host=...,cores=...,user=...,job=...,cmd=...,cpu%=...,kib=...,gpus=...,gpu%=...,gpumem%=...,gpukib=...

would do it for now. Names (columns) could be removed if they turn out to be worthless but names would never be reused with a different meaning.

@bast
Copy link
Member

bast commented Jun 15, 2023

That's a great idea to make backwards compatibility easier. It could also add sonar version to the line so that we can make "database" migrations.

@bast bast self-assigned this Jun 16, 2023
@bast
Copy link
Member

bast commented Jun 16, 2023

I am working on this ...

@bast
Copy link
Member

bast commented Jun 22, 2023

What is the difference between gpumem%=... and gpukib=...? And what is an example for gpus=? (I can also try to figure out from the source code)

@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Jun 23, 2023

At the moment, gpus= is a binary bitmask of the cards used by the process for this sample. For example, the value for a process that uses cards 0, 1 and 3 would be 1011. This is not the only possible representation and not even necessarily the best one, see comments in the code, but it was pretty convenient to emit and to parse (for up to 64 cards per node...). For zombie processes that are known to use the GPUs, but where it can't be determined which GPUs, it's set to !0, but frankly I don't like this representation much, even if it works in practice. Happy to change/discuss, this is a good time to do so. Some of this discussion may touch on matters of what it is sonar is supposed to detect and report.

The difference between gpumem% and gpukib is that on some cards some of the time it is possible to determine one of these but not the other, and vice versa. For example, on the NVIDIA cards we have we can read both quantities for running processes but only gpukib for some zombies. Since we can detect the total amount of memory here we could translate gpukib into gpumem%, though. On the other hand, on our AMD cards there is no support for detecting the absolute amount of memory used, nor the total amount of memory on the cards, only the percentage of gpu memory used. Rather than encoding the logic for dealing with this mess in sonar, it seemed better - certainly for the time being - to report what we can report and let the analyzer sort it out.

@lars-t-hansen
Copy link
Collaborator Author

Which also reminds me, there's probably no need to report subsecond precision for the sonar record timestamp, it just takes up space in the log.

@bast
Copy link
Member

bast commented Jun 23, 2023

Thanks for explanations!

@bast
Copy link
Member

bast commented Jul 27, 2023

Which also reminds me, there's probably no need to report subsecond precision for the sonar record timestamp, it just takes up space in the log.

This is now fixed in #65

@bast
Copy link
Member

bast commented Jul 27, 2023

At the moment, gpus= is a binary bitmask of the cards used by the process for this sample. For example, the value for a process that uses cards 0, 1 and 3 would be 1011. This is not the only possible representation and not even necessarily the best one, see comments in the code, but it was pretty convenient to emit and to parse (for up to 64 cards per node...). For zombie processes that are known to use the GPUs, but where it can't be determined which GPUs, it's set to !0, but frankly I don't like this representation much, even if it works in practice. Happy to change/discuss, this is a good time to do so. Some of this discussion may touch on matters of what it is sonar is supposed to detect and report.

We could do it also like this:

...,gpus="0,1,3",...
...,gpus="unknown",...

@lars-t-hansen
Copy link
Collaborator Author

Pedantically, for CSV the correct format is

...,"gpus=0,1,3",...
...,gpus=unknown,...

That said, I'm fine with this format for the data.

@bast
Copy link
Member

bast commented Jul 28, 2023

Thanks for clarification! I am also fine with bitmask but the "csv inside csv" seems more general and less explaining needed but maybe also uglier? But we don't need to parse it by eye, only with scripts.

@lars-t-hansen
Copy link
Collaborator Author

I think the generality of list-of-numbers-or-special-token wins here over the bitmask, which is going to bite us at some point. And it's a good time to make the change, when we're introducing field names.

@bast bast closed this as completed in 1594b7c Aug 10, 2023
bast added a commit that referenced this issue Aug 10, 2023
make output fields self-identifying; closes #41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants