-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
make output fields self-identifying; closes #41 #66
Conversation
Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved. |
README.md
Outdated
|
||
`gpus` are GPU devices: | ||
- If a process would use GPUs 1, 3, and 7: `"gpus={1, 3, 7}"` | ||
- If a process would use no or unknown GPUs: `gpus={}` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a difference between "no gpus" and "unknown gpus". The reason for this is that in the code for nvidia cards, most information is grabbed using nvidia-smi pmon
, and this gives us precise information, but it does not report everything, in particular, it does not report defunct or zombie processes that hold onto GPU memory. For that we use nvidia-smi --query-compute-apps
, but that command does not give us information about the cards being used, and so if there are any of those we use the "unknown" tag.
While it would be nice to clean that up by using the programmatic interface to the nvidia cards (and I really should do that...), I think it is useful to preserve the distinction between "no gpus" and "unknown gpus" here, it allows us to express more.
I don't understand why the braces are desirable in the syntax here. From a parsing point of view, the field value "gpus=1,3,7" seems almost ideal because it's just a string split operation in the ingestor. And with that background, we could use eg "gpus=none" for the "no gpu" case (though see below) and "gpus=unknown" for the unknown case, it's easy to distinguish those two cases from the list-of-card-numbers case.
On that topic, a useful optimization for the log writer is the use of default values that are agreed between the logger and the ingestor. The default value for gpus
would be none
and sensibly the field does not appear in the output at all. For several other fields, a sensible default is zero and if the value is zero we could (optionally of course) omit the field. The ingestor will have to deal with missing fields and obsoleted fields and newly-introduced fields anyway, so this adds very little complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I will change that.
About not printing values if they are default: then it's possibly not anymore CSV. It's not anymore tabular. But perhaps it does not have to be if we are explicit about it that a CSV parser trying to read an entire file might then fail since it would get confused about where the columns are.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah... if the fields are named they can be in any order, though, and my assumption with having named fields is that fields can be introduced and removed when needed. And indeed a log file for a given host and day can have records with and without a particular field for that reason. The ingestion code I have in the sonarlog library handles that properly (and by design). I think we're going to be adding quite a number of fields over time and having defaults would be nice.
The alternative is running some migration pass every time we upgrade sonar. This sounds unpleasant - pause all cronjobs on all nodes, upgrade all log files, upgrade sonar, restart cronjobs - and error prone. But is this what you had in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now also think that unordered named comma-separated fields is better than forcing it into a tabular format just for the sake of file suffix which in the example I should perhaps change away from .csv
.
The version is there just in case somebody in future decides to migrate some data but I don't plan that on any regular basis. This is just to make it possible and us less worried about breaking format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The high bit is that we agree that the syntax is CSV with its comma-separation and detailed rules for quoting, so that we can use existing high-quality CSV parsers, even if the particular semantic detail of the tabular format is ignored. Both Rust and Go have such parsers that appear to be flexible about the tabular format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to add further to the documentation burden, as discussed on #68 the cpu%
field is not a sample, it is a running average, while the cpukib
value /is/ a sample. For the GPUs I'm not sure what the gpu%
number represents, but I'm investigating. For JobAnalyzer it's almost certainly better for these values to be either samples or something that can easily be converted to samples (eg, cumulative CPU/GPU time).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To close the loop on the gpu fields: they are samples (or close enough) and fine as it is. The only problems we have are with cpu%, as discussed at length on bug #68.
Thanks for the feedback! I will adjust. |
Will do so once the discussion has settled down. |
The patch (as it stands) works fine with the GPUs. |
76b39c1
to
1594b7c
Compare
This is still not done. I only resolved conflicts and rebased. |
Implementing defaults is a good idea but postponed to #74 otherwise I will never complete this PR :-) Remaining here:
|
I suggest we merge although this is far from perfect:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo comments that ought to be fixed and the tweak for writing "none" where it's easy to do so.
Summary:
Please see the change in README.md for how the columns are defined. @lars-t-hansen can you please browse through this change carefully?
I am unsure about the
gpus
part."gpus={1, 3, 7}"
gpus={}
Do we want to distinguish no GPUs and unknown GPUs? If yes, how?