make output fields self-identifying; closes #41 #66

bast · 2023-07-29T16:02:25Z

Summary:

changes output format (which is explained in README.md)
bumps version to 0.7.0

Please see the change in README.md for how the columns are defined. @lars-t-hansen can you please browse through this change carefully?

I am unsure about the gpus part.

If a process would use GPUs 1, 3, and 7: "gpus={1, 3, 7}"
If a process would use no or unknown GPUs: gpus={}

Do we want to distinguish no GPUs and unknown GPUs? If yes, how?

bast · 2023-07-29T16:13:44Z

Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved.

README.md

lars-t-hansen · 2023-07-31T08:27:07Z

README.md

+
+`gpus` are GPU devices:
+- If a process would use GPUs 1, 3, and 7: `"gpus={1, 3, 7}"`
+- If a process would use no or unknown GPUs: `gpus={}`


There's a difference between "no gpus" and "unknown gpus". The reason for this is that in the code for nvidia cards, most information is grabbed using nvidia-smi pmon, and this gives us precise information, but it does not report everything, in particular, it does not report defunct or zombie processes that hold onto GPU memory. For that we use nvidia-smi --query-compute-apps, but that command does not give us information about the cards being used, and so if there are any of those we use the "unknown" tag.

While it would be nice to clean that up by using the programmatic interface to the nvidia cards (and I really should do that...), I think it is useful to preserve the distinction between "no gpus" and "unknown gpus" here, it allows us to express more.

I don't understand why the braces are desirable in the syntax here. From a parsing point of view, the field value "gpus=1,3,7" seems almost ideal because it's just a string split operation in the ingestor. And with that background, we could use eg "gpus=none" for the "no gpu" case (though see below) and "gpus=unknown" for the unknown case, it's easy to distinguish those two cases from the list-of-card-numbers case.

On that topic, a useful optimization for the log writer is the use of default values that are agreed between the logger and the ingestor. The default value for gpus would be none and sensibly the field does not appear in the output at all. For several other fields, a sensible default is zero and if the value is zero we could (optionally of course) omit the field. The ingestor will have to deal with missing fields and obsoleted fields and newly-introduced fields anyway, so this adds very little complexity.

Thanks! I will change that.

About not printing values if they are default: then it's possibly not anymore CSV. It's not anymore tabular. But perhaps it does not have to be if we are explicit about it that a CSV parser trying to read an entire file might then fail since it would get confused about where the columns are.

Yeah... if the fields are named they can be in any order, though, and my assumption with having named fields is that fields can be introduced and removed when needed. And indeed a log file for a given host and day can have records with and without a particular field for that reason. The ingestion code I have in the sonarlog library handles that properly (and by design). I think we're going to be adding quite a number of fields over time and having defaults would be nice.

The alternative is running some migration pass every time we upgrade sonar. This sounds unpleasant - pause all cronjobs on all nodes, upgrade all log files, upgrade sonar, restart cronjobs - and error prone. But is this what you had in mind?

I now also think that unordered named comma-separated fields is better than forcing it into a tabular format just for the sake of file suffix which in the example I should perhaps change away from .csv.

The version is there just in case somebody in future decides to migrate some data but I don't plan that on any regular basis. This is just to make it possible and us less worried about breaking format.

The high bit is that we agree that the syntax is CSV with its comma-separation and detailed rules for quoting, so that we can use existing high-quality CSV parsers, even if the particular semantic detail of the tabular format is ignored. Both Rust and Go have such parsers that appear to be flexible about the tabular format.

Just to add further to the documentation burden, as discussed on #68 the cpu% field is not a sample, it is a running average, while the cpukib value /is/ a sample. For the GPUs I'm not sure what the gpu% number represents, but I'm investigating. For JobAnalyzer it's almost certainly better for these values to be either samples or something that can easily be converted to samples (eg, cumulative CPU/GPU time).

To close the loop on the gpu fields: they are samples (or close enough) and fine as it is. The only problems we have are with cpu%, as discussed at length on bug #68.

README.md

bast · 2023-07-31T11:38:47Z

Thanks for the feedback! I will adjust.

lars-t-hansen · 2023-08-03T11:58:11Z

Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved.

Will do so once the discussion has settled down.

lars-t-hansen · 2023-08-07T12:09:38Z

Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved.

Will do so once the discussion has settled down.

The patch (as it stands) works fine with the GPUs.

bast · 2023-08-10T07:42:11Z

This is still not done. I only resolved conflicts and rebased.

bast · 2023-08-10T07:46:42Z

Implementing defaults is a good idea but postponed to #74 otherwise I will never complete this PR :-)

Remaining here:

Change the output of gpus and then we do the rest as follow-up.
Update README.md after recent changes.

bast · 2023-08-10T08:11:32Z

I suggest we merge although this is far from perfect:

Defaults are not properly implemented and documented but followed up in Define, implement, and document defaults for values where it makes sense #74
gpus field is implemented as a HashSet and this will need to be modified for "unknown" and "none" (currently neither "unknown" nor "none" is implemented) followed up in For gpus implement "unknown" #75
Although the gpus code is now "wrong", tests don't catch it anyway. We will need to adapt the tests.
I am pressed on deadlines and think that 80% OK code on main is better than PR never finished.

lars-t-hansen

LGTM modulo comments that ought to be fixed and the tweak for writing "none" where it's easy to do so.

README.md

src/ps.rs

src/nvidia.rs

src/ps.rs

lars-t-hansen mentioned this pull request Jul 31, 2023

Add support for ingesting data with tagged fields NAICNO/Jobanalyzer#8

Closed

lars-t-hansen reviewed Jul 31, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

lars-t-hansen reviewed Jul 31, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

lars-t-hansen mentioned this pull request Jul 31, 2023

Tidy up the meaning of the GPU fields NAICNO/Jobanalyzer#10

Closed

make output fields self-identifying; closes #41

1594b7c

bast force-pushed the radovan/output-format branch from 76b39c1 to 1594b7c Compare August 10, 2023 07:41

bast mentioned this pull request Aug 10, 2023

Define, implement, and document defaults for values where it makes sense #74

Closed

bast added 2 commits August 10, 2023 09:59

update readme

ed1abe8

print gpu ids comma-separated

b39b4a0

lars-t-hansen approved these changes Aug 10, 2023

View reviewed changes

README.md Outdated Show resolved Hide resolved

README.md Show resolved Hide resolved

src/ps.rs Outdated Show resolved Hide resolved

src/ps.rs Show resolved Hide resolved

src/nvidia.rs Outdated Show resolved Hide resolved

src/ps.rs Outdated Show resolved Hide resolved

bast mentioned this pull request Aug 10, 2023

For gpus implement "unknown" #75

Closed

bast added 3 commits August 10, 2023 11:10

implement code review suggestions

a3d38a9

run "cargo fmt"

4d4b484

apply suggestions from "cargo clippy"

970cc98

bast merged commit a824fd6 into main Aug 10, 2023
1 check passed

bast deleted the radovan/output-format branch August 10, 2023 09:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make output fields self-identifying; closes #41 #66

make output fields self-identifying; closes #41 #66

bast commented Jul 29, 2023

bast commented Jul 29, 2023

lars-t-hansen Jul 31, 2023

bast Aug 3, 2023

lars-t-hansen Aug 3, 2023

bast Aug 3, 2023

lars-t-hansen Aug 4, 2023 •

edited

Loading

lars-t-hansen Aug 7, 2023

lars-t-hansen Aug 9, 2023

bast commented Jul 31, 2023

lars-t-hansen commented Aug 3, 2023

lars-t-hansen commented Aug 7, 2023

bast commented Aug 10, 2023

bast commented Aug 10, 2023 •

edited

Loading

bast commented Aug 10, 2023

lars-t-hansen left a comment

make output fields self-identifying; closes #41 #66

make output fields self-identifying; closes #41 #66

Conversation

bast commented Jul 29, 2023

bast commented Jul 29, 2023

lars-t-hansen Jul 31, 2023

Choose a reason for hiding this comment

bast Aug 3, 2023

Choose a reason for hiding this comment

lars-t-hansen Aug 3, 2023

Choose a reason for hiding this comment

bast Aug 3, 2023

Choose a reason for hiding this comment

lars-t-hansen Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

lars-t-hansen Aug 7, 2023

Choose a reason for hiding this comment

lars-t-hansen Aug 9, 2023

Choose a reason for hiding this comment

bast commented Jul 31, 2023

lars-t-hansen commented Aug 3, 2023

lars-t-hansen commented Aug 7, 2023

bast commented Aug 10, 2023

bast commented Aug 10, 2023 • edited Loading

bast commented Aug 10, 2023

lars-t-hansen left a comment

Choose a reason for hiding this comment

lars-t-hansen Aug 4, 2023 •

edited

Loading

bast commented Aug 10, 2023 •

edited

Loading