Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

make output fields self-identifying; closes #41 #66

Merged
merged 6 commits into from
Aug 10, 2023
Merged

Conversation

bast
Copy link
Member

@bast bast commented Jul 29, 2023

Summary:

  • changes output format (which is explained in README.md)
  • bumps version to 0.7.0

Please see the change in README.md for how the columns are defined. @lars-t-hansen can you please browse through this change carefully?

I am unsure about the gpus part.

  • If a process would use GPUs 1, 3, and 7: "gpus={1, 3, 7}"
  • If a process would use no or unknown GPUs: gpus={}

Do we want to distinguish no GPUs and unknown GPUs? If yes, how?

@bast
Copy link
Member Author

bast commented Jul 29, 2023

Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved.

README.md Outdated Show resolved Hide resolved
README.md Outdated

`gpus` are GPU devices:
- If a process would use GPUs 1, 3, and 7: `"gpus={1, 3, 7}"`
- If a process would use no or unknown GPUs: `gpus={}`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a difference between "no gpus" and "unknown gpus". The reason for this is that in the code for nvidia cards, most information is grabbed using nvidia-smi pmon, and this gives us precise information, but it does not report everything, in particular, it does not report defunct or zombie processes that hold onto GPU memory. For that we use nvidia-smi --query-compute-apps, but that command does not give us information about the cards being used, and so if there are any of those we use the "unknown" tag.

While it would be nice to clean that up by using the programmatic interface to the nvidia cards (and I really should do that...), I think it is useful to preserve the distinction between "no gpus" and "unknown gpus" here, it allows us to express more.

I don't understand why the braces are desirable in the syntax here. From a parsing point of view, the field value "gpus=1,3,7" seems almost ideal because it's just a string split operation in the ingestor. And with that background, we could use eg "gpus=none" for the "no gpu" case (though see below) and "gpus=unknown" for the unknown case, it's easy to distinguish those two cases from the list-of-card-numbers case.

On that topic, a useful optimization for the log writer is the use of default values that are agreed between the logger and the ingestor. The default value for gpus would be none and sensibly the field does not appear in the output at all. For several other fields, a sensible default is zero and if the value is zero we could (optionally of course) omit the field. The ingestor will have to deal with missing fields and obsoleted fields and newly-introduced fields anyway, so this adds very little complexity.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I will change that.

About not printing values if they are default: then it's possibly not anymore CSV. It's not anymore tabular. But perhaps it does not have to be if we are explicit about it that a CSV parser trying to read an entire file might then fail since it would get confused about where the columns are.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... if the fields are named they can be in any order, though, and my assumption with having named fields is that fields can be introduced and removed when needed. And indeed a log file for a given host and day can have records with and without a particular field for that reason. The ingestion code I have in the sonarlog library handles that properly (and by design). I think we're going to be adding quite a number of fields over time and having defaults would be nice.

The alternative is running some migration pass every time we upgrade sonar. This sounds unpleasant - pause all cronjobs on all nodes, upgrade all log files, upgrade sonar, restart cronjobs - and error prone. But is this what you had in mind?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now also think that unordered named comma-separated fields is better than forcing it into a tabular format just for the sake of file suffix which in the example I should perhaps change away from .csv.

The version is there just in case somebody in future decides to migrate some data but I don't plan that on any regular basis. This is just to make it possible and us less worried about breaking format.

Copy link
Collaborator

@lars-t-hansen lars-t-hansen Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The high bit is that we agree that the syntax is CSV with its comma-separation and detailed rules for quoting, so that we can use existing high-quality CSV parsers, even if the particular semantic detail of the tabular format is ignored. Both Rust and Go have such parsers that appear to be flexible about the tabular format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to add further to the documentation burden, as discussed on #68 the cpu% field is not a sample, it is a running average, while the cpukib value /is/ a sample. For the GPUs I'm not sure what the gpu% number represents, but I'm investigating. For JobAnalyzer it's almost certainly better for these values to be either samples or something that can easily be converted to samples (eg, cumulative CPU/GPU time).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To close the loop on the gpu fields: they are samples (or close enough) and fine as it is. The only problems we have are with cpu%, as discussed at length on bug #68.

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@bast
Copy link
Member Author

bast commented Jul 31, 2023

Thanks for the feedback! I will adjust.

@lars-t-hansen
Copy link
Collaborator

Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved.

Will do so once the discussion has settled down.

@lars-t-hansen
Copy link
Collaborator

Also if you could test the branch on the GPUs you have access to, that would help a lot. I haven't tested this outside of CPUs. But the tests are preserved.

Will do so once the discussion has settled down.

The patch (as it stands) works fine with the GPUs.

@bast
Copy link
Member Author

bast commented Aug 10, 2023

This is still not done. I only resolved conflicts and rebased.

@bast
Copy link
Member Author

bast commented Aug 10, 2023

Implementing defaults is a good idea but postponed to #74 otherwise I will never complete this PR :-)

Remaining here:

  • Change the output of gpus and then we do the rest as follow-up.
  • Update README.md after recent changes.

@bast
Copy link
Member Author

bast commented Aug 10, 2023

I suggest we merge although this is far from perfect:

  • Defaults are not properly implemented and documented but followed up in Define, implement, and document defaults for values where it makes sense #74
  • gpus field is implemented as a HashSet and this will need to be modified for "unknown" and "none" (currently neither "unknown" nor "none" is implemented) followed up in For gpus implement "unknown" #75
  • Although the gpus code is now "wrong", tests don't catch it anyway. We will need to adapt the tests.
  • I am pressed on deadlines and think that 80% OK code on main is better than PR never finished.

Copy link
Collaborator

@lars-t-hansen lars-t-hansen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo comments that ought to be fixed and the tweak for writing "none" where it's easy to do so.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
src/ps.rs Outdated Show resolved Hide resolved
src/ps.rs Show resolved Hide resolved
src/nvidia.rs Outdated Show resolved Hide resolved
src/ps.rs Outdated Show resolved Hide resolved
@bast bast mentioned this pull request Aug 10, 2023
@bast bast merged commit a824fd6 into main Aug 10, 2023
1 check passed
@bast bast deleted the radovan/output-format branch August 10, 2023 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants