Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print raw cpu time #68

Closed
lars-t-hansen opened this issue Aug 7, 2023 · 11 comments
Closed

Print raw cpu time #68

lars-t-hansen opened this issue Aug 7, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@lars-t-hansen
Copy link
Collaborator

Currently sonar logs the pcpu field from the ps output, but this is a pretty tricky number. It is not a sample: it is the ratio of the process's consumed CPU time by the elapsed time since the process started, if I read the manual page correctly. As time moves forward, it will take ever greater changes in process behavior to move this number at all. (If a process sits around doing nothing for 24 hours and then runs 100% for an hour, this number will move from 0% to 2.5%.) I think that what we instead want to log is the cputimes field, which is the consumed CPU time up until that point. Given two consecutive log records we can then say something about how busy the process was during the last sampling interval, and it will be meaningful to look at averages, trends, and so on. Some of these are core use cases for NAICNO/Jobanalyzer.

(Given enough precision in the output it may be possible to compute the desired value from the pcpu value, but ps prints only one digit after the decimal point and this is unlikely to be very good in practice.)

Once we have support for named fields in the sonar output (#41) we can easily add this field, and it shouldn't take any extra effort during sampling to generate the data.

Also see NAICNO/Jobanalyzer#27.

@bast
Copy link
Member

bast commented Aug 7, 2023

Great catch! Thank you!

@lars-t-hansen lars-t-hansen changed the title Log raw cpu time Print raw cpu time Aug 8, 2023
@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Aug 8, 2023

I'm not actually sure this patch solves the problem I'm trying to solve. Consider create_snapshot(). It obtains a ps listing (mapping a user/pid/cmd tuple identifying the process to information about the process) and then it first merges entries for the same user/pid/cmd tuple and then it maps the pid to a job ID and merges process values into a single job. If a job consists of a changing set of processes (some are created, some exit) then any measure that is not a pure sample value will be very hard to deal with. Suppose process A runs for the entire length of the job B is later started and then terminates before A. Suppose A and B both use one full core. Then if A runs for a second, and then A and B run in parallel for 2 seconds before a sample is taken, then sonar will report 5 seconds of CPU time for the job (the sum of cputimes for A and B at that point). But then B terminates, A runs for another second, and another sample is taken. Sonar will now report 4 seconds of CPU time for the job (the time A has run) and the time B used will be lost.

I'll pull the patch while I ponder this further.

@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Aug 8, 2023

Since the cpu% field we're currently using is just another view on the cputimes field I'm trying to use, we already have this problem. And I believe the problem is the same for single-process jobs with many threads that start and exit, as it is for multi-process jobs.

The fundamental problem is information loss at the accumulation stage: we have no information about processes that exited. If we were to have a fairly pure sample (ie cpu time since last sample) then this is probably something we could live with. In the example above, the first sample at t=3 would report "time since last sample 5s", and the second sample at t=4 would report "time since last sample 1s". The latter is an underreporting - B could have exited anywhere in that 1s window and its CPU time in that window is lost - but at least the recordings approximate truth.

There are a couple of ways to address this I guess. Basically it comes down to knowing about the processes that go into computing the total CPU time usage and their usage so that it is possible to observe that they exit and ensure that that computations representing samples are performed with that knowledge. This means sonar can either log the individual processes and their cpu times so that the postprocessor (sonalyze, jobgraph) can perform the calculations, or that sonar has a memory so that it can perform the calculations itself. In the former case, the record for the job would presumably contain an array of pairs (pid,cputime) instead of the cputimes field. In the latter case, the sonar memory would contain those data.

Temperamentally I'm probably inclined to put the data in the sonar log, if the data volume does not become too outrageous. I'll experiment some.

@lars-t-hansen
Copy link
Collaborator Author

It looks like the solution here is to add the --cumulative switch to ps, on top of the existing patch. This will ensure that if B is a subprocess of A then B's cputime will be added to A's cutime field when B dies (and until B dies, B will be accounted for on its own and handled by sonar). The --cumulative switch to ps ensures that A's cutime value is picked up when reporting cputime. In this manner, the cputime reported for A will be monotonically nondecreasing, incorporating the time of its children as necessary. I will verify this with an experiment.

A's threads are already rolled into A's data when presented by ps, their stats persist when the threads exit, and they do not present any additional problem. This is as expected but I've verified it with an experiment.

@lars-t-hansen
Copy link
Collaborator Author

A lovely theory, except it turns out --cumulative does not modify the cputimes output. It does modify %cpu and several others but they are mostly ratios or rounded in ways we don't want. However, it also does modify bsdtime the way we want, and this is probably good enough, as this field does not lose data. bsdtime is on the form m...m:ss where the minute field can grow arbitrarily large and the second field is two digits always. It must be parsed, but we can do that.

@lars-t-hansen
Copy link
Collaborator Author

(Presumably just a matter of time before we go directly to /proc/pid/stat and do the calculations ourselves.)

@lars-t-hansen
Copy link
Collaborator Author

The fix that I have will probably not do the right thing for a job that has multiple independent processed created ex nihilo by some supernatural agent ("slurm") but I'm not sure if that's really a problem or not. And in that case they should carry the same slurm job ID and will still be rolled together, as they should be.

@bast
Copy link
Member

bast commented Aug 9, 2023

/proc/pid/stat

Good idea I think. I am surprised that there is nothing simple that tells us the CPU percentage right now. Sonar should IMO ideally not have its own memory. I need to learn what the different numbers in /proc/pid/stat mean but if one of them corresponds to CPU percentage right now, then this is a good route. Few more files to open and close but that's fine.

@lars-t-hansen
Copy link
Collaborator Author

None of the /proc values correspond to cpu percentage as a sample or in a (sensible) sample window, it's all cumulative, from what I can tell (and I've been doing quite a bit of reading).

While it's not enormously appealing for the consumer of the sonar log to have to reconstruct sample values from the cumulative values, it does work (or I have faith that it will work when I've written the code), and moreover, that reconstruction can be hidden in the sonarlog library if we want. The consumer will only need to see the sample values.

@lars-t-hansen
Copy link
Collaborator Author

Adding to my previous comment: to create a cpu "sample" (ie %cpu used "now") there must be a time window, since the nature of a cpu is such that one process is using it 100% and other processes are using it 0%, at a given instant. %cpu only makes sense averaged over a time window. The %cpu field of the ps output takes the window to be the lifetime of the process. If there was to be any other time window for another field to be provided by /proc or ps, what would it be? If it were to exist it would be either arbitrary (ie pre-set by the kernel) or configurable. I've seen nothing, either way. I could imagine that there is a profiling API that would allow something like that to exist but /proc probably isn't the natural place for it.

In addition to that, I suspect that in an ideal world what we really want for %cpu is really "cpu usage since the last sample", not a "cpu usage now" datum. The latter is better in determining instantaneous load, but if that's really what we want then we run rusage and capture its output (or get similar data from /proc). The former is much better for determining system utilization over time.

@bast
Copy link
Member

bast commented Aug 10, 2023

Thanks! That makes lots of sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants