-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print raw cpu time #68
Comments
Great catch! Thank you! |
I'm not actually sure this patch solves the problem I'm trying to solve. Consider I'll pull the patch while I ponder this further. |
Since the The fundamental problem is information loss at the accumulation stage: we have no information about processes that exited. If we were to have a fairly pure sample (ie cpu time since last sample) then this is probably something we could live with. In the example above, the first sample at t=3 would report "time since last sample 5s", and the second sample at t=4 would report "time since last sample 1s". The latter is an underreporting - B could have exited anywhere in that 1s window and its CPU time in that window is lost - but at least the recordings approximate truth. There are a couple of ways to address this I guess. Basically it comes down to knowing about the processes that go into computing the total CPU time usage and their usage so that it is possible to observe that they exit and ensure that that computations representing samples are performed with that knowledge. This means sonar can either log the individual processes and their cpu times so that the postprocessor (sonalyze, jobgraph) can perform the calculations, or that sonar has a memory so that it can perform the calculations itself. In the former case, the record for the job would presumably contain an array of pairs Temperamentally I'm probably inclined to put the data in the sonar log, if the data volume does not become too outrageous. I'll experiment some. |
It looks like the solution here is to add the A's threads are already rolled into A's data when presented by ps, their stats persist when the threads exit, and they do not present any additional problem. This is as expected but I've verified it with an experiment. |
A lovely theory, except it turns out |
(Presumably just a matter of time before we go directly to |
The fix that I have will probably not do the right thing for a job that has multiple independent processed created ex nihilo by some supernatural agent ("slurm") but I'm not sure if that's really a problem or not. And in that case they should carry the same slurm job ID and will still be rolled together, as they should be. |
Good idea I think. I am surprised that there is nothing simple that tells us the CPU percentage right now. Sonar should IMO ideally not have its own memory. I need to learn what the different numbers in |
None of the /proc values correspond to cpu percentage as a sample or in a (sensible) sample window, it's all cumulative, from what I can tell (and I've been doing quite a bit of reading). While it's not enormously appealing for the consumer of the sonar log to have to reconstruct sample values from the cumulative values, it does work (or I have faith that it will work when I've written the code), and moreover, that reconstruction can be hidden in the sonarlog library if we want. The consumer will only need to see the sample values. |
Adding to my previous comment: to create a cpu "sample" (ie %cpu used "now") there must be a time window, since the nature of a cpu is such that one process is using it 100% and other processes are using it 0%, at a given instant. %cpu only makes sense averaged over a time window. The %cpu field of the ps output takes the window to be the lifetime of the process. If there was to be any other time window for another field to be provided by /proc or ps, what would it be? If it were to exist it would be either arbitrary (ie pre-set by the kernel) or configurable. I've seen nothing, either way. I could imagine that there is a profiling API that would allow something like that to exist but /proc probably isn't the natural place for it. In addition to that, I suspect that in an ideal world what we really want for %cpu is really "cpu usage since the last sample", not a "cpu usage now" datum. The latter is better in determining instantaneous load, but if that's really what we want then we run rusage and capture its output (or get similar data from /proc). The former is much better for determining system utilization over time. |
Thanks! That makes lots of sense. |
Currently sonar logs the
pcpu
field from theps
output, but this is a pretty tricky number. It is not a sample: it is the ratio of the process's consumed CPU time by the elapsed time since the process started, if I read the manual page correctly. As time moves forward, it will take ever greater changes in process behavior to move this number at all. (If a process sits around doing nothing for 24 hours and then runs 100% for an hour, this number will move from 0% to 2.5%.) I think that what we instead want to log is thecputimes
field, which is the consumed CPU time up until that point. Given two consecutive log records we can then say something about how busy the process was during the last sampling interval, and it will be meaningful to look at averages, trends, and so on. Some of these are core use cases for NAICNO/Jobanalyzer.(Given enough precision in the output it may be possible to compute the desired value from the pcpu value, but ps prints only one digit after the decimal point and this is unlikely to be very good in practice.)
Once we have support for named fields in the sonar output (#41) we can easily add this field, and it shouldn't take any extra effort during sampling to generate the data.
Also see NAICNO/Jobanalyzer#27.
The text was updated successfully, but these errors were encountered: