Negative rcpu reported around time of disk shutdown #63

lars-t-hansen · 2023-09-08T10:21:38Z

See #59 (comment). This is probably a calculation error of some sort but it could also be garbage in the logs. It is obviously of interest that this happened as the disks came back up after 2300 local time (21:00 UTC in the report). That the "cpu" value is zero is also indicative of something. Reproducing here for completeness:

bash-4.4$ sonalyze load --config-file ~/sonar/ml-nodes.json --data-path ~/sonar/data --host ml6 --from 1d --fmt=date,time,cpu,rcpu,mem,rmem,gpu,rgpu,gpumem,rgpumem,gpus
HOST: ml6.hpc.uio.no
date        time   cpu   rcpu  mem  rmem  gpu  rgpu  gpumem  rgpumem  gpus
2023-09-07  10:00  6182  97    125  49    28   4     15      17       0,1,2,3
2023-09-07  11:00  3285  51    76   30    25   3     11      13       0,1,2,3,5,6,7
2023-09-07  12:00  527   8     75   29    62   8     13      15       1,5,6,7
2023-09-07  13:00  1861  29    60   23    29   4     10      12       1,6,7
2023-09-07  14:00  433   7     52   20    26   3     10      12       1,6
2023-09-07  21:00  0     -175  60   23    13   2     9       10       1,4,6
2023-09-07  22:00  624   10    70   27    25   3     11      14       1,4,6
2023-09-07  23:00  229   4     70   27    26   3     11      14       1,4,6
2023-09-08  00:00  187   3     65   26    17   2     11      13       1,4,6
2023-09-08  01:00  105   2     45   18    0    0     9       11       1,4
2023-09-08  02:00  105   2     37   15    0    0     8       10       1
2023-09-08  03:00  105   2     37   15    0    0     8       10       1
2023-09-08  04:00  105   2     37   15    0    0     8       10       1
2023-09-08  05:00  105   2     37   15    0    0     8       10       1
2023-09-08  06:00  105   2     37   15    0    0     8       10       1
2023-09-08  07:00  105   2     37   15    0    0     8       10       1
2023-09-08  08:00  105   2     37   15    0    0     8       10       1
2023-09-08  09:00  105   2     37   15    0    0     8       10       1
2023-09-08  10:00  105   2     37   15    0    0     8       10       1

The text was updated successfully, but these errors were encountered:

lars-t-hansen · 2023-09-08T11:09:18Z

rcpu is cpu_util_pct / numcores, where numcores = 64 for ml6, this should give cpu_util_pct = 64 * -175 = -11200. Indeed, changing the format from usize to isize for cpu, it prints -11227.

One thing we probably should do is to uniformly print as isize / i32 instead of usize / u32 for float values, to catch bugs like this (though in this case it's the only one I found).

Looking at the code for postprocess_log only, it is possible for cpu_util_pct to be negative if two samples in a stream have descending time stamps (this should not be possible because we've just sorted the data) or the cputime_sec field of the later record is smaller than that of the earlier record. In this case, it's the delta cpu that is negative.

lars-t-hansen · 2023-09-08T12:30:24Z

What's happening here is that there were a lot of jobs that were waiting on the disk to come back before they could progress, and many of them emitted records with the same timestamp; the fix in #60 dealt with this. But some jobs were not run until the next minute mark, and have a different time stamp. Yet the jobs were run in some arbitrary order and some of the measurement data at the later timestamp had lower values than the ditto at the earlier timestamp, because the timestamp reflects the moment the data were printed, not when they were obtained.

Since the data come from sonar, this is a problem sonar should likely fix and I'll file a bug for that (NordicHPC/sonar#100). But we should also fix the problem here by detecting this issue and avoiding creating data that have this problem. Even clamping the data to 0 would be better than making it negative.

In addition, there's a related bug: the now formatting spec obtains a fresh timestamp for every record. It should probably use the same timestamp for all the records printed by the same invocation of sonalyze. I'll file a separate bug for that (#64).

lars-t-hansen added the task:bug Something isn't working label Sep 8, 2023

lars-t-hansen self-assigned this Sep 8, 2023

lars-t-hansen mentioned this issue Sep 8, 2023

Investigate sonar behavior on ML nodes #59

Closed

lars-t-hansen mentioned this issue Sep 8, 2023

Maintain monotonicity of timestamps even in the face of some failures NordicHPC/sonar#100

Closed

lars-t-hansen closed this as completed in 6851dcd Sep 8, 2023

lars-t-hansen mentioned this issue Sep 8, 2023

More test cases for sonarlog #65

Closed

lars-t-hansen added the component:sonarlog sonalyze/sonarlog/* label Sep 8, 2023

lars-t-hansen mentioned this issue Oct 17, 2023

Filtering of negative cpu_util_pct happens too late #166

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative rcpu reported around time of disk shutdown #63

Negative rcpu reported around time of disk shutdown #63

lars-t-hansen commented Sep 8, 2023 •

edited

Loading

lars-t-hansen commented Sep 8, 2023

lars-t-hansen commented Sep 8, 2023 •

edited

Loading

Negative rcpu reported around time of disk shutdown #63

Negative rcpu reported around time of disk shutdown #63

Comments

lars-t-hansen commented Sep 8, 2023 • edited Loading

lars-t-hansen commented Sep 8, 2023

lars-t-hansen commented Sep 8, 2023 • edited Loading

lars-t-hansen commented Sep 8, 2023 •

edited

Loading

lars-t-hansen commented Sep 8, 2023 •

edited

Loading