Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative rcpu reported around time of disk shutdown #63

Closed
lars-t-hansen opened this issue Sep 8, 2023 · 2 comments
Closed

Negative rcpu reported around time of disk shutdown #63

lars-t-hansen opened this issue Sep 8, 2023 · 2 comments
Assignees
Labels
component:sonarlog sonalyze/sonarlog/* task:bug Something isn't working

Comments

@lars-t-hansen
Copy link
Collaborator

lars-t-hansen commented Sep 8, 2023

See #59 (comment). This is probably a calculation error of some sort but it could also be garbage in the logs. It is obviously of interest that this happened as the disks came back up after 2300 local time (21:00 UTC in the report). That the "cpu" value is zero is also indicative of something. Reproducing here for completeness:

bash-4.4$ sonalyze load --config-file ~/sonar/ml-nodes.json --data-path ~/sonar/data --host ml6 --from 1d --fmt=date,time,cpu,rcpu,mem,rmem,gpu,rgpu,gpumem,rgpumem,gpus
HOST: ml6.hpc.uio.no
date        time   cpu   rcpu  mem  rmem  gpu  rgpu  gpumem  rgpumem  gpus
2023-09-07  10:00  6182  97    125  49    28   4     15      17       0,1,2,3
2023-09-07  11:00  3285  51    76   30    25   3     11      13       0,1,2,3,5,6,7
2023-09-07  12:00  527   8     75   29    62   8     13      15       1,5,6,7
2023-09-07  13:00  1861  29    60   23    29   4     10      12       1,6,7
2023-09-07  14:00  433   7     52   20    26   3     10      12       1,6
2023-09-07  21:00  0     -175  60   23    13   2     9       10       1,4,6
2023-09-07  22:00  624   10    70   27    25   3     11      14       1,4,6
2023-09-07  23:00  229   4     70   27    26   3     11      14       1,4,6
2023-09-08  00:00  187   3     65   26    17   2     11      13       1,4,6
2023-09-08  01:00  105   2     45   18    0    0     9       11       1,4
2023-09-08  02:00  105   2     37   15    0    0     8       10       1
2023-09-08  03:00  105   2     37   15    0    0     8       10       1
2023-09-08  04:00  105   2     37   15    0    0     8       10       1
2023-09-08  05:00  105   2     37   15    0    0     8       10       1
2023-09-08  06:00  105   2     37   15    0    0     8       10       1
2023-09-08  07:00  105   2     37   15    0    0     8       10       1
2023-09-08  08:00  105   2     37   15    0    0     8       10       1
2023-09-08  09:00  105   2     37   15    0    0     8       10       1
2023-09-08  10:00  105   2     37   15    0    0     8       10       1
@lars-t-hansen lars-t-hansen added the task:bug Something isn't working label Sep 8, 2023
@lars-t-hansen lars-t-hansen self-assigned this Sep 8, 2023
@lars-t-hansen
Copy link
Collaborator Author

rcpu is cpu_util_pct / numcores, where numcores = 64 for ml6, this should give cpu_util_pct = 64 * -175 = -11200. Indeed, changing the format from usize to isize for cpu, it prints -11227.

One thing we probably should do is to uniformly print as isize / i32 instead of usize / u32 for float values, to catch bugs like this (though in this case it's the only one I found).

Looking at the code for postprocess_log only, it is possible for cpu_util_pct to be negative if two samples in a stream have descending time stamps (this should not be possible because we've just sorted the data) or the cputime_sec field of the later record is smaller than that of the earlier record. In this case, it's the delta cpu that is negative.

@lars-t-hansen
Copy link
Collaborator Author

lars-t-hansen commented Sep 8, 2023

What's happening here is that there were a lot of jobs that were waiting on the disk to come back before they could progress, and many of them emitted records with the same timestamp; the fix in #60 dealt with this. But some jobs were not run until the next minute mark, and have a different time stamp. Yet the jobs were run in some arbitrary order and some of the measurement data at the later timestamp had lower values than the ditto at the earlier timestamp, because the timestamp reflects the moment the data were printed, not when they were obtained.

Since the data come from sonar, this is a problem sonar should likely fix and I'll file a bug for that (NordicHPC/sonar#100). But we should also fix the problem here by detecting this issue and avoiding creating data that have this problem. Even clamping the data to 0 would be better than making it negative.

In addition, there's a related bug: the now formatting spec obtains a fresh timestamp for every record. It should probably use the same timestamp for all the records printed by the same invocation of sonalyze. I'll file a separate bug for that (#64).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:sonarlog sonalyze/sonarlog/* task:bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant