-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative rcpu reported around time of disk shutdown #63
Comments
rcpu is cpu_util_pct / numcores, where numcores = 64 for ml6, this should give cpu_util_pct = 64 * -175 = -11200. Indeed, changing the format from usize to isize for cpu, it prints -11227. One thing we probably should do is to uniformly print as isize / i32 instead of usize / u32 for float values, to catch bugs like this (though in this case it's the only one I found). Looking at the code for postprocess_log only, it is possible for cpu_util_pct to be negative if two samples in a stream have descending time stamps (this should not be possible because we've just sorted the data) or the cputime_sec field of the later record is smaller than that of the earlier record. In this case, it's the delta cpu that is negative. |
What's happening here is that there were a lot of jobs that were waiting on the disk to come back before they could progress, and many of them emitted records with the same timestamp; the fix in #60 dealt with this. But some jobs were not run until the next minute mark, and have a different time stamp. Yet the jobs were run in some arbitrary order and some of the measurement data at the later timestamp had lower values than the ditto at the earlier timestamp, because the timestamp reflects the moment the data were printed, not when they were obtained. Since the data come from sonar, this is a problem sonar should likely fix and I'll file a bug for that (NordicHPC/sonar#100). But we should also fix the problem here by detecting this issue and avoiding creating data that have this problem. Even clamping the data to 0 would be better than making it negative. In addition, there's a related bug: the |
See #59 (comment). This is probably a calculation error of some sort but it could also be garbage in the logs. It is obviously of interest that this happened as the disks came back up after 2300 local time (21:00 UTC in the report). That the "cpu" value is zero is also indicative of something. Reproducing here for completeness:
The text was updated successfully, but these errors were encountered: