cpuload: Workaround for wrong idle thread load #22653
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Solved Problem
When the CPU load monitor is started while already running, then the idle thread last_times[0] is reset to the last 1 second,
rather than since when the CPU load monitor was last started. This can happen if the calls to
cpuload_monitor_start()
andcpuload_monitor_stop()
are unbalanced. In my test case, the cpuload monitor is never disabled.This results in the idle thread having a much lower CPU load, with the remaining CPU time being wrongly attributed as scheduler load. The remaining threads are not impacted, since their last_times[i] is reset to zero for every sample init.
This causes the preflight and postflight cpuload measurements to wrongly show a >30% scheduler load, which raises alarms in our monitoring suite.
See #22655.
Solution
Fix the unbalanced calls in
logger.cpp
.The solution only resets the idle thread CPUlast_times[0]
. This fixes the immediate issue, however, if the cpuload monitor is not reset at some point, the idle thread load will be averaged over a much longer time than the remaining threads, thus resulting in wrong numbers later on.Alternatives
The cpuload API is not salvagable, it needs to be rewritten with a sliding sample window so that multiple requests can instantly get the CPU load for the last 1s window. Pretty much everything in that API is subtly broken and inefficient with implicit and explicit calls to
cpuload_monitor_start()
/cpuload_monitor_stop()
spread all over the logger and top command, so unbalanced calls are easily done.Test coverage
This behavior be provoked by doing opening and closing logs with preflight and postflight cpu load requests in quick order, so that calls to start/stop overlap within the same time.
Context
See #22655.