Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuload: Workaround for wrong idle thread load #22653

Merged
merged 1 commit into from
Jan 31, 2024
Merged

Conversation

niklaut
Copy link
Contributor

@niklaut niklaut commented Jan 18, 2024

Solved Problem

When the CPU load monitor is started while already running, then the idle thread last_times[0] is reset to the last 1 second,
rather than since when the CPU load monitor was last started. This can happen if the calls to cpuload_monitor_start() and cpuload_monitor_stop() are unbalanced. In my test case, the cpuload monitor is never disabled.

This results in the idle thread having a much lower CPU load, with the remaining CPU time being wrongly attributed as scheduler load. The remaining threads are not impacted, since their last_times[i] is reset to zero for every sample init.

This causes the preflight and postflight cpuload measurements to wrongly show a >30% scheduler load, which raises alarms in our monitoring suite.

See #22655.

Solution

Fix the unbalanced calls in logger.cpp.

The solution only resets the idle thread CPU last_times[0]. This fixes the immediate issue, however, if the cpuload monitor is not reset at some point, the idle thread load will be averaged over a much longer time than the remaining threads, thus resulting in wrong numbers later on.

Alternatives

The cpuload API is not salvagable, it needs to be rewritten with a sliding sample window so that multiple requests can instantly get the CPU load for the last 1s window. Pretty much everything in that API is subtly broken and inefficient with implicit and explicit calls to cpuload_monitor_start()/cpuload_monitor_stop() spread all over the logger and top command, so unbalanced calls are easily done.

Test coverage

This behavior be provoked by doing opening and closing logs with preflight and postflight cpu load requests in quick order, so that calls to start/stop overlap within the same time.

# Arming must fail, either hardcode it to fail, or disconnect all servo to fail preflight test.
commander arm -f
# fails and closes the current log
commander arm -f
# fails and opens a new log and then immediately closes it
top once
# Displays <2% idle thread load and <30% sched load
top
# display the error once, then recovers the next times

Context

See #22655.

@dagar
Copy link
Member

dagar commented Jan 18, 2024

The cpuload API is not salvagable,

👍

Copy link
Member

@bkueng bkueng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the unbalanced calls, it's due to log streaming overlapping with file logging. Can you add this?

diff --git a/src/modules/logger/logger.cpp b/src/modules/logger/logger.cpp
index ecb05295d6..f9c3f66800 100644
--- a/src/modules/logger/logger.cpp
+++ b/src/modules/logger/logger.cpp
@@ -1610,6 +1610,11 @@ void Logger::print_load_callback(void *user)

 void Logger::initialize_load_output(PrintLoadReason reason)
 {
+    // If already in progress, don't try to start again
+    if (_next_load_print != 0) {
+        return;
+    }
+
     init_print_load(&_load);

     if (reason == PrintLoadReason::Watchdog) {

platforms/nuttx/src/px4/common/print_load.cpp Outdated Show resolved Hide resolved
platforms/nuttx/src/px4/common/print_load.cpp Outdated Show resolved Hide resolved
When the CPU load monitor is started while already running, then the
idle thread last_times[0] is reset to the last 1 second, rather than
since when the CPU load monitor was last started. The remaining threads
are not impacted, since their last_times[i] is reset to zero here.

This results in the idle thread having a lower than real CPU load, with
the remaining CPU time being wrongly attributed as scheduler load.
@niklaut
Copy link
Contributor Author

niklaut commented Jan 23, 2024

For the unbalanced calls, it's due to log streaming overlapping with file logging.

Yes, that fixes it and it's a better fix!

@bkueng bkueng merged commit 103ddb5 into PX4:main Jan 31, 2024
88 checks passed
@niklaut niklaut deleted the cpuload_idle branch January 31, 2024 08:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants