cpuload: Workaround for wrong idle thread load #22653

niklaut · 2024-01-18T15:19:22Z

Solved Problem

When the CPU load monitor is started while already running, then the idle thread last_times[0] is reset to the last 1 second,
rather than since when the CPU load monitor was last started. This can happen if the calls to cpuload_monitor_start() and cpuload_monitor_stop() are unbalanced. In my test case, the cpuload monitor is never disabled.

This results in the idle thread having a much lower CPU load, with the remaining CPU time being wrongly attributed as scheduler load. The remaining threads are not impacted, since their last_times[i] is reset to zero for every sample init.

This causes the preflight and postflight cpuload measurements to wrongly show a >30% scheduler load, which raises alarms in our monitoring suite.

See #22655.

Solution

Fix the unbalanced calls in logger.cpp.

The solution only resets the idle thread CPU last_times[0]. This fixes the immediate issue, however, if the cpuload monitor is not reset at some point, the idle thread load will be averaged over a much longer time than the remaining threads, thus resulting in wrong numbers later on.

Alternatives

The cpuload API is not salvagable, it needs to be rewritten with a sliding sample window so that multiple requests can instantly get the CPU load for the last 1s window. Pretty much everything in that API is subtly broken and inefficient with implicit and explicit calls to cpuload_monitor_start()/cpuload_monitor_stop() spread all over the logger and top command, so unbalanced calls are easily done.

Test coverage

This behavior be provoked by doing opening and closing logs with preflight and postflight cpu load requests in quick order, so that calls to start/stop overlap within the same time.

# Arming must fail, either hardcode it to fail, or disconnect all servo to fail preflight test.
commander arm -f
# fails and closes the current log
commander arm -f
# fails and opens a new log and then immediately closes it
top once
# Displays <2% idle thread load and <30% sched load
top
# display the error once, then recovers the next times

Context

See #22655.

dagar · 2024-01-18T16:30:02Z

The cpuload API is not salvagable,

👍

bkueng

For the unbalanced calls, it's due to log streaming overlapping with file logging. Can you add this?

diff --git a/src/modules/logger/logger.cpp b/src/modules/logger/logger.cpp
index ecb05295d6..f9c3f66800 100644
--- a/src/modules/logger/logger.cpp
+++ b/src/modules/logger/logger.cpp
@@ -1610,6 +1610,11 @@ void Logger::print_load_callback(void *user)

 void Logger::initialize_load_output(PrintLoadReason reason)
 {
+    // If already in progress, don't try to start again
+    if (_next_load_print != 0) {
+        return;
+    }
+
     init_print_load(&_load);

     if (reason == PrintLoadReason::Watchdog) {

platforms/nuttx/src/px4/common/print_load.cpp

When the CPU load monitor is started while already running, then the idle thread last_times[0] is reset to the last 1 second, rather than since when the CPU load monitor was last started. The remaining threads are not impacted, since their last_times[i] is reset to zero here. This results in the idle thread having a lower than real CPU load, with the remaining CPU time being wrongly attributed as scheduler load.

niklaut · 2024-01-23T10:03:34Z

For the unbalanced calls, it's due to log streaming overlapping with file logging.

Yes, that fixes it and it's a better fix!

niklaut mentioned this pull request Jan 18, 2024

[Bug] CPU scheduling load of 30% in pre-/postflight log via top once #22655

Closed

niklaut requested a review from bkueng January 18, 2024 16:01

niklaut force-pushed the cpuload_idle branch from 132e9ff to 08d8773 Compare January 18, 2024 16:10

niklaut marked this pull request as ready for review January 18, 2024 16:26

bkueng reviewed Jan 22, 2024

View reviewed changes

platforms/nuttx/src/px4/common/print_load.cpp Outdated Show resolved Hide resolved

platforms/nuttx/src/px4/common/print_load.cpp Outdated Show resolved Hide resolved

niklaut force-pushed the cpuload_idle branch from 08d8773 to 2444143 Compare January 23, 2024 09:33

niklaut force-pushed the cpuload_idle branch from 2444143 to 0f43594 Compare January 23, 2024 09:55

bkueng approved these changes Jan 31, 2024

View reviewed changes

bkueng merged commit 103ddb5 into PX4:main Jan 31, 2024
88 checks passed

niklaut deleted the cpuload_idle branch January 31, 2024 08:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpuload: Workaround for wrong idle thread load #22653

cpuload: Workaround for wrong idle thread load #22653

niklaut commented Jan 18, 2024 •

edited

dagar commented Jan 18, 2024

bkueng left a comment

niklaut commented Jan 23, 2024

cpuload: Workaround for wrong idle thread load #22653

cpuload: Workaround for wrong idle thread load #22653

Conversation

niklaut commented Jan 18, 2024 • edited

Solved Problem

Solution

Alternatives

Test coverage

Context

dagar commented Jan 18, 2024

bkueng left a comment

Choose a reason for hiding this comment

niklaut commented Jan 23, 2024

niklaut commented Jan 18, 2024 •

edited