Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Commander: cpuResourcesCheck: add 2 seconds hysteresis to trigger failure #22304

Merged
merged 3 commits into from
Nov 7, 2023

Conversation

sfuhrer
Copy link
Contributor

@sfuhrer sfuhrer commented Nov 2, 2023

Solved Problem

Momentary CPU spikes are not an issue and should not prevent arming or alarm the operator.

Solution

This PR adds a 2 second hysteresis to the high CPU load preflight check.

Changelog Entry

For release notes:

Improvement: add 2s hysteresis to CPU load check to not trigger on load spikes.

Alternatives

Have high_cpu_load_start_time_us_ as as class member instead of a static variable?

Testing

Tested on a PH4, with COM_CPU_MAX set to 60 and then doing some param changes. On main this then throws warnings, with this PR not anymore.
image

@dagar
Copy link
Member

dagar commented Nov 3, 2023

Alternatively you could add a load average to load_mon.

@tstastny
Copy link
Contributor

tstastny commented Nov 3, 2023

Alternatively you could add a load average to load_mon.

so e.g. putting a filter in load_mon and adding filtered_load field to CpuLoad.msg, then subscribing and keeping the thresholds minus hysteresis in the preflight check?

I don't exactly remember the thought process that brought us to hysteresis over a filter before - but in general I don't have strong opinions on this one.

Just as long as we're targeting the particular behavior we care about - i.e. that momentary spikes shouldn't be flagged, but consistent high load should. Hysteresis seems the most direct way to do that, as it requires the load to be constantly over the threshold for x time. But if you would also want to flag "mostly" over the thresshold for some time (e.g. with some couple down spikes over that hysteresis period) then a filter would be the way to go. If using the filter, then for sure would want it logged, as I suppose we wont expose the filter gains for reconstruction post processing.
^ @bkueng @dagar I would defer to your opinions here.

@MaEtUgR
Copy link
Member

MaEtUgR commented Nov 6, 2023

I'd vote for the filter because of the exact case where you could have constant 100% CPU load with some short downward spikes in between and that would pass the check.

MaEtUgR
MaEtUgR previously approved these changes Nov 6, 2023
Copy link
Member

@MaEtUgR MaEtUgR left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with some short downward spikes in between and that would pass the check

I covered the case where the CPU is high for 2 seconds and then has downwards spikes by making the hysteresis apply both ways.

MaEtUgR
MaEtUgR previously approved these changes Nov 6, 2023
@tstastny
Copy link
Contributor

tstastny commented Nov 6, 2023

I covered the case where the CPU is high for 2 seconds and then has downwards spikes by making the hysteresis apply both ways.

so this works if the true state was already reached. but not if we have e.g. high (1 sec), single low. high (1 sec). Probably we'd still want to flag this 2 second portion, but in the current case it will reset the timer after the single low.

if we're worried about this being the false negative report.. then probably as @MaEtUgR said the filter is the better option.

@sfuhrer
Copy link
Contributor Author

sfuhrer commented Nov 7, 2023

Thanks for the clean up @MaEtUgR. To me this symmetrical 2s hysteresis seems to be a good compromise. For further refinement of the check I would try to pull in the learnings of real-world failure cases.

@MaEtUgR MaEtUgR merged commit c53e0c4 into main Nov 7, 2023
43 of 87 checks passed
@MaEtUgR MaEtUgR deleted the port-cpu-hysteresis-main branch November 7, 2023 10:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants