-
Notifications
You must be signed in to change notification settings - Fork 13.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FW hardfault on beta #4530
Comments
Is it reproducible? This was probably indoors with no gps? I've seen an ekf_att_pos_estimator hard fault intermittently with fixedwing HIL when I connect QGC and the ekf finally gets a position. I haven't looked into it, but it might be effectively the same conditions. What's at 0x0808be22 in your build? |
Occurred at least twice yesterday under a metal-roofed shelter (fairly good gps signal) (gdb) info line *0x0808be22 |
crash is consistently reproducible on the bench by booting then faking gps at the console: crashes immediately after starting gps |
We're so tight on memory that a few extra lines in your mixer are this much of a problem? I've never looked at the mixer code, but at the very least we could add a conservative length check. |
The description of that commit says "unused lines... clobbering RAM". Seems like that should crash the IO processor though, not the FMU. |
@AndreasAntener I guess I shouldn't have called f7ac1f0 a "bugfix", since it changes the input files, not the parser (or whatever was blowing RAM). Is is also necessry to "strip" the comment lines from a custom mixer to avoid corrupting RAM? I'd be willing to spend some time on this if it will provide better protection from crashes due to bad mixer files. Especially since custom mixer files are the only mechanism for adjusting servo travel and endpoints. |
@dagar Apparently, the hardfault is not related to the mixer issues. I'm still seeing it when GPS is either lost or re-acquired (can't tell which):
|
For the mixers in ROMFS/px4fmu_common/mixers they're stripped when copied To sidetrack this bug further, it would be great to have a tool to help On Sat, May 14, 2016 at 9:58 AM, Mark Whitehorn notifications@github.com
|
I'm fairly certain this is the same bug I get with fixed wing HIL fairly On Sat, May 14, 2016 at 10:21 AM, Mark Whitehorn notifications@github.com
|
Do you see it as a significant risk? Or do you still have manual control available with the FMU crashed? |
Just verified that the IO coprocessor is still working after the hardfault, and manual RC control is OK. Except my gimbal doesn't work after removing those lines from the main mixer. Will try putting them back now. |
EKF crash fixed by 237bdfd Closing but note there are unanswered questions about custom mixers |
Reopening: observed another hardfault on the bench while trying to get RC failsafe configured for OBC rules:
|
|
@LorenzMeier I guess the two (identical?) hardfaults above are also due to RAM corruption in EKF, but I don't see a good way to track it down. Looks like a subscription handle was trashed. Is there a way to monitor stack limits? |
What about CONFIG_ARMV7M_STACKCHECK? |
Thanks, I was just trying to build with that; is that the only file that needs to change? |
If you're using a pixhawk you'll likely exceed the size limit and need to disable several modules. mpu9250 and uavcan are good candidates. |
I'm getting a hardfault early in the boot process:
probably don't have the right compiler switches set; how do I look at the actual compiler commands generated by cmake? |
From the actual build directory
then
or
It's entirely possible this is going to find lots of little potential issues that haven't bitten us yet. If that's the case once we get a hardware test harness back in the builds (#4532) we should have a variation with STACKCHECK enabled. |
it looks like these switches are missing: (from nsh/Make.defs)
|
@dagar I've had no luck so far trying to get those compiler switches applied. Can you give any advice on how to accomplish it? |
I can give it a try tonight. |
This last fault isn't occurring very often, and I don't know what triggers it. But I was talking about building with stack checking; neither I nor google search is up to the task of figuring out why the build doesn't have those compiler switches specified. |
I was going to try both. For stack check I'm guessing we just need to get the instrumentation flags into the rest of the PX4 cmake build. |
Thanks, nothing I tried had any effect on the cmake build. |
Thanks @davids5 that seems to be working. |
@davids5 Thanks, I was able to understand the patch.
|
I started fixing all the stack sizes in the stackcheck branch on master - https://github.com/PX4/Firmware/tree/stackcheck I had to strip down px4fmu-v2_default to just fixed wing controllers, and estimator to get it building. At the moment there's still a weird hard fault when the sensors main exits. I don't know if this is getting us closer to fixing your issue, but the number of issues is concerning. We should think about either keeping stack check on for px4fmu-v4 or have a special _stackcheck build that's auto tested. @davids5 is there a way to get a good stack trace where the hard fault occurs? |
Observed another hardfault on FW in ekf_att_pos_estimator context (local branch rebased on recent master ):
|
master's version of the PR is wrong - it is missing the second half of the changes beta does not have the PR Please push the rebased version you are working on - so I can pull and test. Also please tell me what Also any tips on get to the crash |
@LorenzMeier, @kd0aij, @dagar @bkueng If the build and testing was done for stackchecking before #4593 Any mismatch in Nuttx CONFIG_ARMV7M_STACKCHECK and px4 notion of the instrumentation setting is highly unstable to say the least. |
@davids5 I think it's okay. It looks like the stack check instrument_flags were lost in a bad cherry-pick 5da9e7e#diff-23a5ada890263edf0da49d40d9c58f8c which happened after the stack checking. I verified a few of the recent stack changes with #4593. I should emphasize again that this was not at all exhaustive and I only checked a FW config at startup. We should still run through FW, MC, VTOL with stack check enabled and actual usage. |
@davids5 I created the PR for which my faulting branch was intended: #4594 fault occurred with a px4fmu-v2_default build, SYS_AUTOSTART=2101 |
@kd0aij Can you please upload the binary next time this happens? |
Hunted down to and fixed in fe60a43. Testing now. |
Confirmed to fix it. @kd0aij Please let me know if any issues remain and please try to test. |
running beta now |
@LorenzMeier Regarding the use of an uninitialized pointer as an ORB topic handle, the result in the publish method would be to overwrite an arbitrary memory location, not "stack smashing", if I understand the term correctly. Perhaps the ORB should be maintaining a list of valid handles in order to perform run-time validation, since that seems to be the only way to guarantee that it doesn't trash memory. |
I realize this isn't helpful now, but out of curiosity I built and ran beta in the SITL gazebo simulation with address sanitizer enabled and our typical optimization level (-0s) and it caught the problem immediately. Another entry for the testing wishlist...
|
@dagar Is an "address sanitizer" able to catch all usages of uninitialized pointers? I assume the answer is no, else the compiler would be doing it. |
Address santizier (ASan) adds instrumentation, but you need to actually run the code (http://clang.llvm.org/docs/AddressSanitizer.html). The tool can detect the following types of bugs:
ASan happened to make this bug obvious (just running the right sitl simulation in gdb would have too), but memory sanitizer is actually what we'd want for catching uninitialized reads. |
observed on bench while setting up for a test flight
https://drive.google.com/open?id=0Bw3digSMQXDuQjR1MHJTWDNnbGM
The text was updated successfully, but these errors were encountered: