-
Notifications
You must be signed in to change notification settings - Fork 13.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unpredictable compilation behavior #20105
Comments
Until we have a better understanding I would try to collect enough information to debug these as separate issues. For H7 there was an alignment fix that went into v1.13. #19724
|
Thanks for the reply. We're going to try to apply the fix for the H7 on a firmware that is known bad to see if it fixes the issue. For the kakute f7, I have an st link and ftdi adapter, so I can debug using either method. I will do that this weekend hopefully |
Here is the serial console log of the boot of the kakute f7 (where it crashes when you try to use the shell. NuttShell (NSH) NuttX-10.2.0 ERROR [mixer] failed to load mixer NuttShell (NSH) NuttX-10.2.0 |
The nuttx shell works perfectly fine over the serial console btw |
If I try to use the shell over mavlink, I get a hard fault over the serial console (below). According to the PC(R15), this happens in nsh_session where the greeting is put in the output stream. The calling method is nsh_consolemain according to R14. INFOarm_hardfault: Hard Fault: |
Removing the mixer module and instead using the control allocator module makes the shell over mavlink work without a problem. It still gave a hard fault when clicking through the tabs of qgroundcontrol (see log below). According to the PC (R15) in the dump and the obj dump of the elf file, it crashed inside memset. The calling method according to R14 is uORB::DeviceNode::Write. arm_hardfault: Hard Fault: |
@dagar Any idea what could cause different threads to go out of memory by changing a (supposedly) completely unrelated part of the code? |
Describe the bug
We're using a build server that uses the official docker containers to build the firmware and does a clean pull every time a build is started. Lately, we've noticed unpredictable behavior with builds. We're building v1.11.3 with some additions for our own drone that runs a cube orange. We've also tried different machines but they all give the same problems.
Two firmwares where the only difference is the values of some tuning parameters in the start up script behave very differently. The version before the change works fine, no issues. The version after the change gives a lot of UAVCAN errors.
We've noticed that the binaries aren't the same size (the second one is 8 bytes larger). If we take the working version and perform the same shift at the same location in the flash before uploading (with modifying the address pointers so the binaries are identical apart from git hash and build timestamps), the exact same problem occurs.
The difference of 8 bytes is determined to be due to the timestamp of compilation of certain c files being set in the binary. If this is the same timestamp (second accuracy), they all point to 1 timestamp in the binary. if these are different, 9 bytes get added to store the second timestamp. This timestamp was added at 0x139872 (relative to 0x08020000) and because of alignment, a single padding byte is removed at 0x166b8b. These offsets might be case specific, but might also indicate a problem somewhere.
Another way we notice this is that a later build with also minimal changes seems to freeze at touch down. We've made ram dumps and verified with a debugger that at least most of the PX4 code is still running (i.e. interrupts and at least some threads are still running). There is just no communication with the outside world. UAVCAN stops sending messages, uarts stop sending messages and the logger stops as well.
Using the ram dumps, we were able to verify that the incoming messages are still put in the rx buffers of the uarts (which indicates that the peripherals and DMA's are still running when the outward communication stopped). The tx buffer of the telemetry radio for instance contains only all the messages up to the last one we received over telemetry. We've also checked the BSRAM to see if any hard faults are logged but it was empty.
I personally also have a PX4 drone that runs 1.13 using a holybro kakute f7 with some modules turned off to enable the ekf. Minor changes in these builds also result in unpredictable behavior where sometimes the gnss driver refuses to start or the entire nuttx shell stops working. Trying to communicate to the nuttx shell using qgroundcontrol then results in a hard crash of the flight controller where the usb connection is stopped by the flight controller. These firmwares are built on a completely different machine.
At first we were thinking of it being due to our code changes, but since it is also happening for my personal drone with different hardware and we've heard other people in slack noting the same problems, we believe that something is fundamentally going wrong in the build process.
How can we best proceed to further investigate the cause of this issue?
The text was updated successfully, but these errors were encountered: