Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable compilation behavior #20105

Open
Etruscian opened this issue Aug 24, 2022 · 7 comments
Open

Unpredictable compilation behavior #20105

Etruscian opened this issue Aug 24, 2022 · 7 comments

Comments

@Etruscian
Copy link
Contributor

Etruscian commented Aug 24, 2022

Describe the bug

We're using a build server that uses the official docker containers to build the firmware and does a clean pull every time a build is started. Lately, we've noticed unpredictable behavior with builds. We're building v1.11.3 with some additions for our own drone that runs a cube orange. We've also tried different machines but they all give the same problems.

Two firmwares where the only difference is the values of some tuning parameters in the start up script behave very differently. The version before the change works fine, no issues. The version after the change gives a lot of UAVCAN errors.

We've noticed that the binaries aren't the same size (the second one is 8 bytes larger). If we take the working version and perform the same shift at the same location in the flash before uploading (with modifying the address pointers so the binaries are identical apart from git hash and build timestamps), the exact same problem occurs.

The difference of 8 bytes is determined to be due to the timestamp of compilation of certain c files being set in the binary. If this is the same timestamp (second accuracy), they all point to 1 timestamp in the binary. if these are different, 9 bytes get added to store the second timestamp. This timestamp was added at 0x139872 (relative to 0x08020000) and because of alignment, a single padding byte is removed at 0x166b8b. These offsets might be case specific, but might also indicate a problem somewhere.

Another way we notice this is that a later build with also minimal changes seems to freeze at touch down. We've made ram dumps and verified with a debugger that at least most of the PX4 code is still running (i.e. interrupts and at least some threads are still running). There is just no communication with the outside world. UAVCAN stops sending messages, uarts stop sending messages and the logger stops as well.

Using the ram dumps, we were able to verify that the incoming messages are still put in the rx buffers of the uarts (which indicates that the peripherals and DMA's are still running when the outward communication stopped). The tx buffer of the telemetry radio for instance contains only all the messages up to the last one we received over telemetry. We've also checked the BSRAM to see if any hard faults are logged but it was empty.

I personally also have a PX4 drone that runs 1.13 using a holybro kakute f7 with some modules turned off to enable the ekf. Minor changes in these builds also result in unpredictable behavior where sometimes the gnss driver refuses to start or the entire nuttx shell stops working. Trying to communicate to the nuttx shell using qgroundcontrol then results in a hard crash of the flight controller where the usb connection is stopped by the flight controller. These firmwares are built on a completely different machine.

At first we were thinking of it being due to our code changes, but since it is also happening for my personal drone with different hardware and we've heard other people in slack noting the same problems, we believe that something is fundamentally going wrong in the build process.

How can we best proceed to further investigate the cause of this issue?

@dagar
Copy link
Member

dagar commented Aug 24, 2022

Until we have a better understanding I would try to collect enough information to debug these as separate issues.

For H7 there was an alignment fix that went into v1.13. #19724

  • Cube Orange
    • can you reproduce with v1.13 or current main branch?
  • Kakute f7
    • would you be able to connect a serial console (and possibly debugger) to take a closer look?
    • nuttx shell crashing could be completely unrelated (we might need to increase the stack)

@Etruscian
Copy link
Contributor Author

Thanks for the reply.

We're going to try to apply the fix for the H7 on a firmware that is known bad to see if it fixes the issue.
Reproducing the uavcan issue on 1.13 should be doable, we'll try. The freeze issue is harder since we need to fly to reproduce it and we would need to port our changes to 1.13 before we can fly safely.

For the kakute f7, I have an st link and ftdi adapter, so I can debug using either method. I will do that this weekend hopefully

@Etruscian
Copy link
Contributor Author

Etruscian commented Aug 26, 2022

Here is the serial console log of the boot of the kakute f7 (where it crashes when you try to use the shell.

NuttShell (NSH) NuttX-10.2.0
nsh>
ERROR [PX4_MTD] failed to locate spi bus
ERROR [PX4_MTD] mtd failure: -6 bus 2 address 0 class 1
nsh: ver: command not found
nsh: hardfault_log: command not found
nsh: mft: command not found
nsh: mft: command not found
INFO [param] selected parameter backup file /fs/microsd/parameters_backup.bson
nsh: ver: command not found
Board architecture defaults: /etc/init.d/rc.board_arch_defaults
Board defaults: /etc/init.d/rc.board_defaults
ERROR [param] Parameter ATT_ACC_COMP not found.
ERROR [param] Parameter ATT_W_ACC not found.
ERROR [param] Parameter ATT_W_GYRO_BIAS not found.
Loading airframe: /etc/init.d/airframes/4001_quad_x
nsh: tone_alarm: command not found
sercon: Registering CDC/ACM serial driver
sercon: Successfully registered the CDC/ACM serial driver
INFO [dataman] data manager file '/fs/microsd/dataman' size is 62560 bytes
nsh: send_event: command not found
nsh: rgbled: command not found
nsh: rgbled_ncp5623c: command not found
Board sensors: /etc/init.d/rc.board_sensors
mpu6000 #0 on SPI bus 4 rotation 6
bmp280 #0 on I2C bus 1 (external) address 0x76
nsh: icm20948_i2c_passthrough: command not found
nsh: hmc5883: command not found
nsh: ist8308: command not found
nsh: ist8310: command not found
nsh: lis2mdl: command not found
nsh: lis3mdl: command not found
qmc5883l #0 on I2C bus 1 (external) address 0xD
nsh: rm3100: command not found
nsh: ak09916: command not found
nsh: pwm_out: command not found
ERROR [parameters] get: param 65535 invalid
ERROR [parameters] get: param 65535 invalid
ERROR [parameters] get: param 65535 invalid
ERROR [parameters] get: param 65535 invalid
ERROR [mixer] can't open /dev/pwm_output0

ERROR [mixer] failed to load mixer
ERROR [init] Failed loading mixer: /etc/mixers/quad_x.main.mix
nsh: tune_control: command not found
ekf2 [361:237]
nsh: mc_hover_thrust_estimator: command not found
INFO [mavlink] mode: Normal, data rate: 1200 B/s on /dev/ttyS0 @ 115200B
Board extras: /etc/init.d/rc.board_extras
INFO [logger] logger started (mode=all)
INFO [logger] Start file log (type: full)

NuttShell (NSH) NuttX-10.2.0
nsh> INFO [logger] [logger] /fs/microsd/log/sess142/log100.ulg
INFO [logger] Opened full log file: /fs/microsd/log/sess142/log100.ulg

@Etruscian
Copy link
Contributor Author

The nuttx shell works perfectly fine over the serial console btw

@Etruscian
Copy link
Contributor Author

Etruscian commented Aug 26, 2022

If I try to use the shell over mavlink, I get a hard fault over the serial console (below).

According to the PC(R15), this happens in nsh_session where the greeting is put in the output stream. The calling method is nsh_consolemain according to R14.

INFOarm_hardfault: Hard Fault:
arm_hardfault: IRQ: 3 regs: 0x2004696c
arm_hardfault: BASEPRI: 000000f0 PRIMASK: 00000000 IPSR: 00000003 CONTROL: 00000000
arm_hardfault: CFAULTS: 00008200 HFAULTS: 40000000 DFAULTS: 00000000 BFAULTADDR: 00000294 AFAULTS: 00000000
arm_hardfault: PANIC!!! Hard fault: 40000000
up_assert: Assertion failed at file:armv7-m/arm_hardfault.c line: 135 task: mavlink_shell
up_registerdump: R0: 00000000 00000001 00000000 00000000 00000000 00000000 00000000 00000000
up_registerdump: R8: 00000000 00000000 00000000 00000000 00000000 20046a40 0801d513 0801dd82
up_registerdump: xPSR: 21000000 BASEPRI: 000000f0 CONTROL: 00000000
up_registerdump: EXC_RETURN: ffffffe9
up_dumpstate: sp: 20020b30
up_dumpstate: IRQ stack:
up_dumpstate: base: 200209c0
up_dumpstate: size: 00000200
up_dumpstate: used: 00000188
up_stackdump: 20020b20: 20020b30 20020bc0 080d3d91 08021e1f 000000f0 00000000 00000000 00000000
up_stackdump: 20020b40: 20046a40 0801d513 0801dd82 0000002d e000ed2c e000ed2c 00000294 00000000
up_stackdump: 20020b60: 00000000 00000000 00000000 00000000 00000000 08019af7 0801a6e1 080195a3
up_stackdump: 20020b80: 40000000 00000000 00000294 00000000 20020bc0 00000003 00000003 0801ad7f
up_stackdump: 20020ba0: 000000f0 08018329 000000f0 2004696c 00000000 00000000 00000000 08018217
up_dumpstate: sp: 20046a40
up_dumpstate: User stack:
up_dumpstate: base: 200462b0
up_dumpstate: size: 000007e8
up_dumpstate: used: 00000134
up_stackdump: 20046a40: 00000000 00000001 00000000 00000000 08075aad 00000000 00000000 0801d513
up_stackdump: 20046a60: 00000000 00000000 20024ae0 08075abf 00000000 080244d5 08075aad 00000001
up_stackdump: 20046a80: 00000000 0801bebb 00000000 00000000 00000000 00000000 00000260 00000820
up_taskdump: Idle Task: PID=0 Stack Used=352 of 726
up_taskdump: hpwork: PID=1 Stack Used=332 of 1264
up_taskdump: lpwork: PID=2 Stack Used=892 of 1616
up_taskdump: init: PID=3 Stack Used=2168 of 2928
up_taskdump: wq:manager: PID=4 Stack Used=420 of 1256
up_taskdump: dataman: PID=133 Stack Used=820 of 1208
up_taskdump: wq:lp_default: PID=135 Stack Used=756 of 1920
up_taskdump: wq:hp_default: PID=168 Stack Used=1156 of 1904
up_taskdump: gps: PID=553 Stack Used=1228 of 1680
up_taskdump: wq:nav_and_controllers: PID=266 Stack Used=1188 of 2240
up_taskdump: wq:rate_ctrl: PID=267 Stack Used=2268 of 3152
up_taskdump: wq:INS0: PID=268 Stack Used=4372 of 6000
up_taskdump: wq:SPI4: PID=173 Stack Used=1472 of 2336
up_taskdump: navigator: PID=782 Stack Used=1068 of 1920
up_taskdump: wq:I2C1: PID=175 Stack Used=876 of 2336
up_taskdump: mavlink_if0: PID=752 Stack Used=1932 of 2728
up_taskdump: mavlink_rcv_if0: PID=754 Stack Used=1476 of 4576
up_taskdump: mavlink_if1: PID=851 Stack Used=1860 of 2824
up_taskdump: mavlink_rcv_if1: PID=852 Stack Used=1756 of 4576
up_taskdump: commander: PID=278 Stack Used=1308 of 3224
up_taskdump: mavlink_shell: PID=855 Stack Used=308 of 2024
up_taskdump: logger: PID=824 Stack Used=2668 of 3648
up_taskdump: log_writer_file: PID=825 Stack Used=596 of 1176
up_taskdump: wq:ttyS4: PID=764 Stack Used=772 of 1632

@Etruscian
Copy link
Contributor Author

Removing the mixer module and instead using the control allocator module makes the shell over mavlink work without a problem. It still gave a hard fault when clicking through the tabs of qgroundcontrol (see log below).

According to the PC (R15) in the dump and the obj dump of the elf file, it crashed inside memset. The calling method according to R14 is uORB::DeviceNode::Write.

arm_hardfault: Hard Fault:
arm_hardfault: IRQ: 3 regs: 0x20049eb4
arm_hardfault: BASEPRI: 000000f0 PRIMASK: 00000000 IPSR: 00000003 CONTROL: 00000000
arm_hardfault: CFAULTS: 00000400 HFAULTS: 40000000 DFAULTS: 00000000 BFAULTADDR: 00000000 AFAULTS: 00000000
arm_hardfault: PANIC!!! Hard fault: 40000000
up_assert: Assertion failed at file:armv7-m/arm_hardfault.c line: 135 task: mavlink_rcv_if1
up_registerdump: R0: 00000000 00000000 000001b8 00000000 000001c0 00000000 00000000 00000000
up_registerdump: R8: 2004a110 00000001 00000001 00000001 000001c0 20049f88 080ca675 08025bf0
up_registerdump: xPSR: 21000200 BASEPRI: 000000f0 CONTROL: 00000000
up_registerdump: EXC_RETURN: ffffffe9
up_dumpstate: sp: 20020b30
up_dumpstate: IRQ stack:
up_dumpstate: base: 200209c0
up_dumpstate: size: 00000200
up_dumpstate: used: 00000188
up_stackdump: 20020b20: 20020b30 20020bc0 080d3b39 08021e1f 000000f0 00000000 00000001 000001c0
up_stackdump: 20020b40: 20049f88 080ca675 08025bf0 0000002d e000ed2c e000ed2c 00000000 00000000
up_stackdump: 20020b60: 00000000 2004a110 00000001 00000001 00000001 08019af7 0801a6e1 080195a3
up_stackdump: 20020b80: 40000000 00000000 00000000 00000000 20020bc0 00000003 00000003 0801ad7f
up_stackdump: 20020ba0: 000000f0 08018329 000000f0 20049eb4 00000000 00000000 00000000 08018217
up_dumpstate: sp: 20049f88
up_dumpstate: User stack:
up_dumpstate: base: 200491e8
up_dumpstate: size: 000011e0
up_dumpstate: used: 000006d4
up_stackdump: 20049f80: 80000010 2004a110 00000038 2002c420 2004a110 00000038 00000000 080ca675
up_stackdump: 20049fa0: 2002c420 000001c0 2004a110 080ca62d 080fd804 00000000 000000f1 080ca78f
up_stackdump: 20049fc0: 080f6a38 20048414 2004a110 08073b55 20047eb8 2004a0ec 2004a264 08085ee9
up_stackdump: 20049fe0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a000: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a020: 00000000 00000000 00000000 00000000 00000000 2004a108 000000f0 00000000
up_stackdump: 2004a040: 20049040 000000f0 20049074 2004a130 00000001 0000000a 00000000 ffffffe9
up_stackdump: 2004a060: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a080: 00000000 2004a158 000000f0 20048600 20047eb8 20048600 00000000 00000000
up_stackdump: 2004a0a0: 00000000 0000002c 00000000 ffffffe9 2004a264 20048724 00000000 2004a264
up_stackdump: 2004a0c0: 20047eb8 000000f1 00000000 00000036 00000000 0000002c 00000000 080861a7
up_stackdump: 2004a0e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a100: 00000000 00000000 010100f1 01000000 03aca4db 00000000 00000000 00000000
up_stackdump: 2004a120: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 000000f1
up_stackdump: 2004a140: beff0101 00000100 2004a264 20047eb8 20048600 080889ed 00000000 20024fe8
up_stackdump: 2004a160: 20047eb8 08089157 0000002b 00000001 deadbeef deadbeef 000028ce 00000000
up_stackdump: 2004a180: 03ac8b66 00000000 deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef
up_stackdump: 2004a1a0: deadbeef 00000003 00000101 00000000 00000000 00000000 000020fd 4cbeffe7
up_stackdump: 2004a1c0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00f10000
up_stackdump: 2004a1e0: 36d10101 00000000 00000000 00000000 00000000 0101ff00 03ac8b66 00000000
up_stackdump: 2004a200: 000000b5 00000005 0000009c 000000c1 00000000 009c001e 0000001a 00000000
up_stackdump: 2004a220: deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef
up_stackdump: 2004a240: deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef deadbeef
up_stackdump: 2004a260: deadbeef 20fd36d1 ffe70000 00004cbe 00000000 00000000 00000000 00000000
up_stackdump: 2004a280: 00000000 00000000 00000000 010100f1 00000000 00000000 00000000 00000000
up_stackdump: 2004a2a0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 01000000
up_stackdump: 2004a2c0: 00000001 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a2e0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a300: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a320: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a340: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a360: 00000000 00000000 00000000 00000000 00000000 00000000 000036d1 00000000
up_stackdump: 2004a380: 00000000 20000000 200242d0 00000000 00000000 00000000 00000000 00000000
up_stackdump: 2004a3a0: 00000000 00000000 00000000 00000000 00000000 08089343 0808933d 08024311
up_stackdump: 2004a3c0: 08024309 00000000 00000040 80001200 080f5b5c 20000001 00000000 20046b80
up_taskdump: Idle Task: PID=0 Stack Used=352 of 726
up_taskdump: hpwork: PID=1 Stack Used=332 of 1264
up_taskdump: lpwork: PID=2 Stack Used=908 of 1616
up_taskdump: init: PID=3 Stack Used=2168 of 2928
up_taskdump: wq:manager: PID=4 Stack Used=452 of 1256
up_taskdump: wq:I2C1: PID=165 Stack Used=840 of 2336
up_taskdump: mavlink_rcv_if0: PID=742 Stack Used=1572 of 4576
up_taskdump: commander: PID=263 Stack Used=1308 of 3224
up_taskdump: logger: PID=813 Stack Used=2596 of 3648
up_taskdump: log_writer_file: PID=814 Stack Used=652 of 1176
up_taskdump: gps: PID=527 Stack Used=1228 of 1680
up_taskdump: wq:ttyS4: PID=754 Stack Used=764 of 1632
up_taskdump: mavlink_if1: PID=820 Stack Used=1844 of 2824
up_taskdump: mavlink_rcv_if1: PID=821 Stack Used=1748 of 4576
up_taskdump: dataman: PID=118 Stack Used=820 of 1208
up_taskdump: mavlink_shell: PID=823 Stack Used=1028 of 2024
up_taskdump: wq:hp_default: PID=152 Stack Used=1024 of 1904
up_taskdump: navigator: PID=761 Stack Used=1092 of 1920
up_taskdump: wq:nav_and_controllers: PID=250 Stack Used=1276 of 2240
up_taskdump: wq:rate_ctrl: PID=251 Stack Used=2268 of 3152
up_taskdump: wq:INS0: PID=252 Stack Used=4372 of 6000
up_taskdump: wq:lp_default: PID=125 Stack Used=868 of 1920
up_taskdump: wq:SPI4: PID=158 Stack Used=1500 of 2336
up_taskdump: mavlink_if0: PID=735 Stack Used=1844 of 2728

@Etruscian
Copy link
Contributor Author

@dagar Any idea what could cause different threads to go out of memory by changing a (supposedly) completely unrelated part of the code?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants