Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copter: spi_timeout on 4.0rc2 #12773

Closed
andyp1per opened this issue Nov 5, 2019 · 63 comments
Closed

Copter: spi_timeout on 4.0rc2 #12773

andyp1per opened this issue Nov 5, 2019 · 63 comments

Comments

@andyp1per
Copy link
Collaborator

andyp1per commented Nov 5, 2019

I have a user who is reporting the error:
“prearm: Internal Error 0x4000”
This is spi_fail which is triggered by an spi timeout. The code says this can never happen but I know that tridge just changed the timeout value. This user also had telemetry issues.

He was using 4.0rc2 with my FFT PR so it's possible that the PR is the culprit, but I thought that @tridge and @rmackay9 woudl want to know just in case it's something serious

Bug report

Issue details

Please describe the problem

Version
Copter 4.0rc2 + PR11886

Platform
[ ] All
[ ] AntennaTracker
[ X] Copter
[ ] Plane
[ ] Rover
[ ] Submarine

Airframe type
Quad

Hardware type
OmnibusNano v6

Logs
2019-11-04 16-57-37CRASH.log

@andyp1per andyp1per added the BUG label Nov 5, 2019
@tridge
Copy link
Contributor

tridge commented Nov 5, 2019

I don't see that pre-arm error in that log?

@1wick
Copy link

1wick commented Nov 5, 2019

Hello,
It was me.. I did not have prearm logging enabled.
I put 4.0rc2 on it after the trouble, and it seems ok.
I will put Andy's firmware back on it later when I have some time to see if I get the same results. If so, I'll be sure to enable prearm logging and send in a log.
I got a couple internal error 0x8000 and a bunch of 0x4000's. Telemetry was buggy.. Not all info sent. Downloading parameters took a long time.

Also, there was no "crash".. I thought I had lost RC control of the copter while it was conveniently drifting back towards my yard, and I killed the motors when it was over my yard. But, what was actually happening is I lost telemetry, and a GCS Failsafe was triggered. I didn't see that on my OSD, and had no idea what was going on, so basically manually crashed the copter to avoid what I though was a flyaway. But, turns out it was just trying to fly itself home....
However, I think the buggy telemetry is what triggered the GCS failsafe.. My RC/Crossfire logs show I had a great signal the entire time, and I wasn't far from myself, so I don't think it was a radio issue.

@andyp1per
Copy link
Collaborator Author

@tridge FFT was disabled so no thread there. Internal errors also include 0x8000!
2019-11-04 16-49-34CRASH.tlog.zip
2019-11-04 16-57-37CRASH.bin.zip

@1wick
Copy link

1wick commented Nov 5, 2019

The 0x8000 error happened after I calibrated level. But then I rebooted, and got the 0x4000 errors.
Then I rebooted again, and was able to arm.
FFT was disabled.. I was going to test ESC based filter. (Which is working great in regular 4.0rc2)

FYI, if you look at the tlog, RSSI is odd because I use crossfire.. It has "mode2" which goes up to 200.. Then mode1 is 0-100, so that's why it's over 100 sometimes.
And the telemetry signal is never over about 20, which is also a crossfire thing.. That's basically full signal.

@andyp1per
Copy link
Collaborator Author

It would be useful to know if this is consistent with my changes in which case the onus is on me to find the culprit :)

@1wick
Copy link

1wick commented Nov 5, 2019

I just had a couple flights, and got several GCS failsafes on both. This is on the official 4.0rc2.
I have been using crossfire for a couple years and have never gotten a GCS failsafe. And while it happens, my reception is fine. I don't know exactly what triggers the failsafe, but something changed.
No 0x#### errors.
I'll load your firmware later and see if I get them w/ it again.

@andyp1per
Copy link
Collaborator Author

rc2 includes the failsafe refactoring work. I have a feeling this made GCS failsafe more of a thing but haven't looked too closely

@1wick
Copy link

1wick commented Nov 5, 2019

Oh, yeah. First thing he mentioned in the rc2 post.
OK, good to know.

@andyp1per
Copy link
Collaborator Author

I ran this on my KakuteF7Mini - absolutely no issues at all. FFT tracked the ESC's nicely, no errors. No internal errors, so I conclude that this is hardware related. Log attached.
19-11-05_18-35-38_2.bin.zip

@andyp1per
Copy link
Collaborator Author

@1wick The OmnibusNano V6 has a specific telemetery UART like the Kakute but on the Kakute we disable DMA because of BLHeli issues - I wonder if this is related, would you be ok trying a build that disables this on the Nano?

@1wick
Copy link

1wick commented Nov 5, 2019

I want to put the 'problem' firmware back on it, to see if the issue is repeatable.
Then sure, I'd be happy to try the nodma version.
Busy day though.. Not sure I'll get to it for a day or two.

@andyp1per
Copy link
Collaborator Author

I tried this on an Omnibus F4 Pro - no issues at all.

@Pedals2Paddles
Copy link
Contributor

I have nothing useful to add to this but wanted to say yay for the new GCS Failsafe working!

It will trigger when the GCS heartbeat is lost for > 5000ms. So likely a symptom of your telemetry issues.

@andyp1per
Copy link
Collaborator Author

andyp1per commented Nov 10, 2019

@tridge it seems this is repeatable on vanilla 4.0rc2, see #12800

Turning off DMA makes no difference, so something else going on here.

@tridge
Copy link
Contributor

tridge commented Nov 11, 2019

thanks, I've ordered a nano v6 to test with
meanwhile, can someone please test with:
define HAL_SPI_CHECK_CLOCK_FREQ 1
in hwdef.dat?
I suspect we have a bad DMA channel that we need to exclude

@tridge
Copy link
Contributor

tridge commented Nov 11, 2019

with HAL_SPI_CHECK_CLOCK_FREQ set I'd like to see startup msgs on USB

@andyp1per
Copy link
Collaborator Author

../../libraries/AP_HAL_ChibiOS/SPIDevice.cpp: In static member function 'static void ChibiOS::SPIDevice::test_clock_freq()':
../../libraries/AP_HAL_ChibiOS/SPIDevice.cpp:429:61: error: format '%u' expects argument of type 'unsigned int', but argument 3 has type 'uint32_t {aka long unsigned int}' [-Werror=format=]
hal.console->printf("Waiting %u\n", AP_HAL::millis());

Oops!

@1wick
Copy link

1wick commented Nov 11, 2019

with HAL_SPI_CHECK_CLOCK_FREQ set I'd like to see startup msgs on USB

Tridge, What do I use to get the startup messages? Will mavproxy get a log of what you need?

@andyp1per
Copy link
Collaborator Author

I've fixed the compilation error in the PR and removed the NODMA.

@1wick
Copy link

1wick commented Nov 12, 2019

@tridge OK, Andy compiled a version w/ the requested changes.
Mission planner is the same. If I calibrate level, I get 0x8000 errors.

I wasn't sure how to get the USB startup messages you mentioned.. But I did connect the FC to mavproxy and did a simple accel calibration, and it acted odd.. Didn't print a 0x8000 error, but the FC kept resetting.
It's a new FC with nothing attached, so the throttle, compass, and battery warnings are expected.

STABILIZE> accelcalsimple
STABILIZE> Simple accel calAPM: Initialising APM
**
PASSED
Got MAVLink msg: COMMAND_ACK {command : 241, result : 0}
Calibrated
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete
APM: PreArm: Throttle below Failsafe
APM: PreArm: Compass not healthy
APM: PreArm: Battery 1 low voltage failsafe
APM: PreArm: Throttle below Failsafe
APM: PreArm: EKF attitude is bad
APM: PreArm: Compass not healthy
APM: PreArm: EKF2 still initialising
APM: PreArm: Battery 1 low voltage failsafe
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete
APM: PreArm: Throttle below Failsafe
APM: PreArm: DCM Roll/Pitch inconsistent by 26 deg
APM: PreArm: EKF attitude is bad
APM: PreArm: Compass not healthyge failsafe
APM: PreArm: EKF2 still initialising
APM: PreArm: Battery 1 low voltage failsafe
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete
APM: EKF2 IMU0 forced reset
APM: PreArm: Throttle below Failsafe
APM: PreArm: DCM Roll/Pitch inconsistent by 25 deg
APM: PreArm: EKF attitude is bad
APM: PreArm: Compass not healthy
APM: PreArm: EKF2 still initialising
APM: PreArm: Battery 1 low voltage failsafe
APM: EKF2 IMU0 tilt alignment complete
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete
APM: PreArm: Throttle below Failsafe
APM: PreArm: DCM Roll/Pitch inconsistent by 24 deg
APM: PreArm: Compass not healthy
APM: PreArm: Battery 1 low voltage failsafe
no link
link 1 down
link 1 OK
heartbeat OK
APM: EKF2 IMU0 forced reset
APM: EKF2 IMU0 tilt alignment complete

0x8000error

If anyone can give me any clues about how to see the other info you want, I'll get it.

@1wick
Copy link

1wick commented Nov 13, 2019

I want to mention.. I had an issue today that I thought might have been just on Andy's firmware w/ the FFT, but it just did it with regular 4.0rc2.

I often plug my copter in as I'm charging its battery.. Then the FC warms up and it gets GPS lock, etc.
Well, w/ 4.0, after the copter is on for a while its telemetry link is bad..
If I plug it in and connect, it works fine.
If I plug it in for 10 minutes, then try connecting, the parameter download is very slow, I get repeated "Already got [parameter]" messages, and when it finally does connect, a lot of the telemetry info in MP doesn't show up.. This time, voltage and the HUD were working. But none of the gps or altitude or speed, etc, parameters were updating.
It's ok again after a reboot.

-I use crossfire for telemetry which does things a bit differently, so it might be related to crossfire. But, it worked fine in 3.6 and 3.7(from about September).

@andyp1per
Copy link
Collaborator Author

I see a similar thing with logging. If I leave it powered but sitting there for too long I quite often get bad logging errors that only go away when I reboot.

@andyp1per
Copy link
Collaborator Author

@tridge I was not able to get any debug output on an Omnibus F4 Pro with this setting.

I ran:

mavproxy.py --console --master /dev/ttyS14

and did not see the debug, but saw other hal.console->printf() messages. I tried adding a 5s delay, but still no luck. Any other way of getting the console output?

@Huibean
Copy link
Contributor

Huibean commented Dec 2, 2019

I also getting this error in Copter 4.0

@1wick
Copy link

1wick commented Dec 2, 2019

It's 100% repeatable for me..
Calibrate level, get the errors.

But, it's fine after a hard reboot. So, not affecting flight or anything. When I initially had the error, I had other issues too which were unrelated..

@Huibean
Copy link
Contributor

Huibean commented Dec 2, 2019

this is USB output for SPI clock freq test

Waiting for USB
Waiting 1002
Waiting 2002
Waiting 3002
CLOCKS=
1:1080000002:540000003:540000004:1080000005:1080000006:108000000
SPI[1] clock=1685596
SPI[4] FAIL 0x20001110 0x20001518

@andyp1per
Copy link
Collaborator Author

SPI4 is not defined in the OmnibusNanov6 hwdef.dat - maybe that is the problem?

@Huibean
Copy link
Contributor

Huibean commented Dec 2, 2019

@andyp1per suppose not, this is a loop with defined spi_devices, I have defined SPI4 in my case

@andyp1per
Copy link
Collaborator Author

@Huibean please can you try with the vanilla hwdef.dat? A customzied version will not really tell us anything!

@Huibean
Copy link
Contributor

Huibean commented Dec 3, 2019

thanks @rmackay9 , I am trying to find out why the timeout happened

@Huibean
Copy link
Contributor

Huibean commented Dec 3, 2019

@tridge @rmackay9 finally figure out why the error happen
I have print every osalSysLock and osalSysUnlock in i2c and SPI loop, when SPI timeout happens it shows SPI transform exec in i2c's locking process, so that's why it takes so long to finish and make timeout

i2c osalSysLock
i2c osalSysUnlock
�3	�Barometer 1 calibration complete��i2c osalSysLock
i2c osalSysUnlock
SPI osalSysLock
i2c osalSysLock
i2c osalSysUnlock
i2c osalSysLock
i2c osalSysUnlock
i2c osalSysLock
i2c osalSysUnlock
i2c osalSysLock
i2c osalSysUnlock
i2c osalSysLock
i2c osalSysUnlock
SPI osalSysUnlock
SPI timeout!

@andyp1per
Copy link
Collaborator Author

@Huibean on the devcall last night we discussed this - it's clear its very specific to the STM32 chip version, hwdef.dat and hardware - it's very easy to mess this up. For the OmnibusNanov6 we think these are all defined correctly and it might be an error in the STM32 datasheet. For your case it's not so clear. Please can you post all of these?

@andyp1per
Copy link
Collaborator Author

@1wick we really need the debug log to find out which SPI device is causing the issue. Are you running on windows/linux/macos?

@1wick
Copy link

1wick commented Dec 3, 2019

Windows and Linux. I usually use Windows w/ my flight stuff because that's what's on my laptop.. But my desktop is linux.
What do I need to do?
(And like I said before, since this issue only occurs for me during a level calibration, it's not really a big deal.. But if you wan to fix it, I'll definitely help get you some data)

@andyp1per
Copy link
Collaborator Author

@1wick so @Huibean seems to think if you do it on Linux you can cat the serial port at startup and see the debug. I am able to see really early debug through a combination of mavproxy and trying to connect just as the board has booted, but cat might be better because it will probably just block waiting for output. cat /dev/ttyusb or something (not sure what the usb port would be called on linux).

@1wick
Copy link

1wick commented Dec 3, 2019

Well, getting closer..
Found the USB port, cat. I get output, but not sure it's anything useful.

-This is on my spare FC w/ nothing attached.

Waiting for USB
Waiting 1003
Waiting 2004
Waiting 3005
CLOCKS=
1:84000000 2:42000000 3:42000000 4:84000000 5:84000000 6:84000000
SPI[1] clock=1311139
SPI[2] clock=1310929
SPI[3] clock=1311139
Storage: Using flash pages 1 and 2

Init ArduCopter V4.0.0-dev (5f202a47)

Free RAM: 109536
FW Ver: 120

load_all took 16986us
0 0 0 AP_OSD: osd font is up to date.
Started MAX7456 OSD
�Initialising APM
�SPI Flash 0x00EF4018 found pages=65536 erase=64k
AP_Logger_Block: buffer size=16384
�3��AK09916: Reset Failed
No Compass backends available
Compass initialisation failed

MPU: temp reset IMU[0] -1012 -66
*�3�Initialising APM�s�oi�T]r

�u��3��f�3�EKF2 IMU0 tilt alignment complete
�� o):��3/����3:����3E�'p�No��e�
(�3O�EKF2 IMU0 forced reset
$[�3`��4�do{�
��3k�,��3v�"
���3��BH�3��

@andyp1per
Copy link
Collaborator Author

@1wick Ok so that's weird - there is no SPItimeout, where is the internal error? I don't see it. It's important to get the logging for a situation where the error occurs.

@1wick
Copy link

1wick commented Dec 4, 2019

The error only occurs after a level calibration, and is fine after a reboot..
So, I don't think there will be a timeout during booting?

And I already uploaded the mavpoxy logs and MP screen shot of the error after doing a calibration, then the FC acting strange afterwords.

After a calibration, then a hot reboot (telling the FC to restart from w/in MP) I do get more errors.. Maybe I can try to do that from MavProxy on Linux while watching the USB port..

@1wick
Copy link

1wick commented Dec 4, 2019

Nope, not possible.. As soon as I connect w/ APM Planner or MavProxy, the CAT session ends.

The error goes..
Calibrate level, get 0x8000. Hot reboot, I get 0x4000. Cold reboot, it's fine.

@1wick
Copy link

1wick commented Dec 4, 2019

I've been trying to catch it, but can't.. Now, more often than not, the FC totally locks up after the hot reboot, and I can't connect to see the output.
The newly updated MP displays "Main Thread Stuck" after doing the calibration.. which is I guess what 0x8000 means. Then I tell it to Preflight_reboot_shutdown, and it locks up.

@1wick
Copy link

1wick commented Dec 5, 2019

Andy,
FYI, I figured out the console in APM Planner gets the startup messages.. If you open the console, then plug in the board and immediately connect in the console, it displays the "waiting for USB" and subsequent messages.

@andyp1per
Copy link
Collaborator Author

The error goes..
Calibrate level, get 0x8000. Hot reboot, I get 0x4000. Cold reboot, it's fine.

@tridge I'm not sure if you are following this but the omnibus nano issue appears more involved - there is no SPI error at startup, only after a level calibration

@1wick
Copy link

1wick commented Dec 5, 2019

I was on an old version of 3.7 before upgrading to 4.0, and 3.7 from last year did not have this issue.
I never really ran 3.6.. But, I could put the official 3.6 on it to see if it has the same issue. And I can install some other previous firmwares, if they are on the firmware site. I'll do some more testing later.
I have pretty frequent telemetry issues, and occasional bootup lockups w/ 4.0, too, which I don't recall having w/ 3.7. I did catch a boot lockup yesterday while I was monitoring the USB port, and the last line it printed was about preparing the logging system.

-and like I said, don't worry about it just for me, but I'm happy to help if it's something you want to figure out. I don't think many people are using this board, and I'm about to build a new copter w/ the nano kakute f7

@andyp1per
Copy link
Collaborator Author

If you can build now then git bisect on the problem would probably be really useful!

@giacomo892
Copy link
Contributor

I had an occurrence of this on 14e01e2 running on a F405-STD board.
When I landed I wasn't able to rearm and looks like it is connected with a failsafe here too.

14/06/2020 09:07:26 : PreArm: Internal errors 0x4000 l:211 ,spi_fail
14/06/2020 09:07:26 : PreArm: Radio failsafe on
14/06/2020 09:07:26 : Radio Failsafe - Disarming
14/06/2020 09:05:50 : Mission: 6 WP
14/06/2020 09:05:50 : Mission: 5 ChangeSpeed
14/06/2020 09:05:50 : Reached command #4
14/06/2020 09:05:44 : Mission: 4 WP
14/06/2020 09:05:44 : Reached command #3

I'm "pretty" sure I didn't turn my radio off as wanted to take off again.

Already send the log to @tridge

@shellixyz
Copy link
Contributor

shellixyz commented Aug 12, 2020

I just updated one of my quads to master and got this issue the first time I tried to fly with it. Only had it once. I just power cycled the board and it was gone. I had never got this issue with the previous revision I was using. The board is a Matek F405-CTR.

Old revision: cefd991
New revision: e137698

photo_2020-08-12_12-16-32

@peterbarker
Copy link
Contributor

Does anybody have a repeatable test case? @shellixyz has given us endpoints for good-and-bad, but a repeatable setup is required for a bisect...

@1wick are you in a position to bisect given the known-good/known-bad commits? Sounds like you've got a reproducible case.

@vierfuffzig
Copy link
Contributor

vierfuffzig commented Oct 5, 2020

i‘ve seen the int spi error just recently on a friend‘s omnibusnanov6 running plane master from ~3 weeks ago. it just happened once after boot, didn‘t reoccur for multiple flight hours through the day. minimal setup, only GPS and RC. i myself run two omnibus nanos on current plane master too and haven‘t seen it happen yet.

@shellixyz
Copy link
Contributor

Seems sporadic, I can't reproduce the issue reliably

@Huibean
Copy link
Contributor

Huibean commented Oct 6, 2020

this happen frequently after I active the optical flow pixart driver
image

@Huibean
Copy link
Contributor

Huibean commented Oct 8, 2020

in my case, expect_delay_start value is always 0 and trigger the spi timeout

if (expect_delay_start != 0) {

@giacomo892
Copy link
Contributor

Just had it on latest master on Plane. I enabled TERRAIN for the very first time ever, and got it on the first time I tried to arm the plane. Rebooted and all good.
Again not able to reproduce.

@IamPete1
Copy link
Member

IamPete1 commented Aug 5, 2022

Any updates to this issue? There don't seem to be any new reports, can we close?

@shellixyz
Copy link
Contributor

It can probably be closed. I haven't got the issue since my comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests