-
Notifications
You must be signed in to change notification settings - Fork 16.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UAVCAN message corruption on ArduPlane #8166
Comments
If can H and can L wires are not the exact same length the signal from can H and can L will be out of sync and messages will be corrupted. I see this on motorcycles all the time when someone puts tall handlebars on there bikes. Not saying that is your issue. But it's worth mentioning |
It has nothing to do with corrupted CAN frames; if that were the case, you wouldn't see anything on the bus monitor in the first place. My bet is that the problem is caused by the TX buffer getting overrun. When that happens, the CAN stack has to triage the outgoing CAN frames, letting higher priority frames out while sacrificing lower priority ones, hence broken transfers. The TX queue is being read by the driver from one end and written by the stack from the other end; apparently the read/write rate difference is such that it takes about a minute for the queue to overrun (judging by the description from @JamesStewy ). I can propose two solutions:
|
This does line up with our experience that their is a cumulative nature to it.
Is there a particular mechanism which would cause this to become a permanent issue? As soon as the issue occurs, even if you back off the data rate, it does not recover until a reboot. |
Could be that the driver's inner state gets somehow damaged. I think you should get a debugger and make sure that the driver handles queue overrun properly. |
This should be tested on a ChibiOS build. The CPU usage is much lower, I bet the problem goes away. However, if we are hitting the queue limits then it may not matter. Just too much data and the bus is busy. Question: how many motors is it considered for? |
I have seen issues with the UAVCAN GUI tool where it cannot keep up and starts to behave this way. This is due to a pyuavcan bug where the data type signature is computed on every transfer (OpenCyphal/pycyphal#31) Even after fixing this bug, at very high bus utilization (4000+ frames per second) I've seen the UAVCAN GUI tool start to display node errors. Restarting the GUI tool seems to fix it temporarily. Let's confirm that this is actually an ardupilot issue. |
We can absolutly do this test but what should we be looking out for? How will we know if the driver (I assume that means libuavcan) handles queue overrun successfully?
All of our testing was done with 8 outputs including one fixed wing motor and 4 quad motors.
What versions of ArduPlane should we use to test with ChibiOS?
We have already seen that the problem goes away if you reduce the data rate (and presumably the CPU load as well at the same time). However, the lingering threat of the CAN side of ardupilot failing terminally worries us.
This doesn't line up with our testing. Restarting the GUI tool did not subside the errors even temporarily. It should also be noted that while the issue is occuring, the GUI tool sometimes crashes due to collecting to many CAN errors.
We have also tested using a simple board with an ST microcontroller running libuavcan connected on the bus. The board converts the last received However, despite saying all this, when I next get a chance I will run the GUI tool with (OpenCyphal/pycyphal#31) applied and report back if it makes a difference. |
Per @magicrub: should just be master, built for fmuv2 |
Can we get your parameters? We are trying to reproduce your problem. |
These parameters cause the issue for us: quadplane - high SCHED_LOOP_RATE.txt The same parameters but with |
Tested using the GUI tool with (OpenCyphal/pycyphal#31) applied and it didn't make a difference. Also tested using master ardupilot and it ultimately didn't fix the problem. A combination of providing constant transmitter inputs and shaking the pixhawk in FWBA a bit was enough to cause the failure. Over multiple tests it did appear to take longer (maybe 50% longer) to fail. |
Completed some more tests. ArduCopter 3.5.5 and 3.6.0-rc1 both have the same issue (@rmackay9). Tried ArduPlane 3.7.1 as well as using PX4 but both did not output 'uavcan.equipment.actuator.ArrayCommand' messages which likely reduced the amount of CAN frames enough for the issue to not occur. Is there anything else that anyone would like to see tested to try to narrow down on the issue? |
just to confirm if it's a bandwdith vs cpu problem, can you reduce the CAN_D1_UC_ESC_BM and CAN_D1_UC_SRV_BM masks. Right now they're both set to forward any servo data to the UAVCAN bus. It is a bitfield. Can you try reducing the bits down one by and see if it suddenly stops? if so, it's probably a bandwidth issue. Also, are you using terminators on your bus? |
Tested by shaking constantly shaking the Pixhawk in FBWA runnning master ArduPlane.
Yes |
Next test would be to use ChibiOS. It's a new OS we're switching too which is much faster. http://firmware.us.ardupilot.org/Plane/2018-04/2018-04-27-21:04/fmuv3/ Use the apj file. Make sure to back up your params before uploading this firmware. I'm not sure anyone has done a vtol flight with it so I'd suggest just a bench test for now. |
I loaded the apj file using Mission Planner and set all the |
And you've loaded all your old params and rebooted? |
My steps:
On the GUI Tool Is there a difference between the apj file and building and flashing ardupilot master using waf? |
Depends how you're building. We are moving away from using NuttX OS, which is the "px4-v2" (or v3) target and changing it to "fmuv2" (or v3) which is the ChibiOS target. The ChibiOS build will have a different file extension but otherwise should run the same way. |
Ok. That makes sense. Up until now we have been building master against NuttX with px4-v3. I rebuilt using fmuv3 and flashed using waf and had the same issue as described in my previous comment. I connected up a debugger and found that the ChibiOS version doesn't appear to send the channel values though to the AP_UAVCAN stuff. So I implemented that based off what is in AP_HAL_PX4: https://github.com/MonashUAS/ardupilot/tree/uavcan_esc_chibios. Running that new version with |
@JamesStewy thanks for your patch! |
My patch, like HAL_PX4, unconditionally sends all the channel values to AP_UAVCAN however the I do agree that a parameter to set the CAN actuator and ESC output rate would be useful. |
Agreed, an output rate would be nice. Doesn't really solve the problem though because, like in Tridge's example but with CAN ESCs, the vtol motors would probably want to update faster than the fwd motors so they would all end up running at 300 anyway. If the problem only happens after a few minutes, and nuttxx Gail's sooner than ChibiOS, it smells like a memory leak. Tridge, can you check the mem/stack usage with UAVCAN sending to an esc? |
@JamesStewy we should really set these automatically based on what nodes are detected on the bus. |
@JamesStewy any chance you could test #7882 ? That removes the need for your PR for HAL_ChibiOS/RCOutput |
@JamesStewy if you set either CAN_D1_UC_ESC_BM or CAN_D1_UC_SRV_BM to zero then does that prevent the issue? I'm thinking we should change the default to zero for both of those, and ask users of UAVCAN servos or ESCs to set them correctly. |
@tridge @magicrub @pavel-kirienko @EShamaev @jschall We would really like to see this issue resolved as we at MonashUAS truly believe that CAN ESCs and actuators provide an enormous amount of flexibility. However, currently it is our view that using CAN ESCs and Actuators with any build of ArduPilot is dangerous and we are not willing to fly with it until we can be relatively confident that the issues outlined in this thread are resolved. Current documentation suggests that
and as far as we can tell, anybody who takes this suggestion is unknowingly taking a significant risk and could experience a terminal failure, with very little warning, which results in a complete loss of control. We have resources available which can be put towards solving this issue. However, @JMare and I are not as familiar with the ArduPilot and UAVCAN codebases as you are nor their design decisions and goals. We would really appreciate some guidance on how to test for and identify the root cause of this issue. @pavel-kirienko suggested at the top of this thread that the UAVCAN driver could be overrunning and/or having its internal state damaged under high load. Is this code in ArduPilot or libuavcan? @pavel-kirienko is there a way to test libuavcan in isolation to see if it is the cause of the issue? @tridge @magicrub, as I have said before I think a fixed, configurable UAVCAN output loop is a good idea long term and it is something I am willing to implement. Should the loop be done using AP_SCHEDULER, a separate thread or something else? Should this be done on top of #7882? As, @tridge pointed out this isn’t a solution to this issue though as if it is set too high (say 300Hz) then the issue will still occur. @tridge I saw your suggestion in the ArduPilot gitter that the ESC and Actuator bitmasks should default to 0 and I agree that is a good solution for the short term. Is there anything stopping #7882 getting merged that we can help with? @jschall where you able to replicate the issue using the parameters I posted? Thanks, |
That particular example does. There's a similar command for RX queue.
May I ask you to submit a pull request to PyUAVCAN increasing the maximum queue size to a much larger value (say, 1 million) for non-OSX systems? |
I didn't try with ifconfig, but tried with ip and it didn't have a rxqueuelen. Everything I found on the internet said that didn't exist and you had to use SO_RCVBUF.
Sure, probably not today, I'm tired of looking at CAN 😄 |
this is to address ArduPilot#8166
@pavel-kirienko As requested: OpenCyphal/pycyphal#33 (damn, that was a fast merge 😃) I have more info in the SocketCAN case: I had an error in my code so it wasn't actually increasing the buffer. Unfortunately, even after increasing it by more than 38x I could still get errors after a while - it took much longer for that to happen though so it seems to be related to full buffer as well (when I comment the buffer increase code it just takes a second to get errors) - I should also mention that increasing the buffer size requires elevated privileges so it's not a good solution. I'll recommend that you add a reading thread in there too. |
I encourage all to contribute that. Fast merge guaranteed. ;) |
this is to address ArduPilot#8166
@JamesStewy would you be able to test the latest changes I made on PR #7882 |
I also found my USBTin adapter, so now I can see the error messages in the GUI as well |
after discussions with @pavel-kirienko I've changed the timeout to 1ms |
@tridge I tried #7882 using Out of intrest, where was the memory leak? |
@JamesStewy it was a tricky thing where resends would get queued unnecessarily |
I've made a fair bit more progress today. I ended up writing new two UAVCAN sniffers. One is here: uavcan.equipment.air_data.StaticPressure: 29 that is printed at 1Hz, showing the count of message types. I have pushed some improvements to the PR #7882 that prevent the issues that you have been seeing. It now works nicely on HAL_ChibiOS and HAL_PX4 (its better on ChibiOS, but HAL_PX4 isn't too bad). I've run with the new PR for quite a long time under both HAL_PX4 and HAL_ChibiOS with both servo and ESC bitmasks at 255. It seems reliable for me. |
I'm going to close this now, as I believe all the required fixes are merged in ArduPilot master.
With all the changes I can run a copter with HAL_ChibiOS (fmuv3) with SCHED_LOOP_RATE=800 and get 800Hz CAN ESC msgs out reliably with both CAN ports enabled, and still have plenty of spare CPU left. I've left it to run for over an hour with no issues. I can do the same with HAL_PX4, although it does get a fair number of scheduling misses at 800Hz. The CAN ESC msgs do get out though (at around 790Hz). It would be nice to work out how to make the Linux sniffers handle high packet rates, but that isn't an ArduPilot bug. Meanwhile we can use the UAVCAN_sniffer firmware I put in AP_UAVCAN/examples to sniff at high rates without msg loss. @JamesStewy many thanks for reporting this, it was a complex problem and led to some really significant improvements in our uavcan support |
Perhaps slightly off-topic, but has this been tested with live UAVCAN servos/actuators? The closest I've found is the CBS-15 produced by Currawong, which was originally going to include support for UAVCAN, but after a lack of customer interest, UAVCAN was dropped (I am currently querying the potential for a continuation, however). Just let me know if this question is better suited to the discussion forum. |
@Evan1010 I heard that these support UAVCAN https://www.uavcomp.com/electronics/micro-uartcan-servos/ |
Hitech make UAVCAN servos |
@proficnc Hitech, or Hitec? @pavel-kirienko Thank you for the lead - I've contacted them for more information. |
Hitec sorry about the spelling! |
MD250MW-CAN |
Awesome |
Where did you find that this is a UAVCAN servo? I only found that it supports CAN. Also do you know when they should be available? |
@antoinealb I just started talking to them, and they have responded positively that UAVCAN is an available option. That said, it could still be a misunderstanding (i.e., they referred to them at one point as "CAN UAV servos"). I likely can't quote availability, but they've been very good about responding. @pavel-kirienko @proficnc I was asked "the CAN Sample Point and the Baud-Rate" for the system. I understand that Baud-Rate is probably 1 Mbit/s, but I can't think what "CAN Sample Point" means. Perhaps its PWM Resolution (i.e., 2048 steps), or update rate (50Hz - 300Hz)? |
@Evan1010 it is the point within a single bit when the signal is sampled. The recommendations for UAVCAN are listed here: http://uavcan.org/Specification/8._Hardware_design_recommendations/#can-bus-parameters |
Hi @tridge, sorry for disappearing. Thank you for fixing this issue. Do you currently have a time frame for a beta of ArduPlane 3.9 with this fix in it? Thanks, |
Issue details
We are experiencing an issue where UAVCAN messages sent from ArduPilot have a very high corruption rate as seen on a Zubax Babel. Exactly why this starts happening is a little unclear however when it starts working incorrectly it does so with a small percentage of messages sent from ArduPilot corrupting and over a period of about 1-10 seconds this rate of corruption increases to nearly every message. Once the message corruption starts we have not found a way to get it to stop apart from a cold reboot. This corruption appears to affect all sent messages from ArduPilot including
uavcan.equipment.actuator.ArrayCommand
,uavcan.equipment.esc.RawCommand
,uavcan.protocol.NodeStatus
anduavcan.protocol.GetNodeInfo
responses. Corrupted messages appear in the UAVCAN GUI Tool Bus Monitor as messages whose value is not present and replaced withTransfer could not be decoded
along with errors such asCRC mismatch: expected xxxx, got xxxx for payload byte
,Toggle bit value x incorrect on frame x
, orEOT not found
.We have found that this issue does not occur as described above in
Manual
mode, but testing inAuto
,FBWA
andRTL
has all resulted in corruption occurring. We have also found that theSCHED_LOOP_RATE
parameter has a big effect. WithSCHED_LOOP_RATE=300
(recommend value for quadplanes), the safety switch in and with constant RC inputs, ArduPilot puts out around 1800 CAN frames per second and will fail normally within a minute (although we have had it take up to a few minutes to fail). The same test but withSCHED_LOOP_RATE=50
results in ArduPilot only outputting around 300 CAN frames per second and it does not fail (we ran multiple tests including one over 40 minutes and everything continued to work). In these tests constant RC inputs are important as the esc/actuator message output rate from ArduPilot appears to be proportional to the amount of ‘activity’ that is occurring.All our testing to date has been on multiple Pixhawk 2.1s running ArduPlane 3.8.4. To confirm this isn’t an issue with the CAN network we have tested on a network with other UAVCAN devices all of which continue to work even when ArduPilot starts sending corrupted messages. We have also tested ArduPilot alone with just a Babel connected and the corruption still occurs.
Steps to replicate
SCHED_LOOP_RATE
to300
.FBWA
.Another Issue
This issue is potentially unrelated but, in our testing, we also found that with UAVCAN enabled it was possible to force ArduPlane into io failsafe due to fmu shutdown by pressing the safety switch too soon after booting. We found that waiting 10 seconds after all the starting noises had completed was enough to mitigate the issue.
Version
ArduPlane 3.8.4
Platform
[ ] All
[ ] AntennaTracker
[ ] Copter
[x] Plane
[ ] Rover
[ ] Submarine
Airframe type
Quadplane
Hardware type
Pixhawk 2.1
The text was updated successfully, but these errors were encountered: