-
-
Notifications
You must be signed in to change notification settings - Fork 752
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.
Already on GitHub? Sign in to your account
Call correct interrupt to prevent Nonce slip (master) #943
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! Thanks for fixing this!
Brilliant piece of work! Capn, send over your gear, I'll flash it back to 1.1 for you! :-) |
Thank you for the hard work.. |
Amazing work as always Cap! This bug has been annoying us for awhile! |
Tested with a tx/rx connected at 500hz, 10mW, 1:4 tlm. And another 8 TX on unique passphrases 500hz, 50mW, no tlm. The rx LQ bounced around 70-90. 7hrs later and zero issues 馃挭 |
Reorganize the Radio ISR code to prevent the wrong interrupt routine from being called on the TX which is causing the NonceTX to advance prematurely, such as in #934. This dangerous bug affects all regulatory domains, and both 1.x and master.
Details
We've seen plenty of symptoms of Nonce desync, where a receiver suddenly will drop to under 75 LQ even with good RSSI. This can either be healed by a sync packet to bring it back in step, but on 1.0 if it gets out of sync by 2 (50 LQ) it will never recover. This is something we haven't been able to reproduce in test setups so it has been difficult to track down and why it took me dozens of hours to narrow in on the cause. This is due to the fact that this bug can only present itself with multiple transmitters running at the same time.
The slip (actually the TX is double-advancing) is caused by the reception of multiple packets by the TX module during the telemetry interval. Once the nonce slips one position though, it is highly likely to slip again, because the transmitter will be listening for telemetry on a clear FHSS channel, which increases the likelyhood of hearing another transmitter which is not synced to our schedule and will generate a bad interrupt at a time to exploit the bug again.
What's happening:
Solution
The meat of this PR is to change the ISR handler from a stateful version that calls RXISR or TXISR depending on the value we set, to using the RF chip's status register to tell us what interrupt it is. Replacing code like this:
with code like this
Other changes
WaitRXresponse
that also relies on the NonceTX staying frozen for two intervals with a proper 3 state enum.SendRCdataToRF
would check if the Nonce was a TLM nonce, clear the flag, leave the nonce its previous value, and return. The next Tock, the Nonce would still be the previous Nonce, which was interpreted to mean we just did telemetry, so the NonceTX would be incremented now, and later in the same period incremented again by the next TXDoneISR. Not confusing at all!crsf.JustSentRFpacket();
has been moved fromSendRCdataToRF
to the Tock handler. This was broken. If we didn't have fresh RC channels data, the OpenTX sync variable would not update and could cause mayhem in the sync calculation. I don't think anyone ever has run into this because that would mean our receiving data from OpenTX wasn't working but for some reason we would still be sending sync packets ok? Not likely.The best way to test
The bug addressed by this PR can only ever occur when there are multiple transmitters running the same RF Rate. I've run for days with a single TX and never had any problems with Nonce sync. Add a second TX and it happens somewhat rarely. To make this happen frequently, increase the telemetry ratio on the pair under test (I used 1:8). A higher packet rate helps as well (I used 250Hz) but all transmitters must be set to this rate. To increase the chance of collisions on the same FHSS, I also reduced Team2.4's hop list down to 10 channels. This presented the bug usually within 2-5 minutes.
I've tested using two Team2.4 setups to verify it works on systems both with and without a busy pin. All devices of the same band were active at the same time
In addition I've tested Team900
Backporting
I will backport these changes to 1.1-maintenance as well in a separate PR, since I'm worried just trying to cherry-pick this could merge poorly in a way that might not be obvious. The changes themselves are straightforward, but I am not looking forward to having to flash allllllllll this gear back to 1.1 to run all the testing again. It is not fun 馃ぎ