Call correct interrupt to prevent Nonce slip (master) #943

CapnBry · 2021-10-01T18:29:53Z

Reorganize the Radio ISR code to prevent the wrong interrupt routine from being called on the TX which is causing the NonceTX to advance prematurely, such as in #934. This dangerous bug affects all regulatory domains, and both 1.x and master.

Details

We've seen plenty of symptoms of Nonce desync, where a receiver suddenly will drop to under 75 LQ even with good RSSI. This can either be healed by a sync packet to bring it back in step, but on 1.0 if it gets out of sync by 2 (50 LQ) it will never recover. This is something we haven't been able to reproduce in test setups so it has been difficult to track down and why it took me dozens of hours to narrow in on the cause. This is due to the fact that this bug can only present itself with multiple transmitters running at the same time.

The slip (actually the TX is double-advancing) is caused by the reception of multiple packets by the TX module during the telemetry interval. Once the nonce slips one position though, it is highly likely to slip again, because the transmitter will be listening for telemetry on a clear FHSS channel, which increases the likelyhood of hearing another transmitter which is not synced to our schedule and will generate a bad interrupt at a time to exploit the bug again.

What's happening:

TX module goes into receive mode to receive telemetry. Regardless of if it receive the TLM packet or not, it stays in continuous receive mode until the next packet is ready to transmit.
During this time, a second transmitter sends a packet with the same RF parameters so the RF chip starts receiving this second packet.
The TX module switches to TX mode in our code
The RF chip throws an interrupt to let us know that there's a new received packet available
Because we've set our state to be "Transmit", the interrupt is interpreted as a TXDone interrupt, which advances the NonceTX even though the packet we have to send hasn't been sent yet.
The packet actually transmits, generates an actual TXDone interrupt, and now has advanced the NonceTX twice in one period.

Solution

The meat of this PR is to change the ISR handler from a stateful version that calls RXISR or TXISR depending on the value we set, to using the RF chip's status register to tell us what interrupt it is. Replacing code like this:

void ICACHE_RAM_ATTR SX1280Hal::dioISR()
{
    if (instance->InterruptAssignment == SX1280_INTERRUPT_RX_DONE)
    {
        RXdoneCallback();
    }
    else if (instance->InterruptAssignment == SX1280_INTERRUPT_TX_DONE)
    {
        TXdoneCallback();
    }
}

with code like this

void ICACHE_RAM_ATTR SX1280Driver::IsrCallback()
{
    uint16_t irqStatus = instance->GetIrqStatus();
    instance->ClearIrqStatus(SX1280_IRQ_RADIO_ALL);
    if ((irqStatus & SX1280_IRQ_TX_DONE))
        instance->TXnbISR();
    else if ((irqStatus & SX1280_IRQ_RX_DONE))
        instance->RXnbISR();
}

Other changes

The NonceTX++ has been removed from being in two places in the transmitter: the TXDone interrupt, and the implied telemetry happened handler. Now it just lives in the Tock handler and is always advanced one time per Tock.
To make it clearer how the Telemetry receive progresses, I've replaced the single boolean WaitRXresponse that also relies on the NonceTX staying frozen for two intervals with a proper 3 state enum.
- The way it used to work: The TXDoneISR would call HandleTLM which would set the boolean, then the next Tock's SendRCdataToRF would check if the Nonce was a TLM nonce, clear the flag, leave the nonce its previous value, and return. The next Tock, the Nonce would still be the previous Nonce, which was interpreted to mean we just did telemetry, so the NonceTX would be incremented now, and later in the same period incremented again by the next TXDoneISR. Not confusing at all!
- The way it works now: The TXDoneISR, sets the state variable indicating it is ready to receive telemetry. Now the Tock sees that it is receiving telemetry, advances the state, returns. The next Tock advances the state again back to normal Transmit. The NonceTX is consistent across all of these operations, ticking once per Tock trigger.
crsf.JustSentRFpacket(); has been moved from SendRCdataToRF to the Tock handler. This was broken. If we didn't have fresh RC channels data, the OpenTX sync variable would not update and could cause mayhem in the sync calculation. I don't think anyone ever has run into this because that would mean our receiving data from OpenTX wasn't working but for some reason we would still be sending sync packets ok? Not likely.

The best way to test

The bug addressed by this PR can only ever occur when there are multiple transmitters running the same RF Rate. I've run for days with a single TX and never had any problems with Nonce sync. Add a second TX and it happens somewhat rarely. To make this happen frequently, increase the telemetry ratio on the pair under test (I used 1:8). A higher packet rate helps as well (I used 250Hz) but all transmitters must be set to this rate. To increase the chance of collisions on the same FHSS, I also reduced Team2.4's hop list down to 10 channels. This presented the bug usually within 2-5 minutes.

I've tested using two Team2.4 setups to verify it works on systems both with and without a busy pin. All devices of the same band were active at the same time

SIYI FM30 TX + BetaFPV 2400 RX (no Busy pin test)
SIYI FR Mini RX as TX + EP2 RX (Busy pin test)
In addition I've tested Team900
HappyModel ES900TX + FrSky R9MM
TTGO Lora32 V1 just set to blast out packets continuously

Backporting

I will backport these changes to 1.1-maintenance as well in a separate PR, since I'm worried just trying to cherry-pick this could merge poorly in a way that might not be obvious. The changes themselves are straightforward, but I am not looking forward to having to flash allllllllll this gear back to 1.1 to run all the testing again. It is not fun 🤮

cruwaller

Very nice work! Thanks for fixing this!

0crap · 2021-10-01T20:36:59Z

Brilliant piece of work!
And very important for ExpressLRS in general.
Reliability is key for any RC link.

Capn, send over your gear, I'll flash it back to 1.1 for you! :-)

deadbytefpv · 2021-10-01T23:52:53Z

Thank you for the hard work..
I'll chug a beer down in your honor..

AlessandroAU · 2021-10-02T14:03:23Z

Amazing work as always Cap! This bug has been annoying us for awhile!

JyeSmith · 2021-10-02T19:40:57Z

Tested with a tx/rx connected at 500hz, 10mW, 1:4 tlm. And another 8 TX on unique passphrases 500hz, 50mW, no tlm. The rx LQ bounced around 70-90.

7hrs later and zero issues 💪

CapnBry added 2 commits October 1, 2021 13:38

Only fire interupt that actually interrupted

922029e

WaitRXresponse -> TelemetryRcvPhase

803e12f

CapnBry added the release notes Should be added to release notes label Oct 1, 2021

cruwaller approved these changes Oct 1, 2021

View reviewed changes

AlessandroAU approved these changes Oct 3, 2021

View reviewed changes

JyeSmith approved these changes Oct 3, 2021

View reviewed changes

JyeSmith merged commit 862d94a into ExpressLRS:master Oct 3, 2021

CapnBry added a commit to CapnBry/ExpressLRS that referenced this pull request Oct 3, 2021

Backport ExpressLRS#943

7c939ed

CapnBry mentioned this pull request Oct 3, 2021

Call correct interrupt to prevent Nonce slip (1.1) #946

Merged

JyeSmith pushed a commit that referenced this pull request Oct 5, 2021

Backport #943 (#946)

7c5af6d

CapnBry mentioned this pull request Oct 9, 2021

Idle callback must follow same hop logic (master) #972

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call correct interrupt to prevent Nonce slip (master) #943

Call correct interrupt to prevent Nonce slip (master) #943

CapnBry commented Oct 1, 2021 •

edited

cruwaller left a comment

0crap commented Oct 1, 2021

deadbytefpv commented Oct 1, 2021

AlessandroAU commented Oct 2, 2021

JyeSmith commented Oct 2, 2021

Call correct interrupt to prevent Nonce slip (master) #943

Call correct interrupt to prevent Nonce slip (master) #943

Conversation

CapnBry commented Oct 1, 2021 • edited

Details

Solution

Other changes

The best way to test

Backporting

cruwaller left a comment

Choose a reason for hiding this comment

0crap commented Oct 1, 2021

deadbytefpv commented Oct 1, 2021

AlessandroAU commented Oct 2, 2021

JyeSmith commented Oct 2, 2021

CapnBry commented Oct 1, 2021 •

edited