Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Call correct interrupt to prevent Nonce slip (master) #943

Merged
merged 2 commits into from
Oct 3, 2021

Conversation

CapnBry
Copy link
Member

@CapnBry CapnBry commented Oct 1, 2021

Reorganize the Radio ISR code to prevent the wrong interrupt routine from being called on the TX which is causing the NonceTX to advance prematurely, such as in #934. This dangerous bug affects all regulatory domains, and both 1.x and master.

Details

We've seen plenty of symptoms of Nonce desync, where a receiver suddenly will drop to under 75 LQ even with good RSSI. This can either be healed by a sync packet to bring it back in step, but on 1.0 if it gets out of sync by 2 (50 LQ) it will never recover. This is something we haven't been able to reproduce in test setups so it has been difficult to track down and why it took me dozens of hours to narrow in on the cause. This is due to the fact that this bug can only present itself with multiple transmitters running at the same time.

The slip (actually the TX is double-advancing) is caused by the reception of multiple packets by the TX module during the telemetry interval. Once the nonce slips one position though, it is highly likely to slip again, because the transmitter will be listening for telemetry on a clear FHSS channel, which increases the likelyhood of hearing another transmitter which is not synced to our schedule and will generate a bad interrupt at a time to exploit the bug again.

What's happening:

  • TX module goes into receive mode to receive telemetry. Regardless of if it receive the TLM packet or not, it stays in continuous receive mode until the next packet is ready to transmit.
  • During this time, a second transmitter sends a packet with the same RF parameters so the RF chip starts receiving this second packet.
  • The TX module switches to TX mode in our code
  • The RF chip throws an interrupt to let us know that there's a new received packet available
  • Because we've set our state to be "Transmit", the interrupt is interpreted as a TXDone interrupt, which advances the NonceTX even though the packet we have to send hasn't been sent yet.
  • The packet actually transmits, generates an actual TXDone interrupt, and now has advanced the NonceTX twice in one period.

Solution

The meat of this PR is to change the ISR handler from a stateful version that calls RXISR or TXISR depending on the value we set, to using the RF chip's status register to tell us what interrupt it is. Replacing code like this:

void ICACHE_RAM_ATTR SX1280Hal::dioISR()
{
    if (instance->InterruptAssignment == SX1280_INTERRUPT_RX_DONE)
    {
        RXdoneCallback();
    }
    else if (instance->InterruptAssignment == SX1280_INTERRUPT_TX_DONE)
    {
        TXdoneCallback();
    }
}

with code like this

void ICACHE_RAM_ATTR SX1280Driver::IsrCallback()
{
    uint16_t irqStatus = instance->GetIrqStatus();
    instance->ClearIrqStatus(SX1280_IRQ_RADIO_ALL);
    if ((irqStatus & SX1280_IRQ_TX_DONE))
        instance->TXnbISR();
    else if ((irqStatus & SX1280_IRQ_RX_DONE))
        instance->RXnbISR();
}

Other changes

  • The NonceTX++ has been removed from being in two places in the transmitter: the TXDone interrupt, and the implied telemetry happened handler. Now it just lives in the Tock handler and is always advanced one time per Tock.
  • To make it clearer how the Telemetry receive progresses, I've replaced the single boolean WaitRXresponse that also relies on the NonceTX staying frozen for two intervals with a proper 3 state enum.
    • The way it used to work: The TXDoneISR would call HandleTLM which would set the boolean, then the next Tock's SendRCdataToRF would check if the Nonce was a TLM nonce, clear the flag, leave the nonce its previous value, and return. The next Tock, the Nonce would still be the previous Nonce, which was interpreted to mean we just did telemetry, so the NonceTX would be incremented now, and later in the same period incremented again by the next TXDoneISR. Not confusing at all!
    • The way it works now: The TXDoneISR, sets the state variable indicating it is ready to receive telemetry. Now the Tock sees that it is receiving telemetry, advances the state, returns. The next Tock advances the state again back to normal Transmit. The NonceTX is consistent across all of these operations, ticking once per Tock trigger.
  • crsf.JustSentRFpacket(); has been moved from SendRCdataToRF to the Tock handler. This was broken. If we didn't have fresh RC channels data, the OpenTX sync variable would not update and could cause mayhem in the sync calculation. I don't think anyone ever has run into this because that would mean our receiving data from OpenTX wasn't working but for some reason we would still be sending sync packets ok? Not likely.

The best way to test

The bug addressed by this PR can only ever occur when there are multiple transmitters running the same RF Rate. I've run for days with a single TX and never had any problems with Nonce sync. Add a second TX and it happens somewhat rarely. To make this happen frequently, increase the telemetry ratio on the pair under test (I used 1:8). A higher packet rate helps as well (I used 250Hz) but all transmitters must be set to this rate. To increase the chance of collisions on the same FHSS, I also reduced Team2.4's hop list down to 10 channels. This presented the bug usually within 2-5 minutes.

I've tested using two Team2.4 setups to verify it works on systems both with and without a busy pin. All devices of the same band were active at the same time

  • SIYI FM30 TX + BetaFPV 2400 RX (no Busy pin test)
  • SIYI FR Mini RX as TX + EP2 RX (Busy pin test)
    In addition I've tested Team900
  • HappyModel ES900TX + FrSky R9MM
  • TTGO Lora32 V1 just set to blast out packets continuously

Backporting

I will backport these changes to 1.1-maintenance as well in a separate PR, since I'm worried just trying to cherry-pick this could merge poorly in a way that might not be obvious. The changes themselves are straightforward, but I am not looking forward to having to flash allllllllll this gear back to 1.1 to run all the testing again. It is not fun 馃ぎ

@CapnBry CapnBry added the release notes Should be added to release notes label Oct 1, 2021
Copy link
Collaborator

@cruwaller cruwaller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice work! Thanks for fixing this!

@0crap
Copy link

0crap commented Oct 1, 2021

Brilliant piece of work!
And very important for ExpressLRS in general.
Reliability is key for any RC link.

Capn, send over your gear, I'll flash it back to 1.1 for you! :-)

@deadbytefpv
Copy link
Contributor

Thank you for the hard work..
I'll chug a beer down in your honor..

@AlessandroAU
Copy link
Member

Amazing work as always Cap! This bug has been annoying us for awhile!

@JyeSmith
Copy link
Member

JyeSmith commented Oct 2, 2021

Tested with a tx/rx connected at 500hz, 10mW, 1:4 tlm. And another 8 TX on unique passphrases 500hz, 50mW, no tlm. The rx LQ bounced around 70-90.

7hrs later and zero issues 馃挭

@JyeSmith JyeSmith merged commit 862d94a into ExpressLRS:master Oct 3, 2021
CapnBry added a commit to CapnBry/ExpressLRS that referenced this pull request Oct 3, 2021
JyeSmith pushed a commit that referenced this pull request Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release notes Should be added to release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants