RX2 fallback leads to endless downlinks #767

stevmei · 2019-08-14T06:19:25Z

Summary

If a MAC command sent in RX2/SF9 was not answered by an end node, the NS is using RX2/SF12 as a fallback. In case that the end node IS listening on RX2/SF9 and just missed the MAC command (due to temporarely unavailability or bad reachability) it will never receive the fallback MACs on RX2/SF12. In that case, again no answer is sent by the end node and the NS starts repeating the requests (LinkADRReq and RXParamSetupReq). That produces a huge amount of downlinks that will never be received by the end node that is still listening on RX2/SF9.

Steps to Reproduce

Start end node, join by OTAA
Wait for MAC command using RX2/SF9
Skip answer or shade end node in RX2 window
Watch MAC command using RX2/SF12
See endless repeated downlinks in RX2/SF12

What is already there? What do you see now?

What is missing? What do you want to see?

Not the same downlink again and again.

Environment

Not relevant.

How do you propose to implement this?

The RX2 fallback should not be used more than 2 times (for example). After 2 missed answeres, the RX2 should be changed back to SF9 again. After 3 cycles of RX2/SF9-RX2/SF12 changes some notification should be generated ("end node not answering any MAC") and the downlinks should be stopped. A restart of the procedure after an amount of time may be possible.

Can you do this yourself and submit a Pull Request?

No.

htdvisser · 2019-08-14T14:18:49Z

This is an issue for https://github.com/TheThingsNetwork/ttn, not https://github.com/TheThingsNetwork/lorawan-stack. I'll transfer it.

mpouillot · 2019-09-11T12:21:39Z

It is a important bug, i have few customers which lost its devices due to this bug. I think in OTAA, there is no need to change the SF of the RX2 after the join-accept. If the device receives the join-accept then it is configured to receive in SF9 on RX2.

For ABP, I think it should be better to configure by using the RX1 window....

htdvisser · 2019-09-12T13:59:36Z

If you have problems with this in a private deployment, you can configure it to use a different RX2 data rate with the --eu-rx2-dr flag (or corresponding environment / yml config).

Since the core team is working full-time on v3, you may want to contribute a fix for this yourself.

My suggestion would be to add a "activated at" field to the device so that the NS can detect if it's an OTAA device or not. This field can then be set by HandleActivate after which the fallback can be bypassed for devices that have been activated with OTAA.

Fixes TheThingsArchive#767

stevmei · 2019-09-13T10:32:53Z

I don't know if everything is correct and compilable (my first contribution in GO 😀), but I've implemented the suggestions of @htdvisser.

Some things to remind: The activatedAt field is never reset to nil, I don't know if that has to be done if the device is changed over to ABP or if a new device is registered in that case. I couldn't find any source code for that case. Additionally I found a activation constraint (here) that is already identifying if the device was actived via OTAA. Maybe that can be used for the bypass, too.

Thanks so far!

jpmeijers · 2019-09-28T21:24:14Z

I'm currently also affected by this. One of my devices joined using OTAA. Then we had the recent gateway outage and the device never received downlinks. The device (ADR) therefore moved all the way to SF12. Using an SDR I can see it is transmitting on SF12 and the downlinks are also sent on SF12, not on SF9 like it should be.

@sigmaroot @htdvisser how can I assist getting this bugfix tested and out on production TTN?

jpmeijers · 2019-09-28T21:36:07Z

Looking at the fallback code, the line that does the fallback magic is:
loraSettings.DataRate = "SF12BW125"

Can't we change that line to:

if loraSettings.DataRate == "SF9BW125" {
    loraSettings.DataRate = "SF12BW125"
} else {
    loraSettings.DataRate = "SF9BW125"
}

With this change (if I understand the code correctly) we will alternate between the incorrect (SF12) downlink and the correct (SF9) downlink.

jpmeijers · 2019-10-02T10:08:10Z

I'm still stuck with one deployment because of this bug.

Steps to reproduce:

Only one gateway should be in reach of the device
Make the device join using OTAA and make sure ADR is switched on
Over night switch off the gateway
The next day switch the gateway back on
The gateway will receive the device on SF12 and downlinks will be sent (ADR, ACK)
Downlinks are ignored/not received by the device

I can't find a workaround to get the device to receive the downlinks again other than to power cycle it to force a re-join. Is there any other way I can force downlinks back to SF9?

stevmei · 2019-10-02T10:59:11Z

@jpmeijers
Please have a look into my commit mentioned earlier: edbe65d

terrillmoore · 2019-10-10T19:49:26Z

I think there are two different discussions here. One is regarding privately hosted work, another is regarding existing apps deployed using TTN. As I don't have a privately-hosted V2 network, changes for private networks are interesting but don't address our issues. @htdvisser Will pushing a change request on the V2 code make it to production, or is TTN V2 basically frozen at this point, or is there some other way to fix it? Those of us with deployed systems need to understand so we can explain to our stakeholders. It would be good to avoid visits to the devices, and better to avoid firmware updates in devices; but it would be great, no matter how we mitigate, to make sure we only visit devices once.

Thanks!

htdvisser · 2019-10-11T13:08:16Z

As you can imagine, we're trying to spend all our time on v3 and only do really critical things on v2. Of course we do welcome contributions, so if you're willing to implement a fix and test it on a private network, I'll make sure it gets merged and deployed to the public community network.

…chive#767

jpmeijers · 2019-10-14T19:25:58Z

I see this as a critical issue as it causes some of my devices to be stuck on SF12, sending retries and draining their batteries. It also prevents me from having remote control of these devices.

Because of this I tried the suggestions in this thread, implemented them on a private instance of the stack on my local computer, and tested. PR #768 contains a working fix.

mpouillot · 2019-10-15T05:03:24Z

Exactly, It is a critical issue on public ttn V2. We lost a lot of devices in downlink... Best regards, Mathieu Pouillot Ingénieur Responsable Produit Tél. : +33(0)4 98 01 60 06 nke Watteco Valgora Bat C Avenue Alfred Kastler 83160 LA VALETTE DU VAR – France http://www.nke-watteco.fr http://www.nke-watteco.com

…

________________________________ De : JP Meijers <notifications@github.com> Envoyé : lundi 14 octobre 2019 21:26:04 À : TheThingsNetwork/ttn <ttn@noreply.github.com> Cc : M Pouillot <m.pouillot@watteco.com>; Comment <comment@noreply.github.com> Objet : Re: [TheThingsNetwork/ttn] RX2 fallback leads to endless downlinks (#767) I see this as a critical issue as it causes some of my devices to be stuck on SF12, sending retries and draining their batteries. It also prevents me from having remote control of these devices. Because of this I tried the suggestions in this thread, implemented them on a private instance of the stack on my local computer, and tested. PR #768<#768> contains a working fix. — You are receiving this because you commented. Reply to this email directly, view it on GitHub<#767>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AJLZSWEQK4AWRXRVKSW6763QOTBUZANCNFSM4ILVNGHA>.

stevmei · 2019-10-15T06:16:28Z

Thanks for contributing 😄
I'm going to investigate the functionality of this fix the next couple of days with our test devices.
Restarting our production devices afterwards.

jpmeijers · 2019-10-15T09:12:34Z

I just realised this fix will only be applied to devices that join after the fix is applied. Devices that are using a session from a join before today will still suffer from this bug. On my side that means half of my devices (that automatically rejoins periodically) are now working as expected, while the other half (which never performs a re-join) will still not receive downlinks.

avbentem · 2020-03-16T10:10:14Z

As ADR-enabled OTAA devices that joined before October 2019, and all ADR-enabled ABP devices are still running into this^†, I wonder if it would be worthwhile to fix this for those in V2 too?

ONLY IF YES, then I wonder:

Is there any reason why ABP devices were excluded from making the fallback less persistent?

@sigmaroot's suggested:

The RX2 fallback should not be used more than 2 times (for example). After 2 missed answeres, the RX2 should be changed back to SF9 again. After 3 cycles of RX2/SF9-RX2/SF12 changes some notification should be generated ("end node not answering any MAC") and the downlinks should be stopped. A restart of the procedure after an amount of time may be possible.

@htdvisser suggested what was actually implemented:

My suggestion would be to add a "activated at" field to the device so that the NS can detect if it's an OTAA device or not. This field can then be set by HandleActivate after which the fallback can be bypassed for devices that have been activated with OTAA.

I assume this seemed best/cleanest because OTAA devices should never need the fallback.

@jpmeijers suggested something that would work for ABP as well (and, in retrospect, would also work for OTAA devices that joined before October 2019):
Looking at the fallback code, the line that does the fallback magic is: loraSettings.DataRate = "SF12BW125"

Can't we change that line to:
```
if loraSettings.DataRate == "SF9BW125" {
    loraSettings.DataRate = "SF12BW125"
} else {
    loraSettings.DataRate = "SF9BW125"
}
```
With this change (if I understand the code correctly) we will alternate between the incorrect (SF12) downlink and the correct (SF9) downlink.
So, what if @jpmeijers' suggestion would be added as well? Or, if that would not work, then what if the fallback would only be used for, say, odd or even values of the downlink (or uplink) counter?

^† A recent case from Slack shows an Elsys OTAA device that dropped to SF12 after some outage, and since already missed 39,000+ (still counting) RX2 downlinks on SF12, with an FOpts of 0503D2AD840345FF0001 telling it to use DR4/SF8. And that single user has about 20 such devices installed.

johanstokking assigned rvolosatovs Aug 14, 2019

htdvisser transferred this issue from TheThingsNetwork/lorawan-stack Aug 14, 2019

htdvisser unassigned rvolosatovs Aug 14, 2019

htdvisser added bug c/backend labels Aug 14, 2019

stevmei added a commit to stevmei/ttn that referenced this issue Sep 13, 2019

Bypass RX2 fallback when device was activated with OTAA method

edbe65d

Fixes TheThingsArchive#767

johanstokking assigned htdvisser Sep 29, 2019

johanstokking added the discussion label Sep 29, 2019

jpmeijers added a commit to jpmeijers/ttn that referenced this issue Oct 14, 2019

Do not fall back to SF12 when activated using OTAA. Fixes TheThingsAr…

66d3bea

…chive#767

jpmeijers mentioned this issue Oct 15, 2019

Do not fall back to SF12 when activated using OTAA #768

Merged

htdvisser closed this as completed in #768 Oct 15, 2019

jpmeijers mentioned this issue Nov 8, 2019

Setting pre-JOIN RX2 download for TTN mcci-catena/arduino-lmic#474

Closed

avbentem mentioned this issue Apr 3, 2020

Downlinks rejected by gateways for certain TX powers TheThingsNetwork/lorawan-stack#2106

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RX2 fallback leads to endless downlinks #767

RX2 fallback leads to endless downlinks #767

stevmei commented Aug 14, 2019

htdvisser commented Aug 14, 2019

mpouillot commented Sep 11, 2019 •

edited

htdvisser commented Sep 12, 2019

stevmei commented Sep 13, 2019

jpmeijers commented Sep 28, 2019

jpmeijers commented Sep 28, 2019 •

edited

jpmeijers commented Oct 2, 2019

stevmei commented Oct 2, 2019

terrillmoore commented Oct 10, 2019

htdvisser commented Oct 11, 2019

jpmeijers commented Oct 14, 2019

mpouillot commented Oct 15, 2019 via email

stevmei commented Oct 15, 2019

jpmeijers commented Oct 15, 2019

avbentem commented Mar 16, 2020

RX2 fallback leads to endless downlinks #767

RX2 fallback leads to endless downlinks #767

Comments

stevmei commented Aug 14, 2019

Summary

Steps to Reproduce

What is already there? What do you see now?

What is missing? What do you want to see?

Environment

How do you propose to implement this?

Can you do this yourself and submit a Pull Request?

htdvisser commented Aug 14, 2019

mpouillot commented Sep 11, 2019 • edited

htdvisser commented Sep 12, 2019

stevmei commented Sep 13, 2019

jpmeijers commented Sep 28, 2019

jpmeijers commented Sep 28, 2019 • edited

jpmeijers commented Oct 2, 2019

stevmei commented Oct 2, 2019

terrillmoore commented Oct 10, 2019

htdvisser commented Oct 11, 2019

jpmeijers commented Oct 14, 2019

mpouillot commented Oct 15, 2019 via email

stevmei commented Oct 15, 2019

jpmeijers commented Oct 15, 2019

avbentem commented Mar 16, 2020

mpouillot commented Sep 11, 2019 •

edited

jpmeijers commented Sep 28, 2019 •

edited