Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTA Bootloader Support #10

Closed
projectgus opened this issue Jun 7, 2015 · 17 comments
Closed

OTA Bootloader Support #10

projectgus opened this issue Jun 7, 2015 · 17 comments
Assignees

Comments

@projectgus
Copy link
Contributor

Be able to compile OTA-compatible image pairs that work with an OTA-type bootloader.

AFAIK the esp_iot_rtos_sdk does not support OTA at the moment.

The rBoot open source OTA bootloader looks like an excellent option:
http://richard.burtons.org/2015/05/18/rboot-a-new-boot-loader-for-esp8266/
http://richard.burtons.org/2015/05/17/decompiling-the-esp8266-boot-loader-v1-3b3/

This might be possible just with linker script changes to the existing code base, or it might require open source startup code (#2) to change some of the early runtime behaviour.

Also need some library functions to download OTA updates, verify image, etc.

@projectgus
Copy link
Contributor Author

OTA support in progress in https://github.com/SuperHouse/esp-open-rtos/tree/ota

Can build an OTA compatible image that works with @raburton's open source rBoot bootloader. Thanks Richard!

Still need to add support inside esp-open-sdk for downloading & flashing OTA images, and changing the target image slot (basically this means porting rboot-ota).

@projectgus projectgus self-assigned this Jul 21, 2015
@raburton
Copy link

Can build an OTA compatible image that works with @raburton's open source rBoot bootloader. Thanks Richard!

No problem, hope you find it useful. Let me know if you need any assistance with it.

projectgus added a commit that referenced this issue Jul 29, 2015
ota_basic example can receive new image via TCP.

However - writing to flash with interrupts disabled causes data loss,
and the TCP flow is very slow to recover. Linux sender quickly ramps up
RTT timer to very long retry intervals, crippling performance &
throughput.

Running the update without the flash writes causes the data to be
received quickly, so this is definitely an issue with the time taken for
the erase cycle.

Progress towards #10
@projectgus
Copy link
Contributor Author

As noted in recent commit, OTA support "works" but the TCP throughput is terrible.

A not-ideal workaround is probably to just erase the whole slot preemptively, before any network traffic comes through. sdk_spi_flash_write() can be called with small increments so hopefully will have less impact on packet throughput.

The only downside that I can think of is losing that slot as a potential fallback in case of flash corruption, but the most likely case I can think of for flash corruption is during an OTA update anyhow - in which case that slot will be at least partially erased anyhow.

It might be possible to tune this out via LWIP anyhow. TCP Fast Retransmit should apply in these cases (they just look like momentary packet loss or receiving buffer overflow), and it seems like Linux does a fast retransmit to begin with but eventually gives up and starts incrementally backing off.

@raburton
Copy link

That's interesting. What sort of performance are you seeing? I'm flashing a ~256k rom to the device with OTA in approx 1-2 seconds typically, on the LAN served from IIS. I've always been happy enough with that, so never looked at the performance.

You could erase the whole thing in advance, as soon the content length header arrives, before the real data. But you've already started the recv then, so that may still not be ideal. The other option would be to work out the slot size based on the next address in the rom table, or even modify the rboot config struct to include the rom sizes (I forget why I didn't include that in the first place, but I did consider it at the time so there was probably some reason).

@projectgus
Copy link
Contributor Author

Yeah, what I'm seeing is a lot worse than that. If I receive the data but then discard it then the performance is similar to to what you're seeing, but with calling spi_flash_erase_sector every 4096 bvtes then the packet loss starts kicking in and the RTT time on the Linux side quickly climbs up to 20 seconds plus, so we're talking several minutes to transfer the data! (most of which is spent with both the ESP8266 and the sender doing nothing, until the Linux PC's RTT timer expires and it re-sends some packets.) I should try a Windows host and see if there's any difference.

Reducing the TCP max window size, buffer sizes, etc., don't seem to make a big difference either.

Regarding pre-erase, thanks for the good suggestions. For now I've been working on an assumption of 1 megabyte slot offsets, so as long as that assumption continues then it should be reasonably easy to erase in advance. Not a lot of modules with <16MBit of flash any more, and it keeps the linker script situation nice and simple - as you pointed out in the rboot docs!

I'm guessing that the whole "disabling interrupts drops packets" thing is fundamentally a bug of some kind. I might try writing a test program on the newest Espressif RTOS SDK version, and also try some packet captures against the IoT SDK, and see if I can find out anything useful by comparison. The bug could be in the from-source LWIP implementation in esp-open-rtos, or it could be in the binary libraries somewhere.

@raburton
Copy link

Wow, that's quite a difference. How are you disabling interrupts? I can try that on mine and see what happens. I've not done that myself before. I assumed that it wasn't necessary because any read/write/erase/query on the flash causes it to be unmapped so I assumed the SDK code that did that must be able to handle it. I've never had a problem. Someone using the arduino sdk said they had to disable interrupts, but I don't know what that really means in that sdk. My advice was to shut down any user processing while the ota is done. Perhaps completely disabling interrupts is excessive, instead just halting user functions is probably sufficient (e.g. unsetting any gpio interrupts and generally stopping doing normal user code stuff).

@projectgus
Copy link
Contributor Author

The RTOS SDK spi_flash_erase_sector/spi_flash_write functions disable/enable interrupts already as part of their implementation, so that happens automatically for RTOS-based usage. I believe the equivalent functions in the IoT SDK are ets_intr_lock() & ets_intr_unlock().

I'd be really interested to see if adding ets_intr_lock/unlock makes much difference with your test program, thanks!

The key difference between the RTOS SDK (and RTOS SDK derivations like esp-open-rtos), compared to the IoT SDK, is that there's a lot of functions in the ESP8266 mask ROM that aren't called when running as an RTOS, because they're not compatible. This means it's hard to get as much functionality running just from IRAM/mask ROM, because more code needs to go into IRAM.

Also because the RTOS SDK has pre-emptive multitasking, an interrupt can cause a context switch to any other task.

As a result, the RTOS SDK design (as I understand it) has the NMI (which is higher priority and can't be disabled) dealing with WiFi functions from the radio up to the MAC layer, and then it hands over to the higher layers via a soft interrupt (at interrupt level 1, ie the level you can disable).

(Which is a double problem, because I think if the MAC layer just rejected some of these frames that arrive during SPI flash operations then things might be fine - the WiFi layer would retry them. But instead they go into the MAC layer and then they get stalled (or dropped) before being passed to the network layer.)

All that said, I have source versions of all of the exception handlers at the moment so it might be possible to selectively route exceptions, and move just enough of the IP layer to IRAM that we can keep it running while SPI operations take place. I don't really like my chances though, the very simple ota_basic example uses 29KB of IRAM already!

I did some more experiments today, tried pre-emptively erasing all the flash then just calling write() with 256 byte (flash page size) buffers. This is better, erasing a 4KB sector takes ~40ms but writing a page only takes 500us. However it still starts dropping packets at some point, and the retry timer exponential backoff slowdowns kick in.

More investigation & understanding required I think...

@projectgus
Copy link
Contributor Author

Seeing you mention Arduino made me go and take a look there, too. They have their own OTA update protocol over UDP, which I'm guessing will pretty much avoid the problem. It just sends a chunk of the image as a UDP datagram, then waits for an "ack" datagram, then sends the next sector, etc. I'm almost certain this is because they've seen the same problem with dropped packets causing death-by-retry-timer. Maybe @igrr has some insight he can share here?

Rolling a custom UDP protocol might be the solution for now as well, although I'd like to support OTA updates via HTTPS some day (at least updates that don't take 15 minutes to complete).

@igrr
Copy link

igrr commented Jul 30, 2015

In Arduino core OTA update doesn't really happen fully over UDP. We have a UDP socket listening on ESP8266, waiting for an "invitation" packet all the time. When IDE wants to update the sketch, it sends this invitation packet (via python script) which contains sketch size and TCP port number. ESP catches this packet and connects to the TCP socket set up by python script (IP is taken from UDP header, port is passed in the packet itself). Once TCP connection is established, python script starts sending data.
The reason for using UDP for handshake is that on ESP, UDP socket takes up much less RAM than a TCP one, and also the number of TCP sockets is limited to 5, so we don't want to occupy one all the time.

@projectgus
Copy link
Contributor Author

Thanks for filling me in @igrr. My mistake for not reading the Python script more closely. Nice trick saving a TCP socket.

As I understand it, the TCP protocol still pauses after each "chunk" waiting for a 4 byte response from the ESP, so I'm guessing the effect is still similar to how I saw it - there's no sliding window in effect, and any time the ESP disables interrupts and might drop packets due to there are none in flight (because the client is still waiting for the in-band ack packet). Do you know if that's deliberate because of problems with dropped packets and retransmit times, if you didn't have the in-band acks?

@igrr
Copy link

igrr commented Jul 30, 2015

It used to work for me even without in-band ACKs, but some people reported that with other services running in background (e.g. mDNS) there were intermittent failures, which they resolved by adding these ACKs. I'm sure this was an ad-hoc solution, no one cared to analyze it down to network level, but overall you're right, it reduces the chance of dropped packets while writing to flash.

@projectgus
Copy link
Contributor Author

Thanks @igrr. Interesting, that seems to dovetail with the stuff you mentioned in the other Issue.

I thought about this some more. I think the "correct" way to fix the problem is to transmit an 802.11 null frame with the power management flag set, which tells the AP not to send any more frames to us until after we transmit again.

Something like:

disable_interrupts()
send_nullframe(PM flag = 1)
do_flash_ops()
enable_interrupts()
send_nullframe(PM flag = 0)

Hacking around, I can call pm_send_nullfunc(1) (libpp.a) and successfully see a single nullframe sent with the PM flag set. However sending the nullframe changes something in the internal state, the WiFi keeps working normally but further calls to pm_send_nullfunc(0) & pm_send_nullfunc(1) don't transmit anything.

There is also ieee80211_send_nulldata(node) in libnet80211, ie higher layer version of the same thing. Calling this function I can send null frames any number of times, but the PM flag is always cleared in them. In the internals of ieee80211_send_nulldata()` I can see that the PM flag (0x20) is set in the frame header depending on at least two other pieces of internal state, but I can't figure out how these get set.

There may be similar ways to get the same result just by just disabling the wifi radio (even without setting the PM flag, if the 802.11 frames aren't ACKed then the AP should retry them a few more times at short intervals before dropping them entirely, all at the frame layer.) I can't find a legit way to disable the radio temporarily though, pm_set_sleep(4) gets called during ADC operation (test_tout) but doesn't seem to do much. Ditto wDev_DisableRx & wDev_DisableTransmit.

I'm going to drop this approach for now and probably go for a synchronous protocol with in-band ACKs, similar to Arduino, but it'd be good to revisit some time in the future.

@projectgus
Copy link
Contributor Author

OTA updates are now supported via TFTP server on the ESP, working with rboot. Seems stable.

Have written the code in a way that adding TFTP client support (fetching firmware images from a third party TFTP server) should be easy, although I haven't done this because I don't have time to test it right now.

I'm going to open another issue to track being able to do HTTP/HTTPS OTA updates, but that's more of a long term goal for now.

@raburton
Copy link

raburton commented Aug 5, 2015

Looks like a nice solution. I found you'd fixed a typo in my the rBoot ota header that I didn't know about so I've fixed that upstream. I see you've written your own Cache_Read_Enable function, which is interesting. Did you have any problem getting it to compile in? I did when I wrote my wrapper. I note you've written it in assembler, did you know there is a c wrapper for doing the same included in rBoot, it may have been added after you took a copy. https://github.com/raburton/esp8266/blob/master/rboot/rboot-bigflash.c Also, if you'd prefer to replace the whole function I've got a c version of the original rom function (but doing it that way that uses more ram than writing a wrapper for the original rom function, so I wouldn't recommend it).

@raburton
Copy link

raburton commented Aug 5, 2015

Just occurred to me, you're using rtos, so presumably you didn't have any trouble putting in the wrapper, because you weren't trying to override something from the original sdk.

@projectgus
Copy link
Contributor Author

Hi Richard,

I see you've written your own Cache_Read_Enable function, which is interesting. Did you have any problem getting it to compile in?
...
Just occurred to me, you're using rtos, so presumably you didn't have any trouble putting in the wrapper, because you weren't trying to override something from the original sdk.

Yeah, once you control both the linker script and the build process it's pretty easy! I just renamed the ROM-based Cache_Read_Enable in the linker script so it moves out of the way, and the linker picks up the new one.

did you know there is a c wrapper for doing the same included in rBoot

I did, thanks for that! Without all your documenting and reverse engineering of Cache_Read_Enable this would have been a massively bigger task!

esp-open-rtos uses the 0.9.9 SDK which predates Cache_Read_Enable_New, so I was forced to replace the original Cache_Read_Enable.

I originally wrote a replacement in C (very similar to your replacement) but I found it crashed on startup. Looking at the ROM implementation of Cache_Read_Enable I saw it didn't use the stack, so my working theory was that the first time Cache_Read_Enable is called, during the startup process, the stack hadn't been properly initialised. Hence I rewrote mine in assembler to explicitly not use the stack, and that seemed to work.

That theory may be totally wrong though, especially if you've written a Cache_Read_Enable in C from scratch. It's possible there was just a bug in my C version that went away when I rewrote it in assembler.

Also, if you'd prefer to replace the whole function I've got a c version of the original rom function (but doing it that way that uses more ram than writing a wrapper for the original rom function, so I wouldn't recommend it).

I'd be interested to see this, just out of general curiosity!

Have you successfully linked it into an SDK program and had it run as normal?

Cheers,

Angus

@raburton
Copy link

raburton commented Aug 7, 2015

Posted it on my blog for you (and anyone else who wants it) http://richard.burtons.org/2015/08/07/c-version-of-cache_read_enable-for-esp8266/

Pretty sure I did run it for a while to test it out, but it uses more iram than writing a wrapper so I soon ditched it. Not had time to play with it again to double check it, or to look at your stack theory, as I'm changing job this week and my fiancée has returned from a week away, so I've been pretty busy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants