ESP8266 send returns WOULD_BLOCK error when busy #9051

michalpasztamobica · 2018-12-11T12:21:56Z

Description

In case ESP8266 reports "busy" status we need to return WOULD_BLOCK error from the send function.

Pull request type

[x] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

michalpasztamobica · 2018-12-11T12:22:18Z

@SeppoTakalo @VeijoPesonen @kjbracey-arm @KariHaapalehto , please review

0xc0170 · 2018-12-11T12:30:19Z

Should this or already was tested with the client ?

michalpasztamobica · 2018-12-11T12:32:30Z

@teetak01 is working on this: https://github.com/ARMmbed/mbed-client-testapp/pull/1156

kjbracey · 2018-12-11T12:35:00Z

components/wifi/esp8266-driver/ESP8266Interface.cpp

@@ -479,6 +481,11 @@ int ESP8266Interface::socket_send(void *handle, const void *data, unsigned size)

    status = _esp.send(socket->id, data, size);

+    if (status == NSAPI_ERROR_WOULD_BLOCK)


If event is the thing that leads to sigio, this should really be timed, or you're just immediately re-entering, probably. I would expect this to be using "call_in" for the event, whereas the other timeout stuff in the PR could just be checking elapsed time in the send call, I think. (although I'm not 100% sure what that's doing)

the status == NSAPI_ERROR_WOULD_BLOCK will only be returned from ESP8266::send function if both _busy and _busy_timeout_reached are set.
To set the _busy_timeout_reached flag we need to call the oob_busy_timeout() function which is only called in ESP8266Interface::_oob_busy_timeout() and that is called via a call_in from _oob_busy_detected().
So indirectly - this is timed. If the send returns with a valid return error within 1/10s the fact that there was a busy flag will be ignored.
Does this make sense?

Um, not sure. Seems too complicated, and not sure there's a need.

If the device does promptly return BUSY or OK after each command, then can just finish on either of those. Timeout shouldn't matter if it's working.

After you get BUSY, check overall elapsed time to determine whether you want to try again, or think it's time to return EWOULDBLOCK. Just look at Kernel::get_ms_count() to note time - don't need a callback, or any special handling in the busy OOB (just set busy flag and abort wait for OK).

kjbracey · 2018-12-11T12:37:44Z

components/wifi/esp8266-driver/ESP8266/ESP8266.cpp

@@ -951,6 +974,11 @@ void ESP8266::_oob_busy()
        MBED_ERROR(MBED_MAKE_ERROR(MBED_MODULE_DRIVER, MBED_ERROR_CODE_ENOMSG), \
                   "ESP8266::_oob_busy() AT timeout\n");
    }
+    if (!_busy)


We may as well just abort the current command here, right? If it says busy we know it's not going to say "OK". It the outer loop wants to retry immediately it can check elapsed time and reconsider. Else fall back to EWOULDBLOCK and retry after timed sigio.

There's probably never any need to be waiting inside the driver itself, unless we really are expecting another command to work if we just wait a few milliseconds.

If we abort here then do we need to set the 1/10s timeout?

Timeouts would ideally be a "something's gone wrong" situation, and should never be hit. Timeout on recv("SEND OK") mattered before when we weren't detecting the BUSY, but if we now positively reckon we'll see BUSY or SEND OK or ERROR(?), timeout shouldn't really matter. Can be high if we never hit it in practice.

adbridge · 2018-12-11T13:00:16Z

@kjbracey-arm can you please review latest commit?

VeijoPesonen · 2018-12-11T12:59:28Z

components/wifi/esp8266-driver/ESP8266/ESP8266.cpp

@@ -573,12 +575,22 @@ nsapi_error_t ESP8266::send(int id, const void *data, uint32_t amount)
            if (_serial_rts == NC) {
                while (_parser.process_oob()); // Drain USART receive register
            }
+            _busy = false;


This should be done before calling send(), after acquiring the mutex. Another request might have triggered the busy OOB so can't rely on the fact that this would be false.

VeijoPesonen · 2018-12-11T13:01:23Z

components/wifi/esp8266-driver/ESP8266/ESP8266.cpp

        if (_error) {
            _error = false;
        }
+        if (_busy && _busy_timeout_reached) {
+            tr_debug("returning NSAPI_ERROR_WOULD_BLOCK");
+            _busy = false;


Unnecessary as each who is checking for busy must set this to false before making a request

Otherwise could follow the same pattern as with
nsapi_error_t ESP8266::connect(const char *ap, const char *passPhrase)

VeijoPesonen · 2018-12-11T13:18:10Z

components/wifi/esp8266-driver/ESP8266/ESP8266.cpp

+void ESP8266::oob_busy_timeout()
+{
+    tr_debug("oob_busy_timeout called");
+    if (_busy)


Unnecessary check because busy has occurred if this function gets called

Yes, but we might have gotten "busy..", registered the callback and then managed to actually send the data, which would clear the _busy flag. Does this make sense?

But the lock is released in between?

VeijoPesonen · 2018-12-11T13:19:30Z

components/wifi/esp8266-driver/ESP8266Interface.cpp

+
+void ESP8266Interface::_oob_busy_detected()
+{
+    _global_event_queue->call_in(ESP8266_OOB_BUSY_TIMEOUT_MS, callback(this, &ESP8266Interface::_oob_busy_timeout));


An event should be added to the queue only if we are certain that "busy s..." or "busy p..." has occurred.

_oob_busy_detected is called from ESP8266::_oob_busy via a callback.
In the _oob_busy callback I can see only three way - "busy s...", "busy... p" or MBED_ERROR which would halt the system.
So I think the requirement is fulfilled?

adbridge · 2018-12-11T13:46:29Z

@michalpasztamobica could you address the review comments please

ciarmcom · 2018-12-11T14:00:47Z

@michalpasztamobica, thank you for your changes.
@ARMmbed/mbed-os-ipcore @ARMmbed/mbed-os-maintainers please review.

VeijoPesonen · 2018-12-11T15:05:49Z

components/wifi/esp8266-driver/ESP8266/ESP8266.cpp

+        if (_busy && _busy_timeout_reached) {
+            tr_debug("returning NSAPI_ERROR_WOULD_BLOCK");
+            _busy_timeout_reached = false;
+            _smutex.unlock();


set_timeout() is missing from here. Doesn't make a difference as the default and SEND-timeout are basically the same...

kjbracey · 2018-12-11T15:09:40Z

I'm still not really understanding the approach - can you look at my comments from a couple of hours ago?

VeijoPesonen · 2018-12-11T15:09:51Z

components/wifi/esp8266-driver/ESP8266/ESP8266.cpp

@@ -576,6 +578,7 @@ nsapi_error_t ESP8266::send(int id, const void *data, uint32_t amount)
    for (unsigned i = 0; i < 2; i++) {


This should become unnecessary.

VeijoPesonen · 2018-12-11T15:31:41Z

components/wifi/esp8266-driver/ESP8266Interface.cpp

@@ -520,6 +522,11 @@ int ESP8266Interface::socket_send(void *handle, const void *data, unsigned size)

    status = _esp.send(socket->id, data, size);

+    if (status == NSAPI_ERROR_WOULD_BLOCK)
+    {
+        event();


Shouldn't this be the function which gets thrown to the event queue

adbridge · 2018-12-11T16:03:16Z

@VeijoPesonen could you double check the latest commits please

cmonr · 2018-12-11T16:03:44Z

And @kjbracey-arm if you like.

components/wifi/esp8266-driver/ESP8266Interface.cpp

components/wifi/esp8266-driver/ESP8266/ESP8266.h

adbridge · 2018-12-11T16:43:12Z

CI started

michalpasztamobica · 2018-12-11T16:44:17Z

It seems that after the recent refactoring the fix is not doing what we wanted it to any more:
18:31:37 16:31:37.750 | D1 <-- DutThread: [01243][DBG ][ESPA]: busy s...
18:31:37 16:31:37.750 | D1 <-- DutThread: [01244][DBG ][ESPA]: busy s...
18:31:42 16:31:41.880 | D1 <-- DutThread: [01192][DBG ][PAL ]: pal_plat_send status -3012

I could previously see (with the complex callback implementation) that -3001 (WOULD_BLOCK) was returned, while now it is back to DEVICE_ERROR... I am not sure, but perhaps I introduced a bug with the recent changes?

cmonr · 2018-12-11T16:46:12Z

CI started

adbridge · 2018-12-11T16:47:20Z

CI started

@cmonr looks like CI needs to be stopped again :(

cmonr · 2018-12-11T16:47:51Z

Stopping both jobs...

michalpasztamobica · 2018-12-11T16:48:04Z

Yes, I am sorry, but the logs are showing that something is wrong now.

cmonr · 2018-12-11T16:48:46Z

@michalpasztamobica It's all good. Better to know sooner than after the jobs have completed 😄

SeppoTakalo · 2018-12-11T16:49:33Z

Where did you spot that?
The Arm compiler builds were able to pass tests.

michalpasztamobica · 2018-12-11T16:50:46Z

https://jenkins-internal.mbed.com/job/ARMmbed/job/mbed-client-testapp/job/PR-1156/14/consoleFull

mbed-ci · 2018-12-11T16:51:36Z

Test run: FAILED

Summary: 1 of 1 test jobs failed
Build number : 1
Build artifacts

Failed test jobs:

jenkins-ci/mbed-os-ci_unittests

michalpasztamobica · 2018-12-11T16:52:51Z

Looking at the code, if I abort the parser it might trigger the return from recv etc. inside ESP8266::send, which would mean that the function is returning, but _busy flag wasn't set yet. not sure how the context would be switched.
Surely, we eliminate this race condition if _busy is set before the call to parser.abort()
Trying this out: https://jenkins-internal.mbed.com/job/ARMmbed/job/mbed-client-testapp/job/PR-1156/15/console

kjbracey · 2018-12-11T17:08:55Z

The abort isn't synchronous, it just sets a flag. It doesn't cause early return from your OOB handler.

(It was a backwards-compatible alternative to giving the OOB handlers a return value - the parser checks to see if they called abort after they return).

michalpasztamobica · 2018-12-11T17:17:10Z

19:12:07 17:12:06.972 | D1 <-- DutThread: [01277][DBG ][ESPA]: returning WOULD_BLOCK
19:12:07 17:12:06.972 | D1 <-- DutThread: Enqueuing the event call[01278][DBG ][PAL ]: pal_plat_send status -3001
(source: https://jenkins-internal.mbed.com/job/ARMmbed/job/mbed-client-testapp/job/PR-1156/18/console)

Ok our software runs fine. It seems the flag setting order made a difference in the end...

cmonr · 2018-12-11T17:18:03Z

CI started

adbridge · 2018-12-11T17:38:19Z

@michalpasztamobica So is this working consistently for us for both older and latest versions of the driver ?

mbed-ci · 2018-12-11T19:51:59Z

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 2
Build artifacts

michalpasztamobica · 2018-12-12T07:13:21Z

@adbridge this fix is a part of the new mbed-os ESP8266 driver only. I did not apply this patch to the old mbed-os ESP8266 driver. Perhaps you meant the newer and older AT firmware version (1.3.0, 1.6.0, 1.7.0?) I think we should also agree on what "working" means.
Teemu pointed out a failing test. Surely, our software (mbed-os ESP8266 driver) was not returning the correct value in case the ESP chip was responding with a "busy..." message. Both the new and old version returned a generic DEVICE_ERROR instead of WOULD_BLOCK error. With this fix we return that correct value (NSAPI_ERROR_WOULD_BLOCK). Thanks to that, even if ESP chip runs into the busy state, the application can act accordingly (perhaps it wants to wait, or try to reset the chip and resend the message?).
Despite the fix in our software we can see the test is still failing. The ESP8266 chip is sometimes not recovering from the busy state. Teemu checked that using a newer 1.7.0 version of the AT firmware (not to be confused with mbed-os ESP8266 driver) let him pass the test.
I therefore suspect that the test revealed a weakness in the mbed-os (fixed with this commit) and a weakness in the ESP8266 chip itself. Seems that the app is working fine if it can pass the test with the newer firmware.
@SeppoTakalo , @teetak01, please correct me if I am wrong in any of the above statements :)

kjbracey · 2018-12-12T07:42:11Z

A couple of questions for future investigation:

How did reordering those 2 lines (abort, busy) make a difference? Are you sure they did? If so, implies some sort of undefined behaviour or compiler bug being triggered somehow - corruption/locking/uninitialised variables.
Is "the chip" really busy (transiently or stuck)? Or is "BUSY" just a "buffer full" flow control response on an attempt to send to a TCP socket? If the latter, then the problem is more localised, and the loop would not be justified at all. The retry loop is only justified if it's a potential random transient error so it's like doing a radio retry due to lack of ack.

michalpasztamobica · 2018-12-12T08:39:34Z

@kjbracey-arm , you are right, after looking into the code I also do not see how reordering the two events made a difference. I assumed that there is some blocking operation awaiting a change of _abort variable, but it is not the case...

I extracted the hashes used to run the test.
This hash made pal_plat_send return status -3012
This hash made pal_plat_send return status -3001
The only difference is the reordering of the abort and the flag.

Addressing the second point.
From the scarce documentation I could find, it seems that latter is the case - "busy" rather means a "buffer full", when too much data was fed. This is the full quotation from the official AT firmware spec:

busy s... Busy sending. The system is sending data now, cannot accept the newly input.
busy p... Busy processing. The system is in process of handling the previous command,
cannot accept the newly input.

Then in the examples documentation I found this:

• If the number of bytes inputted are more than the size defined (n):

the system will reply busy, and send the first n bytes.

and after sending the first n bytes, the system will reply SEND OK.

Looking at our driver, we should never run into a situation where we input more bytes than defined with +CIPSEND. We could truncate it to match the 2048B buffer, but never exceed it.
So sounds like the documentation and examples are not giving the full picture of what "busy" means.
Especially, that we do not come across this issue in AT firmware 1.7.0....

@VeijoPesonen , please correct me if I am wrong.

SeppoTakalo · 2018-12-12T09:04:51Z

What documentation says and what the device actually does are not the same thing.
We have seen sometimes devices using previous AT command syntax, so you need to do a bit of exploring to make ESP work.

Therefore I assume that “busy s” might be sometimes just a generic error code from device.

0xc0170 added the needs: review label Dec 11, 2018

0xc0170 requested review from SeppoTakalo, VeijoPesonen and kjbracey December 11, 2018 12:29

kjbracey reviewed Dec 11, 2018

View reviewed changes

michalpasztamobica force-pushed the esp8266_busy_signal branch from efb7ebe to 4ac8c27 Compare December 11, 2018 12:54

adbridge added the release-version: 5.11.0-rc3 label Dec 11, 2018

SeppoTakalo approved these changes Dec 11, 2018

View reviewed changes

michalpasztamobica force-pushed the esp8266_busy_signal branch from 4ac8c27 to 9571f46 Compare December 11, 2018 13:20

VeijoPesonen suggested changes Dec 11, 2018

View reviewed changes

ciarmcom requested review from a team December 11, 2018 14:00

michalpasztamobica force-pushed the esp8266_busy_signal branch 2 times, most recently from fb45c4e to 0a66812 Compare December 11, 2018 14:59

VeijoPesonen reviewed Dec 11, 2018

View reviewed changes

michalpasztamobica force-pushed the esp8266_busy_signal branch from 0a66812 to f8ecba1 Compare December 11, 2018 16:01

michalpasztamobica force-pushed the esp8266_busy_signal branch from f8ecba1 to 0e0cd95 Compare December 11, 2018 16:03

kjbracey reviewed Dec 11, 2018

View reviewed changes

components/wifi/esp8266-driver/ESP8266Interface.cpp Outdated Show resolved Hide resolved

components/wifi/esp8266-driver/ESP8266/ESP8266.h Outdated Show resolved Hide resolved

components/wifi/esp8266-driver/ESP8266/ESP8266.h Outdated Show resolved Hide resolved

ESP8266 send returns WOULD_BLOCK error when busy

d6e385b

cmonr added needs: work and removed needs: CI labels Dec 11, 2018

michalpasztamobica force-pushed the esp8266_busy_signal branch from d03f870 to d6e385b Compare December 11, 2018 16:49

cmonr added the needs: CI label Dec 11, 2018

cmonr removed the needs: work label Dec 11, 2018

cmonr merged commit 0a832dd into ARMmbed:master Dec 11, 2018

cmonr removed the needs: CI label Dec 11, 2018

ccli8 mentioned this pull request Dec 12, 2018

M487: Crash report test failed in IAR #9069

Closed

michalpasztamobica deleted the esp8266_busy_signal branch December 12, 2018 07:13

		@@ -479,6 +481,11 @@ int ESP8266Interface::socket_send(void handle, const void data, unsigned size)

		status = _esp.send(socket->id, data, size);

		if (status == NSAPI_ERROR_WOULD_BLOCK)

		@@ -576,6 +578,7 @@ nsapi_error_t ESP8266::send(int id, const void *data, uint32_t amount)
		for (unsigned i = 0; i < 2; i++) {

ESP8266 send returns WOULD_BLOCK error when busy #9051

ESP8266 send returns WOULD_BLOCK error when busy #9051

Conversation

michalpasztamobica commented Dec 11, 2018

Description

Pull request type

michalpasztamobica commented Dec 11, 2018

0xc0170 commented Dec 11, 2018

michalpasztamobica commented Dec 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kjbracey Dec 11, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adbridge commented Dec 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adbridge commented Dec 11, 2018

ciarmcom commented Dec 11, 2018

Choose a reason for hiding this comment

kjbracey commented Dec 11, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adbridge commented Dec 11, 2018

cmonr commented Dec 11, 2018

adbridge commented Dec 11, 2018

michalpasztamobica commented Dec 11, 2018

cmonr commented Dec 11, 2018

adbridge commented Dec 11, 2018 • edited Loading

cmonr commented Dec 11, 2018

michalpasztamobica commented Dec 11, 2018

cmonr commented Dec 11, 2018

SeppoTakalo commented Dec 11, 2018

michalpasztamobica commented Dec 11, 2018

mbed-ci commented Dec 11, 2018

michalpasztamobica commented Dec 11, 2018 • edited Loading

kjbracey commented Dec 11, 2018 • edited Loading

michalpasztamobica commented Dec 11, 2018 • edited Loading

cmonr commented Dec 11, 2018

adbridge commented Dec 11, 2018

mbed-ci commented Dec 11, 2018

michalpasztamobica commented Dec 12, 2018

kjbracey commented Dec 12, 2018

michalpasztamobica commented Dec 12, 2018

SeppoTakalo commented Dec 12, 2018

kjbracey Dec 11, 2018 •

edited

Loading

adbridge commented Dec 11, 2018 •

edited

Loading

michalpasztamobica commented Dec 11, 2018 •

edited

Loading

kjbracey commented Dec 11, 2018 •

edited

Loading

michalpasztamobica commented Dec 11, 2018 •

edited

Loading