Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP8266 send returns WOULD_BLOCK error when busy #9051

Merged
merged 1 commit into from
Dec 11, 2018

Conversation

michalpasztamobica
Copy link
Contributor

Description

In case ESP8266 reports "busy" status we need to return WOULD_BLOCK error from the send function.

Pull request type

[x] Fix
[ ] Refactor
[ ] Target update
[ ] Functionality change
[ ] Docs update
[ ] Test update
[ ] Breaking change

@michalpasztamobica
Copy link
Contributor Author

@SeppoTakalo @VeijoPesonen @kjbracey-arm @KariHaapalehto , please review

@0xc0170
Copy link
Contributor

0xc0170 commented Dec 11, 2018

Should this or already was tested with the client ?

@michalpasztamobica
Copy link
Contributor Author

@@ -479,6 +481,11 @@ int ESP8266Interface::socket_send(void *handle, const void *data, unsigned size)

status = _esp.send(socket->id, data, size);

if (status == NSAPI_ERROR_WOULD_BLOCK)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If event is the thing that leads to sigio, this should really be timed, or you're just immediately re-entering, probably. I would expect this to be using "call_in" for the event, whereas the other timeout stuff in the PR could just be checking elapsed time in the send call, I think. (although I'm not 100% sure what that's doing)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the status == NSAPI_ERROR_WOULD_BLOCK will only be returned from ESP8266::send function if both _busy and _busy_timeout_reached are set.
To set the _busy_timeout_reached flag we need to call the oob_busy_timeout() function which is only called in ESP8266Interface::_oob_busy_timeout() and that is called via a call_in from _oob_busy_detected().
So indirectly - this is timed. If the send returns with a valid return error within 1/10s the fact that there was a busy flag will be ignored.
Does this make sense?

Copy link
Contributor

@kjbracey kjbracey Dec 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, not sure. Seems too complicated, and not sure there's a need.

If the device does promptly return BUSY or OK after each command, then can just finish on either of those. Timeout shouldn't matter if it's working.

After you get BUSY, check overall elapsed time to determine whether you want to try again, or think it's time to return EWOULDBLOCK. Just look at Kernel::get_ms_count() to note time - don't need a callback, or any special handling in the busy OOB (just set busy flag and abort wait for OK).

@@ -951,6 +974,11 @@ void ESP8266::_oob_busy()
MBED_ERROR(MBED_MAKE_ERROR(MBED_MODULE_DRIVER, MBED_ERROR_CODE_ENOMSG), \
"ESP8266::_oob_busy() AT timeout\n");
}
if (!_busy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may as well just abort the current command here, right? If it says busy we know it's not going to say "OK". It the outer loop wants to retry immediately it can check elapsed time and reconsider. Else fall back to EWOULDBLOCK and retry after timed sigio.

There's probably never any need to be waiting inside the driver itself, unless we really are expecting another command to work if we just wait a few milliseconds.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we abort here then do we need to set the 1/10s timeout?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timeouts would ideally be a "something's gone wrong" situation, and should never be hit. Timeout on recv("SEND OK") mattered before when we weren't detecting the BUSY, but if we now positively reckon we'll see BUSY or SEND OK or ERROR(?), timeout shouldn't really matter. Can be high if we never hit it in practice.

@adbridge
Copy link
Contributor

@kjbracey-arm can you please review latest commit?

@@ -573,12 +575,22 @@ nsapi_error_t ESP8266::send(int id, const void *data, uint32_t amount)
if (_serial_rts == NC) {
while (_parser.process_oob()); // Drain USART receive register
}
_busy = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done before calling send(), after acquiring the mutex. Another request might have triggered the busy OOB so can't rely on the fact that this would be false.

if (_error) {
_error = false;
}
if (_busy && _busy_timeout_reached) {
tr_debug("returning NSAPI_ERROR_WOULD_BLOCK");
_busy = false;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary as each who is checking for busy must set this to false before making a request

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise could follow the same pattern as with
nsapi_error_t ESP8266::connect(const char *ap, const char *passPhrase)

void ESP8266::oob_busy_timeout()
{
tr_debug("oob_busy_timeout called");
if (_busy)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary check because busy has occurred if this function gets called

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but we might have gotten "busy..", registered the callback and then managed to actually send the data, which would clear the _busy flag. Does this make sense?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the lock is released in between?


void ESP8266Interface::_oob_busy_detected()
{
_global_event_queue->call_in(ESP8266_OOB_BUSY_TIMEOUT_MS, callback(this, &ESP8266Interface::_oob_busy_timeout));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An event should be added to the queue only if we are certain that "busy s..." or "busy p..." has occurred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_oob_busy_detected is called from ESP8266::_oob_busy via a callback.
In the _oob_busy callback I can see only three way - "busy s...", "busy... p" or MBED_ERROR which would halt the system.
So I think the requirement is fulfilled?

@adbridge
Copy link
Contributor

@michalpasztamobica could you address the review comments please

@ciarmcom ciarmcom requested review from a team December 11, 2018 14:00
@ciarmcom
Copy link
Member

@michalpasztamobica, thank you for your changes.
@ARMmbed/mbed-os-ipcore @ARMmbed/mbed-os-maintainers please review.

@michalpasztamobica michalpasztamobica force-pushed the esp8266_busy_signal branch 2 times, most recently from fb45c4e to 0a66812 Compare December 11, 2018 14:59
if (_busy && _busy_timeout_reached) {
tr_debug("returning NSAPI_ERROR_WOULD_BLOCK");
_busy_timeout_reached = false;
_smutex.unlock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

set_timeout() is missing from here. Doesn't make a difference as the default and SEND-timeout are basically the same...

@kjbracey
Copy link
Contributor

I'm still not really understanding the approach - can you look at my comments from a couple of hours ago?

@@ -576,6 +578,7 @@ nsapi_error_t ESP8266::send(int id, const void *data, uint32_t amount)
for (unsigned i = 0; i < 2; i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should become unnecessary.

@@ -520,6 +522,11 @@ int ESP8266Interface::socket_send(void *handle, const void *data, unsigned size)

status = _esp.send(socket->id, data, size);

if (status == NSAPI_ERROR_WOULD_BLOCK)
{
event();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be the function which gets thrown to the event queue

@adbridge
Copy link
Contributor

@VeijoPesonen could you double check the latest commits please

@cmonr
Copy link
Contributor

cmonr commented Dec 11, 2018

And @kjbracey-arm if you like.

@adbridge
Copy link
Contributor

CI started

@michalpasztamobica
Copy link
Contributor Author

It seems that after the recent refactoring the fix is not doing what we wanted it to any more:
18:31:37 16:31:37.750 | D1 <-- DutThread: [01243][DBG ][ESPA]: busy s...
18:31:37 16:31:37.750 | D1 <-- DutThread: [01244][DBG ][ESPA]: busy s...
18:31:42 16:31:41.880 | D1 <-- DutThread: [01192][DBG ][PAL ]: pal_plat_send status -3012

I could previously see (with the complex callback implementation) that -3001 (WOULD_BLOCK) was returned, while now it is back to DEVICE_ERROR... I am not sure, but perhaps I introduced a bug with the recent changes?

@cmonr
Copy link
Contributor

cmonr commented Dec 11, 2018

CI started

@adbridge
Copy link
Contributor

adbridge commented Dec 11, 2018

CI started

@cmonr looks like CI needs to be stopped again :(

@cmonr
Copy link
Contributor

cmonr commented Dec 11, 2018

Stopping both jobs...

@michalpasztamobica
Copy link
Contributor Author

Yes, I am sorry, but the logs are showing that something is wrong now.

@cmonr
Copy link
Contributor

cmonr commented Dec 11, 2018

@michalpasztamobica It's all good. Better to know sooner than after the jobs have completed 😄

@SeppoTakalo
Copy link
Contributor

Where did you spot that?
The Arm compiler builds were able to pass tests.

@michalpasztamobica
Copy link
Contributor Author

@mbed-ci
Copy link

mbed-ci commented Dec 11, 2018

Test run: FAILED

Summary: 1 of 1 test jobs failed
Build number : 1
Build artifacts

Failed test jobs:

  • jenkins-ci/mbed-os-ci_unittests

@michalpasztamobica
Copy link
Contributor Author

michalpasztamobica commented Dec 11, 2018

Looking at the code, if I abort the parser it might trigger the return from recv etc. inside ESP8266::send, which would mean that the function is returning, but _busy flag wasn't set yet. not sure how the context would be switched.
Surely, we eliminate this race condition if _busy is set before the call to parser.abort()
Trying this out: https://jenkins-internal.mbed.com/job/ARMmbed/job/mbed-client-testapp/job/PR-1156/15/console

@kjbracey
Copy link
Contributor

kjbracey commented Dec 11, 2018

The abort isn't synchronous, it just sets a flag. It doesn't cause early return from your OOB handler.

(It was a backwards-compatible alternative to giving the OOB handlers a return value - the parser checks to see if they called abort after they return).

@michalpasztamobica
Copy link
Contributor Author

michalpasztamobica commented Dec 11, 2018

19:12:07 17:12:06.972 | D1 <-- DutThread: [01277][DBG ][ESPA]: returning WOULD_BLOCK
19:12:07 17:12:06.972 | D1 <-- DutThread: Enqueuing the event call[01278][DBG ][PAL ]: pal_plat_send status -3001
(source: https://jenkins-internal.mbed.com/job/ARMmbed/job/mbed-client-testapp/job/PR-1156/18/console)

Ok our software runs fine. It seems the flag setting order made a difference in the end...

@cmonr
Copy link
Contributor

cmonr commented Dec 11, 2018

CI started

@adbridge
Copy link
Contributor

@michalpasztamobica So is this working consistently for us for both older and latest versions of the driver ?

@cmonr cmonr removed the needs: work label Dec 11, 2018
@mbed-ci
Copy link

mbed-ci commented Dec 11, 2018

Test run: SUCCESS

Summary: 11 of 11 test jobs passed
Build number : 2
Build artifacts

@cmonr cmonr merged commit 0a832dd into ARMmbed:master Dec 11, 2018
@cmonr cmonr removed the needs: CI label Dec 11, 2018
@michalpasztamobica
Copy link
Contributor Author

@adbridge this fix is a part of the new mbed-os ESP8266 driver only. I did not apply this patch to the old mbed-os ESP8266 driver. Perhaps you meant the newer and older AT firmware version (1.3.0, 1.6.0, 1.7.0?) I think we should also agree on what "working" means.
Teemu pointed out a failing test. Surely, our software (mbed-os ESP8266 driver) was not returning the correct value in case the ESP chip was responding with a "busy..." message. Both the new and old version returned a generic DEVICE_ERROR instead of WOULD_BLOCK error. With this fix we return that correct value (NSAPI_ERROR_WOULD_BLOCK). Thanks to that, even if ESP chip runs into the busy state, the application can act accordingly (perhaps it wants to wait, or try to reset the chip and resend the message?).
Despite the fix in our software we can see the test is still failing. The ESP8266 chip is sometimes not recovering from the busy state. Teemu checked that using a newer 1.7.0 version of the AT firmware (not to be confused with mbed-os ESP8266 driver) let him pass the test.
I therefore suspect that the test revealed a weakness in the mbed-os (fixed with this commit) and a weakness in the ESP8266 chip itself. Seems that the app is working fine if it can pass the test with the newer firmware.
@SeppoTakalo , @teetak01, please correct me if I am wrong in any of the above statements :)

@michalpasztamobica michalpasztamobica deleted the esp8266_busy_signal branch December 12, 2018 07:13
@kjbracey
Copy link
Contributor

A couple of questions for future investigation:

  • How did reordering those 2 lines (abort, busy) make a difference? Are you sure they did? If so, implies some sort of undefined behaviour or compiler bug being triggered somehow - corruption/locking/uninitialised variables.
  • Is "the chip" really busy (transiently or stuck)? Or is "BUSY" just a "buffer full" flow control response on an attempt to send to a TCP socket? If the latter, then the problem is more localised, and the loop would not be justified at all. The retry loop is only justified if it's a potential random transient error so it's like doing a radio retry due to lack of ack.

@michalpasztamobica
Copy link
Contributor Author

@kjbracey-arm , you are right, after looking into the code I also do not see how reordering the two events made a difference. I assumed that there is some blocking operation awaiting a change of _abort variable, but it is not the case...

I extracted the hashes used to run the test.
This hash made pal_plat_send return status -3012
This hash made pal_plat_send return status -3001
The only difference is the reordering of the abort and the flag.

Addressing the second point.
From the scarce documentation I could find, it seems that latter is the case - "busy" rather means a "buffer full", when too much data was fed. This is the full quotation from the official AT firmware spec:

busy s... Busy sending. The system is sending data now, cannot accept the newly input.
busy p... Busy processing. The system is in process of handling the previous command,
cannot accept the newly input.

Then in the examples documentation I found this:

• If the number of bytes inputted are more than the size defined (n):

  • the system will reply busy, and send the first n bytes.
  • and after sending the first n bytes, the system will reply SEND OK.

Looking at our driver, we should never run into a situation where we input more bytes than defined with +CIPSEND. We could truncate it to match the 2048B buffer, but never exceed it.
So sounds like the documentation and examples are not giving the full picture of what "busy" means.
Especially, that we do not come across this issue in AT firmware 1.7.0....

@VeijoPesonen , please correct me if I am wrong.

@SeppoTakalo
Copy link
Contributor

What documentation says and what the device actually does are not the same thing.
We have seen sometimes devices using previous AT command syntax, so you need to do a bit of exploring to make ESP work.

Therefore I assume that “busy s” might be sometimes just a generic error code from device.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants