Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serial not concurrency safe? #1157

Closed
tomas-mbit opened this issue Aug 31, 2018 · 27 comments
Closed

Serial not concurrency safe? #1157

tomas-mbit opened this issue Aug 31, 2018 · 27 comments

Comments

@tomas-mbit
Copy link

Describe the bug

Mulitiple forever loops writing to serial produces erroneous output.

To Reproduce
Run the simple program at https://makecode.microbit.org/_WEga5JYiX1x7

Three forever loops run in parallell. The body of each loop repeats three times the sequence of pausing one second and writing to serial. Each time a different string is written - loop 1 writes SL1a, SL1b, SL1c, loop 2 writes SL2a, SL2b... and so on. Output on Teraterm is incorrectly SL1a, SL2b SL3c, SL1a, SL2b, SL3c, SL1a...

Picture of block program:
serial output - wrong execution

Producing this incorrect output:
bild

Expected behavior
For each loop, all the text should be output, and in the correct order (though it could be interleaved with the output from the other loops). E.g. SL1a, SL2a, SL3a, SL1b, SL2b, SL3b.... and so on.

When a "gate" function is added to make sure that serial is entered only by one forever loop at a time, correct output is produced, e.g. this program: https://makecode.microbit.org/_hDo7Ea59THbY
Writing to the LED matrix instead of to serial also works corretly (without any "gate")

Picture of program with "gate":
thisworks

Producing this correct output:
bild

Could this be an error in restoring an execution pointer in a context switch? Or an overwrite problem in some serial buffer?
The problems occurred in a rather complex program that I was debugging. The more debugging output I put into the program, the more confused I got and it took a lot of time. Seems rather unfortunate that the tool used for debugging introduces bugs itself.
After suspecting that the strange behavious actually wasn't my fault, I was eventually able to reproduce the problem in the simple programs above.

--Tomas

Desktop :

  • OS: Windows 7
  • Browser Firefox
  • Version 61.0.2 (64-bitars)
@guillaumejenkins
Copy link
Contributor

guillaumejenkins commented Aug 31, 2018

I think this is not a "Forever" concurrency issue.

See the following program - https://makecode.microbit.org/_71UgTKfrw4jh
image

The first 3 Forever loops increment or decrement the shared variable, and the last one prints it.

The first "iteration" adds a net +1 to the variable (-1 + 1 +1 = 1), the second "iteration" adds a net +2 (2 - 2 +2 = 2), the third, +3 (3 + 3 - 3 = 3), and the fourth resets the variable to 0.

On both the simulator and the device, I see the sequence 0 -> 1 -> 3 -> 6 -> 0, which means all 3 Forever loops execute all of their instructions in order (otherwise I would see a different sequence of numbers).

This means that the problem is necessarily with USB reading, or with the serial module.

@tomas-mbit
Copy link
Author

Yes, it appears that the statements in the loops are indeed executed.
I added counters to my test program

https://makecode.microbit.org/_b6vi4diPYbqy

bild

concurrencytest4 output

and they are apparently correctly incremented, even though the corresponding output to serial does not always show up on the PC. Maybe there is some buffer overwrite problem somewhere?

However, if i take the statements of the three serial forever loops and merge them into a single loop in the same order that they (presumably) are executed by the microbit when in the separate concurrent loops, the problem disappears.

https://makecode.microbit.org/_DuHaeLhDbKxm

bild

output of concurrencytest5

So it seems to me that concurrency/context switch possibly plays at least some part in this even though it might not be at fault.

--Tomas

@martinwork
Copy link
Contributor

In the DAL, the serial transmit functions can be used by only one fibre at a time. So I think the first forever loop sleeps inside write line until the transmission completes, with a lock set on the transmit functions. When the second loop calls transmit, the lock causes write line to fail. Which loop gets to transmit at each stage is a matter of timing.

@tomas-mbit
Copy link
Author

Assuming that the serial package uses the default mode of SYNC_SLEEP when calling e.g. "Send" in the DAL, then a forever loop calling serial.WriteLine while it is busy will simply not have the statement executed. The DAL returns the error code MICROBIT_SERIAL_IN_USE instead of sending anything.

The desired behaviour really should be that the fiber instead sleeps, the serial send operation is carried out when serial is no longer busy and then the fiber is woken and resumes execution. This would ensure correct execution of all calls to serial. I suppose that the most efficient implementation would be inside the DAL, but it could also be made in the mapping from javascript to DAL services. (I'm speculating a bit here, as I know very little aboout the internals).

If this cannot be fixed, i think it is important to document the concurrency-unsafe behavior of serial. When serial is used for debugging, a text not printed in the terminal window will be taken as evidence that that part of the program for some reason was not executed. The coder will then (as I did) scratch her head, understand nothing and insert further debugging printouts in the code to shed light on what's going on. This may lead to even more confusion as possibly other debug printouts will now be omitted.
A warning in the documentation about serial-debugging concurrent programs would save a lot of frustration. And if one is aware of the problem, it is possible to work around it by coding your own lock/wait functionality around the calls to serial (similar to what I did in the second example).

--Tomas

@pelikhan
Copy link
Member

pelikhan commented Sep 13, 2018

@pelikhan
Copy link
Member

@finneyj is visiting our offices next week. I'll chat about this with him.

@tomas-mbit
Copy link
Author

>@tomas-mbit could you take screenshots in English? It helps finding which block you are using

I didn't know how to do this, but soon found out. Here they are.
Original program:
bild

Version with counters:
bild

I suppose that if you use the program links that I provided, it will come up at your end in whatever language is in your settings?

--Tomas

@philipphenkel
Copy link
Contributor

SYNC_SLEEP is great for reading operations.

SYNC_SPINWAIT would solve the concurrency issue for write operations. From a students perspective, it seems more important to deliver reliably vs freeing up the CPU while transferring. Of course, I'm assuming that usually only small amounts of data are transferred.

Thus, a change of the default parameters for write operations might already help a lot.

@finneyj
Copy link

finneyj commented Oct 3, 2018

@philipphenkel

thanks for reporting. For the nrf51 on micro:bit I think this makes a lot of sense - it has no DMA support and the transmit buffer is very small (just a handful of bytes). We should also fix the bug though - I guess we missed a mutex lock in there!

@philipphenkel
Copy link
Contributor

philipphenkel commented Oct 3, 2018

@finneyj Should I prepare a PR for the microbit-dal that sets the default parameter for write to SYNC_SPINWAIT? Or should I update serial.cpp to use SYNC_SPINWAIT?
I would recommend doing the later because that's more explicit.

@philipphenkel
Copy link
Contributor

@finneyj @tomas-mbit
I did some tests and with the change of serial.cpp I get the expected results:

SL2a
SL3a
SL1a
SL2b
SL3b
SL1b
SL2c
SL3c
SL1c

SL2a
SL3a
SL1a
SL2b
SL3b
SL1b
SL2c
SL3c
SL1c

SL2a
SL3a
SL1a
SL2b
SL3b
SL1b
SL2c
SL3c
SL1c

philipphenkel added a commit to philipphenkel/pxt-microbit that referenced this issue Oct 3, 2018
@finneyj
Copy link

finneyj commented Oct 3, 2018

Thanks @tomas-mbit

yes, please pop that in as a PR for now, which keeps things nice and safe until we get chance to drop in a mutex and test it through.

@philipphenkel
Copy link
Contributor

@finneyj Done :-)

@finneyj
Copy link

finneyj commented Oct 3, 2018

thanks!

@tomas-mbit
Copy link
Author

@philipphenkel @finneyj Thanks for fixing this. I hope that my initial attempts at testing and analysis were of some help.
Forgive my ignorance, but how/when does a fix like this one make its way into the actual "production environment"?
--Tomas

@pelikhan
Copy link
Member

pelikhan commented Oct 4, 2018

Right now we are waiting on @finneyj to finish review this change. After that, the changes should make it to /beta within a day.

@philipphenkel
Copy link
Contributor

@tomas-mbit Without your report and the analysis, I would have never connected it to the issue I struggled with. Also, your test program made it pretty convenient to verify the potential fix. Super helpful!

@philipphenkel
Copy link
Contributor

@finneyj Do you approve philipphenkel@b72ad9a for merge?

@philipphenkel
Copy link
Contributor

@pelikhan I would love to see this fixed. Let me know if I should re-create my PR #1339

@pelikhan
Copy link
Member

pelikhan commented Jun 7, 2019

@finneyj did you review philipphenkel@b72ad9a ?

@finneyj
Copy link

finneyj commented Jun 7, 2019

@pelikhan Apologies.

Just read it over. As a stop gap, updating the send() functions to be synchronous seems fine. I'll look into the concurrency issue next time I release microbit-dal, then we can revert back to the more efficient implementation. IMHO the calling fiber should block until the resource is ready.

I'm not sure we want to make the receive / receiveUntil busy waiting too though? That could lock out the the CPU for an arbitrary period of time. Do we know there's a similar concurrency issue on the receive side?

@pelikhan
Copy link
Member

pelikhan commented Jun 7, 2019

This needs a thorough investigation before going live. I am leaning on adding an optional argument to the serial APIs call that allow to specify the mode.

@finneyj
Copy link

finneyj commented Jun 7, 2019

Actually, I just misread the code - my bad. The read calls are SYNC_SLEEP, not SYNC_SPIN_WAIT. That's the existing default behaviour, so for receive() calls we're just being more explicit about what we already do.

send() calls should be OK with SPIN_WAIT as that is a much more bounded use case (the fiber can only hog the CPU until its buffer runs out...)

@philipphenkel
Copy link
Contributor

I intended to make the code more explicit so that it is easier to understand. The semantic change should be with send only.

@abchatra
Copy link
Contributor

This is on the DAL side and not makecode. Correct me if I am wrong.

@finneyj
Copy link

finneyj commented Mar 20, 2020

@abchatra

Yes - DAL side.

@philipphenkel
Copy link
Contributor

The issue does still exist.

I tested with
makecode.microbit.org version: 3.0.67
Microsoft MakeCode version: 6.0.39
microbit runtime version: v2.2.0-rc6

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
8 participants