Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broken pipe (close_after=true) or endless retry loop (close_after=false) when exchanging classical messages #58

Open
brunorijsman opened this issue Dec 6, 2019 · 2 comments
Assignees

Comments

@brunorijsman
Copy link

There is a race condition in the way classical channels are handled in Simulaqron that causes BrokenPipe errors.

It happens when node A sends two classical messages to node B in succession.

So, in node A we have:
A1) sendClassical("B", "msg1")
A2) sendClassical("B", "msg2")

And in node B we have:
B1) msg1 = recvClassical()
B2) msg2 = recvClassical()

We have close_after set to the default value False in all calls.

The following sequence of events takes place:

In step A1, node A tries to connect to node B. This fails because node A has not yet created a listening socket. A keeps retrying.

In step B1, node B creates a listening socket "s" and waits for incoming connections. An incoming connection from A does indeed arrive, and node B accepts it which creates a connection socket "conn" (in addition to the listening socket).

In step A1, node A finally connects successfully and sends msg1.

In step A1, node A closes the connection to B.

In step A2, node A tries to connect to node B to send the second message msg2. This succeeds, because node B is still in step B1 and has not yet closed the listening socket for the first message.

In step B1, node B, receives msg1 from node A. After receiving msg1, node B closes connection socket conn (in function closeClassicalServer) but the listen socket is not explicitly closed. The listening socket which is still open still has a pending incoming connection from event 5 above.

In step B2, calls startClassicalServer again for the second incoming message. This creates a new listening socket "s" for the same port. This evidently has the side effect of causing the first listening socket which was never explicitly closed to be closed implicitly. Whatever, at this point node B gets a BrokenPipe error because the pending connection failed to complete.

Actually, the listening socket is probably closed when "s" goes out of scope when startClassicalServer exits the function. But still, the net result is the same. By this time, node A has already started sending msg2 and has already connected to this listening socket before it was closed. Thus, node A will get the BrokenPipe when the listening socket (which was intended for msg1 but which accidentally used for msg2 as well) gets closed.

Simulaqron PythonLib should not be creating new listening sockets for every received message. There should only be one listening socket, and it should be used for the lifetime of the node.

Note: if I set close_after to False in sendClassical and recvClassical, we run into another bug. We get stuck in an endless loop "App Alice: Could not open classical channel to Eve, trying again.." This is because PythonLib cannot handle having two simultaneous TCP connections to two different neighbor nodes (e.g. Eve having two connections, one to Alice and one to Bob) if close_after is False.

In general, the handling of sockets for classical messages is in need of some cleaning up.

PS: All this opening and closing of sockets probably makes the protocols run quite slowly. We should make close_after=False the default and fix the code so that it can handle multiple simultaneous connections to multiple neighboring nodes.

@AckslD If you would like me to give shot at fixing this myself, feel free to assign the issue to me. Or assign it to yourself if you prefer handling it yourself. If the former, it appears that there is no unit test library for pythonLib... is that correct?

@AckslD
Copy link
Member

AckslD commented Dec 11, 2019

Hi @brunorijsman! Sorry for the late reply! I am quite unhappy with the built-in classical communcation in general. It's slow and not robust and something I quickly did before the first SimulaQron competition. I definitely agree that it needs cleaning up. If you have any thoughts or suggestions on how to do it I'm very happy to hear it and I'm also very happy if you have the time to start it :)

Regarding the tests, I have recently started added a test framework for the pythonLib since this was all left in SimulaQron after the split. It's however quite small at the moment but you can find the existing test here https://github.com/SoftwareQuTech/CQC-Python/tree/master/tests

@brunorijsman
Copy link
Author

I will take a shot at fixing the issues, cleaning up the code, and getting to better performance in the process. Expect pull request to review, hopefully within the next week or two.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants