-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Heartbeats between master and worker processes. #18
Comments
@JeffBezanson This seems like a post 0.2 thing. Thoughts? Let's keep thisPR around. |
This is definitely not in scope for 0.2 |
This is not yet a PR. Will convert it into a PR if folks are OK with the general design. |
Hey @amitmurthy , I'm thinking of trying to make a PR for heartbeat functionality based on the new message-passing idiom. Would that be stepping on your toes? Do you have any additional thoughts about heartbeat support since this issue was created? |
Perish the thought. No issues whatsoever. I would initially focus just on the distributed "progress meter". Removing failed from the cluster currently happens when the remote socket connection is closed. This workers pretty well in most cases - the exception being when different transports are used (which is not very widespread). Restarting workers is not straightforward since we have to recreate worker global state if any. Reconnecting in the event of broken connections (due to serialization/deserialization) errors can and should be implemented. This would require
Note : I am currently traveling and will not have reliable internet over the next few days - my responses may be delayed. |
OK, that makes sense. But wouldn't it to be simpler to just not close the connection in the first place in the event of a serialization error? If a worker can't deserialize a message from pid1 (ie, I guess I'm not really understanding why exceptions caused by evaluation of the thunk the worker has been told to evaluate should be treated so differently than exceptions caused by deserializing pid1's message. (Understand you're traveling and am not expecting an immediate response) |
Oh, is it because the state of the stream is then in a corrupted state so serializing/deserializing messages over it is no longer possible? In that case, one unifying solution might be to have a second stream to each worker, an out-of-band "status" stream. It could both be used for heartbeats and for indicating that the main message stream is no longer usable because of a serialization error and has to be reset on both sides. |
|
Ya, I see that out-of-band would have to be on a separate thread. IJulia has example multithreaded heartbeat code (https://github.com/JuliaLang/IJulia.jl/blob/3a1ad9eac8c259c79a6a1c09342eac262b16f34b/src/heartbeat.jl). It might require a new C function somewhere in I see the complexity of dealing with a deserialization error. If we're throwing out possible schemes, we can take inspiration from http multipart encoding -
The receiver first reads the token, then deserializes the message. If it encounters an error during deserialization, it just reads the bytes from the stream until it finds the boundary token so it can start processing the next message. |
If you're interested in a way to frame messages in a stream I've always been a fan of COBS, though my applications have usually been wire protocols on embedded systems where the messages are pretty short (~100 bytes or less). It's nice because you're guaranteed the frame delimiter won't show up in the data, so if you get into a confused state you know that the next delimiter you see is a new frame. In the past I've used protocols where the framing byte just gets escaped when it appears in the stream, but that can cause big overheads if the frame byte shows up a lot in the stream. Receiving COBS can be done on-the-fly without keeping a buffer. Encoding could have a stream interface but would require a buffer of 254 bytes. I'm slammed for the next couple days but after that if you're interested I could write a COBS stream wrapper type that acts like a stream with an additional |
That looks cool, but why not just use a random 10 bytes as the frame On Tue, Oct 27, 2015 at 7:46 PM Spencer Russell notifications@github.com
|
Proposed implementation here for reference: On Tue, Oct 27, 2015 at 7:55 PM Jonathan Malmaud malmaud@gmail.com wrote:
|
It also doesn't seem like it would work here because it needs to know the On Tue, Oct 27, 2015 at 7:56 PM Jonathan Malmaud malmaud@gmail.com wrote:
|
I'll preface this by saying that I'm not entirely convinced either that this is a better way to go, and it's possible that my ignorance of the larger context is getting in the way. Here are a couple of things that concern me with the random delimiter scheme though:
Re: needing to know the number of frame bytes: that's why you'd need the 254-byte buffer on the writer side. You only need to lookahead for a Anyways, mostly I just wanted to throw it out there as an alternative, not advocate to change course, so if these points aren't convincing I think the random delimiter is also a good way to go. |
reading my comment, "code byte" is probably a better term than "frame byte". The frame delimiter is a literal |
Oh ya, I see. Do you mind if I implement it before you? Now I'm excited about it. |
I wouldn't be able to get to it until the beginning of next week, so go for it! |
That's a clever way to do this. |
Alright, JuliaLang/julia#13795 can now recover gracefully from deserialize failures by keeping a log of ACKed messages and using frame delimiters to reset the message stream to a working state. |
Why do we need both? Won't just using frame delimiters be enough? As for the possibility of a (extremely unlikely) collision, it is to be noted that it exists only in the failure case, i.e., upon a deserialization error. Chances of errors during serialization have reduced after JuliaLang/julia@77b2527 |
Well, you want a way for the worker to signal to the client that a Plus I'm hoping it might lay some groundwork for other applications that On Wed, Oct 28, 2015 at 11:53 PM Amit Murthy notifications@github.com
|
It may be better to split the message into a "header" and "body" and deserialize in two steps. First we deserialize the header (which should never fail - if it does fail, treat it as a fatal error) and then the function/args. Errors while deserializing function and args are treated the same as errors while executing |
Makes sense. If the header is going to be message-agnostic, there'd need to Otherwise each message type could have its own corresponding header type, On Thu, Oct 29, 2015 at 12:14 AM Amit Murthy notifications@github.com
|
Yes. The header needs to contain just the |
Alright, JuliaLang/julia#13795 implements the header mechanism. |
Heartbeats between master and workers have been mentioned before. Seeing JuliaLang/IJulia.jl#8 triggered it again.
Suggesting a design :
accept_handler
inmulti.jl
waits for connection requests from peer processes.create_message_handler_loop
as it currently doesshow_workers()
can pretty print this information.The whole heartbeat mechanism can be switched on via a command line argument to the julia executable (default is off)
It seems workable enough and we don't need the admin thread to listen on a different port. Do let me know what you guys think.
The text was updated successfully, but these errors were encountered: