fault tolerance #9

SEAPUNK · 2016-09-29T20:22:01Z

Here are a few cases I need to figure out:

Scenario 1: Server is started with a persistent backend, runners and jobs are created, server abruptly crashes, and then restarts immediately after.

Scenario 1 questions:

Scenario 2: Server is started, runners and jobs added, server abruptly crashes, but does not get restarted.

Scenario 2 questions:

How do the clients handle this?
What do the runners do when they cannot connect to the server, and as a result, cannot send job responses?

SEAPUNK · 2016-09-29T21:37:25Z

With the system that is the close reason for #7, I can answer this pretty easily:

Scenario 1:

Yes. There are two timeouts: Job timeout (max time a job can run, period), and "client" timeout (max time a client can be unresponsive (may it be not querying the server in a timely fashion, or client responding to server's transport query in a timely fashion (if we do something like a websocket transport)) before we can consider it to be dead). Those timeouts determine external job failure.
Yes. We store the runner's last time it contacted the server, and that time is used to calculate whether the runner timed out. In the case of a server failure, the runner timeout is reset, in the sense of the calculation for runner timeout is (now - Math.max(lastRunnerMessage, serverCreated)) > timeout

SEAPUNK · 2016-09-29T21:41:24Z

Scenario 2:

Clients just emit an "error" event for that handler every time checking for status fails. Job timeout checking should be done by the server, to maintain consistency. The user can implement custom logic for error handling.
Runners build an indefinitely sized buffer of messages to send to the server. Runners can run out of memory and crash if the buffer is too big. (Runners should be separate processes anyway, as they can crash and etc.)

SEAPUNK · 2016-09-29T21:41:53Z

I think this answers most of the fault tolerance questions.

SEAPUNK closed this as completed Sep 29, 2016

SEAPUNK mentioned this issue Sep 29, 2016

idea tracker 2: architecture nightmare boogaloo #10

Open

11 tasks

Provide feedback