Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fault tolerance #9

Closed
SEAPUNK opened this issue Sep 29, 2016 · 3 comments
Closed

fault tolerance #9

SEAPUNK opened this issue Sep 29, 2016 · 3 comments

Comments

@SEAPUNK
Copy link
Owner

SEAPUNK commented Sep 29, 2016

Here are a few cases I need to figure out:


Scenario 1: Server is started with a persistent backend, runners and jobs are created, server abruptly crashes, and then restarts immediately after.

Scenario 1 questions:

  1. Do we recover/keep the jobs?
  2. Do we recover/keep the runners?

Scenario 2: Server is started, runners and jobs added, server abruptly crashes, but does not get restarted.

Scenario 2 questions:

  1. How do the clients handle this?
  2. What do the runners do when they cannot connect to the server, and as a result, cannot send job responses?
@SEAPUNK
Copy link
Owner Author

SEAPUNK commented Sep 29, 2016

With the system that is the close reason for #7, I can answer this pretty easily:

Scenario 1:

  1. Yes. There are two timeouts: Job timeout (max time a job can run, period), and "client" timeout (max time a client can be unresponsive (may it be not querying the server in a timely fashion, or client responding to server's transport query in a timely fashion (if we do something like a websocket transport)) before we can consider it to be dead). Those timeouts determine external job failure.
  2. Yes. We store the runner's last time it contacted the server, and that time is used to calculate whether the runner timed out. In the case of a server failure, the runner timeout is reset, in the sense of the calculation for runner timeout is (now - Math.max(lastRunnerMessage, serverCreated)) > timeout

@SEAPUNK
Copy link
Owner Author

SEAPUNK commented Sep 29, 2016

Scenario 2:

  1. Clients just emit an "error" event for that handler every time checking for status fails. Job timeout checking should be done by the server, to maintain consistency. The user can implement custom logic for error handling.
  2. Runners build an indefinitely sized buffer of messages to send to the server. Runners can run out of memory and crash if the buffer is too big. (Runners should be separate processes anyway, as they can crash and etc.)

@SEAPUNK
Copy link
Owner Author

SEAPUNK commented Sep 29, 2016

I think this answers most of the fault tolerance questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant