Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redeploys cause downtime in production #24

Closed
Tracked by #55
0xOlias opened this issue Dec 29, 2022 · 0 comments · Fixed by #104
Closed
Tracked by #55

Redeploys cause downtime in production #24

0xOlias opened this issue Dec 29, 2022 · 0 comments · Fixed by #104

Comments

@0xOlias
Copy link
Collaborator

0xOlias commented Dec 29, 2022

When using a cloud platform that uses git deployments (like Render or Railway), a new deployment will be created for each new commit. Then, once the new deployment is deemed healthy by the platform (by responding to the health check path with a 2XX), the platform will start serving traffic to the new deployment and then shut down the old deployment. Cloud platforms normally will wait between 5 and 15 minutes for a healthy response, after which they will consider the deployment failed and shut it down.

Ponder currently sends a healthy response as soon as the GraphQL server starts up, which happens immediately on ponder start. This is problematic because the backfill (and/or the handler processing) will not NOT be done yet, so the entity tables are not up-to-date but the service is saying its healthy.

New solution circa 2/20

The service should respond as unhealthy until either:

  1. The backfill + log handling is complete (the entity database is reloaded and ready to serve requests)
  2. 4 minutes have passed

Here are the scenarios that come to mind, and how they are handled:

  1. First deployment for a big app where the backfill takes an hour (sad but common at the moment). The service will respond as unhealthy for 4 minutes as the backfill is going. Then, the service will start responding as healthy on the health check path, but the GraphQL path will return 5xx until it's ready to serve requests.
  2. Redeployment for a normal app where the reload takes <4 minutes. The service will start responding as healthy after a short time, and a happy zero downtime deployment will occur.
  3. Redeployment for an app where the reload takes >4 minutes. This sucks, and apps like this will not be able to have zero downtime deployments. From Ponder's internal perspective, this looks just like 1. If users ask, Ponder could be to introduce a config option that allows the GraphQL server to serve incomplete data. This could mitigate the pain of getting a 5xx, and might be acceptable for some apps.

Original solution circa 12/29

Solution (for now):

  • If the backfill + indexing will take >4 minutes (not sure what heuristic to use for this yet), enter a "backfilling" state, where the service responds as healthy, but requests to /graphql respond with an error message describing that the backfill is in progress.
  • If the backfill + indexing will take <4 minutes, enter a "reloading" state, where the service does not respond as healthy until the backfill + indexing are complete. This should enable zero downtime deploys for deployments after the first one.

The reason to use ~4 minutes as the cutoff is to play nice with the common cloud platform behavior of giving new deployments ~5 minutes to become healthy.

Drawbacks with this approach:

  • I'm not sure what heuristic to use to estimate the backfill + indexing time.
  • If a user makes a change to an existing service that adds a large backfilling load (such as adding a new contract to ponder.config.js), the service will enter the "backfilling" state and the production server will stop responding to requests.
@0xOlias 0xOlias mentioned this issue Jan 10, 2023
18 tasks
@0xOlias 0xOlias mentioned this issue Feb 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant