Redeploys cause downtime in production #24

0xOlias · 2022-12-29T19:10:56Z

When using a cloud platform that uses git deployments (like Render or Railway), a new deployment will be created for each new commit. Then, once the new deployment is deemed healthy by the platform (by responding to the health check path with a 2XX), the platform will start serving traffic to the new deployment and then shut down the old deployment. Cloud platforms normally will wait between 5 and 15 minutes for a healthy response, after which they will consider the deployment failed and shut it down.

Ponder currently sends a healthy response as soon as the GraphQL server starts up, which happens immediately on ponder start. This is problematic because the backfill (and/or the handler processing) will not NOT be done yet, so the entity tables are not up-to-date but the service is saying its healthy.

New solution circa 2/20

The service should respond as unhealthy until either:

The backfill + log handling is complete (the entity database is reloaded and ready to serve requests)
4 minutes have passed

Here are the scenarios that come to mind, and how they are handled:

First deployment for a big app where the backfill takes an hour (sad but common at the moment). The service will respond as unhealthy for 4 minutes as the backfill is going. Then, the service will start responding as healthy on the health check path, but the GraphQL path will return 5xx until it's ready to serve requests.
Redeployment for a normal app where the reload takes <4 minutes. The service will start responding as healthy after a short time, and a happy zero downtime deployment will occur.
Redeployment for an app where the reload takes >4 minutes. This sucks, and apps like this will not be able to have zero downtime deployments. From Ponder's internal perspective, this looks just like 1. If users ask, Ponder could be to introduce a config option that allows the GraphQL server to serve incomplete data. This could mitigate the pain of getting a 5xx, and might be acceptable for some apps.

Original solution circa 12/29

Solution (for now):

If the backfill + indexing will take >4 minutes (not sure what heuristic to use for this yet), enter a "backfilling" state, where the service responds as healthy, but requests to /graphql respond with an error message describing that the backfill is in progress.
If the backfill + indexing will take <4 minutes, enter a "reloading" state, where the service does not respond as healthy until the backfill + indexing are complete. This should enable zero downtime deploys for deployments after the first one.

The reason to use ~4 minutes as the cutoff is to play nice with the common cloud platform behavior of giving new deployments ~5 minutes to become healthy.

Drawbacks with this approach:

I'm not sure what heuristic to use to estimate the backfill + indexing time.
If a user makes a change to an existing service that adds a large backfilling load (such as adding a new contract to ponder.config.js), the service will enter the "backfilling" state and the production server will stop responding to requests.

The text was updated successfully, but these errors were encountered:

0xOlias mentioned this issue Jan 10, 2023

Stable release #55

Closed

18 tasks

0xOlias mentioned this issue Feb 20, 2023

Healthchecks #104

Merged

0xOlias closed this as completed in #104 Feb 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redeploys cause downtime in production #24

Redeploys cause downtime in production #24

0xOlias commented Dec 29, 2022 •

edited

Redeploys cause downtime in production #24

Redeploys cause downtime in production #24

Comments

0xOlias commented Dec 29, 2022 • edited

New solution circa 2/20

Original solution circa 12/29

0xOlias commented Dec 29, 2022 •

edited