Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: minimal health check #2092

Merged
merged 7 commits into from Dec 23, 2021
Merged

Conversation

steve-chavez
Copy link
Member

@steve-chavez steve-chavez commented Dec 17, 2021

Closes #1933.

When the admin-server-port config is set, it enables a <host>:<admin_server_port>/health endpoint that replies with 200 OK when postgrest is healthy and with 503 when it's not. In both cases, the response doesn't have a body.

Steps

  • Add another server on the admin port(done as a background thread with forkIO)
  • When db-channel-enabled=True(default), do the health check based on the LISTENer state to avoid sending extra queries.
  • When db-channel-enabled=False, resort to doing the health check with a select 1 query.
  • Add io tests

@steve-chavez
Copy link
Member Author

The admin server is implemented now, it's only activated when a port is specified(in admin-server-port), otherwise disabled.

The health check is responding with a 200 status when postgrest is healthy and with 503 when the connection is down.

There's no body sent. Also, I'm not logging the health endpoint hits.

@steve-chavez
Copy link
Member Author

I thought we could consolidate both the cases for db-channel-enabled = True/False by using the connectionWorker state: when it's working(recovering the connection) the health check gives 503 and when it's not working it would give 200.

However the above is wrong because as mentioned on automatic connection recovery, when db-channel-enabled = False the connectionWorker will only be activated when a request arrives. So postgrest could be idle with a down connection and the health check would always wrongly give a 200 while no new request comes.

So a SELECT 1 is better in that regard since it ensures a successful connection. Although it could give a healthy response while the schema cache is not up to date.

@steve-chavez
Copy link
Member Author

There are some codecov errors because currently our io tests don't test a down connection. I've added the manual tests here for now.

@steve-chavez steve-chavez marked this pull request as ready for review December 22, 2021 03:02
@steve-chavez
Copy link
Member Author

steve-chavez commented Dec 22, 2021

I've named the endpoint as /health based on this pattern https://microservices.io/patterns/observability/health-check-api.html

Also, I decided to not do any logging when hitting /health because that would flood the logs. Additionally, it seems useless to log them.

@@ -208,6 +209,7 @@ listener appState = do
handleFinally dbChannel _ = do
-- if the thread dies, we try to recover
AppState.logWithZTime appState $ "Retrying listening for notifications on the " <> dbChannel <> " channel.."
AppState.putIsListenerOn appState False
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codecov warning should be gone when #1766 (comment) is done.

@steve-chavez steve-chavez merged commit ac3655d into PostgREST:main Dec 23, 2021
@wolfgangwalther
Copy link
Member

Although it could give a healthy response while the schema cache is not up to date.

That's not something we can guarantee anyway. So the healthcheck will always be only about proper connection status.

@wolfgangwalther
Copy link
Member

The current logic only tests the database connection, but does not really make sure whether the main port / socket is still open. Does it respond with an unhealthy status for #2042 for sure? How about other cases?

This was also mentioned in #1933 (comment):

Ensuring that the health check is actually tied to the health of the other service / database connection is key too. I have seen applications where the "main" service goes down, but the health check stays up / happy see_no_evil cry

I think this basically an issue with the two-ports approach. And also there will be additional effort to implement unix sockets for that.

I still think the config option to set an endpoint for those types of checks on the main server would be much better, as it would test the connection both ways (http connection and database connection).

@steve-chavez
Copy link
Member Author

Does it respond with an unhealthy status for #2042 for sure?

Hm, I thought the max number of open files was defined per process(due to ulimit), so if the 3000 port reaches EMFILE then the 3001 port should reach it as well. I could be really wrong about this though, will have to test.

@wolfgangwalther
Copy link
Member

Does it respond with an unhealthy status for #2042 for sure?

Hm, I thought the max number of open files was defined per process(due to ulimit), so if the 3000 port reaches EMFILE then the 3001 port should reach it as well. I could be really wrong about this though, will have to test.

Yeah, you're probably right here. However, I meant to use this as an example for the more general problem of how to make sure that we are really accepting connections properly on the main port. This should be a core part of the health check - and in my opinion can only be done reliably by connecting to that port...

@wolfgangwalther
Copy link
Member

Although it could give a healthy response while the schema cache is not up to date.

That's not something we can guarantee anyway. So the healthcheck will always be only about proper connection status.

However, what we can and should do is: Keep track of the status of the last schema cache refresh - and if that fails, e.g. because of something like #2024, we should return 503. Currently we are still returning 200, when the schema cache fails.

You can test this by adding the SQL code in #2102 (but not the fix in the query) to your fixtures - then run PGRST_ADMIN_SERVER_PORT=3001 postgrest-with-postgresql-14 postgrest-run. Then query the health endpoint and you will get 200 even though the postgrest process is going crazy because it's trying to refresh the schema cache over and over again. Same thing happens when adding the code later and reloading the schema cache just once - still returns with 200.

@wolfgangwalther
Copy link
Member

Does it respond with an unhealthy status for #2042 for sure?

Hm, I thought the max number of open files was defined per process(due to ulimit), so if the 3000 port reaches EMFILE then the 3001 port should reach it as well. I could be really wrong about this though, will have to test.

Yeah, you're probably right here. However, I meant to use this as an example for the more general problem of how to make sure that we are really accepting connections properly on the main port. This should be a core part of the health check - and in my opinion can only be done reliably by connecting to that port...

Here's a better example. Maybe made up a bit, but still valid: Run postgrest on a unix socket, then remove the socket file manually. The health endpoint will still return "healty" / 200 - even though nobody can connect to postgrest anymore. This is not healthy.

Overall, the health endpoint only really makes sense, if it can reliably tell you whether postgrest is healthy or not. And in the current implementation this is not the case, yet.

@steve-chavez
Copy link
Member Author

steve-chavez commented Dec 28, 2021

Then query the health endpoint and you will get 200 even though the postgrest process is going crazy because it's trying to refresh the schema cache over and over again.

Aha, I actually saw a similar case of a failed schema cache load. A db had a crazy view(maybe autogenerated) that not even psql could show with \d+ <view> or \sv <view>, postgrest schema cache would fail to load as well due to a statement timeout but it kept its port open. I think the correct thing to do here would be to just panic and die and then the health check will follow.

Run postgrest on a unix socket, then remove the socket file manually. The health endpoint will still return "healty" / 200 - even though nobody can connect to postgrest anymore.

Perhaps we can catch that error with setOnClose? Then update the health check state accordingly.

Edit: Here we could keep it simple and also die.

and in my opinion can only be done reliably by connecting to that port...

Maybe Warp functions can help us tie the main app port state to the health check port.


Disadvantages of doing the health check on the main app port:

  • The special header on root won't work with some monitoring tools that cannot send headers(prometheus IIRC).
  • The JWT validation comes before any request, so the health check would have to be done with a JWT, which is undesirable. Not doing this would require an escape hatch on the code, which I found really intrusive(tried to do that on the first commits of this PR).
  • All requests on the main app are tied to the db pool, so when the connection was down the error came from the pool and not from the health check logic. Fixing this would also require an escape hatch on the codebase.

Because of those, I think it's better to keep the health check port separate from the main app port.

@wolfgangwalther WDYT?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Dec 28, 2021

Aha, I actually saw a similar case of a failed schema cache load. A db had a crazy view(maybe autogenerated) that not even psql could show with \d+ <view> or \sv <view>, postgrest schema cache would fail to load as well due to a statement timeout but it kept its port open.

Ah, can you still get a hold of that view definition? If possible I'd like to see whether we can do anything to improve the view parsing to avoid that.

I think the correct thing to do here would be to just panic and die and then the health check will follow.

If we think that further, we will eventually end up dying on everything that would make the health check respond unhealthy otherwise - I don't think that makes too much sense. IMHO, we should should only panic once we are in a state, from which we can't recover anymore. A failed schema cache reload is not that. Once the view definition that causes the schema cache to fail is removed, we can safely reload the schema cache and proceed operation.

Or, to put it the other way around: A panic + restart won't help solve the problem either. In most setups it will just lead to an endless restart-panic loop, too.

We should not retry the schema cache load in an endless loop, once it's failed. We should stick around in an unhealthy state, report that state on the health endpoint, report the failed schema cache reload in the log once and wait for a signal / notification for reload.

Run postgrest on a unix socket, then remove the socket file manually. The health endpoint will still return "healty" / 200 - even though nobody can connect to postgrest anymore.

Perhaps we can catch that error with setOnClose? Then update the health check state accordingly.

Again, this was an example to point out the benefit of having the healthcheck on the main port. The idea here is, that whatever problem comes up with the main server, this will catch it. I don't think we should try to find out about every problem that can happen and fix those manually.

Edit: Here we could keep it simple and also die.

In this specific case it'd probably made sense to die, because recovering from a lost unix socket is done easily by restarting. However, given that this case is so unlikely I don't know whether we need to add code for that. Having the health endpoint not respond should be fine, as this will 99% be something that somebody did once and does not need to be automated.

  • The special header on root won't work with some monitoring tools that cannot send headers(prometheus IIRC).

That's why the proposal was to use a config option to set up an endpoint, which is then used as the health endpoint - this can then work the same way as the health point you have here, just on the main port.

  • The JWT validation comes before any request, so the health check would have to be done with a JWT, which is undesirable. Not doing this would require an escape hatch on the code, which I found really intrusive(tried to do that on the first commits of this PR).

  • All requests on the main app are tied to the db pool, so when the connection was down the error came from the pool and not from the health check logic. Fixing this would also require an escape hatch on the codebase.

I imagine both of that could be solved once we have JWT validation moved to a Middleware in #1988. We could then make the health endpoint just another Middleware that's loaded before anything else. This can then check if the request was made on the health endpoint and respond accordingly. If it's not, it can trigger the main app and all the other Middlewares, including the logger - this way you will have avoided logging health requests automatically, too.

@steve-chavez
Copy link
Member Author

steve-chavez commented Dec 28, 2021

We should stick around in an unhealthy state, report that state on the health endpoint, report the failed schema cache reload in the log once and wait for a signal / notification for reload.

Yeah, agree. I'll improve the health check for these cases.

That's why the proposal was to use a config option to set up an endpoint, which is then used as the health endpoint

A disadvantage If we go this route. When db-channel-enabled is false, then every request(which doesn't need to be authenticated) to the health endpoint will do a select 1 and that could be used for DoS attack.

Also, the admin port provides a path forward for doing #1526, which is also meant to be used in an internal network. It'd somehow feel off doing another config path for metrics.

Edit: Related SO question.

I imagine both of that could be solved once we have JWT validation moved to a Middleware in #1988. We could then make the health endpoint just another Middleware that's loaded before anything else

True, forgot about that PR 🤦‍♂️, that would be a nice way to do it. That being said, I still see more benefits in doing the health check with the admin port and it's also less invasive to the main app logic.

@steve-chavez
Copy link
Member Author

Want to note that I'm not opposed to a health check on a custom endpoint, I think that could be an additional feature. It would save the need for configuring a proxy for users that want to expose the health check externally.

@wolfgangwalther
Copy link
Member

A disadvantage If we go this route. When db-channel-enabled is false, then every request(which doesn't need to be authenticated) to the health endpoint will do a select 1 and that could be used for DoS attack.

I think that would be a matter of blocking access to the /health endpoint at the proxy layer for public access. That would be something to do in such a setup anyway, I guess.

Also, the admin port provides a path forward for doing #1526, which is also meant to be used in an internal network. It'd somehow feel off doing another config path for metrics.

We could plan ahead for that and use a config option for an internal endpoint, of which /<internal>/health is a subpath. So that's just a design question, I think.

That being said, I still see more benefits in doing the health check with the admin port and it's also less invasive to the main app logic.

I think overall, most of the questions are just a matter of design and implementation, but don't really have any effect for the feature or user. However:

  • Using the admin port requires less setup at the proxy level to block public access.
  • Using the main port tests the http connection implicitly, at the same time.

So, in the end it's just a matter of "sligthly easier to use" with the admin port or "slightly more likely to provide accurate health status" with the main port.

Since the health check is mostly useful in more complex setups, adding some additional proxy config seems to be easily doable. Personally, I'd always take the more reliable approach here.

@steve-chavez
Copy link
Member Author

steve-chavez commented Dec 29, 2021

Personally, I'd always take the more reliable approach here.

Hm, really I tried to take the simplest approach here(the change needed for the feature is minimal) while maintaining a level of reliability we can improve further. I do assume that EMFILE is the only thing that can happen to Warp given our historic issues (listed on #2042 (comment)) and also from various instances on production I've been debugging. Thus, the health check should be "reliable"(assuming no unknown unknowns later on).

We should not retry the schema cache load in an endless loop, once it's failed. We should stick around in an unhealthy state, report that state on the health endpoint, report the failed schema cache reload in the log once and wait for a signal / notification for reload.

Yeah, I can reliably reproduce this with an alter role postgrest_test_authenticator set statement_timeout to 1;. I think something consistent we can do here is to retry the schema cache load with exponential backoff(same as connection retries, it caps at 32 seconds), that way we don't flood the logs and spend a lot of CPU.

Edit: Scratch that, it's messy to do a second worker for the schema cache load, instead I'll follow your exact suggestion. I reached the same conclusion, that's the best behavior.

@steve-chavez
Copy link
Member Author

Or, to put it the other way around: A panic + restart won't help solve the problem either. In most setups it will just lead to an endless restart-panic loop, too.

True. I wonder if it's worth to have a different response for a failed schema cache load. Similar to how Kubernetes has liveness and readiness probes. Liveness(/health) could be the healthy connection and readiness(/ready) would be the loaded schema cache.

@laurenceisla
Copy link
Member

Does it respond with an unhealthy status for #2042 for sure?

Hm, I thought the max number of open files was defined per process(due to ulimit), so if the 3000 port reaches EMFILE then the 3001 port should reach it as well. I could be really wrong about this though, will have to test.

Can confirm this behavior. I tested 1500 requests to a function that selects a pg_sleep(3600) to simulate the open connections. This causes PostgREST to throw this error:

Network.Socket.accept: resource exhausted (Too many open files)

A new curl localhost:3000 is left hanging with no response, while doing a curl -I localhost:3001/health responds with:

curl: (56) Recv failure: Connection reset by peer

Repeating the health checks will throw:

curl: (7) Failed to connect to localhost port 3001 after 0 ms: Connection refused

While new curl localhost:3000 still hang.

@wolfgangwalther
Copy link
Member

Or, to put it the other way around: A panic + restart won't help solve the problem either. In most setups it will just lead to an endless restart-panic loop, too.

True. I wonder if it's worth to have a different response for a failed schema cache load. Similar to how Kubernetes has liveness and readiness probes. Liveness(/health) could be the healthy connection and readiness(/ready) would be the loaded schema cache.

Hm. I think the article you cited is a pretty valuable read. I might draw a slightly different conclusion, depending on what you meant to say exactly with "Liveness(/health) could be the healthy connection".

Shooting Yourself in the Foot with Liveness Probes

Recall that a liveness-probe failure will result in the container being restarted. Unlike a readiness probe, it is not idiomatic to check dependencies in a liveness probe. A liveness probe should be used to check if the container itself has become unresponsive.

One problem with a liveness probe is that the probe may not actually verify the responsiveness of the service. For example, if the service hosts two web servers—one for the service routes and one for status routes, like readiness and liveness probes, or metrics collection—the service can be slow or unresponsive, while the liveness probe route returns just fine. To be effective, the liveness probe must exercise the service in a similar manner to dependent services.

This is basically the argument I made all along - with the difference, that I did not separate liveness from readyness. The /health check you implemented here seems to be a readyness-check in all matters:

  • It checks dependencies (database)
  • It checks the schema cache availability (readyness to serve requests)

I think we should stop using the term health and instead start using the terms live and ready. Also see the kubernetes docs:

The Kubernetes API server provides 3 API endpoints (healthz, livez and readyz) to indicate the current status of the API server. The healthz endpoint is deprecated (since Kubernetes v1.16), and you should use the more specific livez and readyz endpoints instead.

It seems the vague term of health had been a problem for others, too.


Overall, I think we should do the following, which should solve the whole discussion we had:

  • Keep the admin-port health check you have implemented here.
  • Rename the admin-port check to /ready and make it check database connection and schema cache as discussed.
  • Add a server-live-endpoint="live" config option (better name anyone?) to set up a liveness endpoint. This will just return 200 and do no checks otherwise on /live.

The /live endpoint can then be used for the liveness probe. It will not respond properly when EMFILE is reached - and the container will be restarted, solving that issue.

The :<admin-port>/ready check can be used for the readyness probe. It will make sure that PostgREST is actually ready to serve requests with a fresh schema cache present.

This avoids the whole problem of DDoS-ing the /ready probe on the public port, since /live is the cheapest request possible and everything else is internal.

@steve-chavez
Copy link
Member Author

I think we should stop using the term health and instead start using the terms live and ready
Rename the admin-port check to /ready and make it check database connection and schema cache as discussed.

Fully agree!

Add a server-live-endpoint="live" config option (better name anyone?) to set up a liveness endpoint. This will just return 200 and do no checks otherwise on /live.

I see, so this one will not check for any internal state, it will only reply with 200 to check if postgrest is "alive"(or running).

What I don't see is the need for it to be only on the main app port - 3000. As Laurence confirmed above, if there's an EMFILE on port 3000, the 3001 port will also fail to respond.

The disadvantages of adding an special route(mentioned above) on the main app still apply, it's more complex for the codebase.

So, how about adding the /live path to the admin server? I see a few lines of code in comparison to the server-live-endpoint route.

Add a server-live-endpoint="live"

However, that could be done as a later enhancement as well, it would be useful for users that want an external live endpoint(without extra proxy config).

WDYT?

@wolfgangwalther
Copy link
Member

I see, so this one will not check for any internal state, it will only reply with 200 to check if postgrest is "alive"(or running).

Correct.

What I don't see is the need for it to be only on the main app port - 3000. As Laurence confirmed above, if there's an EMFILE on port 3000, the 3001 port will also fail to respond.

As I said earlier, this is not only about this one specific problem. I gave another example with deleting the unix socket. Mainly this is about those things, that are still unknown to us. Those are the things that - if at all - we can only recover with a restart from. The article you cited makes the same point:

One problem with a liveness probe is that the probe may not actually verify the responsiveness of the service. For example, if the service hosts two web servers—one for the service routes and one for status routes, like readiness and liveness probes, or metrics collection—the service can be slow or unresponsive, while the liveness probe route returns just fine. To be effective, the liveness probe must exercise the service in a similar manner to dependent services.

I don't know what else to say, I am convinced by my own argument ;)

The disadvantages of adding an special route(mentioned above) on the main app still apply, it's more complex for the codebase.

Every feature adds complexity, no? Implementing the live check in the main port should really be simple: We should not even bother with checking the JWT at all, because we don't want to test that internal state / configuration that this depends on. So this will be as simple as a Middleware that just responds with 200 on a given port. This can't be much harder (actually much simpler...) than the Middleware we are using to serve the favicon right now? (The fact that the icon is disfunctional right now, is a different matter).

So, how about adding the /live path to the admin server? I see a few lines of code in comparison to the server-live-endpoint route.

I see a few lines of code in both cases. And I see no additional benefit compared to the ready route, when done on the admin port.

@steve-chavez
Copy link
Member Author

Hm, I guess the OpenAPI output would have to consider the live endpoint as well.

To be effective, the liveness probe must exercise the service in a similar manner to dependent services.

True. It'd still prefer not polluting the API URL paths though, seems a lot cleaner.

So I wonder if there's a way to check for the main app port internally on the admin server port. Perhaps by doing a request to a localhost:3000/unexistent/path/for/testing/purposes and waiting for a 404? Assuming a path like that won't hit the db or demand jwt validation. Maybe there's a cleaner way.. If there's not, then the server-live-endpoint should be the way to go.

@wolfgangwalther
Copy link
Member

Hm, I guess the OpenAPI output would have to consider the live endpoint as well.

I wouldn't consider the /live endpoint to be part of the API we're serving, so I wouldn't do that either.

To be effective, the liveness probe must exercise the service in a similar manner to dependent services.

True. It'd still prefer not polluting the API URL paths though, seems a lot cleaner.

So I wonder if there's a way to check for the main app port internally on the admin server port. Perhaps by doing a request to a localhost:3000/unexistent/path/for/testing/purposes and waiting for a 404? Assuming a path like that won't hit the db or demand jwt validation. Maybe there's a cleaner way.. If there's not, then the server-live-endpoint should be the way to go.

Unfortunately, we can't know for sure whether a path is non-existent without relying on the schema cache or hitting the DB. But we can't do any of the two for liveness. So even if we did it that way, we'd need a reserved path on the main port for that purpose, that is known to return 404 in all cases. The only difference would be, that the 404 for that would be rewritten to a 200. But that would still pollute the API URL path namespace in a way. So it seems a lot cleaner to just add the config option, which allows people to opt-in to this pollution and to change the name of the endpoint to a non-breaking one for their individual setup.

@steve-chavez
Copy link
Member Author

steve-chavez commented Dec 31, 2021

I wouldn't consider the /live endpoint to be part of the API we're serving, so I wouldn't do that either.

Yeah, agree.

Maybe there's a cleaner way..

Turns out we can connect to the main app socket with raw Network.Socket and check if it indeed connects. When no connection is possible(tried by removing the socket file), I get the same low level Connection refused error mentioned above.

Got the hint from this SO question and got the sample code from the intro in https://hackage.haskell.org/package/network-2.6.3.6/docs/Network-Socket.html.

Will make a PR soon. Edit: Done #2109

@wolfgangwalther
Copy link
Member

Maybe there's a cleaner way..

Turns out we can connect to the main app socket with raw Network.Socket and check if it indeed connects. When no connection is possible(tried by removing the socket file), I get the same low level Connection refused error mentioned above.

I like it. I thought about just using a TCP liveness probe in kubernetes, but that wouldn't work with the main app on a unix socket, I think. You're proposal works around that nicely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Add a proper / minimal health check endpoint
3 participants