Implement `/readyz` and `/livez` endpoints #3949

LesnyRumcajs · 2024-02-12T11:46:35Z

Issue summary

To make operations easier with Forest, there should be an endpoint that clearly defines whether Forest is healthy, alive and ready to serve RPC requests.

In the Kubernetes world (subjectively, a good source of best practices when it comes to managing services), there is a notion of liveness and readiness probes. Alternatively, there is a health probe, which was deprecated, but we could still have it.

The goal is for the orchestrator, be it Kubernetes, Docker Compose or any custom contraption to know whether the service is up or down (and should be restarted) and ready to serve requests (for load-balancing purposes).

Task summary

Implement /livez endpoint - it should return 200 if the node is live. The endpoint should accept ?verbose argument which should list the checks that were mode. Sample checks:
- Prometheus server is up,
- connected to at least 1 other peer,
- other tasks in the main loop have started.
Implement /readyz endpoint - it should return 200 if the node is ready to serve RPC requests (note that the checks may be different for an offline node). The endpoint should accept ?verbose argument, which should list the checks that were mode. Sample checks:
- RPC server is up and responding,
- (online node-only) - the node is not too far behind the expected time, e.g., genesis time + block_interval * epoch <= current_timestamp + epsilon, where epsilon is some arbitrarily chosen grace period, for example, six blocks period. This may need to be disabled for devnets.
- (online node-only) is in follow-mode
  ~~- [ ] Implement forest-cli node info node-ready|node-live subcommands to wrap the above.~~
Add these checks to some of the existing tests.

Feel free to come up with more checks. They don't have to be necessarily implemented (at least, not all of them), these may come as follow-up tasks.

Other information and links
https://kubernetes.io/docs/reference/using-api/health-checks/#api-endpoints-for-health

The text was updated successfully, but these errors were encountered:

LesnyRumcajs added RPC Ready Issue is ready for work and anyone can freely assign it to themselves labels Feb 12, 2024

ruseinov assigned ruseinov and unassigned ruseinov Feb 12, 2024

LesnyRumcajs self-assigned this Mar 14, 2024

LesnyRumcajs mentioned this issue Apr 8, 2024

implement healthcheck endpoints #4156

Merged

4 tasks

LesnyRumcajs closed this as completed in #4156 Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `/readyz` and `/livez` endpoints #3949

Implement `/readyz` and `/livez` endpoints #3949

LesnyRumcajs commented Feb 12, 2024 •

edited

Implement /readyz and /livez endpoints #3949

Implement /readyz and /livez endpoints #3949

Comments

LesnyRumcajs commented Feb 12, 2024 • edited

Implement `/readyz` and `/livez` endpoints #3949

Implement `/readyz` and `/livez` endpoints #3949

LesnyRumcajs commented Feb 12, 2024 •

edited