Skip to content

Healthcheck service #24817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
Jun 24, 2024
Merged

Healthcheck service #24817

merged 26 commits into from
Jun 24, 2024

Conversation

vinzscam
Copy link
Member

@vinzscam vinzscam commented May 17, 2024

Hey, I just made a Pull Request!

This PR adds a new healthcheck service which adds a bunch of endpoints to the root HTTP Router

Closes #24548

✔️ Checklist

  • A changeset describing the change and affected packages. (more info)
  • Added or updated documentation
  • Tests for new functionality and regression tests for bug fixes
  • Screenshots attached (for UI changes)
  • All your commits have a Signed-off-by line in the message. (more info)

@vinzscam vinzscam requested review from a team as code owners May 17, 2024 09:42
@vinzscam vinzscam requested review from Rugvip and camilaibs May 17, 2024 09:42
@backstage-goalie
Copy link
Contributor

backstage-goalie bot commented May 17, 2024

Changed Packages

Package Name Package Path Changeset Bump Current Version
@backstage/backend-defaults packages/backend-defaults patch v0.3.0
@backstage/backend-plugin-api packages/backend-plugin-api patch v0.6.19
@backstage/backend-test-utils packages/backend-test-utils patch v0.4.0

Copy link
Member

@Rugvip Rugvip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 🎉

Couple of initial high level bits

const router = Router();
let handler = () => Promise.resolve({ status: 'ok' });

router.get('/healthcheck', async (_request: Request, response: Response) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking we should have this separated from the main routes, probably something like /.backstage/health, but potentially grouped a bit

export interface HttpRouterHealthCheckConfig {
handler: () => Promise<any>;
}

/**
* @public
*/
export interface HttpRouterService {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about keeping health checks at the root level instead? Something like /.backstage/health/v1/readiness for checking whether the entire instance is up, and /.backstage/health/v1/<plugin-id>/readiness for individual plugins?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the other way around, /.backstage/health/v1/readiness/<plugin-id>

// @public (undocumented)
export interface HttpRouterService {
// (undocumented)
addAuthPolicy(policy: HttpRouterServiceAuthPolicy): void;
// (undocumented)
healthCheckConfig(healthCheckOptions: HttpRouterHealthCheckConfig): void;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this should live on the http router tbh. Or framed a bit differently: if we can make this a separate service we probably should. I'm thinking this should probably be additive as well so that it's possible to add health checks from modules etc.

// @public (undocumented)
export interface HttpRouterHealthCheckConfig {
// (undocumented)
handler: () => Promise<any>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's important to consider a liveness + readiness split. It can help avoid situations where instances are shut down because they timed out at startup, while still being able to be pretty aggressive about timeouts when it comes to whether the service is functional at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have any suggestion for a possible root level readiness implementation?

Copy link
Contributor

@camilaibs camilaibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🎉, left minor comments. Could we also deprecate the legacy handler from the backend-common package as well?

@vinzscam vinzscam changed the title Healthcheck endpoint Healthcheck service May 17, 2024
@vinzscam vinzscam force-pushed the healthcheck-router branch 3 times, most recently from 3ede1f0 to 74a6836 Compare May 24, 2024 11:54
Copy link
Contributor

@camilaibs camilaibs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, left minor comments 🙌🏻

@@ -58,6 +59,7 @@ export const defaultServiceFactories = [
userInfoServiceFactory(),
urlReaderServiceFactory(),
eventsServiceFactory(),
healthServiceFactory(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we have also to provide a mock implementation here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is not used anymore, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this file can be deleted, right?

'@backstage/backend-app-api': patch
---

Added a new health service which adds new endpoints for health checks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice add this service to the core services docs:
https://backstage.io/docs/backend-system/core-services/index

@github-actions github-actions bot added the area:documentation Improvements or additions to documentation label Jun 3, 2024
@vinzscam vinzscam force-pushed the healthcheck-router branch from cf48e42 to 8c5ca35 Compare June 3, 2024 12:11
@github-actions github-actions bot added the area:catalog Related to the Catalog Project Area label Jun 3, 2024
@vinzscam vinzscam force-pushed the healthcheck-router branch from 8c5ca35 to 926d6c9 Compare June 3, 2024 12:14
@github-actions github-actions bot removed the area:catalog Related to the Catalog Project Area label Jun 3, 2024
Copy link
Member

@Rugvip Rugvip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! 👍

Couple of nits

* The service reference for the plugin scoped {@link HealthService}.
*/
export const health = createServiceRef<
import('./HealthService').HealthService
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's prolly best to keep this as void instead, since now you can pass anything instead. We can still add methods in the future even if it starts out as void.

Comment on lines 34 to 43
router.get('/v1/readiness', async (_request: Request, response: Response) => {
if (!isRunning) {
throw new Error('Backend has not started yet');
}
response.json({ status: 'ok' });
});

router.get('/v1/liveness', async (_request: Request, response: Response) => {
response.json({ status: 'ok' });
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a quick description to each of these with what their purpose is?


router.get('/v1/readiness', async (_request: Request, response: Response) => {
if (!isRunning) {
throw new Error('Backend has not started yet');
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking this should be a 503 ServiceUnavailableError. Perhaps we shouldn't throw an error in this case either, so that we can have a more consistent response? Instead setting an explicit status code and returning whatever the opposite of status: 'ok' is

@vinzscam vinzscam force-pushed the healthcheck-router branch from 4fd88dc to cdc62c9 Compare June 5, 2024 11:29
Comment on lines 82 to 110
* The service reference for the plugin scoped {@link HealthService}.
*/
export const health = createServiceRef<
import('./HealthService').HealthService
>({ id: 'core.health', scope: 'root' });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* The service reference for the plugin scoped {@link HealthService}.
*/
export const health = createServiceRef<
import('./HealthService').HealthService
>({ id: 'core.health', scope: 'root' });
* The service reference for the root scoped {@link RootHealthService}.
*/
export const rootHealth = createServiceRef<
import('./RootHealthService').RootHealthService
>({ id: 'core.rootHealth', scope: 'root' });

Every single other root scoped thing is named like this - we should probably follow suit here. Make sure to change to that naming pattern everywhere else too.

I'd also be interested in what the future might be in terms of per-plugin liveness/health. That's sort of where the most interesting developments might be in the future. This root scoped one could at most do simple lifecycle matching like it does today, or possibly offer some form of rollup of the states of all installed plugins (not even sure that's a good idea though). I mean, how often do we really desire to look at the liveness of effectively a machine that hosts many plugins? What does that even mean?

@Rugvip Rugvip added the workflow:after-release This is a bit too scary to merge until after the next release label Jun 14, 2024
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I beleive should deprecate these functions and types as well in backend common:

export function createStatusCheckRouter(options: {
logger: LoggerService;
path?: string;
statusCheck?: StatusCheck;
}): Promise<express.Router>;

export type StatusCheck = () => Promise<any>;

export function statusCheckHandler(
options?: StatusCheckHandlerOptions,
): Promise<RequestHandler>;

export interface StatusCheckHandlerOptions {
statusCheck?: StatusCheck;
}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, I'm gonna do it on a following PR

vinzscam added 7 commits June 20, 2024 10:28
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
vinzscam added 12 commits June 20, 2024 10:32
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
@vinzscam vinzscam force-pushed the healthcheck-router branch from cb2689f to 2c97234 Compare June 20, 2024 10:11
vinzscam added 2 commits June 20, 2024 12:20
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
@vinzscam vinzscam force-pushed the healthcheck-router branch from 2c97234 to e36e507 Compare June 20, 2024 10:20
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
const backend = createBackend();

class MyRootHealthService implements RootHealthService {
async getLiveness(): Promise<{ status: number; payload?: any }> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the payloads could have been JsonValue

});

it(`should return a 500 response if the server has stopped`, async () => {
const lifecycle = mockServices.rootLifecycle.mock();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can also do

const lifecycle = mockServices.rootLifecycle.mock({
  addStartupHook: jest.fn(() => {
    ...
  },

const { indexPath, configure = defaultConfigure } = options ?? {};
const logger = rootLogger.child({ service: 'rootHttpRouter' });
const app = express();

const router = DefaultRootHttpRouter.create({ indexPath });
const middleware = MiddlewareFactory.create({ config, logger });
const routes = router.handler();

const healthRouter = createHealthRouter({ health });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that this file should have any changes at all.

The rootHealthServiceFactory should have a dependency on coreServices.rootHttpRouter and just call .use on it with the right path.

Your DefaultRootHealthService could have a create method that accepts the root http router service as an argument and does that work. You could also make your rootHealthServiceFactory be of the options callback form, where you can pass in custom liveness and readiness callbacks as an adopter, and then the DefaultRootHealthService uses those instead of the default ones if they were given.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this approach simplifies matters in the event that the adopter wishes to provide their own implementation of the health service. With your approach, they would essentially have to replicate what the default implementation of the healthServiceFactory is doing, remembering the path where the middleware should be added. This could potentially lead to problems if they attach a middleware to a non-standard path.

This alternative approach simplifies things: they only need to provide their own implementation of the getLiveness and getReadiness methods. No raw operations on the root router are necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I meant was,

backend.add(
  rootHealthServiceFactory({
    getLiveness: async () => ...,
  })
)

as the first level of adaptation, and if you need more,

backend.add(
  createServiceFactory({
    service: coreServices.rootHealth,
    deps: {
      rootHttpRouter: coreServices.rootHttpRouter,
    },
    async factory({ rootHttpRouter }) {
      // performs all registration etc inside create
      return DefaultRootHealthService.create({
        rootHttpRouter,
        getLiveness: async () => ...,
      });
    }
  })
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

of course, this is all moot once we have service modules :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yeah I see that we did the same type of thing for the auth and lifecycle middlewares in httpRouterServiceFactory. So maybe this is where we were going with all this anyway.

Copy link
Member Author

@vinzscam vinzscam Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach simplifies the customization process for adopters and provides a more opinionated method for exposing the metrics

/**
* Get the liveness status of the backend.
*/
getLiveness(): Promise<{ status: number; payload?: any }>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah again, making these payloads JsonObject conveys the limitation that they are being serialized as a response body.

vinzscam and others added 4 commits June 20, 2024 15:22
Co-authored-by: Fredrik Adelöw <freben@gmail.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Co-authored-by: Fredrik Adelöw <freben@gmail.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
Signed-off-by: Vincenzo Scamporlino <vincenzos@spotify.com>
@freben freben merged commit d776d36 into master Jun 24, 2024
28 checks passed
@freben freben deleted the healthcheck-router branch June 24, 2024 11:39
Copy link
Contributor

Thank you for contributing to Backstage! The changes in this pull request will be part of the 1.29.0 release, scheduled for Tue, 16 Jul 2024.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:documentation Improvements or additions to documentation workflow:after-release This is a bit too scary to merge until after the next release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Create a replacement for the status check router in the new system and deprecated it from the commons
5 participants