feat(healthcheck) adding visibility to target health status #3232

hishamhm · 2018-02-16T19:21:36Z

Note: This is a work-in-progress.

This adds /upstreams/:upstream_id/health, an endpoint to report health of the targets of an upstream.

It returns the same contents of /upstreams/:upstream_id/targets, with the addition of a "health" field, which may return one of four values:

"not in balancer" - target failed to be inserted into the load balancer (usually because of DNS resolution error), target is therefore not in use by the load balancer
"healthchecks disabled" - healthchecks are disabled, target is in use by the load balancer
"healthy" - healthchecks are enabled, target is considered healthy by healthchecks, target is in use by the load balancer
"unhealthy" - healthchecks are enabled, target is considered unhealthy by healthchecks, target is in not in use by the load balancer

Note that if DNS for the target did not resolve, it will display as "not in balancer" regardless of healthchecks being enabled or disabled.

hishamhm · 2018-02-16T19:25:59Z

@Tieske I think you'll be interested in discussing this as well, especially the "not in balancer" state. Perhaps it would be desirable for us to be able to report here why exactly the (non-zero-weight) target is not in the balancer? I'm not sure if DNS errors the only possible cause for this state.

coopr · 2018-02-16T20:59:40Z

Does this response distinguish between targets that have been manually "marked healthy" or "marked unhealthy" vs. those that have been ascertained to be healthy or unhealthy by health checks? (Not suggesting that it should, at least not yet, need to think about it more - just wondering)

hishamhm · 2018-02-16T21:27:37Z

@coopr no, it doesn't -- as listed above, there is a single "healthy" and a single "unhealthy" state. The backend this API is exposing does not store this distinction.

kikito · 2018-02-16T21:41:59Z

If possible, I would rather have a string (i.e. "healthchecks disabled") instead of null. I think that would make it a bit easier to deal with - if a client wants to use this api it saves one conditional, and there's less risk of getting a "Null is not an object" javascript error.

coopr · 2018-02-16T22:28:23Z

Does anything special need to be in the response to indicate that it is specific to the Kong node that is reporting it? I ask because all other parts of the Admin API will respond with cluster-wide details, while in this case, the details are node-specific. Perhaps healthcheck visibility should be added to https://getkong.org/docs/0.12.x/admin-api/#information-routes instead (or in addition)?

thibaultcha · 2018-02-16T23:25:40Z

kong/api/routes/upstreams.lua

-            active_n = active_n + 1
-            active[active_n] = entry
+    GET = function(self, dao_factory)
+      self.params.active = nil


Is this a result of a copy/paste or does unsetting it have any influence over one of the below methods? None of them seems to at a glance.

thibaultcha · 2018-02-16T23:28:12Z

kong/api/routes/upstreams.lua

+          entry.health = health_info[entry.target] or "not in balancer"
+        else
+          -- Healthchecks disabled for this upstream
+          entry.health = ngx.null


Should we here distinguish cases where the above get_upstream_health returned an error, from cases where the healthchecks were disabled? I feel like logging an error in Kong might not be enough, the user probably should be informed that "something went wrong" while retrieving the health of those particular targets.

thibaultcha · 2018-02-16T23:30:39Z

I think @coopr's point is particularly interesting - this is one case of node-specific data endpoint (but others already exist, like / and /status for example). This could be a case of documentation, or stronger rules about how we name our endpoints (/cluster/... vs. /node/...).

hishamhm · 2018-02-19T23:15:01Z

@coopr @thibaultcha I thought about that as well. Now that #3234 is merged and we have a node id available, I added added the node id as a wrapping key to the response as a whole, indicating that the response pertains to that node:

{
    "a2ad0188-c249-4f3d-b62b-f3605324031c": {
        "data": [
            {
                "created_at": 1519078028938, 
                "health": "healthy", 
                "id": "37135a85-9752-4fbb-86ac-513ea8f8335c", 
                "target": "127.0.0.1:2112", 
                "upstream_id": "5136b1f1-1f06-4c6f-860c-cdc629044fbb", 
                "weight": 100
            }
        ], 
        "total": 1
    }
}

hishamhm · 2018-02-19T23:18:58Z

I'm a bit torn between that (which makes it explicit that the data is "scoped" by the node) or just adding a "node_id" key alongside "data" (which would make the meaning of the UUID more evident). Opinions?

coopr · 2018-02-19T23:23:14Z

I'm still leaning towards incorporating this information into the existing node-specific endpoints that @thibaultcha mentioned #3232 (comment)

thibaultcha · 2018-02-20T02:00:20Z

I'm still leaning towards incorporating this information into the existing node-specific endpoints that @thibaultcha mentioned #3232 (comment)

I like it better too but I am not sure if now is the time to make such paradigm shifts in the Admin API yet... I think it makes sense eventually, but that such a change is beyond the scope of this PR (or current items at work)?

My favorite (for now) would be:

{
        "node_id": "a2ad0188-c249-4f3d-b62b-f3605324031c",
        "data": [
            {
                "created_at": 1519078028938, 
                "health": "healthy", 
                "id": "37135a85-9752-4fbb-86ac-513ea8f8335c", 
                "target": "127.0.0.1:2112", 
                "upstream_id": "5136b1f1-1f06-4c6f-860c-cdc629044fbb", 
                "weight": 100
            }
        ], 
        "total": 1
}

Tieske · 2018-02-20T09:49:11Z

The logic of having the node_id as a key would be that it would be easy to consolidate data from multiple nodes, since it is all namespaced by node_id. But in hindsight that might not make sense, since if you want to integrate you would want status per target, hence a structure of target/node_id/data would make more sense, but by then the "easy integration" argument is already lost.

As an alternative to @thibaultcha's proposal we could index the data entries by target-field instead of making it an array. But that would be more lua-ish than json-ish. So I'm with @thibaultcha on this one, my 2cts.

as for endpoint;

I'm still leaning towards incorporating this information into the existing node-specific endpoints that @thibaultcha mentioned #3232 (comment)

I like it better too but I am not sure if now is the time to make such paradigm shifts in the Admin API yet... I think it makes sense eventually, but that such a change is beyond the scope of this PR (or current items at work)?

As long as we have an idea what the design looks like, now is as good a time to start as any...

hishamhm · 2018-02-20T11:59:26Z

Implemented the second option listed here (the one in @thibaultcha's example), with "node_id". Reshuffling the data around isn't hard if one needs to consolidate responses later.

Also, changed the nil state to "healthchecks disabled" for now.

hishamhm · 2018-02-20T13:24:02Z

Update: changed the behavior so that failure to resolve DNS (which causes the target to not be included in the balancer) returns "not in balancer", whether healthchecks for the upstream are on or off.

hishamhm · 2018-02-20T13:37:15Z

I'd like to have some opinions on how we should call the four states. Things to consider:

currenly, the "not in balancer" state is always caused by DNS failing to resolve
- "unresolved"? "DNS error"?
- would it be useful to include "DNS" explicitly? (however, what if other situations appear in the future that cause targets not to be added?)
these strings are going to be string-matched by other tools
- should we prefer single-word results, like "unresolved", "unavailable", "unknown"?
- should we do it "enum-style", like "HEALTHCHECKS_OFF" and "DNS_UNRESOLVED"? (is there any precedent for this anywhere else in the Admin API?)
when healthchecks are disabled, the balancer works as if assuming them to be always healthy. Still, I think having separate states for "healthchecker presumes healthy" and "healthchecks disabled" is important for diagnostics.
- reducing the "healthchecks disabled" state to one word is complicated, because "health": "disabled" may sound like the healthchecker determined the target to be unhealthy and disabled it (likewise for "unavailable", etc.) -- the only single-word that seems unequivocal to me would be "health": "unchecked"

tl;dr:

how to call the "not in balancer" state?
how to call the "healthchecks disabled" state?

thibaultcha · 2018-02-20T17:12:39Z

As long as we have an idea what the design looks like, now is as good a time to start as any...

@Tieske Not really - we are designing this on the go which we want to avoid for such a massive change, but more importantly we have our work and priorities already set for the next few weeks, with 2 releases that need to be completed, documented, tested and shipped - there probably isn't time to start changing this now if we want to stick to our target dates.

currenly, the "not in balancer" state is always caused by DNS failing to resolve

@hishamhm My thinking reading this: not in balancer (DNS ERROR)

should we do it "enum-style"

It does indeed look more appropriate for parsing... Maybe "not in balancer" isn't a state, but other states (such as DNS_ERROR) are documented as resulting in the Target not being included in the balancer?

HEALTHY
UNHEALTHY
HEALTHCHECKS_OFF
DNS_ERROR
... (other states which result in "not in balancer")

It'd be nice to also keep track of the error that triggerred the DNS_ERROR state, like: "error": "dns server error: 3 name error" - but if not doable without non-trivial efforts, no biggie.

hishamhm · 2018-02-20T18:09:50Z

but if not doable without non-trivial efforts, no biggie.

Yeah, an extra description field crossed my mind as well, but that error happens inside lua-resty-dns-client and the end-result is that it just doesn't launch the callback, which means kong.core.balancer and healthchecks don't get notified about the target at all. I extrapolate the DNS error out of the fact that the target is missing from the balancer table (that's why I was reluctant in calling it "DNS error" at first). FWIW, lua-resty-dns-client does retry by itself, meaning that if DNS is down and it eventually resumes working, the library will issue the callback and the target will change state.

If we were to add a detailed diagnostics argument, it would be nice to also display the reason why the healthchecker switched state as well (HTTP status, timeout, etc.) but the library doesn't store that at this point, so file this under non-trivial too. :)

thibaultcha · 2018-02-20T21:14:58Z

@hishamhm ditto: can we avoid splitting a feat's test into a separate commit? I think it is also more trouble for you to maintain this way when tweaking the patch...

thibaultcha · 2018-02-21T02:33:25Z

kong/api/routes/upstreams.lua

+          -- meaning that if DNS is down and it eventually resumes working, the
+          -- library will issue the callback and the target will change state.
+          entry.health = health_info[entry.target] or "DNS_ERROR"
+        else


style: it'd be nice to keep the line break above this just as usual - other branches in this file follow it

thibaultcha · 2018-02-21T02:37:11Z

kong/api/routes/upstreams.lua

+      local health_info
+      health_info, err = balancer.get_upstream_health(upstream_id)
+      if err then
+        ngx.log(ngx.ERR, "[healthchecks] failed getting upstream health: ", err)


[healthchecks] is the namespace we chose for the actual health-checker, but I am thinking maybe this one should be different since it is an error in the Admin API and not in the proxy itself (less worrying for users especially since no stacktrace is printed by ngx.log).

If we use an "Admin API" namespace we could also add it to the above error log (or no namespace at all would be fine to imho).

…pstream

This adds `/upstreams/:upstream_id/health`, an endpoint to report health of the targets of an upstream. It returns the same contents of `/upstreams/:upstream_id/targets`, with the addition of a `"health"` field, which may return one of four values: * `"DNS_ERROR"` - target failed to be inserted into the load balancer (because of DNS resolution error), target is therefore **not** in use by the load balancer * `"HEALTHCHECKS_OFF"` - healthchecks are disabled, target is in use by the load balancer * `"HEALTHY"` - healthchecks are enabled, target is considered healthy by healthchecks, target is in use by the load balancer * `"UNHEALTHY"` - healthchecks are enabled, target is considered unhealthy by healthchecks, target is in **not** in use by the load balancer Note that if DNS for the target did not resolve, it will display as `"not in balancer"` regardless of healthchecks being enabled or disabled.

hishamhm added the pr/status/do not merge (to discuss) label Feb 16, 2018

thibaultcha reviewed Feb 16, 2018

View reviewed changes

hishamhm force-pushed the feat/healthcheck-admin-api branch from 24be770 to c43fc32 Compare February 19, 2018 23:12

hishamhm force-pushed the feat/healthcheck-admin-api branch from c43fc32 to 214c3bf Compare February 20, 2018 11:56

hishamhm force-pushed the feat/healthcheck-admin-api branch 2 times, most recently from a89608f to 9ffe79a Compare February 20, 2018 13:19

hishamhm force-pushed the feat/healthcheck-admin-api branch from 9ffe79a to 767a5f6 Compare February 20, 2018 18:47

hishamhm removed the pr/status/do not merge (to discuss) label Feb 20, 2018

hishamhm force-pushed the feat/healthcheck-admin-api branch from 767a5f6 to a681fb2 Compare February 20, 2018 18:56

thibaultcha approved these changes Feb 21, 2018

View reviewed changes

hishamhm force-pushed the feat/healthcheck-admin-api branch from a681fb2 to 34ca558 Compare February 21, 2018 16:37

hishamhm added 2 commits February 21, 2018 14:00

feat(healthchecks) add Lua API for getting health information of an u…

738c7a2

…pstream

hishamhm force-pushed the feat/healthcheck-admin-api branch from 34ca558 to bc8b8c1 Compare February 21, 2018 17:00

hishamhm merged commit bee5580 into master Feb 21, 2018

Tieske deleted the feat/healthcheck-admin-api branch February 21, 2018 17:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(healthcheck) adding visibility to target health status #3232

feat(healthcheck) adding visibility to target health status #3232

hishamhm commented Feb 16, 2018 •

edited

Loading

hishamhm commented Feb 16, 2018

coopr commented Feb 16, 2018

hishamhm commented Feb 16, 2018

kikito commented Feb 16, 2018

coopr commented Feb 16, 2018

thibaultcha Feb 16, 2018

thibaultcha Feb 16, 2018

thibaultcha commented Feb 16, 2018

hishamhm commented Feb 19, 2018 •

edited

Loading

hishamhm commented Feb 19, 2018

coopr commented Feb 19, 2018

thibaultcha commented Feb 20, 2018

Tieske commented Feb 20, 2018

hishamhm commented Feb 20, 2018 •

edited

Loading

hishamhm commented Feb 20, 2018

hishamhm commented Feb 20, 2018

thibaultcha commented Feb 20, 2018

hishamhm commented Feb 20, 2018

thibaultcha commented Feb 20, 2018

thibaultcha Feb 21, 2018

thibaultcha Feb 21, 2018

feat(healthcheck) adding visibility to target health status #3232

feat(healthcheck) adding visibility to target health status #3232

Conversation

hishamhm commented Feb 16, 2018 • edited Loading

hishamhm commented Feb 16, 2018

coopr commented Feb 16, 2018

hishamhm commented Feb 16, 2018

kikito commented Feb 16, 2018

coopr commented Feb 16, 2018

thibaultcha Feb 16, 2018

Choose a reason for hiding this comment

thibaultcha Feb 16, 2018

Choose a reason for hiding this comment

thibaultcha commented Feb 16, 2018

hishamhm commented Feb 19, 2018 • edited Loading

hishamhm commented Feb 19, 2018

coopr commented Feb 19, 2018

thibaultcha commented Feb 20, 2018

Tieske commented Feb 20, 2018

hishamhm commented Feb 20, 2018 • edited Loading

hishamhm commented Feb 20, 2018

hishamhm commented Feb 20, 2018

thibaultcha commented Feb 20, 2018

hishamhm commented Feb 20, 2018

thibaultcha commented Feb 20, 2018

thibaultcha Feb 21, 2018

Choose a reason for hiding this comment

thibaultcha Feb 21, 2018

Choose a reason for hiding this comment

hishamhm commented Feb 16, 2018 •

edited

Loading

hishamhm commented Feb 19, 2018 •

edited

Loading

hishamhm commented Feb 20, 2018 •

edited

Loading