Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(healthcheck) adding visibility to target health status #3232

Merged
merged 2 commits into from
Feb 21, 2018

Conversation

hishamhm
Copy link
Contributor

@hishamhm hishamhm commented Feb 16, 2018

Note: This is a work-in-progress.

This adds /upstreams/:upstream_id/health, an endpoint to report health of the targets of an upstream.

It returns the same contents of /upstreams/:upstream_id/targets, with the addition of a "health" field, which may return one of four values:

  • "not in balancer" - target failed to be inserted into the load balancer (usually because of DNS resolution error), target is therefore not in use by the load balancer
  • "healthchecks disabled" - healthchecks are disabled, target is in use by the load balancer
  • "healthy" - healthchecks are enabled, target is considered healthy by healthchecks, target is in use by the load balancer
  • "unhealthy" - healthchecks are enabled, target is considered unhealthy by healthchecks, target is in not in use by the load balancer

Note that if DNS for the target did not resolve, it will display as "not in balancer" regardless of healthchecks being enabled or disabled.

@hishamhm
Copy link
Contributor Author

@Tieske I think you'll be interested in discussing this as well, especially the "not in balancer" state. Perhaps it would be desirable for us to be able to report here why exactly the (non-zero-weight) target is not in the balancer? I'm not sure if DNS errors the only possible cause for this state.

@coopr
Copy link
Contributor

coopr commented Feb 16, 2018

Does this response distinguish between targets that have been manually "marked healthy" or "marked unhealthy" vs. those that have been ascertained to be healthy or unhealthy by health checks? (Not suggesting that it should, at least not yet, need to think about it more - just wondering)

@hishamhm
Copy link
Contributor Author

@coopr no, it doesn't -- as listed above, there is a single "healthy" and a single "unhealthy" state. The backend this API is exposing does not store this distinction.

@kikito
Copy link
Member

kikito commented Feb 16, 2018

If possible, I would rather have a string (i.e. "healthchecks disabled") instead of null. I think that would make it a bit easier to deal with - if a client wants to use this api it saves one conditional, and there's less risk of getting a "Null is not an object" javascript error.

@coopr
Copy link
Contributor

coopr commented Feb 16, 2018

Does anything special need to be in the response to indicate that it is specific to the Kong node that is reporting it? I ask because all other parts of the Admin API will respond with cluster-wide details, while in this case, the details are node-specific. Perhaps healthcheck visibility should be added to https://getkong.org/docs/0.12.x/admin-api/#information-routes instead (or in addition)?

active_n = active_n + 1
active[active_n] = entry
GET = function(self, dao_factory)
self.params.active = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a result of a copy/paste or does unsetting it have any influence over one of the below methods? None of them seems to at a glance.

entry.health = health_info[entry.target] or "not in balancer"
else
-- Healthchecks disabled for this upstream
entry.health = ngx.null
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we here distinguish cases where the above get_upstream_health returned an error, from cases where the healthchecks were disabled? I feel like logging an error in Kong might not be enough, the user probably should be informed that "something went wrong" while retrieving the health of those particular targets.

@thibaultcha
Copy link
Member

I think @coopr's point is particularly interesting - this is one case of node-specific data endpoint (but others already exist, like / and /status for example). This could be a case of documentation, or stronger rules about how we name our endpoints (/cluster/... vs. /node/...).

@hishamhm
Copy link
Contributor Author

hishamhm commented Feb 19, 2018

@coopr @thibaultcha I thought about that as well. Now that #3234 is merged and we have a node id available, I added added the node id as a wrapping key to the response as a whole, indicating that the response pertains to that node:

{
    "a2ad0188-c249-4f3d-b62b-f3605324031c": {
        "data": [
            {
                "created_at": 1519078028938, 
                "health": "healthy", 
                "id": "37135a85-9752-4fbb-86ac-513ea8f8335c", 
                "target": "127.0.0.1:2112", 
                "upstream_id": "5136b1f1-1f06-4c6f-860c-cdc629044fbb", 
                "weight": 100
            }
        ], 
        "total": 1
    }
}

@hishamhm
Copy link
Contributor Author

I'm a bit torn between that (which makes it explicit that the data is "scoped" by the node) or just adding a "node_id" key alongside "data" (which would make the meaning of the UUID more evident). Opinions?

@coopr
Copy link
Contributor

coopr commented Feb 19, 2018

I'm still leaning towards incorporating this information into the existing node-specific endpoints that @thibaultcha mentioned #3232 (comment)

@thibaultcha
Copy link
Member

I'm still leaning towards incorporating this information into the existing node-specific endpoints that @thibaultcha mentioned #3232 (comment)

I like it better too but I am not sure if now is the time to make such paradigm shifts in the Admin API yet... I think it makes sense eventually, but that such a change is beyond the scope of this PR (or current items at work)?

My favorite (for now) would be:

{
        "node_id": "a2ad0188-c249-4f3d-b62b-f3605324031c",
        "data": [
            {
                "created_at": 1519078028938, 
                "health": "healthy", 
                "id": "37135a85-9752-4fbb-86ac-513ea8f8335c", 
                "target": "127.0.0.1:2112", 
                "upstream_id": "5136b1f1-1f06-4c6f-860c-cdc629044fbb", 
                "weight": 100
            }
        ], 
        "total": 1
}

@Tieske
Copy link
Member

Tieske commented Feb 20, 2018

The logic of having the node_id as a key would be that it would be easy to consolidate data from multiple nodes, since it is all namespaced by node_id. But in hindsight that might not make sense, since if you want to integrate you would want status per target, hence a structure of target/node_id/data would make more sense, but by then the "easy integration" argument is already lost.

As an alternative to @thibaultcha's proposal we could index the data entries by target-field instead of making it an array. But that would be more lua-ish than json-ish. So I'm with @thibaultcha on this one, my 2cts.

as for endpoint;

I'm still leaning towards incorporating this information into the existing node-specific endpoints that @thibaultcha mentioned #3232 (comment)

I like it better too but I am not sure if now is the time to make such paradigm shifts in the Admin API yet... I think it makes sense eventually, but that such a change is beyond the scope of this PR (or current items at work)?

As long as we have an idea what the design looks like, now is as good a time to start as any...

@hishamhm
Copy link
Contributor Author

hishamhm commented Feb 20, 2018

Implemented the second option listed here (the one in @thibaultcha's example), with "node_id". Reshuffling the data around isn't hard if one needs to consolidate responses later.

Also, changed the nil state to "healthchecks disabled" for now.

@hishamhm hishamhm force-pushed the feat/healthcheck-admin-api branch 2 times, most recently from a89608f to 9ffe79a Compare February 20, 2018 13:19
@hishamhm
Copy link
Contributor Author

Update: changed the behavior so that failure to resolve DNS (which causes the target to not be included in the balancer) returns "not in balancer", whether healthchecks for the upstream are on or off.

@hishamhm
Copy link
Contributor Author

I'd like to have some opinions on how we should call the four states. Things to consider:

  • currenly, the "not in balancer" state is always caused by DNS failing to resolve
    • "unresolved"? "DNS error"?
    • would it be useful to include "DNS" explicitly? (however, what if other situations appear in the future that cause targets not to be added?)
  • these strings are going to be string-matched by other tools
    • should we prefer single-word results, like "unresolved", "unavailable", "unknown"?
    • should we do it "enum-style", like "HEALTHCHECKS_OFF" and "DNS_UNRESOLVED"? (is there any precedent for this anywhere else in the Admin API?)
  • when healthchecks are disabled, the balancer works as if assuming them to be always healthy. Still, I think having separate states for "healthchecker presumes healthy" and "healthchecks disabled" is important for diagnostics.
    • reducing the "healthchecks disabled" state to one word is complicated, because "health": "disabled" may sound like the healthchecker determined the target to be unhealthy and disabled it (likewise for "unavailable", etc.) -- the only single-word that seems unequivocal to me would be "health": "unchecked"

tl;dr:

  • how to call the "not in balancer" state?
  • how to call the "healthchecks disabled" state?

@thibaultcha
Copy link
Member

As long as we have an idea what the design looks like, now is as good a time to start as any...

@Tieske Not really - we are designing this on the go which we want to avoid for such a massive change, but more importantly we have our work and priorities already set for the next few weeks, with 2 releases that need to be completed, documented, tested and shipped - there probably isn't time to start changing this now if we want to stick to our target dates.

currenly, the "not in balancer" state is always caused by DNS failing to resolve

@hishamhm My thinking reading this: not in balancer (DNS ERROR)

should we do it "enum-style"

It does indeed look more appropriate for parsing... Maybe "not in balancer" isn't a state, but other states (such as DNS_ERROR) are documented as resulting in the Target not being included in the balancer?

  • HEALTHY
  • UNHEALTHY
  • HEALTHCHECKS_OFF
  • DNS_ERROR
  • ... (other states which result in "not in balancer")

It'd be nice to also keep track of the error that triggerred the DNS_ERROR state, like: "error": "dns server error: 3 name error" - but if not doable without non-trivial efforts, no biggie.

@hishamhm
Copy link
Contributor Author

but if not doable without non-trivial efforts, no biggie.

Yeah, an extra description field crossed my mind as well, but that error happens inside lua-resty-dns-client and the end-result is that it just doesn't launch the callback, which means kong.core.balancer and healthchecks don't get notified about the target at all. I extrapolate the DNS error out of the fact that the target is missing from the balancer table (that's why I was reluctant in calling it "DNS error" at first). FWIW, lua-resty-dns-client does retry by itself, meaning that if DNS is down and it eventually resumes working, the library will issue the callback and the target will change state.

If we were to add a detailed diagnostics argument, it would be nice to also display the reason why the healthchecker switched state as well (HTTP status, timeout, etc.) but the library doesn't store that at this point, so file this under non-trivial too. :)

@thibaultcha
Copy link
Member

@hishamhm ditto: can we avoid splitting a feat's test into a separate commit? I think it is also more trouble for you to maintain this way when tweaking the patch...

-- meaning that if DNS is down and it eventually resumes working, the
-- library will issue the callback and the target will change state.
entry.health = health_info[entry.target] or "DNS_ERROR"
else
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: it'd be nice to keep the line break above this just as usual - other branches in this file follow it

local health_info
health_info, err = balancer.get_upstream_health(upstream_id)
if err then
ngx.log(ngx.ERR, "[healthchecks] failed getting upstream health: ", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[healthchecks] is the namespace we chose for the actual health-checker, but I am thinking maybe this one should be different since it is an error in the Admin API and not in the proxy itself (less worrying for users especially since no stacktrace is printed by ngx.log).

If we use an "Admin API" namespace we could also add it to the above error log (or no namespace at all would be fine to imho).

This adds `/upstreams/:upstream_id/health`, an endpoint to report health of
the targets of an upstream.

It returns the same contents of `/upstreams/:upstream_id/targets`, with the
addition of a `"health"` field, which may return one of four values:

* `"DNS_ERROR"` - target failed to be inserted into the  load balancer
(because of DNS resolution error), target is therefore **not** in use by the
load balancer
* `"HEALTHCHECKS_OFF"` - healthchecks are disabled, target is in use by the
load balancer
* `"HEALTHY"` - healthchecks are enabled, target is considered healthy by
healthchecks, target is in use by the load balancer
* `"UNHEALTHY"` - healthchecks are enabled, target is considered unhealthy by
healthchecks, target is in **not** in use by the load balancer

Note that if DNS for the target did not resolve, it will display as `"not in
balancer"` regardless of healthchecks being enabled or disabled.
@hishamhm hishamhm merged commit bee5580 into master Feb 21, 2018
@Tieske Tieske deleted the feat/healthcheck-admin-api branch February 21, 2018 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants