bug: Passive healthcheck can't disable #3304

mengskysama · 2018-03-16T08:07:44Z

Summary

Upstream passive healthcheck can't disable.

Steps To Reproduce

Config just like https://getkong.org/docs/0.12.x/health-checks-circuit-breakers/ says disable all healthcheck.


{
    "created_at": 1521134937862,
    "hash_on": "none",
    "id": "35adaf0b-3cd3-4271-bec0-617f1b948146",
    "healthchecks": {
        "active": {
            "unhealthy": {
                "http_statuses": [
                    429,
                    404,
                    500,
                    501,
                    502,
                    503,
                    504,
                    505
                ],
                "tcp_failures": 0,
                "timeouts": 0,
                "http_failures": 0,
                "interval": 0
            },
            "http_path": "/",
            "healthy": {
                "http_statuses": [
                    200,
                    302
                ],
                "interval": 0,
                "successes": 0
            },
            "timeout": 1,
            "concurrency": 10
        },
        "passive": {
            "unhealthy": {
                "http_failures": 0,
                "http_statuses": [
                    429,
                    500,
                    503
                ],
                "tcp_failures": 0,
                "timeouts": 0
            },
            "healthy": {
                "successes": 0,
                "http_statuses": [
                    200,
                    201,
                    202,
                    203,
                    204,
                    205,
                    206,
                    207,
                    208,
                    226,
                    300,
                    301,
                    302,
                    303,
                    304,
                    305,
                    306,
                    307,
                    308
                ]
            }
        }
    },
    "name": "ip_qingting_fm",
    "hash_fallback": "none",
    "slots": 100
}

Let's upstream service stop
Request

curl  x -H "Host: x.fm"
The upstream server is timing out
curl  x -H "Host: x.fm"
The upstream server is timing out
curl  x -H "Host: x.fm"
{"message":"failure to get a peer from the ring-balancer"}

Let's upstream service resume...but..

curl  x -H "Host: x.fm"
{"message":"failure to get a peer from the ring-balancer"}

Additional Details & Logs

Kong version (0.12.3)

2018/03/16 06:16:40 [warn] 922#0: *4158563 [lua] healthcheck.lua:957: log(): [healthcheck] (ip_qingting_fm) unhealthy TCP increment (4/2) for 192.168.50.132:80 while connecting to upstream, client: 60.25.21.33, server: kong, request: "GET /ip HTTP/1.1", upstream: "http://192.168.50.132:80/ip", host: "x.fm", referrer: "httpsies/3617?appversion=6.2.6"
2018/03/16 06:16:40 [error] 922#0: *4158563 [lua] init.lua:319: balancer(): failed to retry the dns/balancer resolver for ip_qingting_fm' with: failure to get a peer from the ring-balancer while connecting to upstream, client: 60.25.21.33, server: kong, request: "GET /ip HTTP/1.1", upstream: "http://192.168.50.132:80/ip", host: "x.fm", referrer: "httpsies/3617?appversion=6.2.6"

The text was updated successfully, but these errors were encountered:

mengskysama · 2018-03-16T10:57:41Z

btw https://getkong.org/docs/0.12.x/health-checks-circuit-breakers/ target enable url http://localhost:8001/upstream/my_upstream/targets/10.1.2.3:1234/healthy should be http://localhost:8001/upstreams/my_upstream/targets/10.1.2.3:1234/healthy

See Kong/kong#3304

thibaultcha · 2018-03-17T09:03:29Z

@mengskysama Thanks for the report.

Pushed Kong/docs.konghq.com#627 for now - we will look at your PR soon!

See Kong/kong#3304 From #627

hishamhm · 2018-03-19T12:40:08Z

@mengskysama I left a comment in your PR!

@thibaultcha thank you for the doc fix!

mengskysama · 2018-03-20T07:07:24Z

@hishamhm Thanks for your reply!

Passive health check can be disable in single instance but cluster.

Here is code can help reproduce problem.

Modify to yours

cluster_admin_api = 'http://192.168.50.135:8001'
# same instance cluster_admin_api
cluster0_api = 'http://192.168.50.135'
# different instance
cluster1_api = 'http://192.168.50.136'

Delete all of upstream and api before run

import requests
import json
import time


s = requests.session()
s.headers['Content-Type'] = 'application/json'

cluster_admin_api = 'http://192.168.50.135:8001'
# same instance cluster_admin_api
cluster0_api = 'http://192.168.50.135'
# different instance
cluster1_api = 'http://192.168.50.136'


def init_kong():
    # create api
    data = {
        "strip_uri": True,
        "hosts": [
            "api.test.com"
        ],
        "name": "test",
        "methods": [
            "GET",
            "HEAD"
        ],
        "http_if_terminated": True,
        "https_only": False,
        "retries": 1,
        "uris": [
            "/"
        ],
        "preserve_host": False,
        "upstream_connect_timeout": 1000,
        "upstream_read_timeout": 3000,
        "upstream_send_timeout": 3000,
        "upstream_url": "http://192.168.50.132"
    }
    r = s.post('%s/apis' % cluster_admin_api, data=json.dumps(data)).json()
    return r['id']


def update_upstream(api_id):
    # upstream
    data = {
        "name": "test"
    }
    ret = s.post('%s/upstreams' % cluster_admin_api, data=json.dumps(data)).json()
    upstream_id = ret['id']

    # add target
    data = {
        "target": "192.168.50.132:80"
    }
    s.post('%s/upstreams/%s/targets' % (cluster_admin_api, upstream_id), data=json.dumps(data))

    # update upstream
    data = {
        "id": api_id,
        "created_at": int(time.time()),
        "strip_uri": True,
        "hosts": [
            "api.test.com"
        ],
        "name": "test",
        "methods": [
            "GET",
            "HEAD"
        ],
        "http_if_terminated": True,
        "https_only": False,
        "retries": 1,
        "uris": [
            "/"
        ],
        "preserve_host": False,
        "upstream_connect_timeout": 1000,
        "upstream_read_timeout": 3000,
        "upstream_send_timeout": 3000,
        "upstream_url": "http://test",
    }

    s.put('%s/apis' % cluster_admin_api, data=json.dumps(data))


api_id = init_kong()
print('created api %s' % api_id)
update_upstream(api_id)

# wait kong pull db config
time.sleep(10)

for i in range(10):
    r = s.get(cluster0_api, headers={'Host': 'api.test.com'})
    print("cluster0_api ret = %s" % r.text)
    r = s.get(cluster1_api, headers={'Host': 'api.test.com'})
    print("cluster1_api ret = %s" % r.text)

output

cluster0_api ret = The upstream server is timing out

cluster1_api ret = The upstream server is timing out

cluster0_api ret = The upstream server is timing out

cluster1_api ret = The upstream server is currently unavailable

cluster0_api ret = The upstream server is timing out

cluster1_api ret = {"message":"failure to get a peer from the ring-balancer"}

cluster0_api ret = The upstream server is timing out

cluster1_api ret = {"message":"failure to get a peer from the ring-balancer"}

cluster0_api ret = The upstream server is timing out

cluster1_api ret = {"message":"failure to get a peer from the ring-balancer"}

cluster0_api ret = The upstream server is timing out

cluster1_api ret = {"message":"failure to get a peer from the ring-balancer"}

Please let me know if you can reproduce this problem.

hishamhm · 2018-03-20T14:03:00Z

@mengskysama Thank you for the test case! I will try to reproduce it.

@mengskysama

In the upstream event handler, `create_balancer` was being called with the object received via the event, which contains "id" and "name" only, and not the entire entity table containing the rest of the upstream fields. This caused it create a healthchecker with an empty configuration (ignoring the user's configuration), which then fell back into the lua-resty-healthcheck defaults. This fix obtains the proper entity object from the id and passes it to `create_balancer`. A regression test is included, which spawns two Kong instances and reproduces the error scenario described by @mengskysama. Fixes #3304.

hishamhm · 2018-03-20T17:21:03Z

@mengskysama I reproduced your test case and submitted a PR with the fix! Thank you once again! If possible, please download the PR branch (fix/upstream-event) and let me know your results.

@mengskysama

In the upstream event handler, `create_balancer` was being called with the object received via the event, which contains "id" and "name" only, and not the entire entity table containing the rest of the upstream fields. This caused it create a healthchecker with an empty configuration (ignoring the user's configuration), which then fell back into the lua-resty-healthcheck defaults. This fix obtains the proper entity object from the id and passes it to `create_balancer`. A regression test is included, which spawns two Kong instances and reproduces the error scenario described by @mengskysama. Fixes #3304 From #3319

@mengskysama

In the upstream event handler, `create_balancer` was being called with the object received via the event, which contains "id" and "name" only, and not the entire entity table containing the rest of the upstream fields. This caused it create a healthchecker with an empty configuration (ignoring the user's configuration), which then fell back into the lua-resty-healthcheck defaults. This fix obtains the proper entity object from the id and passes it to `create_balancer`. A regression test is included, which spawns two Kong instances and reproduces the error scenario described by @mengskysama. Fixes #3304 From #3319

mengskysama · 2018-03-21T03:22:32Z

@hishamhm Great! I double check fix/upstream-event fixed perfect 👍
I will I spend some time to understand balancer module logic.

hishamhm · 2018-03-21T11:05:04Z

Awesome, thank you!

@thibaultcha

The regression test for issue #3304 was flaky because it launches two Kong nodes and it waited for the second one to be ready by reading the logs. This is not a reliable way of determining if a node is immediately ready to proxy a configured route. Reversing the order of proxy calls in the test made it fail more consistently, which helped debugging the issue. This changes the check to verify if the router has been rebuilt, using a dummy route for triggering the routing rebuild before the proper test starts. (Thanks @thibaultcha for the idea!) The changes are also backported to `spec-old-api/`.

@thibaultcha

The regression test for issue #3304 was flaky because it launches two Kong nodes and it waited for the second one to be ready by reading the logs. This is not a reliable way of determining if a node is immediately ready to proxy a configured route. Reversing the order of proxy calls in the test made it fail more consistently, which helped debugging the issue. This changes the check to verify if the router has been rebuilt, using a dummy route for triggering the routing rebuild before the proper test starts. (Thanks @thibaultcha for the idea!) The changes are also backported to `spec-old-api/`. From #3454

mengskysama mentioned this issue Mar 16, 2018

fix(core) disable balancer upstream passive health check #3305

Closed

thibaultcha added a commit to Kong/docs.konghq.com that referenced this issue Mar 17, 2018

docs(healthchecks) correct Admin API health endpoint

87c72ff

See Kong/kong#3304

thibaultcha mentioned this issue Mar 17, 2018

docs(healthchecks) correct Admin API health endpoint Kong/docs.konghq.com#627

Merged

thibaultcha added a commit to Kong/docs.konghq.com that referenced this issue Mar 17, 2018

docs(healthchecks) correct Admin API health endpoint (#627)

90987c4

See Kong/kong#3304 From #627

hishamhm mentioned this issue Mar 20, 2018

fix(balancer) use correct upstream object in event handler #3319

Merged

thibaultcha closed this as completed in #3319 Mar 20, 2018

hishamhm mentioned this issue May 11, 2018

tests(healthchecks) fix flaky healthcheck test #3454

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: Passive healthcheck can't disable #3304

bug: Passive healthcheck can't disable #3304

mengskysama commented Mar 16, 2018 •

edited

Loading

mengskysama commented Mar 16, 2018

thibaultcha commented Mar 17, 2018

hishamhm commented Mar 19, 2018

mengskysama commented Mar 20, 2018 •

edited

Loading

hishamhm commented Mar 20, 2018

hishamhm commented Mar 20, 2018

mengskysama commented Mar 21, 2018 •

edited

Loading

hishamhm commented Mar 21, 2018 via email

bug: Passive healthcheck can't disable #3304

bug: Passive healthcheck can't disable #3304

Comments

mengskysama commented Mar 16, 2018 • edited Loading

Summary

Steps To Reproduce

Additional Details & Logs

mengskysama commented Mar 16, 2018

thibaultcha commented Mar 17, 2018

hishamhm commented Mar 19, 2018

mengskysama commented Mar 20, 2018 • edited Loading

hishamhm commented Mar 20, 2018

hishamhm commented Mar 20, 2018

mengskysama commented Mar 21, 2018 • edited Loading

hishamhm commented Mar 21, 2018 via email

mengskysama commented Mar 16, 2018 •

edited

Loading

mengskysama commented Mar 20, 2018 •

edited

Loading

mengskysama commented Mar 21, 2018 •

edited

Loading