Performance degradation in 0.10.0 due to DNS #2303

Vermeille · 2017-03-30T16:03:25Z

Hello,

Here, you can see some perf tests done in 0.9.9. Our setup is quite simple: we have a client querying a backend either directly ("Without Kong") or through Kong ("With Kong"). With upstream keep-alive properly set, the latency induced by kong is approxymately 1ms (networking time being negligible in our setup).

We were really looking forward to deploying Kong 0.10 because of the dynamic upstream keep-alive configuration. That being said, we ran our very same performance tests and got those results:

As you can see, latency distribution has clearly shifted right, with the mode of the distribution now being 15ms. After investigating, it turns out this time is spent in Kong, and more exactly in this call stack:

kong.access()
-> kong.core.before()
--> kong.balancer.execute()
---> toip()
----> resolve()

toip() does a DNS call way too frequently, it seems. Is it a misconfiguration on my side? Is it expected? Is there any issue when caching IPs?

I'm still investigating and will let you know of my results, and I'll be more efficient with your insights :)

Tieske · 2017-03-30T18:48:21Z

@Vermeille what is the ttl of your dns records?

Vermeille · 2017-03-31T08:20:58Z

The TTL is 3600

Vermeille · 2017-03-31T13:06:27Z

The DNS call is always performed because somehow, singletons.dao.upstreams:find_all() returns an empty table. That makes upstream_id always nil here . So, get_balancer() returns false . balancer is false and falls down to toip.

Continuing my investigations. I'm trying to understand how upstreams_dict is meant to work.

Vermeille · 2017-03-31T14:32:24Z

Okaaaaaaaaaaaaay. singletons.dao.upstreams:find_all() checks a new /upstreams endpoint that was not mentioned in the UPGRADE.md document and that somehow require special attention. Migration did not apparently do The Expected Thing, which is, if my understanding is correct, migrate upstream_url for APIs to that table.

Should we write a migration procedure?
Is my understanding totally wrong?
Or should the code handle the "base/simple" case (nothing specified in /upstreams)?

I'll gladly help, but need your guidance to fully understand what /upstreams is all about

Tieske · 2017-03-31T14:38:00Z

The DNS call is always performed because somehow, singletons.dao.upstreams:find_all() returns an empty table. That makes upstream_id always nil here . So, get_balancer() returns false . balancer is false and falls down to toip.

This is completely logical, it should as it tests whether the hostname refers to an upstream entity. If it doesn't then there is no ring-balancer, and hence it must invoke the dns resolver. Which results in the toip call.

Tieske · 2017-03-31T14:41:40Z

Okaaaaaaaaaaaaay. singletons.dao.upstreams:find_all() checks a new /upstreams endpoint that was not mentioned in the UPGRADE.md document and that somehow require special attention. Migration did not apparently do The Expected Thing, which is, if my understanding is correct, migrate upstream_url for APIs to that table.

No upstreams is a completely different mechanism to do loadbalancing

Should we write a migration procedure?

No

Is my understanding totally wrong?

Yes 😄

Or should the code handle the "base/simple" case (nothing specified in /upstreams)?

yes. The upstream name is used as a virtual host name, see my previous comment. So it first checks whether a ring-balancer for that upstream is available, if not, it will fall through to regular dns resolution.

I'll gladly help, but need your guidance to fully understand what /upstreams is all about

see https://getkong.org/docs/0.10.x/loadbalancing/

Vermeille · 2017-03-31T15:12:01Z

This is completely logical, it should as it tests whether the hostname refers to an upstream entity. If it doesn't then there is no ring-balancer, and hence it must invoke the dns resolver. Which results in the toip call.

Okay, I think I got the general idea. However it seems the IP isn't cached, considering how long name resolution takes. Why so? Why Is it a good default behavior?

My question essentially is: how do I stop spending that much time in DNS resolution and go back to my 0.9.9 awesome perfs?

coopr · 2017-03-31T17:05:08Z

(Somewhat off topic) @Vermeille what are you using for load testing and generating those pretty charts? Is your testing script something you could share? Thanks!

Vermeille · 2017-03-31T18:58:16Z

@coopr
We are using jmeter. Kong and the upstream server run in docker containers.

Jmeter 3 has super sexy charts. If you can't or don't use jmeter 3, we analyze the csv produced by jmeter 2 with a python/numpy script. I'll see on monday if I'm allowed to open source that script (I see no reason they refuse, but procedures...)

Tieske · 2017-04-03T11:23:47Z

@Vermeille

kong.access()
-> kong.core.before()
--> kong.balancer.execute()
---> toip()
----> resolve()

toip() will

get a record through resolve()
round robin on that dns record

resolve() will

identify which record type to query (last successful one)
try each record type through lookup() until successful

lookup()

query the cache for the given name and type
if not found query the server

What do your dns records look like? for SRV, A and CNAME queries, eg. dig <name> SRV etc

Tieske · 2017-04-03T11:49:10Z

an interesting experiment would be to add

    order = { "last", "A", "SRV", "AAAA", "CNAME" },

to the opts table here

Vermeille · 2017-04-03T12:20:31Z

My DNS is a CNAME that points to an A.

My guess now is that there is a mistake in the cache.
Say dig a.com returns

a.com CNAME b.com
b.com A 1.2.3.4

When calling resolve("a.com") and the cache is empty, it will perform a DNS resolution. the DNS answer is:

a.com CNAME b.com
b.com A 1.2.3.4

And that will be cached. Meaning the cache now is

"5:a.com" => "CNAME b.com"
"1:b.com" => "A 1.2.3.4"

plus the "last" entry

"a.com" => 1

Which is the mistake.

The next query will then look for "a.com" in the cache, get the last successful type in the cache, which in this case is 1, and check for "1:a.com" in the cache, which doesn't exist, and a full DNS query is performed. I guess you meant to save a.com => 5 as a last successful query, so that you check the cache for "5:a.com" and recursively to "1:b.com" for the CNAME.

Vermeille · 2017-04-03T12:28:40Z

The mischievous line is this one. I guess the correct second argument is not qtype but records[1].type or something like this.

Vermeille · 2017-04-03T16:22:39Z

Okay got a quick fix I guess. Certainly not what's best though.

Adding this after this line puts in cache the first A response returned by the DNS, after following CNAMEs.

if answers and (#answers > 0) then
  answers[1].name = qname
  cacheinsert(answers, qname, qtype)
end

By doing so, the cache entry for "1:a.com" is populated.

Tieske · 2017-04-04T15:26:49Z

@Vermeille can you give this branch a try: https://github.com/Mashape/lua-resty-dns-client/tree/fix/cname-caching

check it out using git, and then do a luarocks make from the repo directory.

Vermeille · 2017-04-05T12:36:09Z

That works super well for me, thanks!

Tieske · 2017-04-06T07:23:28Z

@Vermeille dunno how much work is involved, but could you share the update graphs from the original post as a comparison?

Vermeille · 2017-04-06T12:40:08Z

That fix, my lord, is amazing.

Tieske · 2017-04-06T15:34:30Z

thx! I love the way those graphs look ❤️

Vermeille · 2017-04-06T16:02:26Z

You're the real MVP!

Many thanks!

Fixes an issue with the dns resolver not properly caching CNAME records. fixes #2303

subnetmarco assigned Tieske Mar 30, 2017

Tieske mentioned this issue Apr 6, 2017

fix(client) proper cache cname result to prevent excessive dns lookups Kong/lua-resty-dns-client#9

Merged

Tieske added a commit that referenced this issue Apr 21, 2017

chore(deps) updates dns resolver dependency to 0.4.1

b5f9aad

Fixes an issue with the dns resolver not properly caching CNAME records. fixes #2303

Tieske mentioned this issue Apr 21, 2017

chore(deps) updates dns resolver dependency to 0.4.1 #2434

Merged

Tieske closed this as completed in #2434 Apr 21, 2017

Tieske added a commit that referenced this issue Apr 21, 2017

chore(deps) updates dns resolver dependency to 0.4.1 (#2434)

5616c94

Fixes an issue with the dns resolver not properly caching CNAME records. fixes #2303

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance degradation in 0.10.0 due to DNS #2303

Performance degradation in 0.10.0 due to DNS #2303

Vermeille commented Mar 30, 2017 •

edited

Loading

Tieske commented Mar 30, 2017

Vermeille commented Mar 31, 2017

Vermeille commented Mar 31, 2017 •

edited

Loading

Vermeille commented Mar 31, 2017

Tieske commented Mar 31, 2017

Tieske commented Mar 31, 2017

Vermeille commented Mar 31, 2017 •

edited

Loading

coopr commented Mar 31, 2017

Vermeille commented Mar 31, 2017 •

edited

Loading

Tieske commented Apr 3, 2017

Tieske commented Apr 3, 2017

Vermeille commented Apr 3, 2017 •

edited

Loading

Vermeille commented Apr 3, 2017

Vermeille commented Apr 3, 2017

Tieske commented Apr 4, 2017

Vermeille commented Apr 5, 2017

Tieske commented Apr 6, 2017

Vermeille commented Apr 6, 2017

Tieske commented Apr 6, 2017

Vermeille commented Apr 6, 2017

Performance degradation in 0.10.0 due to DNS #2303

Performance degradation in 0.10.0 due to DNS #2303

Comments

Vermeille commented Mar 30, 2017 • edited Loading

Tieske commented Mar 30, 2017

Vermeille commented Mar 31, 2017

Vermeille commented Mar 31, 2017 • edited Loading

Vermeille commented Mar 31, 2017

Tieske commented Mar 31, 2017

Tieske commented Mar 31, 2017

Vermeille commented Mar 31, 2017 • edited Loading

coopr commented Mar 31, 2017

Vermeille commented Mar 31, 2017 • edited Loading

Tieske commented Apr 3, 2017

Tieske commented Apr 3, 2017

Vermeille commented Apr 3, 2017 • edited Loading

Vermeille commented Apr 3, 2017

Vermeille commented Apr 3, 2017

Tieske commented Apr 4, 2017

Vermeille commented Apr 5, 2017

Tieske commented Apr 6, 2017

Vermeille commented Apr 6, 2017

Tieske commented Apr 6, 2017

Vermeille commented Apr 6, 2017

Vermeille commented Mar 30, 2017 •

edited

Loading

Vermeille commented Mar 31, 2017 •

edited

Loading

Vermeille commented Mar 31, 2017 •

edited

Loading

Vermeille commented Mar 31, 2017 •

edited

Loading

Vermeille commented Apr 3, 2017 •

edited

Loading