cloudflare throttling for DNS api #1941

vonp · 2018-12-01T13:49:03Z

we have been dealing with senior CS people at cloudflare (CF) since early 2018.OCT, but they must then turn around and deal with actual network engineers. this has made the process protracted and somewhat opaque.

we had noticed that some app's suddenly were consuming huge amounts of wall-clock for very little being done through CF. this is in a DC where CF has a POP and we normally experience sub-millisecond RTT's. research revealed that CF was tacking on 5000 ms per query through their maintenance api. the engineers have not been able to pin-point what is triggering this event. we have noticed that the throttling goes away within 24 hours and the duration seems independent of whether or not there is any use of CF's api after it starts.

this has also started up during the use of acme.sh for several domains where each of them had 70-84 wildcard sub-domains. we noticed from the logging of the transactions that there was a query for the zone data for each sub-domain since acme.sh does not cache the initial response. it would not be unheard-of for a system-protection mechanism such as throttling to be triggered by many duplicate queries in a short time-frame.

what ever the cause may be, people should be aware that something is causing CF to begin throttling queries when there is a large number of sub-domains being processed from a single domain.tld base.

attached is a commented log of a sub-domain transaction that was submitted to CF engineering that highlights the latency problem.

CF_RR_latency_acme.txt

1. fix #1977 2. The cache is too long to as a line to save in the conf

Neilpang · 2018-12-28T14:54:33Z

it's reverted.

vonp · 2018-12-29T02:01:43Z

Neilpang:

from what i infer, you attempted to cache the domain and eliminate the dup's in response to 1941 and that caused 1977/1980 to appear. so, 'revert fix for #1941' seems to indicate that 1941 stands un-corrected/-modified.

Neilpang · 2018-12-29T14:00:43Z

@vonp
Yes, it reverted.
I tried to cache the response, but the response is too long to cache, and we must use the api with name=example.com.
We will see if there is anything we can do for this issue.

vonp · 2018-12-30T02:00:55Z

i saw your problem. i may have a working suggestion available if i can get CF in motion. when i discovered the forced latency, i revived an effort i underwent with CF several years ago because i have an on-going need to handle the obtaining of initial certs. i am available for pro bono work to a little over 300 non-profits (NGO's) for whom i do systems assistance and also mentor in linux systems admin for 2-3 dozen military veterans (vets) per year. naturally, i always stress the need for system-wide TLS and other security measures. the advent of letsencrypt has made the TLS part free, your efforts have made that availability utile, and CF offers both low latency and bandwidth conservation as an affordable (or, even, free) possibility. you may have missed the CF proc buried in their voluminous API doc's or you may have experienced the same thing i did when you previously tried it and the proc failed. the proc permits one to "bundle" all the RR's into a bind9 config formated file and submit it in one go. when i first gave it a try i seem to remember that the problem was it would not handle MX's and nobody in CF's CS could figure out who could/would fix it on the engineering level. thus, i stayed with the same RR-by-RR method of creation that you now are using. what i do now is start with a mysql template and create a cert-unique table containing all the RR's. then another script extracts the RR's and formats them into a bind9 config file. this file is then verified using bind9's zone-checking utils before submitting it to CF. also, since CF handles NS and SOA creation, these have to be stripped out of the valid "named-type" file before going to CF. other than those changes, CF now works with anything bind9 supports! although the NGO's have only limited support/need for mysql and almost no need of bind9, i just install-and-forget these on their systems to support the cert business. of course the vets go on to commercial work where these app's are far more prevelent or can be installed and their need for training is vital. for your purposes the mysql step should be handled in a different script-like way to get to the bind9 config file. not only is the existence of mysql a question, but sys admins are unlikely to permit you access to mysql anyway. i do know there are app's that can create a mysql-like table file, but this is probably over-kill for what you need and actually not a necessity since it merely reflects the way i created the process. i would have to get guidance from ISC as to whether their zone-checking utils are truly "stand-alone" or not. the zone-check step not only checks syntax, but it also validates params (such as the validity of MX addr's, for one). so — even though you will need few of the checks ISC provides — the little 41kb util still is worth putting in your install pkg to guarantee that the CF submission succeeds and you do not have to duplicate what already works perfectly. is this all worth it? i went back and analyzed the last 41 certs we recently handled. due to the high average number of sub-domains involved, i figure that it took 18,942 api queries to CF to do that part. with my limited knowledge of what letsencrypt is doing, i estimate that the bind9 process would have cut your CF queries to only 164. the JSON response has all the wealth of CF data (RR id, dates, …) which could be left in a file in the domain's directory you create. however, your sole interest is in what is in the last JSON object: '"success": true'. for us, we update the mysql tables for persistence and add some of the data to a multi-dimension domain-indexed array for sub-millisecond access (i.e., without the mysql overhead). right now i am awaiting a CF response to several questions. obviously, the proc works to load all the RR's for a domain which is virgin or has had all the RR's deleted: which matches our own set-up use-case. your use-cases are incremental/decremental and those questions remain to be answered by CF. ### QUESTIONS TO CF (open since 2018.DEC.01): ? A.) will CF's proc take a bind9 file consisting of only additions and fold them in? (THIS IS PART OF acme.sh's USE-CASE.) ? B.) last year (mid 2017) someone in CF engineering said that there was api work being done to handle mass deletes and i neither heard back nor remembered to followed up. what is the status of this work? (THIS IS PART OF acme.sh's USE-CASE.) ? C.) if the bind9 file contains matching RR's, what will your proc do (i.e., reject everything, reject only dups, update existing from dups and process additions, ignore dups and process additions, delete everything and start with only what is in the file, or …)? (this primarily relates to our own use-cases. it might apply to acme.sh if RR deletes can somehow be handled through a bind9 update submission.) on my side i certainly can test QUESTION A and let you know what i find out. i cannot directly test QUESTION B since there is nothing in CF's doc's yet re how this can be done through the API. of course, i could play around with variants of the '-X DELETE' curl option to see if multiple RR's can be packed into the 'data' object, but this might not be how CF is trying to implement this proc. as to QUESTION C, while it might not impact you directly, if you desire, i would be happy to update you when i get something definitive from CF. i laud your extensive efforts in acme.sh. please feel free to contact me re any of the foregoing.

…

-- Thank you, Johann

On Sat, 29 Dec 2018 14:00:57 +0000 (UTC) neil ***@***.***> wrote: @vonp Yes, it reverted. I tried to cache the response, but the response is too long to cache, and we must use the api with `name=example.com`. We will see if there is anything we can do for this issue.

Neilpang · 2019-01-08T05:02:15Z

let's keep it open.

vonp · 2019-01-08T11:28:55Z

FYI, i am making some progress since i am now dealing with a CF engineer who actually works on the bind9 part of the API. here is what i can report:
A) right at this moment adds can be done through this method. if any dups of current RR's are encountred in the uploaded file they are reported back as errors, but the adds all go through.
B) in relatively short order the dups presently handled as errors will be handled as edits through this same submission method. the coding for this is already finished/approved and awaiting the completion of QC/testing/implementation.
C) mass deletes: this is apparently a hot topic amongst CF' clients and is being explored, but not presently on anyone's front burner. i have suggested that using the bind9 method provides a dramatic network/resource conservation move for CF and the engineer has agreed to look at deletes from that angle since it is in his own purview.

so, adds are already on the table with edits coming on-line in the near future. hopefully, deletes will also join the adds/edits in the bind9 method so that there will be an universal method for multiple RR submissions in one "throw".

the only note of caution re this method is the fact that out-of-zone domains must be handled separately … more-or-less as they are handled now.

Neilpang closed this as completed in 7917aa2 Dec 10, 2018

Neilpang mentioned this issue Dec 11, 2018

Cloudflare (dns_cf) is using the wrong domain id. #1959

Closed

KYLE-HILL mentioned this issue Dec 22, 2018

Invalid Domain with CloudFlare DNS #1980

Closed

Neilpang pushed a commit that referenced this issue Dec 28, 2018

revert fix for #1941

7ba9a59

1. fix #1977 2. The cache is too long to as a line to save in the conf

Neilpang reopened this Jan 8, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cloudflare throttling for DNS api #1941

cloudflare throttling for DNS api #1941

vonp commented Dec 1, 2018 •

edited

Neilpang commented Dec 28, 2018

vonp commented Dec 29, 2018

Neilpang commented Dec 29, 2018

vonp commented Dec 30, 2018 via email

Neilpang commented Jan 8, 2019

vonp commented Jan 8, 2019

cloudflare throttling for DNS api #1941

cloudflare throttling for DNS api #1941

Comments

vonp commented Dec 1, 2018 • edited

Neilpang commented Dec 28, 2018

vonp commented Dec 29, 2018

Neilpang commented Dec 29, 2018

vonp commented Dec 30, 2018 via email

Neilpang commented Jan 8, 2019

vonp commented Jan 8, 2019

vonp commented Dec 1, 2018 •

edited