New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MySQL backend restart issue #3824
Comments
|
hi manique, this is fixed in the current master (in #3666). We will release Alpha3 soon that should land in Ubuntu Xenial. If you want to use PowerDNS 4, I would recommend for now using the master packages at https://repo.powerdns.com/ |
|
I was experiencing the issue in #3535, so I installed the alpha3 version. It seems like PDNS loses connection to MySQL for some reason, I do query, it returns REFUSED, I do another query, it returns the correct data. Is it supposed to be losing the db connection so frequently? |
|
No, it's not supposed to lose it frequently. It is also not supposed to use REFUSED for that. Reopening this ticket - if you have any logs or other information, please provide them. |
|
If you are also experiencing #3535 in alpha3, please let us know. |
|
Here's my logfile. The resulting queries: |
|
@willtorres we need output from reliable tools like |
|
These were done in succession, about 2 seconds apart |
|
Ok - now we see |
|
Well, in my case this was also SERVFAIL, I think, we need more system logs here. |
|
BTW, can anyone please say, which exact Debian package version have this fix? I still would like to move 4.x to production ;) |
|
In my case, it goes down not very often. Will look for Alpha3 in Debian. Thanks. |
|
alpha3 is now in Debian sid, and of course also at https://repo.powerdns.com/ |
|
Recently performed an in-place upgrade of PowerDNS 3.3 (Ubuntu 14.04/Trusty build) to PowerDNS 4.0.0-alpha3 (PowerDNS Repository build) on an Ubuntu 14.04/Trusty host; also seeing intermittent MySQL backend drops: Have not previously experienced this with PowerDNS 3.3, and no other moving parts of the setup have changed. MySQL Libraries on client: MySQL Server: The servers this has been occurring on are very lightly loaded (average of under 5 queries/sec), and the backend drops have been occurring at a frequency of no more than once a day so far. The connectivity appears to be recovered automatically on a subsequent query. |
|
5 qps and lost connections for tcp looks like a mysql server is closing idle tcp backend connections after mysql wait_timeout. |
|
@mind04 Thank you so much! I've increased the wait_timeout to the maximum, and my connections are staying alive. |
|
Upon further investigation, it appears that the second log line is significant: While the server in question handles an average of 5 queries/sec, the bulk of those queries are UDP. The 5 queries/second load keeps the MySQL connections of the 3 default distributor threads sufficiently utilized to prevent time-outs even with a fairly aggressive wait_timeout setting (although this may not be the case on even quieter servers). The TCP receiver, however, maintains its own backend thread. In the above scenario, it sees just a fraction of the load, the time-out is hit fairly regularly, and the connection is lost. It is receiving the first TCP query in a long while that triggered the issue. This leaves open the question of why a lost connection would result in an error in the first place. After all, the gmysqlbackend makes use of the libmysqlclient MYSQL_OPT_RECONNECT option, which should result in a transparent reconnect after a time-out, rather than in an error. The answer to that may lie in the documentation of this feature itself. Amongst the caveats, the following is listed: A cursory look through the code suggests that gsql's setDB function prepares the statements ahead of time when a database backend is initialized. Perhaps when a previously prepared statement is executed after a silent re-connect by libmysqlclient, the prepared statement, as per the documentation, is no longer available, and an error occurs - resulting in the behaviour outlined in this issue? |
|
Yes, it certainly is looking like we may need to work on re-preparing the statements, or perhaps even just handling the reconnection ourselves. |
|
There is no obvious way to trigger on the reconnection attempt and perform a re-preparation of the queries in-query since the raison d'être of MYSQL_OPT_RECONNECT is to hide what has transpired from the application. There are, however, a number of ways to work around this. Switch from query preparation at backend thread init to a just-in-time model. Ping the database with mysql_ping() before executing each query. Thus, by keeping track and comparing the mysql_thread_id before and after mysql_ping() a reconnect can be detected, and the queries can be re-prepared prior to execution. Drop MYSQL_OPT_RECONNECT and mask the problem with regular mysql_ping(). By simply firing off a mysql_ping() on a timer with intervals shorter than wait_timeout, it should be possible to prevent session time-outs in the first place. In such a scenario, MYSQL_OPT_RECONNECT can be disabled, with the ultimate fallback becoming full backend restart. Since the client can get and set the wait_timeout variable for its own session, gmysqlbackend could also potentially first determine the current value in order to configure its mysql_ping() timer, or override it during init to a sensible value in order to reduce noise. |
|
Partially fixed by #3937 |
|
Should be fixed in #3937 |
|
I'm still having this problem on 4.0.1 with TCP connections on my master server (only serving slaves). The first slave that connects for an AXFR after more than 'wait_timeout' seconds (mysql setting, default 8 hours) gets a disconnected TCP session and misses the update. I will work around this by increasing the wait_timeout setting, but I think pdns-gmysql should prevent the connection/transfer from failing. E.g. by a keepalive-ping to mysql or a reconnect before failing the TCP session, just as @wk suggested. logs from master: logs from failing slave: |
|
This indeed appears to not function ideally in 4.0.1, and may be unintentional. I can replicate this behaviour. What appears to have happened is this:
As a result of both these commits being merged, a situation emerged in which the precondition for MYSQL_OPT_RECONNECT being functional has in fact been resolved, but the feature itself has been forcefully disabled, leading to the outcome @CaptainQwark is seeing. |
|
Am seeing this on on pdns 4.0.3, mysql (or percona, have tried both) 5.7.17. This causes the AXFR from the slave to fail - Slave log Master log The next AXFR query is successful as it opens a new connection. We already have a Any chance this will be fixed in 4.0.4? Are there any other work arounds other than increasing the Can the slave be configured to run the AXFR on a schedule that is more frequent than the connect timeout? I guess that doesn't guarantee that the connections in the retrieval pool will be fresh though. Would setting |
|
As we have been unable to figure out a reliable way to fix this, and it appears this only affects machines with very low traffic, I am sad to say I am removing the 4.1 milestone from this. |
|
That's a shame, any chance we could get the slave to retry the AXFR query - it would only need to retry once from what I can see. |
|
In my case, although not particularly high traffic, dnsdist is performing aggressive caching so that exacerbates the issue. |
|
Just curious, are there any downsides of running for example That should fix the issue, shouldn't it? |
|
@stecklars that's what I've ended up doing, and it's working fine in my case. No more errors for over a week. |
|
We do this as part of our monitoring to check a couple of records we are authoritative for and also some we are not to test that recursion is also working, but not specifically as a keepalive on the mysql connection. The AXFR is trickier as there is a separate backend/thread pool for this. We were already checking that the SOA serial matches when querying the master and slave using |
|
#5245 should be of extreme interest to you and testing would be appreciated. |
|
Ping! We cannot fix bugs if you don't test our fixes! |
|
Not sure my problem has nothing to do with wait_timeout (i ahve it set at 10 sec) as i get this error at a frequency that is usually as high as one every one or three seconds:
this started as soon as we upgraded to version 4 and it's happening since then and the dig trick makes no difference. The database is also used by php + webserver without any problem |
|
A 10 second |
|
how can i get the path once i have the repo setup ? |
|
I'm getting my lab set up now, and will test this fix as soon as I'm done. I will update this comment accordingly. Thanks, and sorry for the delay. |
|
Testing packages (based on the master branch + the fixes in #5245) are now available at https://downloads.powerdns.com/autobuilt/. Browse to your flavour, then find the files with 'authsqlconnectionreset' in their name. |
|
I've been testing this the last few days. Connection before was in minutes... we're now up to over a day without the same error. Looks good. |
|
Hi, Sorry, but ... I'm unable to find if 4.0.4 have this fix ? Best regards, |
|
@Poil this fix is not in 4.0.4. It is on the master branch, and will be in 4.1.0. A release candidate for 4.1.0 should come within one or two weeks. |
|
I'm running |
|
When I was using powerDNS, I'd configured session timeout to 1 year. (I'm no more working for the same company, I didn't keep how I did that) |
|
@Poil thanks for the tip. What is strange about this is that it only happens in zone transfers. The server is running without any issues and answering dig queries just fine, however in AXFR this error happens. |
|
Yes it because the slave doesn't do a retry (and get an error when the session timed out) it comes back only after the value of the SOA if there is no a new notify (a change on the zone) |
Did you consider using the 4.1 packages we provide at https://repo.powerdns.com? |
|
Running PDNS 4.0.4 as autoritative DNS server for quite a number of domains. The issue only did happens on TCP queries (UDP queries are so frequent, they don't cause the session timeout to hit). I use the following workaround to increase the session timeout for mysql connections used by PDNS: in /etc/my.cnf, add a section in pdns.conf, set "gmysql-group": This causes pdns to read the [powerdns] section in my.cnf to set a long session timeout. In my case, 1 day was enough to also cover the less often TCP queries. UDP and TCP queries use individual database connections (running in separate threads?), so keeping UDP alive with sufficient queries still will not fix the issue with TCP time-outs. Since the above mentioned change, I did not observe any further database connection timeouts / reconnects. /Thomas |
|
@rgacogne so can I just pin the repos and run |
Please visit https://www.powerdns.com/opensource.html to see how you can reach us for support. |
Hi there!
Having problems with restarting MySQL backend while PowerDNS is running (Ubuntu 16.04, PowerDNS 4.0.0~alpha2-3build1)
After restart both MySQL and PowerDNS are started, I see following statements in journalctl
Afterwards I restart MySQL. It breaks everything down in pieces:
May 06 00:09:11 vova pdns[30268]: Backend reported condition which prevented lookupThere is only one cure - restart PowerDNS. Doesn't look like good solution.
The text was updated successfully, but these errors were encountered: