Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auth: Incremental backoff for failed slave checks #4953

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/markdown/authoritative/modes-of-operation.md
Expand Up @@ -65,6 +65,10 @@ is higher, the domain is retrieved and inserted into the database. In any case,
after the check the domain is declared 'fresh', and will only be checked again
after '**refresh**' seconds have passed.

When the freshness of a domain cannot be checked, e.g. because the master is offline, PowerDNS will retry the domain after [`slave-cycle-interval`](settings.md#slave-cycle-interval) seconds.
Every time the domain fails it's freshness check, PowerDNS will hold back on checking the domain for `amount of failures * slave-cycle-interval` seconds, with a maximum of [`soa-retry-default`](settings.md#soa-retry-default) seconds between checks.
With default settings, this means that PowerDNS will back off for 1, then 2, then 3 etc. minutes, to a maximum of 60 minutes between checks.

**Warning**: Slave support is OFF by default, turn it on by adding [`slave`](settings.md#slave) to the configuration.
**Note**: When running PowerDNS via the provided systemd service file, [`ProtectSystem`](http://www.freedesktop.org/software/systemd/man/systemd.exec.html#ProtectSystem=) is set to `full`, this means PowerDNS is unable to write to e.g. `/etc` and `/home`, possibly being unable to write AXFR's zones.

Expand Down
5 changes: 5 additions & 0 deletions pdns/communicator.hh
Expand Up @@ -217,6 +217,11 @@ private:
bool d_masterschanged, d_slaveschanged;
bool d_preventSelfNotification;

// Used to keep some state on domains that failed their freshness checks.
// uint64_t == counter of the number of failures (increased by 1 every consecutive slave-cycle-interval that the domain fails)
// time_t == wait at least until this time before attempting a new check
map<DNSName, pair<uint64_t, time_t> > d_failedSlaveRefresh;

struct RemoveSentinel
{
explicit RemoveSentinel(const DNSName& dn, CommunicatorClass* cc) : d_dn(dn), d_cc(cc)
Expand Down
22 changes: 21 additions & 1 deletion pdns/slavecommunicator.cc
Expand Up @@ -748,8 +748,13 @@ void CommunicatorClass::slaveRefresh(PacketHandler *P)
{
Lock l(&d_lock);
domains_by_name_t& nameindex=boost::multi_index::get<IDTag>(d_suckdomains);
time_t now = time(0);

for(DomainInfo& di : rdomains) {
const auto failed = d_failedSlaveRefresh.find(di.zone);
if (failed != d_failedSlaveRefresh.end() && now < failed->second.second )
// If the domain has failed before and the time before the next check has not expired, skip this domain
continue;
std::vector<std::string> localaddr;
SuckRequest sr;
sr.domain=di.zone;
Expand Down Expand Up @@ -828,6 +833,7 @@ void CommunicatorClass::slaveRefresh(PacketHandler *P)
L<<Logger::Warning<<"Received serial number updates for "<<ssr.d_freshness.size()<<" zone"<<addS(ssr.d_freshness.size())<<", had "<<ifl.getTimeouts()<<" timeout"<<addS(ifl.getTimeouts())<<endl;

typedef DomainNotificationInfo val_t;
time_t now = time(0);
for(val_t& val : sdomains) {
DomainInfo& di(val.di);
// might've come from the packethandler
Expand All @@ -836,8 +842,22 @@ void CommunicatorClass::slaveRefresh(PacketHandler *P)
continue;
}

if(!ssr.d_freshness.count(di.id)) // what does this mean? XXX
if(!ssr.d_freshness.count(di.id)) { // If we don't have an answer for the domain
uint64_t newCount = 1;
const auto failedEntry = d_failedSlaveRefresh.find(di.zone);
if (failedEntry != d_failedSlaveRefresh.end())
newCount = d_failedSlaveRefresh[di.zone].first + 1;
time_t nextCheck = now + std::min(newCount * d_tickinterval, (uint64_t)::arg().asNum("soa-retry-default"));
d_failedSlaveRefresh[di.zone] = {newCount, nextCheck};
if (newCount == 1 || newCount % 10 == 0)
L<<Logger::Warning<<"Unable to retrieve SOA for "<<di.zone<<", this was the "<<(newCount == 1 ? "first" : std::to_string(newCount) + "th")<<" time."<<endl;
continue;
}

const auto wasFailedDomain = d_failedSlaveRefresh.find(di.zone);
if (wasFailedDomain != d_failedSlaveRefresh.end())
d_failedSlaveRefresh.erase(di.zone);

uint32_t theirserial = ssr.d_freshness[di.id].theirSerial, ourserial = di.serial;

if(rfc1982LessThan(theirserial, ourserial) && ourserial != 0) {
Expand Down