Master caches stale records, causing slaves to skip AXFR #427

Closed
Habbie opened this Issue Apr 26, 2013 · 5 comments

Projects

None yet

1 participant

@Habbie
Member
Habbie commented Apr 26, 2013

I recently tracked down a bug that causes slaves to fail to update the zone from a PowerDNS master. This issue was very hard to reproduce as it happened seemingly randomly - only about 10% of the time. I finally managed to pin down a dump of the DNS traffic when this bug occurred.

This is how the bug usually appears:

  1. The PowerDNS master detects a SOA change and notifies all slaves.
  2. The slaves connect back to the master and request the zone's SOA.
  3. The master replies to the slaves with the old, incorrect SOA.
  4. The slaves do not update as they believe that they already have the newest version of the zone.

This also fixes many other cache-related bugs, such as differing SOA records being delivered to clients and slaves, and outdated records being served by the master even after it has notified slaves. (The bug in question prevents the cache from being cleared in response to a detection of a new SOA.)

To reproduce this bug locally, with BIND slave(s):

  1. Set cache-ttl high (such as 300)
  2. Increment the SOA of the domain, and wait for the slaves to update (this update will happen correctly, because a SOA record hasn't been cached yet)
  3. Before the cache TTL expires, increment the SOA again

After step 3, you'll notice that the slaves are notified and acknowledge the notification, but they don't AXFR. This is because they received the stale SOA from the cache.

This bug is mitigated or made harder to reproduce by having a shorter cache TTL, preventing slaves from periodically checking the SOA, and/or preventing direct querying of the master server. However, all current installations where PowerDNS is the master are susceptible to this problem (assuming the slaves check the SOA before they AXFR).

This bug appears to have been introduced in svn revision 1221, so it is likely present in all PowerDNS versions compiled since June 2008.

I have written two separate patches to fix this bug, depending on how you would like to fix it. The "clean" patch, which I would prefer be used, splits PacketCache::purge into two separate functions, one which accepts no arguments and clears the entire cache, and one which accepts a const string argument, clearing the cache of all entries related to the zone specified in the argument. The "ugly" patch is much shorter (one line); it just inserts a "dummy" argument in the temporary vector passed to PacketCache::purge.

@Habbie Habbie was assigned Apr 26, 2013
@Habbie Habbie closed this Apr 26, 2013
@Habbie
Member
Habbie commented Apr 26, 2013

Attachment '' (purge_clean.patch) https://gist.github.com/5466729

@Habbie
Member
Habbie commented Apr 26, 2013

Attachment '' (purge_ugly.patch) https://gist.github.com/5466730

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
I forgot to mention: this was the bug I mentioned in #powerdns a few days ago. (I'm mr_flea)

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
A small correction:

This issue is also fixed if the cache-ttl option is set to 0, as that will prevent anything from being cached at all. (This is obviously not desirable for servers that accept requests from the internet.)

  • Keith Buck <mr_flea at esper.net>
@Habbie
Member
Habbie commented Apr 26, 2013

Author: ahu
Applause! Thank you very much, I merged the pretty patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment