Duplicate entry 'ifojedaijee.example' for key 'name_index' #453

Closed
Habbie opened this Issue Apr 26, 2013 · 15 comments

Projects

None yet

1 participant

@Habbie
Member
Habbie commented Apr 26, 2013

Hi,

running a local compile of PowerDNS 3.1 RC2 20120420.2574 as a slave with gmysql backend. Master is the same software, and a script creates quite some traffic on the database.

The slave sporadically dies with "Duplicate entry 'ifojedaijee.example' for key 'name_index'". Log entries, grepped for the domain name in question:

root@pdns-debug:/home/mhaber# grep ifojedaijee.example /var/log/syslog
Apr 24 14:24:43 pdns-debug pdns-pdns-gmysql-master[10622]: Queued notification of domain 'ifojedaijee.example' to 192.168.1.1
Apr 24 14:24:43 pdns-debug pdns-pdns-gmysql-master[10622]: Queued notification of domain 'ifojedaijee.example' to 192.168.1.2
Apr 24 14:25:20 pdns-debug pdns-pdns-gmysql-slave[10610]: Received NOTIFY for ifojedaijee.example from 192.168.1.1 for which we are not authoritative
Apr 24 14:25:22 pdns-debug pdns-pdns-gmysql-master[10622]: Received NOTIFY for ifojedaijee.example from 192.168.1.1 which is not a master
Apr 24 14:25:23 pdns-debug pdns-pdns-gmysql-slave[10610]: Received NOTIFY for ifojedaijee.example from 192.168.1.1 for which we are not authoritative
Apr 24 14:25:25 pdns-debug pdns-pdns-gmysql-master[10622]: Received NOTIFY for ifojedaijee.example from 192.168.1.1 which is not a master
Apr 24 14:25:28 pdns-debug pdns-pdns-gmysql-slave[10610]: Received NOTIFY for ifojedaijee.example from 192.168.1.1 for which we are not authoritative
Apr 24 14:25:30 pdns-debug pdns-pdns-gmysql-master[10622]: Received NOTIFY for ifojedaijee.example from 192.168.1.1 which is not a master
Apr 24 14:25:35 pdns-debug pdns-pdns-gmysql-slave[10610]: Created new slave zone 'ifojedaijee.example' from supermaster 192.168.1.1, queued axfr
Apr 24 14:25:36 pdns-debug pdns-pdns-gmysql-slave[10610]: Initiating transfer of 'ifojedaijee.example' from remote '192.168.1.1'
Apr 24 14:25:36 pdns-debug pdns-pdns-gmysql-master[10622]: AXFR of domain 'ifojedaijee.example' initiated by 192.168.1.2
Apr 24 14:25:36 pdns-debug pdns-pdns-gmysql-master[10622]: AXFR of domain 'ifojedaijee.example' allowed: client IP 192.168.1.2 is in allow-axfr-ips
Apr 24 14:25:37 pdns-debug pdns-pdns-gmysql-slave[10610]: AXFR started for 'ifojedaijee.example', transaction started
Apr 24 14:25:37 pdns-debug pdns-pdns-gmysql-slave[10610]: AXFR done for 'ifojedaijee.example', zone committed with serial number 3
Apr 24 14:25:37 pdns-debug pdns-pdns-gmysql-master[10622]: AXFR of domain 'ifojedaijee.example' to 192.168.1.2 finished
Apr 24 14:25:39 pdns-debug pdns-pdns-gmysql-master[10622]: Removed from notification list: 'ifojedaijee.example' to 192.168.1.2:53 (was acknowledged)
Apr 24 14:25:43 pdns-debug pdns-pdns-gmysql-master[10622]: Received unsuccessful notification report for 'ifojedaijee.example' from 192.168.1.1:53, rcode: 5
Apr 24 14:25:43 pdns-debug pdns-pdns-gmysql-master[10622]: Removed from notification list: 'ifojedaijee.example' to 192.168.1.1:53
Apr 24 14:25:51 pdns-debug pdns-pdns-gmysql-master[10622]: Received spurious notify answer for 'ifojedaijee.example' from 192.168.1.2:53
Apr 24 14:26:00 pdns-debug pdns-pdns-gmysql-master[10622]: Received unsuccessful notification report for 'ifojedaijee.example' from 192.168.1.1:53, rcode: 5
Apr 24 14:26:00 pdns-debug pdns-pdns-gmysql-master[10622]: Received spurious notify answer for 'ifojedaijee.example' from 192.168.1.1:53
Apr 24 14:26:02 pdns-debug pdns-pdns-gmysql-master[10622]: Received spurious notify answer for 'ifojedaijee.example' from 192.168.1.2:53
Apr 24 14:26:22 pdns-debug pdns-pdns-gmysql-slave[10610]: Communicator thread died because of error: Database error trying to insert new slave 'ifojedaijee.example': Failed to execute mysql_query, perhaps connection died? Err=1: Duplicate entry 'ifojedaijee.example' for key 'name_index'
root@pdns-debug:/home/mhaber#

in this (lab) case, Master and Slave are running on the same box (but on different IP addresses and on different databases, albeit in the same MySQL instance).

The Master continues to run just fine, so I suggest that there is no duplicate entry. Maybe the slave struggles about being notified three times in a row?

Please also notice that the MySQL error happens like ~ 30 seconds after the domain was touched by the slave for the last time, and 20 seconds after the domain was touched for the last time at all.

@Habbie Habbie was assigned Apr 26, 2013
@Habbie Habbie closed this Apr 26, 2013
@Habbie
Member
Habbie commented Apr 26, 2013

Attachment '' (pdns-syslog) https://gist.github.com/5466738

@Habbie
Member
Habbie commented Apr 26, 2013

Attachment '' (zone) https://gist.github.com/5466739

@Habbie
Member
Habbie commented Apr 26, 2013

Attachment 'supermaster anti-race workaround' (supermaster-race.patch) https://gist.github.com/5466740

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
Ouch. I'll attach the log as a file.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: peter
supermaster-race.patch should limit the impact of this issue. It is not a real fix, though.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: ahu
This patch has been applied, and effectively solves the issue.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
I am having the same issue with 3.1, so I do not believe this is fixed.

I have a server setup as a slave which receives updates from a supermaster. Updates to existing domains appear to work ok, however if a new zone is created on the supermaster, it crashes the slave when the slave commits the new zone, see below:

Jan 19 16:31:08 ns2 pdns[24478]: AXFR started for 'example.net', transaction started
Jan 19 16:31:08 ns2 pdns[24478]: Exiting because communicator thread died with error: Database error trying to insert new slave 'example.net': Failed to execute mysql_query, perhaps connection died? Err=1: Duplicate entry 'example.net' for key 'name_index'
Jan 19 16:31:08 ns2 pdns[24478]: AXFR done for 'example.net', zone committed with serial number 2013011902

pdns on the slave is dead at that point and needs to be started again.

upon looking at the mysql database, I can see that the domains and records tables have been populated with the correct data, so it does indeed seem like a race of sorts.

Any ideas on how to fix the issue?

@Habbie
Member
Habbie commented Apr 26, 2013

Author: peter
Hello,

the fix is in 3.2, not in 3.1. Can you try with 3.1?

@Habbie
Member
Habbie commented Apr 26, 2013

Author: peter
Uh. I meant - can you try with 3.2? :)

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
Hi Peter,

I've tried 3.2-rc4 (from freebsd ports). The same error is there but at least the server doesnt crash. The conversation looks a bit messy! There is a subsequent axfr because the version of the SOA on the slave is 0??


Jan 20 16:08:45 ns2 pdns[51746]: Done launching threads, ready to distribute questions
Jan 20 16:15:39 ns2 pdns[51746]: Received NOTIFY for tittybumbum.com from 65.19.176.90 for which we are not authoritative
Jan 20 16:15:39 ns2 pdns[51746]: Received NOTIFY for tittybumbum.com from 2001:470:1:203::1002 for which we are not authoritative
Jan 20 16:15:40 ns2 pdns[51746]: Created new slave zone 'tittybumbum.com' from supermaster 65.19.176.90, queued axfr
Jan 20 16:15:40 ns2 pdns[51746]: Initiating transfer of 'tittybumbum.com' from remote '65.19.176.90'
Jan 20 16:15:40 ns2 pdns[51746]: Database error trying to create tittybumbum.com for potential supermaster 2001:470:1:203::1002: Database error trying to insert new slave 'tittybumbum.com': Failed to execute mysql_query, perhaps connection died? Err=1: Duplicate entry 'tittybumbum.com' for key 'name_index'
Jan 20 16:15:40 ns2 pdns[51746]: 1 slave domain needs checking, 0 queued for AXFR
Jan 20 16:15:40 ns2 pdns[51746]: AXFR started for 'tittybumbum.com', transaction started
Jan 20 16:15:40 ns2 pdns[51746]: Received serial number updates for 1 zones, had 0 timeouts
Jan 20 16:15:40 ns2 pdns[51746]: Domain 'tittybumbum.com' is stale, master serial 2013012002, our serial 0
Jan 20 16:15:40 ns2 pdns[51746]: Initiating transfer of 'tittybumbum.com' from remote '65.19.176.90'
Jan 20 16:15:40 ns2 pdns[51746]: 1 slave domain needs checking, 0 queued for AXFR
Jan 20 16:15:40 ns2 pdns[51746]: Received serial number updates for 1 zones, had 0 timeouts
Jan 20 16:15:40 ns2 pdns[51746]: Domain 'tittybumbum.com' is stale, master serial 2013012002, our serial 0
Jan 20 16:15:40 ns2 pdns[51746]: AXFR started for 'tittybumbum.com', transaction started
Jan 20 16:15:40 ns2 pdns[51746]: AXFR done for 'tittybumbum.com', zone committed with serial number 2013012002
Jan 20 16:15:40 ns2 pdns[51746]: Initiating transfer of 'tittybumbum.com' from remote '65.19.176.90'
Jan 20 16:15:40 ns2 pdns[51746]: AXFR started for 'tittybumbum.com', transaction started
Jan 20 16:15:40 ns2 pdns[51746]: AXFR done for 'tittybumbum.com', zone committed with serial number 2013012002
Jan 20 16:15:40 ns2 pdns[51746]: AXFR done for 'tittybumbum.com', zone committed with serial number 2013012002
Jan 20 16:15:42 ns2 pdns[51746]: Received NOTIFY for tittybumbum.com from 2001:470:1:203::1002 which is not a master
Jan 20 16:15:56 ns2 last message repeated 2 times

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
both supermaster and slave are running the same version of powerdns.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
there seems to be some ipv4 and ipv6 weirdness in that log...

ie:

a) Received NOTIFY for tittybumbum.com from 2001:470:1:203::1002 for which we are not authoritative

b) Created new slave zone 'tittybumbum.com' from supermaster 65.19.176.90, queued axfr Jan 20 16:15:40

Received notify via ipv6, axfr done via ipv4..

c) Database error trying to create tittybumbum.com for potential supermaster 2001:470:1:203::1002:

There seems to be a bit of flip flopping going on between v4 and v6.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: peter
There is no flipflopping - it receives both notifies but can only act on one. The initial v4 notify manages to trigger the supermaster entry's creation, which means the v6 will be ignored for that domain from now on.

The 0 serial means "domain not present yet", this is normal before the first AXFR.

And finally, 'not crashing' is what the fix for this issue promised. Avoiding all race conditions in this area is hard; not crashing is what we currently have to offer for this.

Thank you for checking! Closing ticket.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: anon
I think perhaps using MySQL replication might be a better option. Thanks.

@Habbie
Member
Habbie commented Apr 26, 2013

Author: peter
Probably cleanest for now. Ticket #506 is tracking the general messiness of dual-stack slaving, by the way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment