Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IXFR serial quirks when NSD restarts and only have old zone file on disk #227

Closed
klaus-nicat opened this issue Aug 17, 2022 · 13 comments
Closed
Assignees

Comments

@klaus-nicat
Copy link

Setup: NSD 4.6 (compiled with --disable-radix-tree and --enable-packed).

server:
        # the database to use
        # database: "/dns/nsd/nsd.db"
        database: ""

        # Number of NSD servers to fork.
        server-count: 1

        # Verbosity level.
        verbosity: 2

        # The directory for zonefile: files.
        zonesdir: "/var/lib/nsd/"

zone:
        name: example.net
        zonefile: %s
        allow-axfr-fallback: yes
        allow-notify: 136.244.78.244 NOKEY
        request-xfr:  136.244.78.244 NOKEY
        max-refresh-time: 300

NSD is Secondary. Serial on Primary is 1, Serial on NSD is 1.

root@klaus-nsd-slave:~# nsd-control zonestatus
zone:   example.net
        state: ok
        served-serial: "1 since 2022-08-17T20:00:26"
        commit-serial: "1 since 2022-08-17T20:00:26"
        wait: "292 sec between attempts"

Then, write the zonefile:

# nsd-control write

Now, upate the zone on the Primary to serial=2. NSD will IXFR the new zone, but does not write it to disk:

Aug 17 20:05:43 klaus-nsd-slave nsd[15797]: xfrd: zone example.net committed "received update to serial 2 at 2022-08-17T20:05:43 from 136.244.78.244"
Aug 17 20:05:43 klaus-nsd-slave nsd[15810]: zone example.net. received update to serial 2 at 2022-08-17T20:05:43 from 136.244.78.244 of 235 bytes in 3.8e-05 seconds
Aug 17 20:05:43 klaus-nsd-slave nsd[15797]: zone example.net serial 1 is updated to 2

# nsd-control zonestatus
zone:   example.net
        state: ok
        served-serial: "2 since 2022-08-17T20:05:43"
        commit-serial: "2 since 2022-08-17T20:05:43"
        wait: "281 sec between attempts"

Now restart NSD. NSD will serve the old version with serial=1, as this is the latest version on disk.

Aug 17 20:07:33 klaus-nsd-slave nsd[15861]: nsd started (NSD 4.6.0), pid 15845
Aug 17 20:07:33 klaus-nsd-slave nsd[15845]: zone example.net serial 2 is updated to 1

# nsd-control zonestatus
zone:   example.net
        state: ok
        served-serial: "1 since 2022-08-17T20:07:33"
        commit-serial: "2 since 2022-08-17T20:07:33"
        wait: "275 sec between attempts"

So, now NSD serves serial=1, but when checking against the Primary, NSD uses serial=2. NSD should use serial=1, as serial=1 is the version that NSD has on disk. (btw: is there somewhere a description what the commit-serial is?)

# nsd-control transfer example.net

--> IXFR request decoded by tcpdump:
IXFR? example.net. ns: example.net. [30s] SOA secdev1-tst.rcode0.net. rcodezero-soa.ipcom.at. 2 10800 3600 604800 30 (120)
                                                                                              ^
                                                                                              |

Now, when the serial is increased to 3, NSD requests IXFR with serial=2. Hence, NSD receives the differences from 2-to-3, and then detects that the DIFF can not be applied to the local existing zone with serial=1:

root@klaus-nsd-slave:~# nsd-control zonestatus
zone:   example.net
        state: ok
        served-serial: "1 since 2022-08-17T20:07:33"
        commit-serial: "2 since 2022-08-17T20:15:03"
        wait: "12998 sec between attempts"

Aug 17 20:20:58 klaus-nsd-slave nsd[15862]: notify for example.net. from 136.244.78.244 serial 3
Aug 17 20:20:58 klaus-nsd-slave nsd[15845]: xfrd: zone example.net committed "received update to serial 3 at 2022-08-17T20:20:58 from 136.244.78.244"
Aug 17 20:20:58 klaus-nsd-slave nsd[15845]: xfrd: zone example.net: soa serial 3 update failed, restarting transfer (notified zone)

root@klaus-nsd-slave:~# nsd-control zonestatus
zone:   example.net
        state: refreshing
        served-serial: "1 since 2022-08-17T20:07:33"
        commit-serial: "1 since 2022-08-17T20:07:33"
        wait: "26537 sec between attempts"

Proof, that NSD requests IXFR with serial=2, although it only has serial=1 available local:

 Aug 17 20:20:58 klaus-bind-master named[17796]: client @0x7eff1c032790 78.141.197.185#35992 (example.net): transfer of 'example.net/IN': IXFR started (serial 2 -> 3)

Now, as NSD "falled back" to served-serial=commit-serial=1, it can recover on the next check:

root@klaus-nsd-slave:~# nsd-control transfer example.net
root@klaus-nsd-slave:~# Aug 17 20:21:23 klaus-nsd-slave nsd[15845]: xfrd: zone example.net committed "received update to serial 3 at 2022-08-17T20:21:23 from 136.244.78.244"
Aug 17 20:21:23 klaus-nsd-slave nsd[15861]: zone example.net. received update to serial 3 at 2022-08-17T20:21:23 from 136.244.78.244 of 325 bytes in 4.8e-05 seconds
Aug 17 20:21:23 klaus-nsd-slave nsd[15845]: zone example.net serial 1 is updated to 3

# nsd-control zonestatus
zone:   example.net
        state: ok
        served-serial: "3 since 2022-08-17T20:21:23"
        commit-serial: "3 since 2022-08-17T20:21:23"
        wait: "285 sec between attempts"

So. When NSD restarts and the served serial goes backwards due to "only old zone on disk", also the "committed-serial" (what is the commited serial?) should be lowered to the served serial, so that IXFR requests use the proper serial.

@anandb-ripencc
Copy link
Contributor

This is a very good analysis @klaus-nicat! I think the commit-serial is from the database days. NSD should not update its value until it has saved the zone to disk in some way. So this certainly looks like a bug.

@wtoorop
Copy link
Member

wtoorop commented Aug 18, 2022

Indeed excellent analysis @klaus-nicat !

So, now NSD serves serial=1, but when checking against the Primary, NSD uses serial=2. NSD should use serial=1, as serial=1 is the version that NSD has on disk. (btw: is there somewhere a description what the commit-serial is?)

commit-serial is the serial received from transfer and stored on disk in a transfer file to be processed.
Sending that along in the IXFR is indeed a bug! Thanks for finding that!

wtoorop added a commit that referenced this issue Aug 18, 2022
TODO: create a tpkg test to be certain
@klaus-nicat
Copy link
Author

@wtoorop We have a zone which gets updates every few seconds, and we regularly also see "update failed". It seems that happens when multiple IXFRs are applied - see log below. The weird thing is that NSD activates the new zone and then the "update failed" is about on old serial of this zone. Can it be, that this is caused by the same bug? If not, I would open another issue.

13:11:04 nsd[2935091]: zone eu. received update to serial 1083742837 at 2022-08-18T13:11:04 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4331 bytes in 0.000121 seconds
13:11:04 nsd[2935090]: zone eu serial 1083742836 is updated to 1083742837

13:11:09 nsd[341245]: notify for eu. from 83.136.34.4 serial 1083742840
13:11:09 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742840 at 2022-08-18T13:11:09 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:09 nsd[2935091]: zone eu. received update to serial 1083742840 at 2022-08-18T13:11:09 from 83.136.34.4 TSIG verified with key rcode0-distribution of 9261 bytes in 0.000124 seconds
13:11:09 nsd[2935090]: zone eu serial 1083742837 is updated to 1083742840

13:11:19 nsd[341320]: notify for eu. from 83.136.34.4 serial 1083742841
13:11:19 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742841 at 2022-08-18T13:11:19 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:19 nsd[2935091]: zone eu. received update to serial 1083742841 at 2022-08-18T13:11:19 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4386 bytes in 0.000106 seconds
13:11:19 nsd[2935090]: zone eu serial 1083742840 is updated to 1083742841
13:11:20 nsd[2935091]: writing zone eu to file eu.zone

13:11:24 nsd[341655]: notify for eu. from 2a02:850:8::5 serial 1083742842
13:11:24 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742842 at 2022-08-18T13:11:24 from 2a02:850:8::5 TSIG verified with key rcode0-distribution"
13:11:29 nsd[341655]: notify for eu. from 83.136.34.4 serial 1083742843
13:11:30 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742843 at 2022-08-18T13:11:29 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:34 nsd[2935091]: zone eu. received update to serial 1083742842 at 2022-08-18T13:11:24 from 2a02:850:8::5 TSIG verified with key rcode0-distribution of 4390 bytes in 5.2e-05 seconds
13:11:34 nsd[2935091]: zone eu. received update to serial 1083742843 at 2022-08-18T13:11:29 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4306 bytes in 5.6e-05 seconds
13:11:35 nsd[2935090]: zone eu serial 1083742841 is updated to 1083742843

==== HERE UPDATE FAILED =====
13:11:35 nsd[2935090]: xfrd: zone eu: soa serial 1083742842 update failed, restarting transfer (notified zone)
==== HERE UPDATE FAILED =====

13:11:36 nsd[342099]: notify for eu. from 83.136.34.4 serial 1083742844
13:11:37 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742844 at 2022-08-18T13:11:37 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:37 nsd[2935091]: zone eu. received update to serial 1083742844 at 2022-08-18T13:11:37 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4268 bytes in 0.000115 seconds
13:11:37 nsd[2935090]: zone eu serial 1083742843 is updated to 1083742844

13:11:41 nsd[342113]: notify for eu. from 83.136.34.4 serial 1083742845
13:11:41 nsd[2935090]: xfrd: zone eu committed "received update to serial 1083742845 at 2022-08-18T13:11:41 from 83.136.34.4 TSIG verified with key rcode0-distribution"
13:11:42 nsd[2935091]: zone eu. received update to serial 1083742845 at 2022-08-18T13:11:41 from 83.136.34.4 TSIG verified with key rcode0-distribution of 4340 bytes in 0.000135 seconds
13:11:42 nsd[2935090]: zone eu serial 1083742844 is updated to 1083742845

@klaus-nicat
Copy link
Author

Your suggested fix does not work. Starting with serial=1, then write, then increase serial=2, then NSD restart:

a) after NSD restart, now NSD requests IXFR with serial 1 (the last written to disk, that is good), but the served zone is not updated. NSD sill serves serial=1
b) when updating the serial=3, NSD requests IXFR with serial=1, but then fails to apply the IXFR:

Aug 18 14:52:17 klaus-nsd-slave nsd[18051]: notify for example.net. from 136.244.78.244 serial 3
Aug 18 14:52:17 klaus-nsd-slave nsd[18037]: xfrd: zone example.net bad transfer 3 from 136.244.78.244

@wtoorop
Copy link
Member

wtoorop commented Aug 18, 2022

Can it be, that this is caused by the same bug? If not, I would open another issue.

It does look related. Your situation at 13:11:24 would be a reason to send along the committed serial (because the served serial is not yet 1083742842). So let's try to take it along in this issue for the moment :)

@wtoorop
Copy link
Member

wtoorop commented Aug 18, 2022

b) when updating the serial=3, NSD requests IXFR with serial=1, but then fails to apply the IXFR:

That's even worse! I'll change the title of the PR! Thanks for trying it out immediately.

@bleve
Copy link

bleve commented Aug 19, 2022

I guess current lazy write to disk is not possible with IXFR and zone must be written to disk immediately. OR there need to be two copies of data for IXFR - one on disk, written when zone is written to disk and other version which is in ram and reflects zone status in ram.

@klaus-nicat
Copy link
Author

Hi! Any news? Can we help getting this fixed? Thanks

@wtoorop
Copy link
Member

wtoorop commented Oct 27, 2022

Not yet. Sorry @klaus-nicat . @k0ekk0ek do you think you can make a bit of time to have a look?

@k0ekk0ek
Copy link
Contributor

@wtoorop, @klaus-nicat, I'll have a look.

@k0ekk0ek k0ekk0ek self-assigned this Nov 1, 2022
@k0ekk0ek
Copy link
Contributor

k0ekk0ek commented Nov 1, 2022

@klaus-nicat, I've picked this one up. I'm creating a small reproducer first for easy validation and to avoid regressions in the future and then implement a fix. I'll report back ASAP.

@k0ekk0ek
Copy link
Contributor

k0ekk0ek commented Nov 2, 2022

@klaus-nicat, I've added a test to verify correct behavior in the future (so it currently fails 😅). I'll start working on the actual fix tomorrow.

@klaus-nicat
Copy link
Author

That's a smart idea ;-) Thanks

k0ekk0ek added a commit to k0ekk0ek/nsd that referenced this issue Nov 3, 2022
k0ekk0ek added a commit to k0ekk0ek/nsd that referenced this issue Nov 3, 2022
k0ekk0ek added a commit to k0ekk0ek/nsd that referenced this issue Nov 4, 2022
k0ekk0ek added a commit to k0ekk0ek/nsd that referenced this issue Nov 4, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants