Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd nss-lookup.target is reached before unbound can successfully answer queries #296

Closed
xnoreq opened this issue Aug 30, 2020 · 7 comments

Comments

@xnoreq
Copy link

xnoreq commented Aug 30, 2020

Running unbound 1.11.0 (pkg) on arch linux.
Other services rely on nss-lookup.target being reached when name resolution is actually working.

To demonstrate the issue I've created the following "check-name-resolution.service":

[Unit]
Description=check name resolution
After=network.target network-online.target nss-lookup.target

[Service]
Type=oneshot
User=nobody
ExecStart=/usr/bin/drill google.com

[Install]
WantedBy=multi-user.target

After a reboot (!) the journal shows this:

-- Reboot --
systemd[1]: Starting Validating, recursive, and caching DNS resolver...
unbound[329]: [329:0] notice: init module 0: subnet
unbound[329]: [329:0] notice: init module 1: validator
unbound[329]: [329:0] notice: init module 2: iterator
unbound[329]: [329:0] info: start of service (unbound 1.11.0).
systemd[1]: Started Validating, recursive, and caching DNS resolver.
systemd[1]: Reached target Host and Network Name Lookups.
systemd[1]: Starting check name resolution...
sh[340]: ;; ->>HEADER<<- opcode: QUERY, rcode: SERVFAIL, id: 60132
sh[340]: ;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
sh[340]: ;; QUESTION SECTION:
sh[340]: ;; google.com.        IN        A
sh[340]: ;; ANSWER SECTION:
sh[340]: ;; AUTHORITY SECTION:
sh[340]: ;; ADDITIONAL SECTION:
sh[340]: ;; Query time: 29 msec
sh[340]: ;; SERVER: ::1
sh[340]: ;; WHEN: Sun Aug 30 15:47:16 2020
sh[340]: ;; MSG SIZE  rcvd: 28
systemd[1]: check-name-resolution.service: Succeeded.
systemd[1]: Finished check name resolution.
@wcawijngaards
Copy link
Member

If you want unbound to be delayed until nss-lookup has done, you should make it wait for that? I mean, make the nss-lookup.target wait for the unbound target to be done? Or a pseudo target like the network-online.target for name resolution? I think this is done in systemd by making the target require the other target that it needs to wait for. You need to then add those wait require lines to the systemd config to make it start the servers in the correct sequence that you need it to.

It is more of an issue with systemd and archlinux setup, so I think you should really ask there, or if you do not know where, perhaps try our users mailing list to see if other users know about the systemd setup.

@xnoreq
Copy link
Author

xnoreq commented Aug 31, 2020

The service file, from unbound itself (see contrib/unbound.service.in), already does this:

[Unit]
Description=Validating, recursive, and caching DNS resolver
Documentation=man:unbound(8)
After=network.target
Before=network-online.target nss-lookup.target

And an excerpt of the service part:

[Service]
ExecReload=+/bin/kill -HUP $MAINPID
ExecStart=@UNBOUND_SBIN_DIR@/unbound -d -p
Type=notify

From systemd manual:

Behavior of notify is similar to exec; however, it is expected that the service sends a notification message via sd_notify(3) or an equivalent call when it has finished starting up. systemd will proceed with starting follow-up units after this notification message has been sent. If this option is used, NotifyAccess= (see below) should be set to open access to the notification socket provided by systemd. If NotifyAccess= is missing or set to none, it will be forcibly set to main

So from what I can see unbound seems to send the notification (that it started and is ready to service requests) too early.

@wcawijngaards
Copy link
Member

Unbound sends the notification that the server is up. But the rest of the system is not, this is why you get a response, from unbound, with SERVFAIL, a failure. This means unbound is up and responding. The servfail means that unbound cannot get content for the response. Unbound indicates that it has started, but there is no information if content is going to be available. I mean, when unbound starts and it indicates that it has come up, it cannot figure out if content for queries is going to be returned when it later is going to make lookups.

For that you need, like, the network to be up and responding. And I see network-online in there. But there are not (yet?) good responses. If you make unbound wait for the things that make stuff respond to unbound, then it would work when you start it, I guess?

@xnoreq
Copy link
Author

xnoreq commented Aug 31, 2020

Maybe this line here is called too early:

ret = sd_notify(0, "READY=1");

?

As you can see from my logs above, after unbound sends the notification that it's started (which means we reach nss-lookup.target), DNS resolution fails with SERVFAIL.
So it seems that something is not set up completely yet.

@wcawijngaards
Copy link
Member

No, the notification is for server start. I see you want no-servfail resolution, but that is not what unbound indicates with it's 'the server has started' notification.

For no-servfail resolution, you need unbound started, but also you need, aparrantly, something else. Something that causes when unbound makes queries, those queries get responses. If you made unbound's service file wait for that, it would likely work.

@xnoreq
Copy link
Author

xnoreq commented Aug 31, 2020

I was too quick with my previous reply. You are probably right, still need to test this though.
Will report back with the required changes in the service file.

@xnoreq
Copy link
Author

xnoreq commented Aug 31, 2020

Works with this service file:

--- /usr/lib/systemd/system/unbound.service     2020-08-08 10:50:44.000000000 +0200
+++ /etc/systemd/system/unbound.service 2020-08-31 11:33:25.698267256 +0200
@@ -42,9 +42,9 @@
 [Unit]
 Description=Validating, recursive, and caching DNS resolver
 Documentation=man:unbound(8)
-After=network.target
-Before=network-online.target nss-lookup.target
-Wants=nss-lookup.target
+After=network-online.target
+Before=nss-lookup.target
+Wants=network-online.target nss-lookup.target

 [Install]
 WantedBy=multi-user.target

@xnoreq xnoreq closed this as completed Aug 31, 2020
wcawijngaards added a commit that referenced this issue Aug 31, 2020
  successfully answer queries. Changed contrib/unbound.service.in.
jedisct1 added a commit to jedisct1/unbound that referenced this issue Sep 2, 2020
* nlnet/master: (37 commits)
  - Fix NLnetLabs#296: systemd nss-lookup.target is reached before unbound can   successfully answer queries. Changed contrib/unbound.service.in.
  - Refactor to use sock_strerr shorthand function.
  - Merge PR NLnetLabs#293: Add missing prototype.  Also refactor to use the new   shorthand function to clean up the code.
  Add missing prototype.
  - Review fix, doxygen and assign null in case of error free.
  Please doxygen, quote the characters to stop it from parsing a doxygen command.
  - Similar to NSD PR#113, implement that interface names can be used,   eg. something like interface: eth0 is resolved at server start and   uses the IP addresses for that named interface.
  - Update documentation in python example code.
  - Change configure to use EVP_sha256 instead of HMAC_Update for   openssl-3.0.0.
  - Fix to apply chroot to dnstap-socket-path, if chroot is enabled.
  - Fix that dnstap reconnects do not spam the log with the repeated   attempts.  Attempts on the timer are only logged on high verbosity,   if they produce a connection failure error.
  - Fix stats double count issue (NLnetLabs#289).
  - Create and init edns tags data for libunbound.
  Changelog note. 	- Rerun autoconf
  Rerun autoconf and autoheader on configure.ac, with libtool
  Add changlog entry for PR NLnetLabs#277.
  - Check for existence 'EVP_MAC_CTX_set_params' function (openssl >=    3.0.0-alpha5)
  - Fix NLnetLabs#287: doc typo: "Additionaly".
  Changelog note for NLnetLabs#246 and NLnetLabs#284 - Merge PR NLnetLabs#284 and Fix NLnetLabs#246: Remove DLV entirely from Unbound.   The DLV has been decommisioned and in unbound 1.5.4, in 2015, there   was advise to stop using it.  The current code base does not contain   DLV code any more.  The use of dlv options displays a warning.
  dlv removal, remove DLV reference from unused use in test case.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants