Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Service Failed with 'timeout' #56

Closed
combivanCoder opened this issue Aug 7, 2019 · 19 comments
Closed

Service Failed with 'timeout' #56

combivanCoder opened this issue Aug 7, 2019 · 19 comments
Assignees

Comments

@combivanCoder
Copy link

I'm seeing serious issues with unbound where the service simply restarts itself intermittently. I say intermittently but I'm beginning to think it's related to the service failed timeout + the restart delay.

When I attempt to start the service normally via systemctl start unbound or systemctl restart unbound the prompt hangs and waits. Eventually there's an error message along the lines of:

Job for unbound.service failed because a timeout was exceeded. See "systemctl status unbound.service" and "journalctl -xe" for details.

In the unbound log I can see that it's reached a status of info: start of service (unbound 1.9.0). within milliseconds of the service start though.

tls-upstream is set to yes, and looking at the more verbose outputs requests are getting resolved. However when the service restarts the cache is dropped, which is the one of the main reasons for using unbound on this LAN (as well as part of a wider security & privacy design).

I'm using DNS-over-TLS upstreams only, and are configured as per this example:

forward-zone:
        name: "."
        forward-tls-upstream: yes

        ### straight connect instead of via stubby
        ## https://ripe78.ripe.net/on-site/tech-info/dns-over-tls-resolvers/
        forward-addr:   193.0.31.237@853#nscache.ripemtg.ripe.net
        forward-addr:   193.0.31.238@853#nscache.ripemtg.ripe.net

I know this generally works because if I start the service directly (stopping the systemd service) via /usr/sbin/unbound -dvv, I see it working as expected. No restarts or drop-outs.

It's almost like the systemd service thinks the startup hasn't completed, and assumes timeout - and forces the restart based on the unbound.service Restart=on-failed setting. If I remove the Restart=... line from the service I see the following in the systemctl status:

unbound.service - Unbound DNS server
Loaded: loaded (/lib/systemd/system/unbound.service; enabled; vendor preset: enabled)
Active: failed (Result: timeout) since Wed 2019-08-07 18:21:22 CEST; 38s ago
Docs: man:unbound(8)
Process: 6990 ExecStartPre=/usr/lib/unbound/package-helper chroot_setup (code=exited, status=0/SUCCESS)
Process: 6993 ExecStartPre=/usr/lib/unbound/package-helper root_trust_anchor_update (code=exited, status=0/SUCCESS)
Process: 6997 ExecStart=/usr/sbin/unbound -d $DAEMON_OPTS (code=exited, status=0/SUCCESS)
Main PID: 6997 (code=exited, status=0/SUCCESS)

Aug 07 18:19:51 ServerName systemd[1]: Starting Unbound DNS server...
Aug 07 18:19:51 ServerName package-helper[6993]: /var/lib/unbound/root.key has content
Aug 07 18:19:51 ServerName package-helper[6993]: success: the anchor is ok
Aug 07 18:21:21 ServerName systemd[1]: unbound.service: Start operation timed out. Terminating.
Aug 07 18:21:22 ServerName systemd[1]: unbound.service: Failed with result 'timeout'.
Aug 07 18:21:22 ServerName systemd[1]: Failed to start Unbound DNS server.

The pre-reqs for chroot and trust anchors appear to operate as expected, it's the actual unbound service which seems to timeout as a systemd service.

The server in question is a new build to replace an older server, running another Debian derivative with unbound 1.6.0, and that uses Stubby and DnsCrypt as its forwarders. I've tried swapping out the DoT configuration for a straight udp forwarder with known-good DNS (Cloudflare in the test), and that exhibits the same symptoms.

Version in use: 1.9.0.2
OS: Debian Buster (stable) arm64

I enabled a higher verbosity and captured a part of the log for reference - it shows the brief time unbound was actually up through to the timeout 'happening' and the service dropping. Just after the systemctl start unbound I hit it with a dig google.com +dnssec:

[1565196026] unbound[7414:0] debug: validator[module 1] operate: extstate:module_wait_subquery event:module_event_pass
[1565196026] unbound[7414:0] info: validator operate: query google.com. A IN
[1565196026] unbound[7414:0] debug: val handle processing q with state VAL_FINDKEY_STATE
[1565196026] unbound[7414:0] info: validator: FindKey google.com. A IN
[1565196026] unbound[7414:0] info: current keyname com. DNSKEY IN
[1565196026] unbound[7414:0] info: target keyname google.com. DNSKEY IN
[1565196026] unbound[7414:0] debug: striplab 0
[1565196026] unbound[7414:0] info: next keyname google.com. DNSKEY IN
[1565196026] unbound[7414:0] info: DS RRset com. DS IN
[1565196026] unbound[7414:0] info: generate request google.com. DS IN
[1565196026] unbound[7414:0] debug: mesh_run: validator module exit state is module_wait_subquery
[1565196026] unbound[7414:0] debug: subnet[module 0] operate: extstate:module_state_initial event:module_event_pass
[1565196026] unbound[7414:0] info: subnet operate: query google.com. DS IN
[1565196026] unbound[7414:0] debug: subnet: pass to next module
[1565196026] unbound[7414:0] debug: mesh_run: subnet module exit state is module_wait_module
[1565196026] unbound[7414:0] debug: validator[module 1] operate: extstate:module_state_initial event:module_event_pass
[1565196026] unbound[7414:0] info: validator operate: query google.com. DS IN
[1565196026] unbound[7414:0] debug: validator: pass to next module
[1565196026] unbound[7414:0] debug: mesh_run: validator module exit state is module_wait_module
[1565196026] unbound[7414:0] debug: iterator[module 2] operate: extstate:module_state_initial event:module_event_pass
[1565196026] unbound[7414:0] debug: process_request: new external request event
[1565196026] unbound[7414:0] debug: iter_handle processing q with state INIT REQUEST STATE
[1565196026] unbound[7414:0] info: resolving google.com. DS IN
[1565196026] unbound[7414:0] debug: request has dependency depth of 0
[1565196026] unbound[7414:0] debug: forwarding request
[1565196026] unbound[7414:0] debug: iter_handle processing q with state QUERY TARGETS STATE
[1565196026] unbound[7414:0] info: processQueryTargets: google.com. DS IN
[1565196026] unbound[7414:0] debug: processQueryTargets: targetqueries 0, currentqueries 0 sentcount 0
[1565196026] unbound[7414:0] info: DelegationPoint<.>: 0 names (0 missing), 7 addrs (0 result, 7 avail) parentNS
[1565196026] unbound[7414:0] debug: [dns.larsdebruin.net] ip4 51.15.70.167 port 853 (len 16)
[1565196026] unbound[7414:0] debug: [dnsovertls.sinodun.com] ip4 145.100.185.15 port 853 (len 16)
[1565196026] unbound[7414:0] debug: [dns.cmrg.net] ip4 199.58.81.218 port 443 (len 16)
[1565196026] unbound[7414:0] debug: [dnsovertls.sinodun.com] ip4 145.100.185.15 port 443 (len 16)
[1565196026] unbound[7414:0] debug: [getdnsapi.net] ip4 185.49.141.37 port 853 (len 16)
[1565196026] unbound[7414:0] debug: [nscache.ripemtg.ripe.net] ip4 193.0.31.238 port 853 (len 16)
[1565196026] unbound[7414:0] debug: [nscache.ripemtg.ripe.net] ip4 193.0.31.237 port 853 (len 16)
[1565196026] unbound[7414:0] debug: attempt to get extra 3 targets
[1565196026] unbound[7414:0] debug: servselect ip4 193.0.31.237 port 853 (len 16)
[1565196026] unbound[7414:0] debug: rtt=1504
[1565196026] unbound[7414:0] debug: servselect ip4 193.0.31.238 port 853 (len 16)
[1565196026] unbound[7414:0] debug: rtt=1504
[1565196026] unbound[7414:0] debug: servselect ip4 185.49.141.37 port 853 (len 16)
[1565196026] unbound[7414:0] debug: rtt=649
[1565196026] unbound[7414:0] debug: servselect ip4 145.100.185.15 port 443 (len 16)
[1565196026] unbound[7414:0] debug: rtt=569
[1565196026] unbound[7414:0] debug: servselect ip4 199.58.81.218 port 443 (len 16)
[1565196026] unbound[7414:0] debug: rtt=1147
[1565196026] unbound[7414:0] debug: servselect ip4 145.100.185.15 port 853 (len 16)
[1565196026] unbound[7414:0] debug: rtt=598
[1565196026] unbound[7414:0] debug: selrtt 376
[1565196026] unbound[7414:0] info: sending query: google.com. DS IN
[1565196026] unbound[7414:0] debug: sending to target: <.> 145.100.185.15#443
[1565196026] unbound[7414:0] debug: dnssec status: not expected
[1565196026] unbound[7414:0] debug: qname perturbed to GOogLE.cOM.
[1565196026] unbound[7414:0] debug: comm point start listening 39
[1565196026] unbound[7414:0] debug: mesh_run: iterator module exit state is module_wait_reply
[1565196026] unbound[7414:0] info: mesh_run: end 2 recursion states (1 with reply, 0 detached), 1 waiting replies, 1 recursion replies sent, 0 replies dropped, 0 states jostled out
[1565196026] unbound[7414:0] info: average recursion processing time 4.325272 sec
[1565196026] unbound[7414:0] info: histogram of recursion processing times
[1565196026] unbound[7414:0] info: [25%]=0 median[50%]=0 [75%]=0
[1565196026] unbound[7414:0] info: lower(secs) upper(secs) recursions
[1565196026] unbound[7414:0] info: 4.000000 8.000000 1
[1565196026] unbound[7414:0] info: 0vRDCD mod2 google.com. DS IN
[1565196026] unbound[7414:0] info: 1RDdc mod1 rep google.com. A IN
[1565196026] unbound[7414:0] debug: cache memory msg=68665 rrset=73891 infra=9590 val=69820 subnet=74504
[1565196026] unbound[7414:0] debug: svcd callbacks end
[1565196026] unbound[7414:0] debug: close fd 38
[1565196026] unbound[7414:0] debug: comm point listen_for_rw 39 0
[1565196026] unbound[7414:0] debug: peer certificate:
Issuer: C=US, O=Let's Encrypt, CN=Let's Encrypt Authority X3
Validity
Not Before: Jul 8 10:15:11 2019 GMT
Not After : Oct 6 10:15:11 2019 GMT
Subject: CN=dnsovertls.sinodun.com
X509v3 extensions:
X509v3 Key Usage: critical
Digital Signature, Key Encipherment
X509v3 Extended Key Usage:
TLS Web Server Authentication, TLS Web Client Authentication
X509v3 Basic Constraints: critical
CA:FALSE
X509v3 Subject Key Identifier:
05:3C:F3:1F:64:B2:4C:57:F0:79:8B:0D:55:7B:33:7B:EF:B6:CA:1F
X509v3 Authority Key Identifier:
keyid:A8:4A:6A:63:04:7D:DD:BA:E6:D1:39:B7:A6:45:65:EF:F3:A8:EC:A1

        Authority Information Access: 
            OCSP - URI:http://ocsp.int-x3.letsencrypt.org
            CA Issuers - URI:http://cert.int-x3.letsencrypt.org/

        X509v3 Subject Alternative Name: 
            DNS:dnsovertls.sinodun.com
        X509v3 Certificate Policies: 
            Policy: 2.23.140.1.2.1
            Policy: 1.3.6.1.4.1.44947.1.1.1
              CPS: http://cps.letsencrypt.org

        CT Precertificate SCTs: 
            Signed Certificate Timestamp:
                Version   : v1 (0x0)
                Log ID    : 74:7E:DA:83:31:AD:33:10:91:21:9C:CE:25:4F:42:70:
                            C2:BF:FD:5E:42:20:08:C6:37:35:79:E6:10:7B:CC:56
                Timestamp : Jul  8 11:15:11.999 2019 GMT
                Extensions: none
                Signature : ecdsa-with-SHA256
                            30:44:02:20:4C:D9:6C:B3:34:4B:7E:D9:E7:90:A4:D5:
                            AB:1D:13:B0:3F:6C:16:CA:93:3C:F6:AC:1D:62:3F:96:
                            5B:E3:F0:1B:02:20:59:0F:B5:B1:8F:FB:74:4F:CE:56:
                            D5:E1:27:C4:C7:98:99:20:2C:12:DF:94:1C:09:50:FD:
                            88:B4:03:E1:65:40
            Signed Certificate Timestamp:
                Version   : v1 (0x0)
                Log ID    : 29:3C:51:96:54:C8:39:65:BA:AA:50:FC:58:07:D4:B7:
                            6F:BF:58:7A:29:72:DC:A4:C3:0C:F4:E5:45:47:F4:78
                Timestamp : Jul  8 11:15:12.030 2019 GMT
                Extensions: none
                Signature : ecdsa-with-SHA256
                            30:46:02:21:00:B8:23:CD:36:89:77:64:0A:88:1B:68:
                            10:F9:D2:8A:27:28:75:F9:13:AF:58:5F:A6:40:F6:CD:
                            41:92:59:70:34:02:21:00:8B:71:0F:CD:DE:E7:9A:54:
                            6D:05:BD:7D:95:DE:A4:7C:ED:1E:65:3C:5F:BC:48:F6:
                            91:67:50:5A:AD:97:57:CB

[1565196026] unbound[7414:0] debug: SSL connection to dnsovertls.sinodun.com authenticated ip4 145.100.185.15 port 443 (len 16)
[1565196026] unbound[7414:0] debug: comm point listen_for_rw 39 1
[1565196026] unbound[7414:0] debug: comm point stop listening 39
[1565196026] unbound[7414:0] debug: comm point start listening 39
[1565196026] unbound[7414:0] debug: Reading ssl tcp query of length 766
[1565196026] unbound[7414:0] debug: comm point stop listening 39
[1565196026] unbound[7414:0] debug: outnettcp cb
[1565196026] unbound[7414:0] debug: measured TCP-time at 207 msec
[1565196026] unbound[7414:0] debug: svcd callbacks start
[1565196026] unbound[7414:0] debug: good 0x20-ID in reply qname
[1565196026] unbound[7414:0] debug: worker svcd callback for qstate 0xaaab0d9901b0
[1565196026] unbound[7414:0] debug: mesh_run: start
[1565196026] unbound[7414:0] debug: iterator[module 2] operate: extstate:module_wait_reply event:module_event_reply
[1565196026] unbound[7414:0] info: iterator operate: query google.com. DS IN
[1565196026] unbound[7414:0] debug: process_response: new external response event
[1565196026] unbound[7414:0] info: scrub for . NS IN
[1565196026] unbound[7414:0] info: response for google.com. DS IN
[1565196026] unbound[7414:0] info: reply from <.> 145.100.185.15#443
[1565196026] unbound[7414:0] info: incoming scrubbed packet: ;; ->>HEADER<<- opcode: QUERY, rcode: NOERROR, id: 0
;; flags: qr rd ra ; QUERY: 1, ANSWER: 0, AUTHORITY: 6, ADDITIONAL: 0
;; QUESTION SECTION:
google.com. IN DS

;; ANSWER SECTION:

;; AUTHORITY SECTION:
com. 1200 IN SOA a.gtld-servers.net. nstld.verisign-grs.com. 1565195898 1800 900 604800 86400
com. 1200 IN RRSIG SOA 8 1 900 20190814163818 20190807152818 17708 com. b/iaRYjfVFnFfMeWeTvihmSBwHG0FxpjtSHMjwYxoj0H4zzga/JM8wzO3Bj4/XXaJZA6wKWvR58znpfhUkqS/U3jbsPwRTH6Zavl/0IZ08b4iTwGUFmiMOAnddkzpjqNaIxfnC/py5k3Nd4iNY0Pz/PydcQhL4viSS1bg33JVS4= ;{id = 17708}
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 1200 IN NSEC3 1 1 0 - ck0q1gin43n1arrc9osm6qpqr81h5m9a NS SOA RRSIG DNSKEY NSEC3PARAM ;{flags: optout}
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 1200 IN RRSIG NSEC3 8 2 86400 20190811044603 20190804033603 17708 com. DFIdCVmw76xkgFouDSnCX8/TUMjZOkvQ/VmNcdSW3tE2rTDsD6VszVwNp5vV0iBTyXFk1lM3F4edA+zWR+fyCMwFfQs0vGdL4PHYbYMfxlY9cswx36W5suvp6oZGEMapfhDFRCcQG88kmIYjGKLjo9QAeTQKEStghE7DWD5SO3M= ;{id = 17708}
S84BDVKNH5AGDSI7F5J0O3NPRHU0G7JQ.com. 1200 IN NSEC3 1 1 0 - s84cfh3a62n0fjpc5d9ij2vjr71oglv5 NS DS RRSIG ;{flags: optout}
S84BDVKNH5AGDSI7F5J0O3NPRHU0G7JQ.com. 1200 IN RRSIG NSEC3 8 2 86400 20190812045756 20190805034756 17708 com. tCi2xBJY/XO0kpl2Sj23eyW/dFAgmeY11F+zcaeuLdCrwHzP4EM3yLq/XgbQV3L+9icbvcpqCHPv6B3461NGQ5HRvq4wrBgtoFMi0oFp2hZRHaVIMrJqTyerTH+dnoQioY+62bLf8tJ7A/BZVfpe28fg5JNHe3XbgAFVrVgKyWE= ;{id = 17708}

;; ADDITIONAL SECTION:
;; MSG SIZE rcvd: 749

[1565196026] unbound[7414:0] debug: iter_handle processing q with state QUERY RESPONSE STATE
[1565196026] unbound[7414:0] info: query response was nodata ANSWER
[1565196026] unbound[7414:0] debug: iter_handle processing q with state FINISHED RESPONSE STATE
[1565196026] unbound[7414:0] info: finishing processing for google.com. DS IN
[1565196026] unbound[7414:0] debug: mesh_run: iterator module exit state is module_finished
[1565196026] unbound[7414:0] debug: validator[module 1] operate: extstate:module_wait_module event:module_event_moddone
[1565196026] unbound[7414:0] info: validator operate: query google.com. DS IN
[1565196026] unbound[7414:0] debug: validator: nextmodule returned
[1565196026] unbound[7414:0] debug: not validating response, is valrec(validation recursion lookup)
[1565196026] unbound[7414:0] debug: mesh_run: validator module exit state is module_finished
[1565196026] unbound[7414:0] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
[1565196026] unbound[7414:0] info: subnet operate: query google.com. DS IN
[1565196026] unbound[7414:0] debug: mesh_run: subnet module exit state is module_finished
[1565196026] unbound[7414:0] info: validator: inform_super, sub is google.com. DS IN
[1565196026] unbound[7414:0] info: super is google.com. A IN
[1565196026] unbound[7414:0] info: verify rrset CK0POJMG874LJREF7EFN8430QVIT8BSM.com. NSEC3 IN
[1565196026] unbound[7414:0] debug: verify sig 17708 8
[1565196026] unbound[7414:0] debug: verify result: sec_status_secure
[1565196026] unbound[7414:0] info: verify rrset S84BDVKNH5AGDSI7F5J0O3NPRHU0G7JQ.com. NSEC3 IN
[1565196026] unbound[7414:0] debug: verify sig 17708 8
[1565196026] unbound[7414:0] debug: verify result: sec_status_secure
[1565196026] unbound[7414:0] debug: nsec3: keysize 1024 bits, max iterations 150
[1565196026] unbound[7414:0] info: ce candidate com. TYPE0 CLASS0
[1565196026] unbound[7414:0] info: NSEC3s for the referral proved no DS.
[1565196026] unbound[7414:0] debug: validator[module 1] operate: extstate:module_wait_subquery event:module_event_pass
[1565196026] unbound[7414:0] info: validator operate: query google.com. A IN
[1565196026] unbound[7414:0] debug: val handle processing q with state VAL_VALIDATE_STATE
[1565196026] unbound[7414:0] info: Verified that unsigned response is INSECURE
[1565196026] unbound[7414:0] debug: val handle processing q with state VAL_FINISHED_STATE
[1565196026] unbound[7414:0] debug: mesh_run: validator module exit state is module_finished
[1565196026] unbound[7414:0] debug: subnet[module 0] operate: extstate:module_wait_module event:module_event_moddone
[1565196026] unbound[7414:0] info: subnet operate: query google.com. A IN
[1565196026] unbound[7414:0] debug: mesh_run: subnet module exit state is module_finished
[1565196026] unbound[7414:0] debug: query took 0.794191 sec
[1565196026] unbound[7414:0] info: mesh_run: end 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 2 recursion replies sent, 0 replies dropped, 0 states jostled out
[1565196026] unbound[7414:0] info: average recursion processing time 2.559731 sec
[1565196026] unbound[7414:0] info: histogram of recursion processing times
[1565196026] unbound[7414:0] info: [25%]=0 median[50%]=0 [75%]=0
[1565196026] unbound[7414:0] info: lower(secs) upper(secs) recursions
[1565196026] unbound[7414:0] info: 0.524288 1.000000 1
[1565196026] unbound[7414:0] info: 4.000000 8.000000 1
[1565196026] unbound[7414:0] debug: cache memory msg=68965 rrset=75290 infra=9590 val=70000 subnet=74504
[1565196026] unbound[7414:0] debug: svcd callbacks end
[1565196026] unbound[7414:0] debug: close fd 39
[1565196053] unbound[7414:0] info: service stopped (unbound 1.9.0).
[1565196053] unbound[7414:0] debug: stop threads
[1565196053] unbound[7414:0] debug: join 1
[1565196053] unbound[7414:1] debug: got control cmd quit
[1565196053] unbound[7414:0] debug: join success 1
[1565196053] unbound[7414:0] debug: join 2
[1565196053] unbound[7414:2] debug: got control cmd quit
[1565196053] unbound[7414:0] debug: join success 2
[1565196053] unbound[7414:0] debug: cleanup.

==> /var/log/syslog <==
Aug 7 18:40:53 localhost systemd[1]: unbound.service: Start operation timed out. Terminating.

==> /etc/unbound/logs/default.log <==
[1565196053] unbound[7414:0] info: server stats for thread 0: 2 queries, 0 answers from cache, 2 recursions, 0 prefetch, 0 rejected by ip ratelimiting
[1565196053] unbound[7414:0] info: server stats for thread 0: requestlist max 0 avg 0 exceeded 0 jostled 0
[1565196053] unbound[7414:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 2 recursion replies sent, 0 replies dropped, 0 states jostled out
[1565196053] unbound[7414:0] info: average recursion processing time 2.559731 sec
[1565196053] unbound[7414:0] info: histogram of recursion processing times
[1565196053] unbound[7414:0] info: [25%]=0 median[50%]=0 [75%]=0
[1565196053] unbound[7414:0] info: lower(secs) upper(secs) recursions
[1565196053] unbound[7414:0] info: 0.524288 1.000000 1
[1565196053] unbound[7414:0] info: 4.000000 8.000000 1
[1565196053] unbound[7414:0] debug: cache memory msg=66072 rrset=66072 infra=9590 val=70000 subnet=74504
[1565196054] unbound[7414:0] info: server stats for thread 1: 1 queries, 0 answers from cache, 1 recursions, 0 prefetch, 0 rejected by ip ratelimiting
[1565196054] unbound[7414:0] info: server stats for thread 1: requestlist max 0 avg 0 exceeded 0 jostled 0
[1565196054] unbound[7414:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 1 recursion replies sent, 0 replies dropped, 0 states jostled out
[1565196054] unbound[7414:0] info: average recursion processing time 9.752778 sec
[1565196054] unbound[7414:0] info: histogram of recursion processing times
[1565196054] unbound[7414:0] info: [25%]=0 median[50%]=0 [75%]=0
[1565196054] unbound[7414:0] info: lower(secs) upper(secs) recursions
[1565196054] unbound[7414:0] info: 8.000000 16.000000 1
[1565196054] unbound[7414:0] debug: cache memory msg=66072 rrset=66072 infra=9590 val=70000 subnet=74504
[1565196054] unbound[7414:0] info: server stats for thread 2: 0 queries, 0 answers from cache, 0 recursions, 0 prefetch, 0 rejected by ip ratelimiting
[1565196054] unbound[7414:0] info: server stats for thread 2: requestlist max 0 avg 0 exceeded 0 jostled 0
[1565196054] unbound[7414:0] info: mesh has 0 recursion states (0 with reply, 0 detached), 0 waiting replies, 0 recursion replies sent, 0 replies dropped, 0 states jostled out
[1565196054] unbound[7414:0] debug: cache memory msg=66072 rrset=66072 infra=9590 val=70000 subnet=74504
[1565196054] unbound[7414:0] debug: Exit cleanup.
[1565196054] unbound[7414:0] debug: switching log to stderr

==> /var/log/syslog <==
Aug 7 18:40:54 localhost systemd[1]: unbound.service: Failed with result 'timeout'.
Aug 7 18:40:54 localhost systemd[1]: Failed to start Unbound DNS server.

@gthess
Copy link
Member

gthess commented Aug 8, 2019

I see that you are using debian, is this a packaged unbound or you build your own from source?
If so, unbound does support systemd but it needs to be compiled with --enable-systemd.

@combivanCoder
Copy link
Author

Yes - it's straight from the repo:
https://tracker.debian.org/pkg/unbound

@gthess
Copy link
Member

gthess commented Aug 8, 2019

I am not able to reproduce the problem with a fresh installation of debian buster. This included installing the OS and apt installing unbound.
The only added configuration was a forward-zone as stated above to try and mimic the behavior.

I cannot reproduce the problem with any combination of forwarding servers and DNS over TLS.

Have you installed unbound in another way, or do you use a more complicated configuration?

By the way, the ripe name servers that you use were only set up for that conference if I recall correctly as they are not responding now (this does not cause any problems here with regards to systemd except that queries are not answered).
For an updated list of DoT capable nameservers you can use:

@combivanCoder
Copy link
Author

combivanCoder commented Aug 9, 2019

Thanks for looking into this.

That was a sample of two to demonstrate configuration - there are many other DoT configured in the "." forward zone, and resolution is 100% operational during the time unbound is actually up.

Resolution is not the issue, the service start / timeout is the problem. I deployed unbound using standard repo tools e.g. apt install unbound

Have you tried using Debian Buster on an arm64 architecture device? Do you have any recommendations for me to try?

@combivanCoder
Copy link
Author

I wondered if there might have been an issue resolved during the transition of Buster from testing to stable, as unbound was originally installed when Buster was still in testing.

So just did an apt install --reinstall unbound - unfortunately that hasn't changed anything. Let me know if you think I should try something else or provide further logs or configs.

@gthess
Copy link
Member

gthess commented Aug 9, 2019

I could also try an arm64 installation but I am not expecting to find something different. Before I go down that road it could be useful to share your configuration (minus any sensitive settings).

@combivanCoder
Copy link
Author

Sure - configs below. I've variously tried removing specific config files to see if any of them contain configurations which would do this, but have not seen the problem resolve itself in these tests.

unbound.conf

# remote - enabled for the ad-list parsers to reload the configuration when they run
remote-control:
	control-enable: yes
        control-interface: <iface>
        control-port: <port>

# base configuration of server
server:
	## general configuration
	chroot: "/etc/unbound"
	directory: "/etc/unbound"
	username: "<uname>"
	logfile: "logs/default.log"
	log-queries: no
	log-replies: no
	log-servfail: no
	hide-identity: yes
	hide-version: yes
	hide-trustanchor: no
	do-not-query-localhost: no
	rrset-roundrobin: yes
	
	## hardening - https://calomel.org/unbound_dns.html
	harden-dnssec-stripped: yes
	harden-glue: yes
	harden-large-queries: yes
	harden-short-bufsize: yes
	harden-algo-downgrade: yes
	use-caps-for-id: yes
	harden-below-nxdomain: yes
	identity: "<id>"

	# Enable ratelimiting of queries (per second) sent to nameserver for
	# performing recursion. More queries are turned away with an error
	# (servfail). This stops recursive floods (e.g., random query names), but
	# not spoofed reflection floods. Cached responses are not rate limited by
	# this setting. Experimental option.
	#ratelimit: 3000
	# TODO Disabled - Assess whether this is the setting causing all the restart issues with the service (and therefore losing the cache)	

	# Set the total number of unwanted replies to eep track of in every thread.
	# When it reaches the threshold, a defensive action of clearing the rrset
	# and message caches is taken, hopefully flushing away any poison.
	# Unbound suggests a value of 10 million.
	unwanted-reply-threshold: 1000000
	
	## private addresses
	private-address: <lan cidr 1>
	private-address: <lan cidr 2>
	private-domain: <priv. d. name>

# See the unbound.conf(5) man page.
#
# See /usr/share/doc/unbound/examples/unbound.conf for a commented
# reference config file.
#
# The following line includes additional configuration files from the
# /etc/unbound/unbound.conf.d directory.
include: "/etc/unbound/unbound.conf.d/*.conf"

unbound.conf.d/base.conf

# base configuration of server for LAN
server:
	interface: 127.0.0.1
	access-control: <lan cidr 1> allow
	access-control: <lan cidr 2> allow
	access-control:	<control cidr> allow
	access-control:	<wan cidr> deny
	#verbosity: 4
	do-ip4: yes
	do-ip6: no
	do-udp: yes
	do-tcp: yes
	# Aggressive NSEC uses the DNSSEC NSEC chain to  synthesize  NXDO-
        # MAIN  and  other  denials, using information from previous NXDO-
        # MAINs answers.  Default is no.  It helps  to  reduce  the  query
        # rate  towards  targets  that  get  a  very high nonexistent name
        # lookup rate.
	aggressive-nsec: yes
	
	# standard port:
	interface: 0.0.0.0@53

unbound.conf.d/caching.conf

# cache-specific settings
server:
	# https://nlnetlabs.nl/documentation/unbound/howto-optimise/

	# the time to live (TTL) value lower bound, in seconds. Default 0.
	# If more than an hour could easily give trouble due to stale data.
	cache-min-ttl: 1200

	# the time to live (TTL) value cap for RRsets and messages in the
	# cache. Items are not cached for longer. In seconds.
	cache-max-ttl: 172800

	prefetch:		yes
	prefetch-key:		yes
	num-threads:		3

	msg-cache-slabs:	4
	rrset-cache-slabs:	4
	infra-cache-slabs:	4
	key-cache-slabs:	4

	num-queries-per-thread:	4096
	outgoing-range:		8192

	msg-cache-size:		131721898
	rrset-cache-size:	263443797
	
	serve-expired:		yes

	so-reuseport:		yes

	#edns-buffer-size:	10m

unbound.conf.d/tls.conf

# configuration to enable requests on 853
server:
	# enable upstream TLS requests
	tls-upstream: yes
	tcp-upstream: yes

        # enable listen on a TLS port with a local certificate
	tls-cert-bundle: "path/to/ca-certificates.crt" 
        tls-service-key: "path/to/<server pk>.pem" 
        tls-service-pem: "path/to/<server cert>.pem" 
	tls-win-cert: no
        tls-port: 853

        # DNS over TLS
        minimal-responses: yes
        rrset-roundrobin: yes
        use-caps-for-id: yes
	
	# apply port for TLS
	interface: 0.0.0.0@853

	# define ciphers
	tls-ciphers: "<list of tls 1.2 compatible EC ciphers>"

unbound.conf.d/root-auto-trust-anchor-file.conf

server:
	# The following line will configure unbound to perform cryptographic
	# DNSSEC validation using the root trust anchor.
	auto-trust-anchor-file: "path/to/root.key"
	trust-anchor-signaling: yes
	root-key-sentinel: yes

unbound.conf.d/qname-min.conf

server:
    # Send minimum amount of information to upstream servers to enhance
    # privacy. Only sends minimum required labels of the QNAME and sets
    # QTYPE to NS when possible.

    # See RFC 7816 "DNS Query Name Minimisation to Improve Privacy" for
    # details.

    qname-minimisation: yes

@gthess
Copy link
Member

gthess commented Aug 13, 2019

I was finally able to reproduce the issue. The offending option was chroot: "/etc/unbound".
The unbound.service file shipped with Debian uses Type=notify for communication between unbound and systemd. This uses an environment option NOTIFY_SOCKET that points to a socket in the /run/systemd directory. This location is out of the chroot directory and as a result unbound cannot send status back to systemd.
That is why systemd thinks unbound reached the startup timeout and tries to restart it again and again.

I would advise to remove the chroot: "/etc/unbound"option or explicitly set it to chroot: "" (also the Debian default).

Can you verify that without chroot it works as expected?

I will consider opening an issue to inform the user if such configuration is encountered and immediately terminate to avoid cryptic behavior.

@combivanCoder
Copy link
Author

Sorry for taking so long to reply, was away for a few days and could not find a secure network.

I've just tried it without chroot and you're right - it works without issue. The systemd restart unbound returns almost instantly and appears to stay up. Well spotted.

Confining DNS via chroot is useful from a security perspective due to natural exposure of such a system. I've used bind mounts to deal with log and cert folders within the confines of the chroot - is there something similar I should do for this socket?

@gthess
Copy link
Member

gthess commented Aug 19, 2019

I see two options. Both are not ideal but they work around the issue:

  1. Bind mount the /run/systemd directory. systemd creates the socket file in this directory. This is not ideal because all the other files there are not relevant for unbound.
  2. You could change Type=notify to Type=simple in the unbound.service file. This will not use a socket for communication and instead "will consider the unit started immediately after the main service process has been forked off."[1]. This is not ideal because the file may be replaced between system upgrades. Creating your own custom service file and disabling the default unbound one should help with that.

@combivanCoder
Copy link
Author

Hi - option #1 is the most viable to ensure upgrades are not affected / do not affect the operation of the service.

I tried binding /run/systemd to a sub-folder of the /etc/unbound chroot, but this still didn't work - I'm guessing something needs to be configured to tell the unbound service to look in a particular folder for the socket? (otherwise it'll continue looking in /run/systemd)

I couldn't find any references in man for unbound or systemd service unit files - did I miss something?

Essentially this is a defect with the chroot option which needs resolution by:

  • Documenting a workable approach to using chroot with unbound (which it isn't currently)
  • Adding a story onto the backlog for future development which matches the above workaround, but eliminates the need for too much additional configuration e.g. the bind to /run/systemd in this case

@gthess
Copy link
Member

gthess commented Sep 25, 2019

Hi, now it is my time to apologize for the late response :)

When you tried binding the folder did you end up with /etc/unbound/run/systemd ?

@Maryse47
Copy link
Contributor

Bind-mounting systemd socket should be fixed by ff8fd0b . I don't know if debian will use upstream service file though.

@combivanCoder
Copy link
Author

Hi, now it is my time to apologize for the late response :)

When you tried binding the folder did you end up with /etc/unbound/run/systemd ?

No problem at all :)

Did some digging and I'd made a mistake: I hadn't created the path structure correctly for the bind mount (looks like I'd forgotten the systemd subfolder). Corrected so that the mount destination was /etc/unbound/run/systemd and it appears to be working.

Appears to be running within chroot with the workaround you've described. Thanks for your help getting this operational.

@combivanCoder
Copy link
Author

combivanCoder commented Sep 26, 2019

Bind-mounting systemd socket should be fixed by ff8fd0b . I don't know if debian will use upstream service file though.

Thanks for that - looks like 1.9.3-1 recently got pulled into Debian testing, so will keep an eye out for versions with your fix.

A side note: I didn't appear to need to bind /dev/urandom within the chroot folder and I'm not seeing any errors. What would this omission inhibit?

@Maryse47
Copy link
Contributor

A side note: I didn't appear to need to bind /dev/urandom within the chroot folder and I'm not seeing any errors. What would this omission inhibit?

I think at some point unbound was teached to read from /dev/urandom before chrooting. Some docs still point to bind-mounting /dev/urandom and I added it just in case to be safe. I don't know if it's useful but it shouldn't do any harm.

@Maryse47
Copy link
Contributor

On linux kernels 3.17+ it will also call getrandom syscall instead of /dev/urandom.

@MeganerdNL
Copy link

MeganerdNL commented May 2, 2020

In Debian 10/Buster the standard location for the chroot is /var/lib/unbound, while the configuration files stay in /etc/unbound. For the Chroot to work I bind mounted /run/systemd as suggested and made it permanent in /etc/fstab by adding

/run/systemd /var/lib/unbound/run/systemd none defaults,bind 0 0

When unbound is started, the configurations files are copied from /etc/unbound to /var/lib/unbound in some magical way (I can't figure it out) and: it works. Unbound is running chrooted.

All this is not even mentioned by the Debian documentation and also not the correct location of the chroot (/var/lib/unbound). If you use /etc/unbound (as suggested) as chroot location, unbound won't start and complains there is no config in /etc/unbound//etc/unbound.

Thanks.

@DFazeli
Copy link

DFazeli commented May 1, 2022

In Debian 10/Buster the standard location for the chroot is /var/lib/unbound, while the configuration files stay in /etc/unbound. For the Chroot to work I bind mounted /run/systemd as suggested and made it permanent in /etc/fstab by adding

/run/systemd /var/lib/unbound/run/systemd none defaults,bind 0 0

When unbound is started, the configurations files are copied from /etc/unbound to /var/lib/unbound in some magical way (I can't figure it out) and: it works. Unbound is running chrooted.

All this is not even mentioned by the Debian documentation and also not the correct location of the chroot (/var/lib/unbound). If you use /etc/unbound (as suggested) as chroot location, unbound won't start and complains there is no config in /etc/unbound//etc/unbound.

Thanks.
It is possible step by step bad explanation exactly what did you do? (with command please)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants