Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

processQueryResponse() THROWAWAY should be mindful of fail_reply #923

Closed
jarodnash opened this issue Aug 18, 2023 · 1 comment
Closed

processQueryResponse() THROWAWAY should be mindful of fail_reply #923

jarodnash opened this issue Aug 18, 2023 · 1 comment

Comments

@jarodnash
Copy link

Oracle Solaris ships unbound on SPARC platforms built with ADI (Application Data Integrity) enabled. This SPARC HW feature picks up on memory corruption/violations, and it's this feature which is triggering unbound SEGVs.

NOTE: In the following examples, "XXX" is a redaction.

The resulting core file when analysed points to a problem with attempting to log a SERVFAIL. The unbound configuration has:

log-servfail: yes

Disabling ADI (by using elfedit(1) on the unbound binary) instead results in the SERVFAIL messages being logged. A real world example:

Jul 8 05:15:01 XXX unbound: [ID 993594 daemon.error] [11684:0] error: SERVFAIL <XXX. A IN>: all the configured stub or forward servers failed, at zone . from (inet_ntop_error) upstream server timeout

The "inet_ntop_error" is the message of interest here.

This comes from addr_to_str(). We can see addr_to_str() in the stacktrace in the core file:

$ mdb core.unbound.26674.1682413768
Loading modules: [ libc.so.1 ld.so.1 ]
unbound:core> ::status
debugging core file of unbound (64-bit) from XXX
file: /usr/sbin/unbound
initial argv: /usr/sbin/unbound
status: process terminated by SIGSEGV (Segmentation Fault) code=5 (SEGV_ADIPERR), pc=7fff590d7b918c, ADI version c mismatch for VA 7fff590bfdfa28
unbound:core> $c
addr_to_str+8(7fff590bfdfa28?, 7fff59?, 7fff590bfdefd0?, 100?, 7fff590bfdedd0?, d00000890beaaf68?)
errinf_reply+0x12c(d00000890beaa2a0?, d00000890beaa6d8?, d00000890beaa870?, d00000890beaa870?, 1?, 7fff590bfdefd0?)
processQueryTargets+0x189c(d00000890beaa2a0?, d00000890beaa6d8?, b000008909ed9ee0?, 1?, 1000000000000?, d00000890beaa870?)
iter_handle+0x544(d00000890beaa2a0?, d00000890beaa6d8?, b000008909ed9ee0?, 1?, 7fff590d741e70?, 7fff590d77cf18?)
iter_operate+0x384(d00000890beaa2a0?, 3?, 1?, d00000890beaaef8?, d00000890beaa2a4?, d00000890beaa2a0?)
mesh_run+0x80(400000890b9772a0?, d00000890beaa250?, 3?, d00000890beaaef8?, 7fff590d744bb0?, d00000890beaa2a0?)
...

Running through the call sequence, errinf_reply() is attempting to "add response specific error information for log servfail". It calls addr_to_str() passing "fail_reply" (a copy of a pointer to a struct comm_reply). In turn, addr_to_str() calls inet_ntop(), which first validates the address family; failure to validate means inet_ntop() returns NULL, and it's this NULL that results in addr_to_str() producing the "(inet_ntop_error)" string.

So why does the address family validation fail?

Using debug logging it appeared that lookups were failing with both THROWAWAY and timeouts. Code inspection lead to the following few lines:

iterator/iterator.c

static int
processQueryTargets(struct module_qstate* qstate, struct iter_qstate* iq,
	struct iter_env* ie, int id)
{
...
	} else if(type == RESPONSE_TYPE_THROWAWAY) {
		/* LAME and THROWAWAY responses are handled the same way. 
		 * In this case, the event is just sent directly back to 
		 * the QUERYTARGETS_STATE without resetting anything, 
		 * because, clearly, the next target must be tried. */
		verbose(VERB_DETAIL, "query response was THROWAWAY");
	} else {

Namely the "without resetting anything" comment.

Rather than attempt to craft a DNS environment which results in response_type_from_server() returning RESPONSE_TYPE_THROWAWAY, response_type_from_server() was modified to always return RESPONSE_TYPE_THROWAWAY.

Then set unbound.conf to have two forward-addr settings: one for a working DNS server, the other for a machine with no DNS service.

Finally, a script which fires a number of dig(1) queries at unbound completes the test case.

Without ADI enabled, log messages seen were of the likes:

[1692363449] unbound[1500:0] error: SERVFAIL <XXX. A IN>: all the configured stub or forward servers failed, at zone . from (inet_ntop_error) upstream server timeout

Which seems pretty close to the original failure. The suspect code/comment suggests what's happening is:

  • query sent to working configured DNS server
  • response passed to response_type_from_server(), which artificially always returning RESPONSE_TYPE_THROWAWAY
  • processQueryResponse() as cited above moves to try the next target
  • no response from second target (it doesn't have a DNS service)
  • logging of SERVFAIL ends up in errinf_reply() which finds a "fail_reply" pointer and attempts to use it
  • however, this pointer was for the THROWAWAY response, and the underlying memory has since been reused
  • without ADI, inet_ntop() fails with the described address family validation
  • with ADI the HW SEGVs the process with "this isn't the memory you are looking for"

Changing processQueryResponse() and clearing "fail_reply", ie:

               * because, clearly, the next target must be tried. */
               iq->fail_reply = NULL;
               verbose(VERB_DETAIL, "query response was THROWAWAY");
       } else {

gives us a quick fix, as we aren't leaving an old pointer lying around.

Note...the change to response_type_from_server() is an ugly hack to make reproducing the circumstances easier, obviously in the real world the DNS environment was resulting in a RESPONSE_TYPE_THROWAWAY return value from time to time.

Finally, as a quick fix I'm sure there's a more elegant/complete solution which can be implemented.

@wcawijngaards
Copy link
Member

The commit fixes it by having a copy of the address in_addr or in6_addr stored for later usage. This makes the address available also after the callback has completed. Thanks for the bug report! The commit should also fix it for that HW feature, but it was tested with some debug tests to replicate it.

jedisct1 added a commit to jedisct1/unbound that referenced this issue Sep 4, 2023
* nlnet/master: (44 commits)
  - Fix NLnetLabs#927: unbound 1.18.0 make test error. Fix make test without SHA1.
  - Fix autoconf 2.69 warnings in configure.
  - Fix for WKS call to getservbyname that creates allocation on exit   in unit test by testing numbers first and testing from the services   list later.
  Tag 1.18.0rc1 became the 1.18.0 release on 30 aug 2023, with the fix from 25 aug, fix compile on NetBSD included. The repository continues with version 1.18.1.
  - Fix for version generation race condition that ignored changes.
  - Fix compile error on NetBSD in util/netevent.h.
  - Tag for 1.18.0rc1 release.
  - Set version number to 1.18.0.
  - Fix unit test for unbound-control to work when threads are disabled,   and fix cache dump check.
  - Fix NLnetLabs#923: processQueryResponse() THROWAWAY should be mindful of   fail_reply.
  - Fix for NLnetLabs#925: unbound.service: Main process exited, code=killed,   status=11/SEGV. Fixes cachedb configuration handling.
  - Fix windows ci workflow to install bison and flex.
  Further debug for windows ci workflow.
  - Debug Windows ci workflow.
  - Fix stat_values test to work with dig that enables DNS cookies.
  - Fix uninitialized memory passed in padding bytes of cmsg to sendmsg.
  Changelog for commit. - Fix for iter_dec_attempts that could cause a hang, part of   capsforid and qname minimisation, depending on the settings.
  - Fix for iter_dec_attempts that could cause a hang, part of   capsforid and qname minimisation, depending on the settings.
  - Fix ip_ratelimit test to work with dig that enables DNS cookies.
  - Fix regional_alloc_init for potential unaligned source of the copy.
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants