Skip to content

processQueryResponse() THROWAWAY should be mindful of fail_reply #923

@jarodnash

Description

@jarodnash

Oracle Solaris ships unbound on SPARC platforms built with ADI (Application Data Integrity) enabled. This SPARC HW feature picks up on memory corruption/violations, and it's this feature which is triggering unbound SEGVs.

NOTE: In the following examples, "XXX" is a redaction.

The resulting core file when analysed points to a problem with attempting to log a SERVFAIL. The unbound configuration has:

log-servfail: yes

Disabling ADI (by using elfedit(1) on the unbound binary) instead results in the SERVFAIL messages being logged. A real world example:

Jul 8 05:15:01 XXX unbound: [ID 993594 daemon.error] [11684:0] error: SERVFAIL <XXX. A IN>: all the configured stub or forward servers failed, at zone . from (inet_ntop_error) upstream server timeout

The "inet_ntop_error" is the message of interest here.

This comes from addr_to_str(). We can see addr_to_str() in the stacktrace in the core file:

$ mdb core.unbound.26674.1682413768
Loading modules: [ libc.so.1 ld.so.1 ]
unbound:core> ::status
debugging core file of unbound (64-bit) from XXX
file: /usr/sbin/unbound
initial argv: /usr/sbin/unbound
status: process terminated by SIGSEGV (Segmentation Fault) code=5 (SEGV_ADIPERR), pc=7fff590d7b918c, ADI version c mismatch for VA 7fff590bfdfa28
unbound:core> $c
addr_to_str+8(7fff590bfdfa28?, 7fff59?, 7fff590bfdefd0?, 100?, 7fff590bfdedd0?, d00000890beaaf68?)
errinf_reply+0x12c(d00000890beaa2a0?, d00000890beaa6d8?, d00000890beaa870?, d00000890beaa870?, 1?, 7fff590bfdefd0?)
processQueryTargets+0x189c(d00000890beaa2a0?, d00000890beaa6d8?, b000008909ed9ee0?, 1?, 1000000000000?, d00000890beaa870?)
iter_handle+0x544(d00000890beaa2a0?, d00000890beaa6d8?, b000008909ed9ee0?, 1?, 7fff590d741e70?, 7fff590d77cf18?)
iter_operate+0x384(d00000890beaa2a0?, 3?, 1?, d00000890beaaef8?, d00000890beaa2a4?, d00000890beaa2a0?)
mesh_run+0x80(400000890b9772a0?, d00000890beaa250?, 3?, d00000890beaaef8?, 7fff590d744bb0?, d00000890beaa2a0?)
...

Running through the call sequence, errinf_reply() is attempting to "add response specific error information for log servfail". It calls addr_to_str() passing "fail_reply" (a copy of a pointer to a struct comm_reply). In turn, addr_to_str() calls inet_ntop(), which first validates the address family; failure to validate means inet_ntop() returns NULL, and it's this NULL that results in addr_to_str() producing the "(inet_ntop_error)" string.

So why does the address family validation fail?

Using debug logging it appeared that lookups were failing with both THROWAWAY and timeouts. Code inspection lead to the following few lines:

iterator/iterator.c

static int
processQueryTargets(struct module_qstate* qstate, struct iter_qstate* iq,
	struct iter_env* ie, int id)
{
...
	} else if(type == RESPONSE_TYPE_THROWAWAY) {
		/* LAME and THROWAWAY responses are handled the same way. 
		 * In this case, the event is just sent directly back to 
		 * the QUERYTARGETS_STATE without resetting anything, 
		 * because, clearly, the next target must be tried. */
		verbose(VERB_DETAIL, "query response was THROWAWAY");
	} else {

Namely the "without resetting anything" comment.

Rather than attempt to craft a DNS environment which results in response_type_from_server() returning RESPONSE_TYPE_THROWAWAY, response_type_from_server() was modified to always return RESPONSE_TYPE_THROWAWAY.

Then set unbound.conf to have two forward-addr settings: one for a working DNS server, the other for a machine with no DNS service.

Finally, a script which fires a number of dig(1) queries at unbound completes the test case.

Without ADI enabled, log messages seen were of the likes:

[1692363449] unbound[1500:0] error: SERVFAIL <XXX. A IN>: all the configured stub or forward servers failed, at zone . from (inet_ntop_error) upstream server timeout

Which seems pretty close to the original failure. The suspect code/comment suggests what's happening is:

  • query sent to working configured DNS server
  • response passed to response_type_from_server(), which artificially always returning RESPONSE_TYPE_THROWAWAY
  • processQueryResponse() as cited above moves to try the next target
  • no response from second target (it doesn't have a DNS service)
  • logging of SERVFAIL ends up in errinf_reply() which finds a "fail_reply" pointer and attempts to use it
  • however, this pointer was for the THROWAWAY response, and the underlying memory has since been reused
  • without ADI, inet_ntop() fails with the described address family validation
  • with ADI the HW SEGVs the process with "this isn't the memory you are looking for"

Changing processQueryResponse() and clearing "fail_reply", ie:

               * because, clearly, the next target must be tried. */
               iq->fail_reply = NULL;
               verbose(VERB_DETAIL, "query response was THROWAWAY");
       } else {

gives us a quick fix, as we aren't leaving an old pointer lying around.

Note...the change to response_type_from_server() is an ugly hack to make reproducing the circumstances easier, obviously in the real world the DNS environment was resulting in a RESPONSE_TYPE_THROWAWAY return value from time to time.

Finally, as a quick fix I'm sure there's a more elegant/complete solution which can be implemented.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions