Skip to content

Hang in Unbound 1.13.1 (and 1.13.0) #439

@jcjones

Description

@jcjones

We're occasionally seeing unbound 1.13.1 (and 1.13.0 and 1.13.1rc1) hang, taking all available CPU time and no longer servicing queries. Because of automated watchdog mechanisms it's taken me a while to get a core dump during a stuck state, but I have at least one now, running 1.13.1 release on CentOS7. The core dump shows two threads operating:

(gdb) thread 1
[Switching to thread 1 (Thread 0x7f4744f8b840 (LWP 5614))]
#0  0x000055cf3785f241 in reuse_cmp (key1=0x7fff77f41758, key2=0x55cffef85878) at services/outside_network.c:162
162	{
(gdb) bt
#0  0x000055cf3785f241 in reuse_cmp (key1=0x7fff77f41758, key2=0x55cffef85878) at services/outside_network.c:162
#1  0x000055cf3781a45e in rbtree_find_less_equal (rbtree=rbtree@entry=0x55cfe7436428, key=key@entry=0x7fff77f41758, result=result@entry=0x7fff77f41738)
    at util/rbtree.c:527
#2  0x000055cf3785f99c in reuse_tcp_find (outnet=outnet@entry=0x55cfe7436320, addr=addr@entry=0x55d003791910, addrlen=16, use_ssl=<optimized out>)
    at services/outside_network.c:487
#3  0x000055cf37860a65 in use_free_buffer (outnet=outnet@entry=0x55cfe7436320) at services/outside_network.c:740
#4  0x000055cf37860ffb in outnet_tcp_cb (c=0x55cffeff7670, arg=0x55cffeff7540, error=<optimized out>, reply_info=0x0) at services/outside_network.c:1112
#5  0x00007f47446c7a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#6  0x000055cf378569cc in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#7  0x000055cf377d3779 in worker_work (worker=<optimized out>) at daemon/worker.c:1949
#8  0x000055cf377c7f31 in daemon_fork (daemon=daemon@entry=0x55cf390be030) at daemon/daemon.c:700
#9  0x000055cf377c3790 in run_daemon (need_pidfile=1, debug_mode=1, cmdline_verbose=0, cfgfile=0x55cf378765f0 "/etc/unbound/unbound.conf") at daemon/unbound.c:707
#10 main (argc=<optimized out>, argv=<optimized out>) at daemon/unbound.c:808
(gdb) list
157		return 0;
158	}
159	
160	int
161	reuse_cmp(const void* key1, const void* key2)
162	{
163		int r;
164		r = reuse_cmp_addrportssl(key1, key2);
165		if(r != 0)
166			return r;
(gdb) 

and

(gdb) thread 2
[Switching to thread 2 (Thread 0x7f4742218700 (LWP 5616))]
#0  0x000055cf3781a458 in rbtree_find_less_equal (rbtree=rbtree@entry=0x7f468e428198, key=key@entry=0x7f4675d87748, result=result@entry=0x7f4742217c20)
    at util/rbtree.c:527
527			r = rbtree->cmp(key, node->key);
(gdb) bt
#0  0x000055cf3781a458 in rbtree_find_less_equal (rbtree=rbtree@entry=0x7f468e428198, key=key@entry=0x7f4675d87748, result=result@entry=0x7f4742217c20)
    at util/rbtree.c:527
#1  0x000055cf3781a4ec in rbtree_search (rbtree=rbtree@entry=0x7f468e428198, key=key@entry=0x7f4675d87748) at util/rbtree.c:285
#2  0x000055cf3781a530 in rbtree_delete (rbtree=rbtree@entry=0x7f468e428198, key=key@entry=0x7f4675d87748) at util/rbtree.c:333
#3  0x000055cf3785f2a5 in reuse_tcp_remove_tree_list (outnet=0x7f468e428090, reuse=0x7f4675d87748) at services/outside_network.c:867
#4  0x000055cf37860d63 in decommission_pending_tcp (outnet=outnet@entry=0x7f468e428090, pend=pend@entry=0x7f4675d87730) at services/outside_network.c:922
#5  0x000055cf37860e72 in reuse_cb_and_decommission (outnet=outnet@entry=0x7f468e428090, pend=0x7f4675d87730, error=error@entry=-2)
    at services/outside_network.c:973
#6  0x000055cf37861400 in outnet_tcptimer (arg=0x7f466fa3eb30) at services/outside_network.c:2004
#7  0x00007f47446c7a14 in event_base_loop () from /lib64/libevent-2.0.so.5
#8  0x000055cf378569cc in comm_base_dispatch (b=<optimized out>) at util/netevent.c:246
#9  0x000055cf377d3779 in worker_work (worker=worker@entry=0x55cf39159e30) at daemon/worker.c:1949
#10 0x000055cf377c74cf in thread_start (arg=0x55cf39159e30) at daemon/daemon.c:540
#11 0x00007f474403fea5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f4743d689fd in ?? () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()
(gdb) list
522		*result = NULL;
523		fptr_ok(fptr_whitelist_rbtree_cmp(rbtree->cmp));
524	
525		/* While there are children... */
526		while (node != RBTREE_NULL) {
527			r = rbtree->cmp(key, node->key);
528			if (r == 0) {
529				/* Exact match */
530				*result = node;
531				return 1;
(gdb)

At the time this core was taken, unbound had been running at 100% of CPU for about six hours (watchdogs turned off).

I'm afraid I was unable to attach the debugger during the hang itself, but will attempt that next time.

The configuration matches that of #411 and this hang happens intermittently with the same segfaults of that issue.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions