Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

crash due to rbtree_remove of a node never inserted #558

Closed
glebius opened this issue Oct 25, 2021 · 9 comments
Closed

crash due to rbtree_remove of a node never inserted #558

glebius opened this issue Oct 25, 2021 · 9 comments
Assignees

Comments

@glebius
Copy link

glebius commented Oct 25, 2021

Unbound crashes due to NULL pointer dereference. This definitely is associated with TCP requests, however my reading of the code says the bug is in the event library and is not specific to TCP.

#3  0x000000080115f531 in comm_point_stop_listening (c=0x801dd0000)
    at unbound/util/netevent.c:4107
#4  0x00000008011458fa in tcp_req_info_setup_listen (req=0x801c55100, req@entry=0x1)
    at unbound/services/listen_dnsport.c:1941
#5  0x0000000801145aaf in tcp_req_info_handle_readdone (req=<optimized out>)
    at unbound/services/listen_dnsport.c:2079
#6  0x00000008011619ca in tcp_callback_reader (c=c@entry=0x801dd0000)
    at unbound/util/netevent.c:1168
#7  0x000000080115f8bc in comm_point_tcp_handle_read (fd=fd@entry=0xd, c=c@entry=0x801dd0000, short_ok=short_ok@entry=0x0)
    at unbound/util/netevent.c:1758
#8  0x000000080115efe1 in comm_point_tcp_handle_callback (fd=0xd, event=<optimized out>, arg=0x801dd0000)
    at unbound/util/netevent.c:2152
#9  0x000000080115348c in handle_select (base=0x801c70780, wait=<optimized out>)
    at unbound/util/mini_event.c:220
#10 minievent_base_dispatch (base=0x801c70780) at unbound/util/mini_event.c:242
#11 0x000000000104077a in ub_event_base_dispatch (base=0x801ca10c0)
    at unbound/util/ub_event.c:280
#12 0x000000080115ce21 in comm_base_dispatch (b=<optimized out>)
    at unbound/util/netevent.c:256
#13 0x000000000104574e in worker_work (worker=<optimized out>)
    at unbound/daemon/worker.c:1940
#14 0x0000000001035563 in daemon_fork (daemon=<optimized out>, daemon@entry=0x801c1a000)
    at unbound/daemon/daemon.c:701
#15 0x0000000001040d4f in run_daemon (cfgfile=0x7fffffffeef7 "/unbound.conf", cmdline_verbose=0x0, debug_mode=0x0, need_pidfile=0x1)
    at unbound/daemon/unbound.c:736
#16 main (argc=<optimized out>, argv=<optimized out>) at unbound/daemon/unbound.c:837

Looking into the node we see it was not removed before, since removed node would have pointers set to <rbtree_null_node>.

(gdb) frame 0
(gdb) p *to_delete
$16 = {
  parent = 0x0,
  left = 0x0,
  right = 0x0,
  key = 0x801ca10c0,
  color = 0x0
}

Thus, node was never inserted into the tree.

The node insertion predicate is timeout pointer passed and EV_TIMEOUT flag set in event_add() at mini_event.c:311:

        if(tv && (ev->ev_events&EV_TIMEOUT)) {
#ifndef S_SPLINT_S
                struct timeval *now = ev->ev_base->time_tv;
                ev->ev_timeout.tv_sec = tv->tv_sec + now->tv_sec;
                ev->ev_timeout.tv_usec = tv->tv_usec + now->tv_usec;
                while(ev->ev_timeout.tv_usec >= 1000000) {
                        ev->ev_timeout.tv_usec -= 1000000;
                        ev->ev_timeout.tv_sec++;
                }
#endif
                (void)rbtree_insert(ev->ev_base->times, &ev->node);
        }

The node removal predicate is just the flag, see event_del() at mini_event.c:332.

So potentially we would create offending event if event_add() is ever called with NULL tv. This is possible in netevent.c:4165.

To reproduce
We don't have a reproduce case for this. Happens not so often at a large fleet of machines.

Expected behavior
Not crash!

System:
OS: FreeBSD 14
local-unbound -V
Version 1.13.1

Configure line: --with-ssl=/usr --with-libexpat=/usr --disable-dnscrypt --disable-dnstap --enable-ecdsa --disable-event-api --enable-gost --with-libevent --disable-subnet --disable-tfo-client --disable-tfo-server --with-pthreads--prefix=/usr --localstatedir=/var/unbound --mandir=/usr/share/man --build=freebsd
Linked libs: mini-event internal (it uses select), OpenSSL 1.1.1l-freebsd 24 Aug 2021
Linked modules: dns64 respip validator iterator

BSD licensed, see LICENSE in source package for details.
Report bugs to unbound-bugs@nlnetlabs.nl or https://github.com/NLnetLabs/unbound/issues

@gthess
Copy link
Member

gthess commented Oct 26, 2021

Hi there,

Thanks for reporting this!
If this is TCP related then 1.13.2 could help since it solved some cases with random crashes on the TCP reuse code.

You mention that this is maybe related to event handling but I don't have the time to look closer atm; will do later today or tomorrow. From a quick look I see that you compiled with libevent support but unbound couldn't find it and instead uses the builtin mini-event. Another option to test is to actually use libevent and see if you could hit the same bug.

If you can reproduce it in the meantime, that would be great!

@glebius
Copy link
Author

glebius commented Oct 26, 2021

I meant embedded minievent actually. Please take a look at the pull request, I think I have nailed the problem.

Can you please point at 1.13.2 commit that fixed random crashes in TCP reuse code?

@gthess
Copy link
Member

gthess commented Oct 26, 2021

I will. Could you reproduce (or just suffer) the problem in the meantime and verified that the change solves that?

Sure! There were a bunch of them (some also part of 1.13.1):

@glebius
Copy link
Author

glebius commented Oct 26, 2021

These mostly work on the reuse rbtree, and we have crash in timeout rbtree. Some, e.g. 7396eff we already have in our tree. We use the unbound that is supplied with FreeBSD.

@gthess
Copy link
Member

gthess commented Nov 1, 2021

I had a closer look at this and tried to reproduce it (not successfully). Indeed this is different than the issues fixed previously. This seems to be for in the out-of-order processing for clients, whereas the previous fixes were for the stream reuse for upstreams.

Were you able to stumble upon it again or reproduce it?
Could you provide frames 0-2 in your initial post and any logging at the time of the crash?
Also configuration may help if you are able to post it.

Calling ub_event_add() with NULL timeval is fine. In that case the event does not have the UB_EV_TIMEOUT bit set from the caller.

I wouldn't want to use the fix in the PR since it changes the event bits in the add routine which is something unexpected to calling code. But, did it solve the issue for you?

@glebius
Copy link
Author

glebius commented Nov 1, 2021

Frames 0-2. Not much info here...

(gdb) frame 0
#0  rbtree_delete (rbtree=0x801dc5880, key=key@entry=0x801ca10c0) at FreeBSD/contrib/unbound/util/rbtree.c:341
(gdb) info local
smright = 0x0
to_delete = 0x801ca10c0
child = <optimized out>
(gdb) p *to_delete
$1 = {
  parent = 0x0,
  left = 0x0,
  right = 0x0,
  key = 0x801ca10c0,
  color = 0x0
}
(gdb) frame 1
#1  0x0000000801153856 in minievent_del (ev=0x801ca10c0)at FreeBSD/contrib/unbound/util/mini_event.c:334
(gdb) info locals
No locals.
(gdb) p *ev
$3 = {
  node = {
    parent = 0x0,
    left = 0x0,
    right = 0x0,
    key = 0x801ca10c0,
    color = 0x0
  },
  added = 0x1,
  ev_base = 0x801c70780,
  ev_fd = 0xd,
  ev_events = 0x13,
  ev_timeout = {
    tv_sec = 0x61742a49,
    tv_usec = 0xeabf4
  },
  ev_callback = 0x80115eec0 <comm_point_tcp_handle_callback>,
  ev_arg = 0x801dd0000
}
(gdb) frame 2
#2  0x000000000104090a in ub_event_del (ev=0x801ca10c0) at FreeBSD/contrib/unbound/util/ub_event.c:395
(gdb) info locals
No locals.

@glebius
Copy link
Author

glebius commented Nov 1, 2021

I now have only single core and don't have logs. But I've been told that within fleet several machines had unbound crashed.
I yet can't confirm whether my patch helps or not. We have rather long release cycle, and patched version isn't yet widely deployed.

gthess added a commit that referenced this issue Nov 5, 2021
… reclaimed

  more than once during callbacks.
@gthess
Copy link
Member

gthess commented Nov 5, 2021

I believe I solved the issue with the above commits. For me the problem was really solved when I fixed the potential loop in the comm_point->tcp_free list. I also included changes to clear the UB_EV_TIMEOUT bit on event creation because it is a nice to have fix.

I am closing the related PR but I am leaving this issue open; it would be good if we get feedback that these changes solve the issue for you as well.

jedisct1 added a commit to jedisct1/unbound that referenced this issue Nov 18, 2021
* nlnet/master: (23 commits)
  Document PR NLnetLabs#563 to changelog
  Clarify KEEPALIVE EDNS0 option operation
  Make explicit whether edns options are parsed from queries or responses
  add missing return code
  Remove wrongly added EDE comments
  Update util/data/msgparse.c
  add potential EDE spots
  complete renaming of the modules edns list
  Apply suggestions from code review
  Changelog note for NLnetLabs#565 - Merge NLnetLabs#565: unbound.service.in: Disable ProtectKernelTunables again.
  - Fix to remove unused code from rpz resolve client and action   function.
  - Fix analyzer review failure in rpz action override code to not   crash on unlocking the local zone lock.
  - Fix for NLnetLabs#558: clear the UB_EV_TIMEOUT bit before adding an event.
  - Fix for NLnetLabs#558: fix loop in comm_point->tcp_free when a comm_point is reclaimed   more than once during callbacks.
  - Fix that forward-zone name is documented as the full name of the   zone. It is not relative but a fully qualified domain name.
  Disable ProtectKernelTunables again
  - Fix NLnetLabs#552: Unbound assumes index.html exists on RPZ host.
  Fix keepalive logic
  Move option handling to parse-time
  split edns_data.opt_list in opt_list_in and opt_list_out
  ...
@gthess gthess self-assigned this Mar 17, 2023
@gthess
Copy link
Member

gthess commented Mar 17, 2023

Closing this as inactive; the observed issue is already resolved.

@gthess gthess closed this as completed Mar 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants