Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unbound 1.13.2 crashes due to p->pc is NULL in serviced_udp_callback #588

Closed
dyunwei opened this issue Dec 15, 2021 · 4 comments
Closed
Assignees

Comments

@dyunwei
Copy link
Contributor

dyunwei commented Dec 15, 2021

Below is a part of the call stack:

#0  0x00000000004f1cf7 in serviced_udp_callback (c=0x445ca490, arg=0x84ecef40, error=0, rep=0x7ffcfa077c40) at services/outside_network.c:3112
#1  0x00000000004ed10d in outnet_udp_cb (c=0x445ca490, arg=0x44286b50, error=0, reply_info=0x7ffcfa077c40) at services/outside_network.c:1424

At frame 0 in the call stack , i printed "struct pending* p" as below:

(gdb) p	*p
$2 = {node = {parent = 0x442f84d0, left = 0x0, right = 0x0, key = 0x863e1ca0, color = 1 '\001'}, id = 7761632, addr = {ss_family = 16520,
    __ss_padding = "...", '\000' <repeats 47 times>, __ss_align = 0}, addrlen = 0, pc = 0x0, timer = 0x0, cb = 0x0, cb_arg = 0x0, outnet = 0x0,	sq = 0x0, next_waiting = 0x0,
  timeout = 16,	pkt = 0x71d54668 "j\210", pkt_len = 56}

From this display, it can be known that the p->pc is NULL which causes the crash.

Looking into frame 1 in the call stack, and printed "struct pending* p" and "p->sq" in outnet_udp_cb.

(gdb) p	*p
$32 = {node = {parent = 0x766ee0 <rbtree_null_node>, left = 0x766ee0 <rbtree_null_node>, right = 0x766ee0 <rbtree_null_node>, key = 0x7c3d7ea0,	color = 0 '\000'}, id = 16484, addr = {ss_family = 2,
    __ss_padding = "\000\065\300)\242\036", '\000' <repeats 111	times>,	__ss_align = 0}, addrlen = 16, pc = 0x445ca460,	timer = 0x56b6c180, cb = 0x4f1c8b <serviced_udp_callback>, cb_arg = 0x84ecef40,
  outnet = 0x44286b50, sq = 0x84ecef40,	next_waiting = 0x0, timeout = 0, pkt = 0x0, pkt_len = 0}

(gdb) p	*p->sq
$30 = {node = {parent = 0x6eecc6f0, left = 0x766ee0 <rbtree_null_node>,	right = 0x766ee0 <rbtree_null_node>, key = 0x84ecef40, color = 1 '\001'}, qbuf = 0x5faf9550 "",	qbuflen = 43, dnssec = 32784,
  want_dnssec =	1, nocaps = 0, tcp_upstream = 0, ssl_upstream = 0, tls_auth_name = 0x0,	addr = {ss_family = 2,
    __ss_padding = "...", __ss_align = 6364733720421584647}, addrlen = 16,
  zone = 0x68a30580 "\003com", zonelen = 5, qtype = 65,	status = serviced_query_TCP_EDNS, to_be_deleted = 0, retry = 0,	last_sent_time = {tv_sec = 1639352958, tv_usec = 178860}, last_rtt = 383,
  edns_lame_known = 1, opt_list = 0x0, outnet = 0x44286b50, cblist = 0x69b1a6c0, pending = 0x71d54530, padding_block_size = 0}

From the two displays above, i find two strange questions:

  1. the value of p is 0x7c3d7ea0, but p->sq->pending is 0x71d54530, shouldn't the two pointers always be the same value?
  2. the p->sq->status is "serviced_query_TCP_EDNS", shoudn't it always be serviced_query_UDP_* in udp callback?

By the way, it is not easy to reproduce the problem.

@susnux
Copy link

susnux commented Dec 22, 2021

This issue is reported by some users of openDKIM, there is a stack trace on the openSUSE bugzilla.

The crash happens on

struct port_if* pi = p->pc->pif;

Maybe this is the UDP equivalent of 71f311d ?

@dyunwei
Copy link
Contributor Author

dyunwei commented Dec 24, 2021

Thank you for your reply. Indeed, the crashes experienced by the user of openDKIM and i seemed to occur in the same place.

The null pointer assignment can be easily fixed through a method similar to 71f311d that you mentioned.

However the two questions that i mentioned above still remain. i am not sure whether there is an unintentional purpose causing the null pointer.

@gthess gthess self-assigned this Jan 25, 2022
@gthess gthess closed this as completed in 5c85615 Jan 25, 2022
@gthess
Copy link
Member

gthess commented Jan 25, 2022

The above commit solves the issue for legitimate cases where p->pc is NULL i.e., when a UDP query waiting to be sent fails before sending.
The above PR should solve the strange cases:

From the two displays above, i find two strange questions:

  1. the value of p is 0x7c3d7ea0, but p->sq->pending is 0x71d54530, shouldn't the two pointers always be the same value?

When a UDP query needs to fallback to TCP, sq->pending will point to the new pending which is now struct waiting_tcp. But the UDP pending will be deleted right after that.

  1. the p->sq->status is "serviced_query_TCP_EDNS", shoudn't it always be serviced_query_UDP_* in udp callback?

It makes sense to be "serviced_query_TCP_EDNS" when falling back to TCP but it doesn't make sense to appear in the udp callback.

I believe the last two cases above are a byproduct of the race condition bug that could lead to memory corruption.

jedisct1 added a commit to jedisct1/unbound that referenced this issue Jan 25, 2022
* nlnet/master:
  Changelog note for NLnetLabs#612: - Merge PR NLnetLabs#612: TCP race condition.
  - Fix NLnetLabs#588: Unbound 1.13.2 crashes due to p->pc is NULL in   serviced_udp_callback.
  - Better bookkeeping when reclaiming the TCP buffer.
  - Mark waiting_tcp and serviced_query as being in the   cb_and_decommission stage to signal later code about their state;   prevents premature item deletion.
  Changelog note for NLnetLabs#610 - Fix NLnetLabs#610: Undefine-shift in sldns_str2wire_hip_buf.
  - Fix NLnetLabs#610: Undefine-shift in sldns_str2wire_hip_buf.
  - Add serviced_query timer to send upstream queries outside of the mesh   flow to prevent race conditions.
  - For dnstap, do not wakeupnow right there. Instead zero the timer to   force the wakeup callback asap.
  - For NLnetLabs#602: Allow the module-config "subnetcache validator cachedb   iterator".
  - Add rpz: for-downstream: yesno option, where the RPZ zone is   authoritatively answered for, so the RPZ zone contents can be   checked with DNS queries directed at the RPZ zone.
  Changelog note for NLnetLabs#605: - Merge PR NLnetLabs#605: Fix EDNS to upstream where the same option could be   attached more than once.
  - Make sure callback changes for EDNS are not lost.
  - Fix EDNS to upstream where the same option could be attached more than   once. - Add a region to serviced_query for allocations.
@gthess
Copy link
Member

gthess commented Jan 26, 2022

Just to note that the crash part of the bug only relates to dnstap configured Unbound.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants