-
-
Notifications
You must be signed in to change notification settings - Fork 358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unbound-1.13.1 crashed by SIGABRT #469
Comments
Hello, Thanks for the backtrace! This seems to be the same as #411 and #439 and the backtrace will hopefully help narrow down the cause of the corruption. |
No scratch that, this is just a node and having |
Btw, can you somehow reproduce the issue reliably-ish? |
I can see in my case that the key=NULL too: |
By the way, i have another core dump where the key is NULL: This node is appearing in while(node != RBTREE_NULL) loop (source is util/rbtree.c in rbtree_insert() function) and program is abnormally terminating when we call rbtree->cmp(data->key, node->key) in frame 0. |
It would be hard. Unbound crashes under massive TCP-requests when server generating huge amount of iterative requests to upstreams. So i will need to research this traffic first to reproduce the issue in my sandbox. |
There is a possible fix on master branch (ff6b527) for this. |
Hi @iruzanov, just checking if you were able to test with the aforementioned fix. |
Hello colleagues! In will definitely test your patch on my load Unbound resolvers in this week. And then i will report you about results! Thank you very much for help! |
* nlnet/master: - zonemd-check: yesno option, default no, enables the processing of ZONEMD records for that zone. - Merge NLnetLabs#496 from banburybill: Use build system endianness if available, otherwise try to work it out. Use build system endianness if available, otherwise try to work it out. - For NLnetLabs#492: Fix font highlighting for the man page on emacs. - Fix NLnetLabs#492: module-config respip missing in unbound.conf.5.in man page. Merges NLnetLabs#494 from he32. Remove comment line (?) from man page. Transplant parts of the contributed RPZ documentation. - Move the NSEC3 max iterations count in line with the 150 value used by BIND, Knot and PowerDNS. This sets the default value for it in the configuration to 150 for all key sizes. - Test code has -q option for quiet output. - Fix for NLnetLabs#411, NLnetLabs#439, NLnetLabs#469: Reset the DNS message ID when moving queries between TCP streams. - Refactor for uniform way to produce random DNS message IDs. Fix date in changelog. - Fix NLnetLabs#489: Compile using MSYS2 MinGW 64-bit. - Fix that auth-zone zonefiles use last TTL if no TTL is specified. Changelog note for NLnetLabs#487 - Merge PR NLnetLabs#487: ifdef RLIMIT_AS in recently added check. ifdef RLIMIT_AS in recently added check
Today i have patched one of my loaded Unbound resolvers. My plan is:
Thank you very much for big help! |
Unbound patched on one of the most loaded servers is still running without any core dumps. Looks good. In days i'm going to patch still more loaded servers. |
Thanks for letting us know, very good news! |
Hi! |
I have patched still three Unbound resolvers that are running under heavy load. No one of the processes are crased with core dump. I would like you to give me still more time to observe the patched software and then i will let you know with my final conclusion. |
Hello colleagues! Unfortunately i have detected core dumps from two Unbound processes.
I.e. we have deal with the same issue related to key2=NULL:
|
I have applied the patch that is on resource: |
Oh, bad news but thanks for the information as always! |
I will dig Unbound logs. May be i will find some common points to describe under what traffic the program has crashed. Also i will recompile libevent and Unbound itself with -O0 flag. To get the maximum debug info. |
Would you be able to share your config file? |
python: dnstap: remote-control: stub-zone:
|
What about traffic similarity - i have grep-ed Unbound logs on two servers. The log records (tons of the records) before Unbound was crashing to core dump are look like:
I.e. these are tons of oubound requests to upstreams (probably fake upstreams received from crafted DNS packets from spooffed clients or intruders) by TCP protocol. The code responding for this functionality is in outside_network.c source file. |
Hello, Colleagues! Last weekend i've detected that Unbound was crashed and there is interesting core dump: (gdb) fr 1 As we can see there are nodes colored with BLACK with not NULL keys and one node colored with RED with NULL key. |
And still one thing - our traffic includes tons of queries on which Unbound reply with ServFail status. |
Do you mean that you see an increase of SERVFAIL responses with the affected versions or that your usual traffic includes a lot of SERVFAIL responses? |
The second - usual traffic includes a lot of SERVFAIL responses. |
Hello Colleagues! In previous week i have patched 6 servers. These are working FINE for now! ;) So my plan is to upgrade Unbound on all other servers to monitor my service during this week. And in next monday i will report you about results. |
Hi Igor, these are some good news for now! |
Yes, i will! Is it the normal situation? |
Hello, collegues! |
Hi Igor, great news! The error message is a bit worrying since it reveals a situation that shouldn't happen normally. But the non-crashing part seems to indicate that this may happen during a callback and no extra harm at this point (like unbound/services/outside_network.c Line 1264 in 8e538dc
unbound/services/outside_network.c Line 2590 in 8e538dc
Unfortunately I haven't come across this in any of my tests to provide more useful information. Before debugging further I would wait for you to test the full 1.13.2 release because as I understand, you test 1.13.1 with patches from the related commits currently. If you want to debug further you can configure unbound with This WILL NOT produce a core dump. Lines 222 to 226 in 8e538dc
|
Thank you very much! Well, tomorrow i will be in vacation untill august 29. On monday august 30 i will return to start step 1: upgrade Unbound to 1.13.2. I hope this release will be available ;) |
Actually it was released not more than 30 minutes ago :) Thank you for your help thus far and enjoy your vacation! |
Hello colleagues! Its my pleasure to use Unbound 1.13.2. And the program is working fine but one more bug i've detected one month ago. Unbound is crashing via SIGSEGV. So i have recompiled the program with the following flags: "-g -O0 -fsanitize=address -fno-omit-frame-pointer". Below i put full stacktrace of the problem thread catched by libasan: (gdb) thr 3 Now, if we will dig the point of the thread where Unbound has terminated by sanitizer: so we will see that the thread is crashing in for() (line 577 in util/rbtree.c) loop. As i can understand there free() was called somewhere before this loop where the operation node = node->right has provoked program crash. Could you please see the stack trace i put above? Maybe it will prompt the correct place in your code with root cause. Big thanks in advance! |
Hi @iruzanov! |
Sure, i am totally at your service! ;) |
Hello, @gthess! A couple of days ago i have received email with subject "Unbound DoS vulnerability (only with debugging enabled!)" from your colleagues. In short - this is the warning about to use "--disable-debug --disable-checking" when compiling Unbound to avoid resolver termination when it receives crafted packet from upstream. It is the subject of versions 1.13.x and 1.14.0 of the program. So my question is - does it relates to race conditions in TCP-code that you have mentioned above? And are there some ready fixes to test by myself in the Unbound (1.13.2) running on my servers? ;) |
Using '--disable-debug --disable-checking' explicitly turns off debugging (if you are not sure what is happening with the configure line in your environment). If you don't specify them at all, debugging is off by default. |
Thanks for the detailed clarification! |
Hi @iruzanov, |
Hello @gthess! I will gladly test your fixpack mantioned above, BIG thank you! And in next week i will report you about my first observations for Unbound on a couple of my loaded servers. |
Just a heads up that the branch is now merged to master which also includes other (relevant) fixes. |
ohh, ok! So i need to recompile my first patched Unbound again. Because i have fetched master branch with fixpacks from #612 yesterday :) |
I would just go for the master branch as is since it contains other fixes. For this case, 8e76eb9 addresses a dnstap issue that could trigger something similar. |
Thank you @gthess! |
Hello @gthess! My first resolver with new code from master branch works fine during last week. And today i have upgraded next two of my loaded resolvers with the latest code. |
Hi, @gthess! |
That's always good to hear, thanks for always reporting back! |
Hi, @gthess! I'm sorry for timeout. I was just waiting for core dump on some of loaded resolver that i still did not patch. And today such resolver has crashed ;) So, some hours ago i have patched next three loaded resolvers. The resolvers patched on previous week work fine. |
Hello, @gthess! |
Hi! Btw, are you subscribed to the unbound-users mailing list? Early announcements about releases (also for release candidates) are announced there. |
Thank you for your answer! No, i am not subscribed yet. Usualy i often go to your web-site to see for the last news about NSD and Unbound ;) Ok, i will. |
In that case nsd-users can also be useful to you. |
Closing this as resolved by now. |
Hello, Wouter!
I am actively using unbound-1.13.1 (with our DNSTAP patches, issue #367). And sometimes my unbound is crashing under highload, massive recursive TCP-requests. Any abnormal terminations caused by services/outside_network.c code. And now i have one of such core dumps:
(gdb) bt
#0 0x0000000800955c2a in thr_kill () from /lib/libc.so.7
#1 0x0000000800954084 in raise () from /lib/libc.so.7
#2 0x00000008008ca279 in abort () from /lib/libc.so.7
#3 0x0000000800464641 in ?? () from /usr/local/lib/libevent-2.1.so.7
#4 0x0000000800464939 in event_errx () from /usr/local/lib/libevent-2.1.so.7
#5 0x000000080045ec54 in evmap_io_del_ () from /usr/local/lib/libevent-2.1.so.7
#6 0x0000000800457e8f in event_del_nolock_ () from /usr/local/lib/libevent-2.1.so.7
#7 0x000000080045ada8 in event_del () from /usr/local/lib/libevent-2.1.so.7
#8 0x000000000030e25b in ub_event_del (ev=) at ./util/ub_event.c:395
#9 comm_point_close (c=0xdc97b7c00) at ./util/netevent.c:3860
#10 0x0000000000315bab in decommission_pending_tcp (outnet=, pend=0xdc9494980)
at ./services/outside_network.c:945
#11 0x00000000003147d6 in reuse_cb_and_decommission (outnet=0x18e75, pend=0x6, error=-2)
at ./services/outside_network.c:986
#12 0x0000000000317491 in outnet_tcptimer (arg=0xee67c2300) at ./services/outside_network.c:2033
#13 0x000000080045e0ed in ?? () from /usr/local/lib/libevent-2.1.so.7
#14 0x000000080045a09c in event_base_loop () from /usr/local/lib/libevent-2.1.so.7
#15 0x000000000024dc54 in thread_start (arg=0x8014c0800) at ./util/ub_event.c:280
#16 0x0000000800780fac in ?? () from /lib/libthr.so.3
#17 0x0000000000000000 in ?? ()
Backtrace stopped: Cannot access memory at address 0x7fffdf7fa000
(gdb)
If we enter frame 12 (outnet_tcptimer) and do print pend structure, we will see the following:
(gdb) print pend
$15 = (struct pending_tcp *) 0x6
(gdb) print *pend
Cannot access memory at address 0x6
(gdb)
And this corrupt pend structure is passing to reuse_cb_and_decommission() function (frame 11) and higher in the stacktrace output above.
In the outnet_tcptimer() function we can see the following code (in services/outside_network.c):
/* it was in use /
struct pending_tcp pend=(struct pending_tcp*)w->next_waiting;
But the structure w->next_waiting is of type waiting_tcp:
(gdb) print w->next_waiting
$18 = (struct waiting_tcp *) 0xdc9494980
(gdb)
So my question - is the types casting correct in outnet_tcptimer() function? And does this corrupt pend structure cause event_errx() in libevent?
If it might help, i found structure of pending_tcp type in w structure:
(gdb) print w->outnet->tcp_free
$23 = (struct pending_tcp *) 0xdc9494980
(gdb)
(gdb) print *w->outnet->tcp_free
$24 = {next_free = 0xdc9493e40, pi = 0xd7da2c000, c = 0xdc97b7c00, query = 0x0, reuse = {node = {parent = 0xdc94953a0,
left = 0x3287d0 <rbtree_null_node>, right = 0x3287d0 <rbtree_null_node>, key = 0x0, color = 1 '\001'}, addr = {
ss_len = 0 '\000', ss_family = 2 '\002', __ss_pad1 = "\000\065X\320\017\067", __ss_align = 0,
__ss_pad2 = "\000\000\000\000\000\000\000\016", '\000' <repeats 103 times>}, addrlen = 16, is_ssl = 0,
lru_next = 0xdc9494ae0, lru_prev = 0x0, item_on_lru_list = 0, pending = 0xdc9494980, cp_more_read_again = 0,
cp_more_write_again = 0, tree_by_id = {root = 0x3287d0 <rbtree_null_node>, count = 0,
cmp = 0x3133e0 <reuse_id_cmp>}, write_wait_first = 0x0, write_wait_last = 0x0, outnet = 0xd7d805000}}
(gdb)
Big thank you in advance!
PS I did not send core-file itself because of 31GB in size of the file.
The text was updated successfully, but these errors were encountered: