Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unbound crashing with SIGABRT #38

Closed
bleve opened this issue Jun 6, 2019 · 39 comments
Closed

unbound crashing with SIGABRT #38

bleve opened this issue Jun 6, 2019 · 39 comments

Comments

@bleve
Copy link

bleve commented Jun 6, 2019

Jun 06 07:46:51 mx3-fi1 unbound[20686]: *** stack smashing detected ***: /usr/sbin/unbound terminated
Jun 06 07:46:51 mx3-fi1 unbound[20686]: ======= Backtrace: =========
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /lib64/libc.so.6(__fortify_fail+0x37)[0x7fcfe5cf0b67]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /lib64/libc.so.6(+0x117b22)[0x7fcfe5cf0b22]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x2f351)[0x55cc1f6e1351]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x36822)[0x55cc1f6e8822]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x2e703)[0x55cc1f6e0703]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x2a23e)[0x55cc1f6dc23e]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x3cdd8)[0x55cc1f6eedd8]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(iter_operate+0x335)[0x55cc1f6efb05]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x4d782)[0x55cc1f6ff782]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(worker_handle_request+0x1b45)[0x55cc1f6da055]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0xc91db)[0x55cc1f77b1db]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0xc5868)[0x55cc1f777868]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(comm_point_tcp_handle_callback+0xf4)[0x55cc1f777ba4]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /lib64/libevent-2.0.so.5(event_base_loop+0x774)[0x7fcfe6a00a14]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0xc2eac)[0x55cc1f774eac]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x1ddf1)[0x55cc1f6cfdf1]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x1995f)[0x55cc1f6cb95f]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fcfe5bfb495]
Jun 06 07:46:51 mx3-fi1 unbound[20686]: /usr/sbin/unbound(+0x1a542)[0x55cc1f6cc542]

This is new issue, This didn't happen yet with svn trunk@5175 but happens with git a4f4d7b

@bleve
Copy link
Author

bleve commented Jun 6, 2019

svn trunk@5175 is git 2a78803

@wcawijngaards
Copy link
Member

Hi bleve,
Could you get a stack trace with debug symbols available? (perhaps install the debug symbol package). Or compile with debug symbols. Is this reproducible somehow, eg. in valgrind or gdb? If so that would be nice, or perhaps verbosity 4 logging of the events preceding the failure?

@bleve
Copy link
Author

bleve commented Jun 6, 2019

Problem is I only see that on heavily loaded production email servers where I can't experiment. Upgraded to new snapshot from previous one last night and more than half of the servers had unbound crash this morning.

@wcawijngaards
Copy link
Member

The debug symbol package should only affect the printout and not the running server. On some package systems this is just a package you can install next to it, with the debug symbols, it would make all the anonymous unbound+xx functions have a name.

@bleve
Copy link
Author

bleve commented Jun 6, 2019

I know - but I can't run crashing version on production. I downgaded already to previous working commit.

@bleve
Copy link
Author

bleve commented Jun 6, 2019

Ok. I downgraded unbound on one server to crashing version and installed debuginfos - we can only hope it hits the issue.

@wcawijngaards
Copy link
Member

Thanks!

@bleve
Copy link
Author

bleve commented Jun 6, 2019

Crahser happened this fast first time but no better backtrace, installing abrtd to get proper coredump.

@bleve
Copy link
Author

bleve commented Jun 6, 2019

(gdb) bt full
#0 0x00007f0d130b22c7 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
resultvar = 0
pid = 25268
selftid = 25273
#1 0x00007f0d130b39b8 in __GI_abort () at abort.c:90
save_stage = 2
act = {__sigaction_handler = {sa_handler = 0x7f0d13204071, sa_sigaction = 0x7f0d13204071}, sa_mask = {__val = {3, 139694080928315, 5, 139694132179495, 1,
139694132188209, 3, 139694080928292, 12, 139694132188213, 2, 139694132188213, 2, 139694080928976, 139694080928976, 139694080930736}}, sa_flags = 4,
sa_restorer = 0x7f0d10122ed0}
sigs = {__val = {32, 0 <repeats 15 times>}}
#2 0x00007f0d130f4e17 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f0d13205492 "*** %s ***: %s terminated\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:196
ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x7f0d101233c0, reg_save_area = 0x7f0d101232d0}}
ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7f0d101233c0, reg_save_area = 0x7f0d101232d0}}
fd = 2
on_2 =
list =
nlist =
cp =
written =
#3 0x00007f0d13193b67 in __GI___fortify_fail (msg=msg@entry=0x7f0d1320547a "stack smashing detected") at fortify_fail.c:30
No locals.
#4 0x00007f0d13193b22 in __stack_chk_fail () at stack_chk_fail.c:28
No locals.
#5 0x0000556a51981351 in dname_query_hash (dname=0x7f0cf875fcb2 "", h=654970236) at util/data/dname.c:295
labuf = "\240\373u\370\f\177\000\000"\000\000\000\000\000\000\000\020\000\001", '\000' <repeats 14 times>, "\001", '\000' <repeats 29 times>
lablen =
i =
#6 0x0000000000000000 in ?? ()
No symbol table info available.

@wcawijngaards
Copy link
Member

Hi bleve,
Another person has reported a similar bug report. His bug report included information that makes me conclude there was a build failure. Did you make clean? Can you rebuild in a new directory? Can you disable link time optimizations with --disable-flto option for configure. Or disable more optimizations with CFLAGS="-g" and (if you passed custom compile flags) not pass other optimizer flags?
That should give a cleaner stack trace or maybe fix the bug if it is related to compiler processing. With CFLAGS="-g3" for configure there is even more debug information. The performance is not really that different (I mean, not like high levels of logging and so on, the compiler optimizer is obviously good at making it faster, but the non-optimized code can run too).

@bleve
Copy link
Author

bleve commented Jun 6, 2019

I build rpm packages in mock build which is completely clean for all builds.

@bleve
Copy link
Author

bleve commented Jun 6, 2019

And previous commit id works just fine with exactly same build falgs.

@wcawijngaards
Copy link
Member

That is excellent. So I would like to disable the optimizations, eg. custom CFLAGS with optimization options, and with --disable-flto passed to configure.... what really?

@wcawijngaards
Copy link
Member

You mean that last commit, I mean that one commit is the cause of the problem? Or you mean the older version that you cited above built fine. Yeah I guess the code changed in the meantime...

@bleve
Copy link
Author

bleve commented Jun 6, 2019

I mean the previous non-problematic commit id works.

@wcawijngaards
Copy link
Member

Ok, but I still think it could be the optimizer. Could try optimization options disabled. Or could try to 'bisect' to search which commit(s) cause the problem, eg. what caused the problem to appear.

@bleve
Copy link
Author

bleve commented Jun 6, 2019

Used build flags are:

CFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -fPIE -pie'
CXXFLAGS='-O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -fPIE -pie'
LDFLAGS='-Wl,-z,relro,-z,now -pie'

@bleve
Copy link
Author

bleve commented Jun 6, 2019

I cna bisect that but it takes some time, it takes about one hour to crash and there is about five steps to bisect.

@wcawijngaards
Copy link
Member

Okay thanks for the build flags. So the --disable-flto flag to configure disables another optimization argument that is added by configure and not in the list put there. Trying that or the bisect could maybe show up something, or show a better stack trace (i.e. #6 is not ??).

@bleve
Copy link
Author

bleve commented Jun 7, 2019

I removed -O2 and added --disable-flto for bisect builds. d4f697f was good.

@bleve
Copy link
Author

bleve commented Jun 7, 2019

Tested with latest git to make sure I get crash with new build options.

(gdb) bt full
#0 x86_64_fallback_frame_state (context=0x7f5f300758c0, context=0x7f5f300758c0, fs=0x7f5f300759b0) at ./md-unwind-support.h:58
pc = 0x100000000 <Address 0x100000000 out of bounds>
sc =
new_cfa =
#1 uw_frame_state_for (context=context@entry=0x7f5f300758c0, fs=fs@entry=0x7f5f300759b0) at ../../../libgcc/unwind-dw2.c:1253
fde = 0x0
cie =
aug =
insn =
end =
#2 0x00007f5f2f670fb9 in _Unwind_Backtrace (trace=0x7f5f330e2f20 <backtrace_helper>, trace_argument=0x7f5f30075b70) at ../../../libgcc/unwind.inc:290
fs = {regs = {reg = {{loc = {reg = 0, offset = 0, exp = 0x0}, how = REG_UNSAVED} <repeats 18 times>}, prev = 0x0, cfa_offset = 0, cfa_reg = 0,
cfa_exp = 0x0, cfa_how = CFA_UNSET}, pc = 0x0, personality = 0x0, data_align = 0, code_align = 0, retaddr_column = 0,
fde_encoding = 0 '\000', lsda_encoding = 0 '\000', saw_z = 0 '\000', signal_frame = 0 '\000', eh_ptr = 0x0}
context = {reg = {0x0, 0x0, 0x0, 0x7f5f300764e8, 0x0, 0x0, 0x7f5f300764f0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x7f5f30076410, 0x7f5f30076418,
0x7f5f30076420, 0x7f5f30076428, 0x7f5f300764f8, 0x0}, cfa = 0x7f5f30076500, ra = 0x100000000, lsda = 0x0, bases = {tbase = 0x0, dbase = 0x0,
func = 0x5574cf3d0710 <dname_query_hash>}, flags = 4611686018427387904, version = 0, args_size = 0, by_value = '\000' <repeats 17 times>}
code =
#3 0x00007f5f330e3096 in __GI___backtrace (array=array@entry=0x7f5f30075d50, size=size@entry=64) at ../sysdeps/x86_64/backtrace.c:107
arg = {array = 0x7f5f30075d50, cfa = 140046804411504, cnt = 4, size = 64}
once = 2
#4 0x00007f5f33047e34 in __libc_message (do_abort=do_abort@entry=2, fmt=fmt@entry=0x7f5f33158492 "*** %s ***: %s terminated\n")
at ../sysdeps/unix/sysv/linux/libc_fatal.c:176
addrs = {0x7f5f33047e34 <__libc_message+1012>, 0x7f5f330e6b67 <__GI___fortify_fail+55>, 0x7f5f330e6b22, 0x5574cf3d07d7 <dname_query_hash+199>,
0x1, 0x7f5f336bae68 <int_rsa_verify+808>, 0x7f5f30075ed0, 0x7f5f198859e0, 0x339f7f00, 0x7f5f1987f160, 0x33, 0x339f7f0000000020,
0xed1bffff7bfac1d3, 0x7f5f19885a13, 0x7f5f1987f160, 0xb120852805306200, 0x7f5f198851a0, 0x7f5f19590ec0, 0x20,
0x5574cf420151 <ub_crypto_lock_cb+162>, 0x80, 0x1871987c97d, 0x7f5f337a6c8f, 0xa0000000a, 0xa, 0x337a6c8f, 0x7f5f337a6c8f,
0x7f5f33625f56 <CRYPTO_add_lock+102>, 0x7f5f1986f510, 0x18719590ec0, 0x7f5f19884540, 0x7f5f336dd589 <getrn+137>, 0x7f5f19884540,
0x5574d042edb0, 0x6, 0x5574cf420151 <ub_crypto_lock_cb+162>, 0x7f5f19885200, 0x152198851a0, 0x7f5f3375e50e, 0xa00000002, 0x5574d120e590, 0x6,
0x6, 0x5574cf420151 <ub_crypto_lock_cb+162>, 0x6, 0x20530076bc8, 0x7f5f3375e50e, 0x7f5f19864eb0, 0x1, 0x7f5f30076bc8, 0x0, 0x7f5f19886100,
0x7f5f30075fd0, 0x5574cf3d0198 <query_dname_compare+154>, 0x5574cf4e2ad1, 0x0, 0x0, 0x0, 0x7f5f30075f60,
0x5574cf3d8471 <query_info_compare+116>, 0x7f5f19238220, 0x7f5f300760c0, 0x7f5f18000960, 0xffffffff5d0331cf}
n =
ap = {{gp_offset = 32, fp_offset = 48, overflow_arg_area = 0x7f5f30076440, reg_save_area = 0x7f5f30076350}}
ap_copy = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7f5f30076440, reg_save_area = 0x7f5f30076350}}
fd = 2
on_2 =
list =
nlist =
cp =
written =
#5 0x00007f5f330e6b67 in __GI___fortify_fail (msg=msg@entry=0x7f5f3315847a "stack smashing detected") at fortify_fail.c:30
No locals.
#6 0x00007f5f330e6b22 in __stack_chk_fail () at stack_chk_fail.c:28
No locals.
#7 0x00005574cf3d07d7 in dname_query_hash (dname=0x7f5f188a4156 "", h=569085103) at util/data/dname.c:295
labuf = "\177", '\000' <repeats 22 times>, "\b", '\000' <repeats 39 times>
lablen = 0 '\000'
i = 127
#8 0x0000000100000000 in ?? ()
No symbol table info available.
#9 0x0000000000000000 in ?? ()
No symbol table info available.

@wcawijngaards
Copy link
Member

So that #8 looks like it is no information at all. Is it stripping information from the executable or something? Or the other options damage the stack trace somehow? CFLAGS="-g" without the other options (that excludes the frame protection), perhaps that gives a better backtrace? Perhaps try that after the bisect.

@bleve
Copy link
Author

bleve commented Jun 7, 2019

There is no other app, just unbound caching dns.

@Antonio-Prado
Copy link

over here we decided to downgrade to 1.8.3 in the meantime

@bjovereinder
Copy link
Member

bjovereinder commented Jun 7, 2019

Hi Antonio,

Thanks for reporting. You are downgrading to 1.8.3 for the same reason (that is SIGABRT)?

@bleve
Copy link
Author

bleve commented Jun 8, 2019

git bisect start
good: [2a78803] - Fix wrong query name in local zone redirect answers with a CNAME, the copy of the local alias is in unpacked form.
git bisect good 2a78803
bad: [a4f4d7b] - Fix to guard _OPENBSD_SOURCE from redefinition.
git bisect bad a4f4d7b
good: [d4f697f] Merge branch 'master' into features/XoT
git bisect good d4f697f
good: [779b86f] - Fix doxygen output error on readme markdown vignettes.
git bisect good 779b86f
good: [3103961] - Fix #31: swig 4.0 and python module.
git bisect good 3103961
bad: [a03f0a3] - Fix double file close in tcp pipelined response code.
git bisect bad a03f0a3
bad: [0b77c9d] - Fix that spoolbuf is not used to store tcp pipelined response between mesh send and callback end.
git bisect bad 0b77c9d
skip: [1c3ba0c] - Note that so-reuseport at extreme load is better turned off, otherwise queries are not distributed evenly, on Linux 4.4.x.
git bisect skip 1c3ba0c
only skipped commits left to test
possible first bad commit: [0b77c9d] - Fix that spoolbuf is not used to store tcp pipelined response between mesh send and callback end.
possible first bad commit: [1c3ba0c] - Note that so-reuseport at extreme load is better turned off, otherwise queries are not distributed evenly, on Linux 4.4.x.
possible first bad commit: [3103961] - Fix #31: swig 4.0 and python module.
possible first bad commit: [a08fe8c] - Attempt to fix malformed tcp response.
possible first bad commit: [a95f5fd] - Squelch log messages from tcp send about connection reset by peer. They can be enabled with verbosity at higher values for diagnosing network connectivity issues.
possible first bad commit: [a922a19] - Revert fix for oss-fuzz, error is in that build script that unconditionally includes .o files detected by configure, also when the machine architecture uses different LIBOBJS files.
possible first bad commit: [a8d0177] - Attempt to fix build failure in oss-fuzz
good: [1c3ba0c] - Note that so-reuseport at extreme load is better turned off, otherwise queries are not distributed evenly, on Linux 4.4.x.
git bisect good 1c3ba0c
first bad commit: [0b77c9d] - Fix that spoolbuf is not used to store tcp pipelined response between mesh send and callback end.

@bleve
Copy link
Author

bleve commented Jun 8, 2019

Now testing with commit 0b77c9d reverted to verify bisect.

@Antonio-Prado
Copy link

Thanks for reporting. You are downgrading to 1.8.3 for the same reason (that is SIGABRT)?

yes, for the very same reason on both our 1.9.1 instances

@bleve
Copy link
Author

bleve commented Jun 8, 2019

This issue is not in 1.9.1. You are talking about different issue which has been fixed already.

@Antonio-Prado
Copy link

you mean this one is a different issue?

May 29 11:11:44 ns2-rec kernel: pid 78000 (unbound), uid 59: exited on signal 6

@bleve
Copy link
Author

bleve commented Jun 8, 2019

That is different because that issue we are talking here is introduced in last two weeks in git version.

@bleve
Copy link
Author

bleve commented Jun 8, 2019

Can you get a backtrace from your crasher? And if it's not same as this one, you could report that as a separate issue - and of course if it's same, please give that information too.

@Antonio-Prado
Copy link

well, as previously mentioned, we just downgraded to 1.8.3 so currently we're no more observing that issue.
will report updates, of course

@bleve
Copy link
Author

bleve commented Jun 9, 2019

No crashes since reverting bisected bad commit.

@wcawijngaards
Copy link
Member

Thank you very much for the bisect! The commit has to be removed. Issue is that the bug it fixes needs a different fix now...

@wcawijngaards
Copy link
Member

Fix with the original fix and a fix for the fix, this makes several error cases also use the correct buffer. In 6067ce6
That should work without the bisect commit removed. And I intend to release that in the rc2, and that to become the release. (Further fixes either as rc3 or for the release after, I want to get the current bugfixes out).

@bleve
Copy link
Author

bleve commented Jun 11, 2019

Testing fix, I'll report back in few hours.

@bleve
Copy link
Author

bleve commented Jun 11, 2019

Looks good in production.

@wcawijngaards
Copy link
Member

Wow Thanks! That is good to hear, since rc2 already contains it, that should go to the released version soon. I'll close the issue then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants