New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SIGSEGV running Rust tests under kcov #212

Open
mthebridge opened this Issue Jul 31, 2017 · 34 comments

Comments

Projects
None yet
7 participants
@mthebridge

mthebridge commented Jul 31, 2017

I'm hitting an issue where running a Rust test binary under kcov segfaults, but running the binary natively is fine:

$ LD_LIBRARY_PATH=/opt/rust/.multirust/toolchains/stable-x86_64-unknown-linux-gnu/lib/ kcov --exclude-pattern=/opt/rust/.cargo --verify /data/mlt/rustproj/slog-extlog/slog-extlog-derive/target/cov/slog_extlog_derive-33e2f62c06364b1a/ /data/mlt/rustproj/slog-extlog/target/debug/deps/slog_extlog_derive-33e2f62c06364b1a 
kcov: 94 invalid breakpoints skipped in /opt/rust/.multirust/toolchains/stable-x86_64-unknown-linux-gnu/lib/libstd-f4594d3e53dcb114.so
kcov: Process exited with signal 11 (SIGSEGV) at 0x7ffff7105131

Obviously due to the way kcov works I can't run it in a debugger, but if I dump a core and load that in gdb I can see the segfault is somewhere in jemalloc in thread initialization.

Unfortunately I can't seem to generate a simple example for repro, and I can't share the underlying code, but running under kcov v33 works as expected, indicating this is a regression - and indeed git bisect found that the commit at fault is e620da4, which fixed #180. Annoyingly, the reason I want to run off master and not from the v33 release is because I want to be able to exclude lines within my project!

I appreciate there's not a lot to go on here - let me know if there's any useful diagnsotics I can send!

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 1, 2017

Don't be so humble! You've even found the faulty commit! That said, it's a bit surprising that this particular commit causes a crash: it shouldn't change breakpoint setting very much.

Do you use the #180 feature? I.e., via LCOV_EXCL_START?

Is this consistent across every run, or does it sometimes work?

@mthebridge

This comment has been minimized.

mthebridge commented Aug 1, 2017

Thanks for the speedy reply! It doesn't matter whether I'm using LCOV_EXCL_START or not, and yes it's 100% consistent - always fails on master, always works with v33...

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 1, 2017

And the commit immediately before e620da4 (i.e., 860a324) always work? Sorry for doubting git bisect a bit ;-)

@mthebridge

This comment has been minimized.

mthebridge commented Aug 1, 2017

Yes. I just manually checked out those two revs and compared.

On e620da4:

$ cargo kcov
error: failed to get coverage
caused by: signal: 11

On 860a324:

$ cargo kcov
kcov: 0 invalid breakpoints skipped...
<snipped>

test result: ok. 2 passed; 0 failed; 0 ignored; 0 measured
@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 1, 2017

OK, great!

If you run kcov with --debug=15, you should be able to see how/if the breakpoint setting changes between the two versions. I think there must be something wrong related to breakpoint setting, so that a breakpoint gets misplaced (i.e. what --verify works around) and therefore causes the crash.

The output will be quite long, but should be possible to pass to diff.

@mthebridge

This comment has been minimized.

mthebridge commented Aug 1, 2017

Hmm - if I set debug=15 then both cases fail! The previously working case gives me:

$ cargo kcov -- --debug=15 > kcov_debug.out
error: failed to get coverage
caused by: exit code: 255

And the end of the debug output for this case is:

PT stopped PID 8671 0x0000057f
PT BP at 0x7ffff71050cd:-1 for 8671
REPORT hit at 0x7ffff71050cd
PT continuing 8671 with signal 0
PT stopped PID 8671 0x0000057f
PT BP at 0x7ffff71050db:-1 for 8671
REPORT hit at 0x7ffff71050db
PT continuing 8671 with signal 0
PT stopped PID 8671 0x00000b7f
PT signal 11 at 0x7ffff71050da for 8671
PT continuing 8671 with signal 11
PT stopped PID 8671 0x0000000b
PT terminating signal 11 at 0x7ffff71050da for 8671
PT continuing 8671 with signal 0
PT error for 8671: -1
Returning error

The previously broken case has 7000 lines of debug output, the previously working case 220000 - I assume you don't want to see 200k lines of diff! The 7000 lines in the "broken" case are the same as the first 7000 in the "working" case, but for the "broken" one the output just stops mid-line. No obvious errors in the "broken" version.

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 1, 2017

OK, strange. That almost sounds like kcov itself is crashing (as in #207) if it stops mid-line. Do you get a core file for kcov in that case?

I'm busy with real-life things right now (repainting a room), but I'll take a deeper look at the commit after that to see if I can figure out why it breaks things.

@mthebridge

This comment has been minimized.

mthebridge commented Aug 2, 2017

No problem about the delay - enjoy your redecorating!

You're right - running it with debug causes kcov itself to crash, and I do get a core.

Output from loading in gdb (running kcov at e620da4, Linux x86_64):

$ gdb `which kcov` core.7908 
<snip>
Core was generated by `kcov --debug=15 /data/mlt/rustproj/slog-extlog/slog-extlog-derive/target/cov/sl'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f7e4b234295 in std::string::find(char const*, unsigned long, unsigned long) const () from /lib64/libstdc++.so.6
(gdb) bt
#0  0x00007f7e4b234295 in std::string::find(char const*, unsigned long, unsigned long) const () from /lib64/libstdc++.so.6
#1  0x0000000000430fd1 in find (__pos=0, __str="LCOV_EXCL_LINE", this=0x0) at /usr/include/c++/4.8.2/bits/basic_string.h:1848
#2  match (line=<error reading variable: Cannot access memory at address 0x0>, lineNr=<optimized out>, filePath=..., this=0x24eb0d0) at /data/mlt/kcov/src/filter.cc:144
#3  Filter::runLineFilters (this=<optimized out>, filePath=..., lineNr=<optimized out>, line=<error reading variable: Cannot access memory at address 0x0>) at /data/mlt/kcov/src/filter.cc:107
#4  0x000000000043dcca in Reporter::onLine (this=0x24eae90, file="/checkout/src/libcore/lib.rs", lineNr=1, addr=93824992326304) at /data/mlt/kcov/src/reporter.cc:328
#5  0x0000000000450892 in ElfInstance::onLine (this=0x85c020 <g_instance>, file=..., lineNr=1, addr=94880) at /data/mlt/kcov/src/parsers/elf-parser.cc:692
#6  0x0000000000453d7d in kcov::DwarfParser::forEachLine (this=this@entry=0x7fff4bebd430, listener=...) at /data/mlt/kcov/src/parsers/dwarf.cc:112
#7  0x00000000004509b0 in ElfInstance::parseOneDwarf (this=this@entry=0x85c020 <g_instance>, relocation=relocation@entry=93824992231424) at /data/mlt/kcov/src/parsers/elf-parser.cc:420
#8  0x0000000000452b2d in ElfInstance::doParse (this=0x85c020 <g_instance>, relocation=93824992231424) at /data/mlt/kcov/src/parsers/elf-parser.cc:305
#9  0x000000000045535b in checkSolibData (this=0x24eb9d0) at /data/mlt/kcov/src/solib-handler.cc:228
#10 SolibHandler::onTick (this=0x24eb9d0) at /data/mlt/kcov/src/solib-handler.cc:78
#11 0x000000000041ab0c in tick (this=0x24eb220) at /data/mlt/kcov/src/collector.cc:72
#12 Collector::run (this=0x24eb220, filename=...) at /data/mlt/kcov/src/collector.cc:53
#13 0x0000000000418d6a in main (argc=<optimized out>, argv=<optimized out>) at /data/mlt/kcov/src/main.cc:298

Looks like a null pointer dereference to me - seems that FileLineHandler::match doesn't cope with being passed a null string for line. Do shout if you want me to send you the core (suspect it;s too large to atatch here).

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 2, 2017

Yes, that sounds like it!

Is the /checkout/src/libcore/lib.rs present in your system?

Anyway, with that finding, it should be pretty easy to fix the problem. Good catching!

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 2, 2017

Hmm... match takes a reference for line, so it really shouldn't be null. Could you mail me the source code of lib.rs? simon.kagstrom@gmail.com is the address.

I'll see if I can reproduce this in a unit test, although I'll have to warn you that it might be a few days until I look at this again.

@mthebridge

This comment has been minimized.

mthebridge commented Aug 2, 2017

Well, libcore/lib.rs is from the core Rust library, so not something I have my own copy of. I imagine it's this file though!

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 8, 2017

Well, something is strange if the source code isn't present on your system: Then it should not be able to read the file to start with. I've read the code of source-file-cache.cc again, but I can't see that it should be possible to end up in this situation.

Could you mail me the core file?

@mthebridge

This comment has been minimized.

mthebridge commented Aug 8, 2017

Thanks Simon - I've sent you a mail.

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 8, 2017

Unfortunately, I guess I would need your kcov binary as well (matching the core file), I don't get any symbols in the backtrace. One other thing though: Could you build kcov in debug-mode, i.e., configure it with

cmake -DCMAKE_BUILD_TYPE=Debug <...>

? Perhaps the backtrace would be easier to follow in this case. At least the "optimized out" stuff should be gone then.

@mthebridge

This comment has been minimized.

mthebridge commented Aug 8, 2017

The debug build is a good call - I may have got somewhere with that:

(gdb) up 3
#3  0x00000000004f2bfb in Reporter::onLine (this=0x1e14f20, file="/checkout/src/libcore/lib.rs", lineNr=1, addr=93824992326304) at /data/mlt/kcov/src/reporter.cc:328
328					!m_filter.runLineFilters(file, lineNr, lines[lineNr - 1]))
(gdb) p file
$10 = "/checkout/src/libcore/lib.rs"
(gdb) p *fp
$11 = {m_fileHash = 6018595602966914235, m_lines = std::vector of length 2, capacity 2 = {0x0, 0x2a451c0}, m_nrLines = 1}
(gdb) p *line
$12 = {m_addrs = std::vector of length 0, capacity 0, m_lineId = 2419036335428861953, m_order = 0, m_unreachable = false}
(gdb) p lines
$13 = std::vector of length 0, capacity 0

I wonder if the fact that file doesn't actually exist at the path specified is the problem? The /checkout/src/libcore is part of the Rust compiler code itself, not my code - but my Rust library does have a compiler plugin.

Looks like if that is the case, then the call to ISourceFileCache::getInstance().getLines(file) in reporter.cc is going to return an empty vector...

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 8, 2017

@mthebridge

This comment has been minimized.

mthebridge commented Aug 8, 2017

OK, so this issue with the null reference is a bit of a red herring. :(

The test I see at that commit is:

if (lines.size() >= lineNr - 1 &&
				!m_filter.runLineFilters(file, lineNr, lines[lineNr - 1]))

And lines.size() will be 0, but lineNr is 1. If I change the code to explicitly also cover the case when `lines.size() is 0, then everything passes...

BUT you've fixed this in master, by switching to a for loop:

for (unsigned int nr = 1; nr <= lines.size(); nr++) {
				if (!m_filter.runLineFilters(file, lineNr, lines[nr - 1]))

And running from the master, kcov itself doesn't segfault, the instrumented binary does.

However I switched back to running with --debug=15 and that fails even back obn v33!

So I suspect that this second bug isn't a regression, it's always been a problem, but somehow before the introduction of the line filters it only manifests with debug tracing on.

The end of the debug trace shows:

REPORT hit at 0x7ffff71050db
PT continuing 13400 with signal 0
PT stopped PID 13400 0x00000b7f
PT signal 11 at 0x7ffff71050da for 13400
PT continuing 13400 with signal 11
PT stopped PID 13400 0x0000008b
PT terminating signal 11 at 0x7ffff71050da for 13400

And 0x7ffff71050db is inside malloc()...

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 9, 2017

OK, too bad - I had forgot that change myself. Sounds like some kind of race then, but difficult to say where it might be.

Things to test:

  • Run kcov with --skip-solibs

  • Does the crashing code have source code? (I guess the one calling malloc from Rust libs) If so, maybe try filtering it out with --exclude-path

Is the crash the same with --debug=15 for both v33 and master?

@mthebridge

This comment has been minimized.

mthebridge commented Aug 9, 2017

skip-solibs doesn't crash, but that's because of how Rust test binaries are built - by default everything is a library, and the test harness builds a minimal wrapper around it, so while it doesn't crash the coverage report is empty.

Yes, I get the same problem on master and v33.

Using --exclude-path makes no difference, but using --include-path $(pwd) to only cover my code DOES work. That's a workaround I can use easily enough - so if you want to close this issue, then go ahead!

One other thought - it's possible that this is actually a Rust bug of course, and rustc is generating bad debugging information in the binary...

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 14, 2017

Well, normally you'd want to only cover your own code anyway, but it should still work I think.

Does --verify report any skipped breakpoints?

@mthebridge

This comment has been minimized.

mthebridge commented Aug 16, 2017

Sorry for delayed replay - --verify does report skipped breakpoints, but that happens for all Rust crates. I wonder if this is just another symptom of rust-lang/rust#27324 ?

@mthebridge mthebridge changed the title from SIGSEGV running Rust tests under kcov - regression to SIGSEGV running Rust tests under kcov Aug 16, 2017

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Aug 16, 2017

OK, good. At least that's working as it should then.

It almost sounds like a timing issue I'd say. kcov (as you've no doubt seen!) tends to slow down execution quite a bit, and maybe that triggers some race somewhere?

@zetok

This comment has been minimized.

zetok commented Sep 4, 2017

Hi, just dropping by to say that I'm too affected by this. Looks like rustc bump (1.19.0 → 1.20.0) broke things completely:

Last working build: https://travis-ci.org/zetok/tox/jobs/270343484#L2215
Build with segfault: https://travis-ci.org/zetok/tox/jobs/271103907#L2367

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Sep 4, 2017

Does the workaround @mthebridge uses (--include-pattern=<path-to-your-own-source-code>) help for you as well?

zetok added a commit to zetok/tox that referenced this issue Sep 4, 2017

chore(travis): fix coveralls again
Looks like rustc bump (1.19.0 → 1.20.0) broke things.

Use modified script from
https://github.com/gimli-rs/gimli/blob/master/coverage
@ de7eeff87fa21264e41f2b2e0b82f45f88851b2d and a workaround from
SimonKagstrom/kcov#212 (comment)
to make it work again.

New script isn't really needed, but it also makes kcov cover tests that
previously weren't checked by it.
@zetok

This comment has been minimized.

zetok commented Sep 4, 2017

@SimonKagstrom That works, thanks!

@luser

This comment has been minimized.

luser commented Dec 21, 2017

One other thought - it's possible that this is actually a Rust bug of course, and rustc is generating bad debugging information in the binary...

FWIW, I dealt with a similar situation with Rust code in Firefox, and even contributed a patch to rustc to make things better. Generics in Rust do not get actually compiled until you instantiate them with a concrete type, and the Rust standard library has a lot of generics. This means that when you use like Option<Foo>, where Foo is defined in your code, the Rust compile will compile that object code into your binary when your crate is compiled. Additionally, since Rust doesn't support dynamic linking, the rest of the standard library will be statically linked into your final binary. This means that there are a lot of things that wind up in your binary for which the source paths will map back to the Rust standard library in whatever path it was built for the Rust toolchain you're using, and you are not going to have source for those.

It's possible to encounter this in C code, but less likely due to the fact that many C libraries are dynamically linked. If you were to link an external static library that contained debug info I'm pretty sure you'd hit the same issue, though.

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Dec 21, 2017

Thanks for the insight into rust!

Still, I don't think the missing source code as such is really the problem - that's quite common for C/C++ as well, and present no problem by itself. But the volume of instrumented code might certainly increase the chances of triggering a case of bad debug info.

This problem would have been so much simpler if x86 - with variable instruction lengths - weren't so prevalent.

@Dushistov

This comment has been minimized.

Dushistov commented Mar 4, 2018

I hitted by this issue today.

The steps to reproduce are simple:

$ cargo new --lib foo && cd foo && cargo test
...
$ export TEST_BIN=`find target/debug/ -executable -type f | grep 'target/debug/foo'`
$ /tmp/kcov/build/src/kcov target/cov $TEST_BIN
kcov: Process exited with signal 11 (SIGSEGV) at 0x5555555b9497

version of software that I use:

$ cargo --version
cargo 0.26.0-nightly (1d6dfea44 2018-01-26)
$ /tmp/kcov/build/src/kcov --version
kcov v34-25-g894e
$ uname -a
Linux retina 4.14.22-1-lts #1 SMP Sun Feb 25 15:13:49 CET 2018 x86_64 GNU/Linux
@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Mar 4, 2018

@Dushistov try running kcov with --verify, it seems to be needed pretty much always with rust binaries.

@Dushistov

This comment has been minimized.

Dushistov commented Mar 4, 2018

@SimonKagstrom

Sorryk forgot to mention, I've tried both variants,
but I hope that and without any flag it should not crash.

$ /tmp/kcov/build/src/kcov --verify  target/cov target/debug/foo-7cf2e66c10f1fab7
kcov: 213 invalid breakpoints skipped in /tmp/foo/target/debug/foo-7cf2e66c10f1fab7

running 1 test
test tests::it_works ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
$ /tmp/kcov/build/src/kcov --include-path .  target/cov target/debug/foo-7cf2e66c10f1fab7

running 1 test
test tests::it_works ... ok
test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out
@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Mar 4, 2018

So it works with --verify and limiting the source path? Issue #82 explains a bit more in detail why verify is needed, it's basically a workaround for broken DWARF info. Limiting the source path probably means that the locations where the DWARF info points to the middle of instructions are simply ignored.

I agree it should work without any extra options, but since this appears to be quite common (not only with Rust), the only alternative would be to make --verify default. That would slow down execution fairly significantly though for the cases where it's not needed.

@Arzte

This comment has been minimized.

Arzte commented Mar 5, 2018

Perhaps a warning noting that if it's not needed --noverify can be passed?

@fschutt

This comment has been minimized.

fschutt commented Sep 26, 2018

The segfault occurs when you don't have libiberity-dev in your packages. To fix it:

addons:
  apt:
     update: true
     - libiberity-dev

For some reason, kcov can build without libiberity, but the segfaults at runtime.

@SimonKagstrom

This comment has been minimized.

Owner

SimonKagstrom commented Sep 27, 2018

@fschutt kcov detects if binutils-dev is installed, and only then builds with disassembly, so it's optional.

It gives a warning though if you run with --verify without the disassembler. Rust binaries appear to be especially sensitive to this, so they typically don't work at all without discarding the invalid breakpoints.

@Arzte running without --verify is the default, so if you don't give any options it will run without verification. Sorry about not seeing your reply until now!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment