Skip to content

Support log rotation, don't append crash.log files but use per-peer.#1847

Merged
cdecker merged 3 commits into
ElementsProject:masterfrom
rustyrussell:logfiles
Aug 23, 2018
Merged

Support log rotation, don't append crash.log files but use per-peer.#1847
cdecker merged 3 commits into
ElementsProject:masterfrom
rustyrussell:logfiles

Conversation

@rustyrussell
Copy link
Copy Markdown
Contributor

@rustyrussell rustyrussell commented Aug 15, 2018

This is based on ~~~#1846~~~ #1863 because I needed the start flag to get_node(), but it's basically the last two commits.

@rustyrussell rustyrussell requested a review from cdecker August 15, 2018 11:08
@rustyrussell rustyrussell added this to the v0.6.1 milestone Aug 20, 2018
@rustyrussell rustyrussell force-pushed the logfiles branch 2 times, most recently from d6be27f to f5d5624 Compare August 22, 2018 10:07
@cdecker
Copy link
Copy Markdown
Member

cdecker commented Aug 22, 2018

Rebased after applying #1863

Comment thread tests/fixtures.py Outdated
# determine whether we succeeded or failed.
if request.node.rep_call.outcome == 'passed':
shutil.rmtree(directory)
pass #shutil.rmtree(directory)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's probably not intended to be in this queue

Comment thread tests/test_misc.py Outdated
logpath = os.path.join(l1.daemon.lightning_dir, 'logfile')
logpath_moved = os.path.join(l1.daemon.lightning_dir, 'logfile_moved')

# FIXME: I couldn't get super(TailableProc, l1.daemon).start() to work?
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found super to be rather obscure. This should work however:

TailableProc.start(l1.daemon)

Since it'll take the classmethod start from TailableProc and passes in l1.daemon as self

Comment thread lightningd/log.c Outdated
static void handle_sighup(int sig)
{
/* This may fail if we're hammered with SIGHUP. We don't care. */
if (write(signalfds[1], "", 1));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this took me a while to figure out. This'll just write a 0x00 byte, since any C string is null-terminated, right?

Comment thread lightningd/log.c Outdated
fd = open(logfile, O_WRONLY|O_CREAT, 0600);
}
/* We expect to be in config dir. */
snprintf(logfile, sizeof(logfile), "crash.log.%u", getpid());
Copy link
Copy Markdown
Member

@cdecker cdecker Aug 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if pid is the best identifier here, we might want to use something like strftime(buffer, 14, "%Y%m%d%H%M%S", tm_info); in order to easily identify the last crash or correlate a crash dump with external monitoring.

Comment thread lightningd/log.c
if (write(signalfds[1], "", 1));
/* Writes a single 0x00 byte to the signalfds pipe. This may fail if
* we're hammered with SIGHUP. We don't care. */
if (write(signalfds[1], "", 1))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang will complain if this is not on a separate line to signal explicit opt-out of this case.

@cdecker
Copy link
Copy Markdown
Member

cdecker commented Aug 22, 2018

I took the liberty to add fixup! commits for the two suggestions in my comments. Other than these changes the PR is good, and if you're happy with the changes I proposed, feel free to squash and merge :-)

ACK 734b5f8

@cdecker
Copy link
Copy Markdown
Member

cdecker commented Aug 22, 2018

Actually Travis-CI keeps complaining about one of the two tests (it fails test_crashlog but the error message related to the fake-bitcoin-cli of test_logging, which is really strange).

/tmp/ltests-0xxoonjf/test_logging_1/lightning-1/fake-bitcoin-cli exec failed: No such file or directory
lightningd: Fatal signal 6 (version 36db31f)
lightningd/lightningd: libbacktrace: no debug info in ELF executable
lightningd/lightningd: libbacktrace: no debug info in ELF executable

This happens both before and after my datetime crashlog commit, and I can't figure out why...

@cdecker
Copy link
Copy Markdown
Member

cdecker commented Aug 22, 2018

This might actually be valgrind intercepting the crash and thus not producing the crashlog:

------------------------------- Valgrind errors --------------------------------
Valgrind error file: valgrind-errors.2382
==2382== Jump to the invalid address stated on the next line
==2382==    at 0x0: ???
==2382==    by 0x4C5B96: backtrace_full (backtrace.c:127)
==2382==    by 0x430F6B: crashdump (daemon.c:42)
==2382==    by 0x56634AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2382==    by 0x47A658: brute_force_first (timer.c:211)
==2382==    by 0x47A7BE: get_first (timer.c:246)
==2382==    by 0x47A7D8: update_first (timer.c:251)
==2382==    by 0x47A829: timer_earliest (timer.c:264)
==2382==    by 0x46CC99: io_loop (poll.c:272)
==2382==    by 0x4152E8: main (lightningd.c:455)
==2382==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==2382== 
==2382== 
==2382== Process terminating with default action of signal 11 (SIGSEGV)
==2382==  Bad permissions for mapped region at address 0x0
==2382==    at 0x0: ???
==2382==    by 0x4C5B96: backtrace_full (backtrace.c:127)
==2382==    by 0x430F6B: crashdump (daemon.c:42)
==2382==    by 0x56634AF: ??? (in /lib/x86_64-linux-gnu/libc-2.23.so)
==2382==    by 0x47A658: brute_force_first (timer.c:211)
==2382==    by 0x47A7BE: get_first (timer.c:246)
==2382==    by 0x47A7D8: update_first (timer.c:251)
==2382==    by 0x47A829: timer_earliest (timer.c:264)
==2382==    by 0x46CC99: io_loop (poll.c:272)
==2382==    by 0x4152E8: main (lightningd.c:455)
--------------------------------------------------------------------------------

@rustyrussell
Copy link
Copy Markdown
Contributor Author

Backtrace does not play nicely with valgrind... We disable it in dev mode, but not in non-dev mode:

#if DEVELOPER
	/* Suppresses backtrace (breaks valgrind) */
	if (!getenv("LIGHTNINGD_DEV_NO_BACKTRACE"))
		backtrace_state = backtrace_create_state(argv0, 0, NULL, NULL);
#else
	backtrace_state = backtrace_create_state(argv0, 0, NULL, NULL);
#endif

I'll work around it in the test itself...

Closes: ElementsProject#1623
Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
rustyrussell and others added 2 commits August 23, 2018 11:39
Someone had a 21GB crash.log, which doesn't help anyone!

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
This should make it easier to identify the latest crash file and correlate
crashes with external monitoring tools.
@cdecker
Copy link
Copy Markdown
Member

cdecker commented Aug 23, 2018

The remaining test failure seems to be an instance of the flaky test tracked in #1866, restarting to see if it unflakes :-)

@cdecker
Copy link
Copy Markdown
Member

cdecker commented Aug 23, 2018

ACK d5425bc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants