Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash handler hangs in Docker container (linux) #269

Closed
hoditohod opened this issue May 22, 2018 · 10 comments
Closed

Crash handler hangs in Docker container (linux) #269

hoditohod opened this issue May 22, 2018 · 10 comments

Comments

@hoditohod
Copy link
Contributor

Hi Kjell,
Applications using g3log hang (do not exit) when they encounter a crash (SIGFPE, SIGSEGV, etc...) while running in a Docker container.

Here's my understanding of the crash handling mechanism (angle brackets indicate threads)

  • [main] signal handler is invoked and captures stacktrace
  • [main] passes on fatal log to logworker
  • [main] enters a sleep loop: while(1) sleep(1);
  • [logworker] passes fatal log to sink
  • [sink] writes log to output
  • [logworker] deletes (and flushes) all sinks
  • [logworker] restores the original sighandler (default in older versions)
  • [logworker] re-emits the signal with kill() then calls exit()

When the app runs on a normal host (non-container), then the execution stops at kill(), and the default handler eventually terminates the app. In a container the app runs as PID1, and different signal handling rules apply. The default action for signals sent by kill() is to ignore them (I didn't found the exact specification for this in linux documentation, only blogposts). So execution passes kill() and hangs in exit(). I don't know exactly why exit() hangs, but I suspect it has to do something with the pending signal handler on the main thread.

For me replacing exit() with _exit() solves the problem. _exit() is similar to exit(), but it does not call atexit() callbacks (not a problem for abnormal termination), and does not flush the stdio streams. In our application streams are flushed in the sink. So _exit() works for me, but I don't know about other platforms g3log supports. (_exit() is POSIX, but there's an equivalent C99 call: _Exit())

What do you think?

@KjellKod
Copy link
Owner

Interesting. I’ve never encountered this before. It sounds like your app is doing something specific - not docker- but I haven’t verified the docket behavior yet.

The reason I think that is that all pull requests have their unit tests run on a docker platform since a couple of years back and a hanging loop at exit has never happened... I can’t say with 100% that a fatal crash has happened with these runs. I certainly will try to replicate it.

—-
Regardless, one option you can do is insert a custom “exit function” that is triggered once a fatal crash is detected.

See the examples (unfortunately I haven’t added this to the documentation yet) for g3::setFatalPreLoggingHook()

@hoditohod
Copy link
Contributor Author

One important note if you try to reproduce it: the application must be PID 1. This is the PID that has special signal handling semantics in the kernel (usually taken by init)

@KjellKod
Copy link
Owner

It seems that the PID1 is a special Linux thing and not docker
“Well PID 1 is special in Linux, amongst other things it ignores any signals unless a handler for that signal is explicitly declared. ”

https://hackernoon.com/my-process-became-pid-1-and-now-signals-behave-strangely-b05c52cc551c

Option 1: declare your own signal handler for the signal emitted with exit()
Option 2: exit in a custom way with the hook called out above
Option 3: override the G3log signal handling (see code and tests, examples) and handle the signal handling yourself or with the overridden customization

@KjellKod
Copy link
Owner

Option 4: possibly custom exit handling can be added with the std::atexit

http://en.cppreference.com/w/cpp/utility/program/atexit

@KjellKod
Copy link
Owner

I don’t see any action on the G3log side regarding this. Although new to me this is expected Linux behavior if you run the process as PID1. Please see option 1-4. Please let me know how it goes. I’ll keep this open for awhile longer

@hoditohod
Copy link
Contributor Author

  • Stock g3log hangs in a container -> I see g3log side action, at least mention it in the documentation
  • using exit() for fatal program termination is plain wrong. exit() is for normal termination and it invokes atexit() callbacks that expect the application to be in a sane state. There's not much point to start proper cleanup/shutdown when a thread crashed with segfault.
  • the exit() in exitWithDefaultSignalHandler() is never called for non-1 PID: the program is terminated at kill(). And it hangs for PID1. So it doesn't make sense in any case. _exit() on the other works fine in the later case.

Regarding the options:

  • Option1: I don't see exit() emitting any signals. exit.c
  • Option2: if I terminate in with the fatalPreLoggingHook, the stackdump won't get to the sink
  • Option3: this could work, but seems over engineering to me
  • Option4: see above, atexit() is for normal termination

Anyway, if you don't see this as an issue, it's ok. I just wanted to let you know.

@KjellKod
Copy link
Owner

I do see it as an issue - for PID1 processes only

To iterate how the fatal handling currently works.

Once G3log has successfully shut down sinks etc the original fatal handler is restored, if there were any and the fatal signal is re-emitted.

I.e if you are using G3log on a non-PID1 system the behavior should be as close to normal “fatal signal” handling as possible.

The default signal exit function does:

kill(getpid(), signal_number);
 exit(signal_number);

The signal from the kill is re-emitted and it will either directly kill the process or be caught in a (non-G3log) custom signal handler that can do whatever
_exit(...) for example.

The reasoning was that if the kill is caught and not handled properly (i.e the PID1 situation) the exit should “nicely” exit the process with the information from the signal. I.e true signal number is the exit number.

Easy solution for you:

  1. Have custom signal handler and deal with the fatal signal once G3log is done and reemits the signal.

It’s the same solution you would have to implement if G3log wasn’t there

  1. Replace the exit for something else like you suggested _exit or add something more fatal that takes also on PID1

I’m not against 2. I however want to keep G3log as much std library and cross system calls as possible. So what other exit calls work apart from the non-standard _exit?

Does std::terminate work?

@KjellKod KjellKod self-assigned this May 23, 2018
@KjellKod
Copy link
Owner

My understanding of PID1 type processes is that they are very special beasts and just exiting it rarely makes any sense. It makes more sense to reboot or to shutdown the whole system.

I.e. if you have decided to use PID1 then it’s good coding and design practice to have your own defaults to deal with fatal exits/signals. This is probably (opt1 in my latest reply) the path to go instead of relying on G3log to handle the PID1 exit logic

@hoditohod
Copy link
Contributor Author

std::terminate calls std::abort which raises SIGABRT, which will get the same treatment as the kill() one line above.

exit() conforms to: POSIX.1-2001, POSIX.1-2008, C89, C99, SVr4, 4.3BSD.
_exit(): POSIX.1-2001, POSIX.1-2008, SVr4, 4.3BSD
_Exit(): C99
The later 2 are equivalent, provide the exit code, and C99 isn't cutting edge nowadays.

As for PID1, it's not special any more: containerized applications run as PID1 by default (PID namespace), and this configuration is recommended by Docker. I'm not insisting on fixing this in g3log, though I've spent some time figuring out why my application hangs in a container and operates fine otherwise.

Having this thread here for others running into this, with all the information, is fine for me.

@KjellKod
Copy link
Owner

Added information for recommended PID1 fatal handling in the API documentation. Ref: 01be7d4

Repository owner locked as resolved and limited conversation to collaborators May 24, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants