Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2.11 RC1: Nessus Scan crash the Windows-Client. #7431

Closed
stevie-sy opened this issue Aug 19, 2019 · 21 comments

Comments

@stevie-sy
Copy link

commented Aug 19, 2019

Describe the bug

We tested the 2.11 RC1 against the Nessus Scan again. So it's a litte bit a follow up from #6559: the Linux Client survived on CentOS 7 - Great! But not on our Windows System. Here the service is stopped after the scan. 2.11 RC1 is installed on a Windows Server 2012 R2 Standard.

If you Need something more pls tell us.

To Reproduce

Scan with a security scanner against a Windows system

Expected behavior

The Windows Client should also not Crash like the Linux Client.

Your Environment

Include as many relevant details about the environment you experienced the problem in

  • Version used (icinga2 --version): 2.11 RC1
  • Operating System and version: Windows Server 2012 R2 Standard with .NET Framework 4.8
@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 19, 2019

Addition: To be sure, we have scanned 3 times.

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 19, 2019

Since we don't have access to your environment (this implies an NDA/support contract), we need all the logs, as always. Also, the entries from the windows event console. Are you 100% certain that you did restart the agent after upgrading to 2.11 RC?

@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 20, 2019

Yes I did also a reinstall of the windows Client to be sure.
You get the logs as soon as possible.

@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 20, 2019

My college did the security scan yesterday between 15:31 and 15:44 for the test environment. I filtered and anonymize the icinga2.log for this period. There was also only one entry in the Windows Event log. Icinga didn't produce a crash log.
We have no debug.log yet. But if this two log aren't enough, we can do - of course - an another scan to produce one. Also if you need a more detailed Windows Event log. But for that I have to "clean" the server a little bit. Because there is also other software installed and running for testing (which is also monitored by icinga) and nessus is also scanning this one at the same time of course.
event_icinga2_nessus.txt

icinga2_nessus.log

For me it looks like, that the windows client has problems with the GET-requests at 15:39:21, like GET %24%7B%7B57550614+16044095%7D%7D//. After that the client stopped working.

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 20, 2019

Thanks, the logs unfortunately don't tell much except for the connections and requests made. The last log line before the restart doesn't unveil much.

I've performed the HTTP GET request against my Windows 10 VM, works as expected.

$ curl -k 'https://192.168.243.157:5665/%24%7B%7B57550614+16044095%7D%7D//'
<h1>Unauthorized. Please check your user credentials.</h1>

Just our of interest - I read that Nessus is a free scanner, but obviously has enterprise modules and so on. Which version of it are you using and do you think that there is a chance to reproduce this just with using the free scanner?

@dnsmichi dnsmichi added this to the 2.11.0 milestone Aug 20, 2019
@dnsmichi dnsmichi referenced this issue Aug 20, 2019
54 of 54 tasks complete
@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 20, 2019

Ok, than I ask my college for another scan and I try to "clean" (and stop all other Services from our other software on that Server) to get good logs; inclusive debug.log. Also I ask him which version and modules we use from Nessus. What I know we use the commercial version.

@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 20, 2019

So this is the Information I got from my colleague about the version:

Nessus Enterprise
Tenable.sc
Version: 5.10.1
Nessus Free (for scanning up to 16 IPs)
„Nessus Essentials“
https://www.tenable.com/products/nessus/nessus-essentials
If free and Enterprise version scan in the same manner could only tell Tenable Support.

logs are coming later

@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 21, 2019

So my colleague from the security department and I did a lot of scans with various scenarios.

note:
First of all we noticed, that we have also the same problem with the debuglog on Windows, described in #7388 . We also install our windows agents with the PS Module. And it's correct what you wrote there, the api feature is disabled. But with a manual enabled api feature we also don't get a debug.log file. So what we do now to produce this file for you is to start manually an extra daemon with
icinga2 daemon --log-level debug > C:\ProgramData\icinga2\var\log\icinga2\debug.log

Scans
With this knowledge we build some diffrent scans today. That means we disable/enable the api and/or the debuglog feature, start an extra daemon to get the debug log. After that we start icinga and the scan again. This resulted in this Matrix (the time is for your to find this better in the logs):
image

Logs for this:
icinga2.log - this is not seperated, the complete scans are in this file
debug_1.log - scan with extra process and enabled api/debuglog feature
debug_2.log - scan with extra process and disabled api and enabled debuglog feature
debug_3.log - scan with extra process and disabled api/debuglog feature

an interesting fact from yesterday's scan (before wie read the issue with the debuglog)
With the extra process for the debug log icinga survied - every process! But today not. So we don't know why and can't reproduce this. That are the logs from this scan:
icinga2.log
debug.log

Conclusion:
Every time when the icinga daemon crashed there is an entry in the windows event log with the faulting module is C:\Program Files\ICINGA2\ucrtbase.DLL with the exeption code: 0xc0000409 (STATUS_STACK_BUFFER_OVERRUN) and the offset 0x000000000006f08e
image

image

I hope this helps you to find the bug. If you need more, pls tell us.

@dnsmichi dnsmichi self-assigned this Aug 21, 2019
@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 22, 2019

I did a little reasearch about this dll. It's from Microsoft so if this is the fault auf the crash, it's maybe a little bit difficult to fix.
image

With google I found some postings in forums, that also other people has problems with this DLL. They also have crashes. They didn't write the reason. But maybe this is helpful for you:

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

Thanks, I'll talk to you on Tuesday when back from vacation.

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

We'll continue offline. I'll update the issue once all data is collected and whether we've found something from the analysis.

@dnsmichi dnsmichi pinned this issue Aug 28, 2019
@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 28, 2019

Both Stefan and Philipp are doing a marvelous job here - they provided me with a full self-written PDF on how to install and configure Nessus Essentials to trigger the full scan.

Previously I tested with a macOS install, but with a wrong subnet Nessus locked me out of the trial (yeah, enterprise software which always wants you to buy something). Now I am using a CentOS 7 Vagrant box.

I can see that Icinga stops running when such a scan happens. This is good news, so I can dig deeper.

For security reasons, I am not sharing any details about Nessus and its configuration nor the exact logs for triggering things until I have found and implemented a fix.

@dnsmichi dnsmichi added bug and removed needs-feedback labels Aug 28, 2019
Al2Klimov added a commit that referenced this issue Aug 28, 2019
Al2Klimov added a commit that referenced this issue Aug 28, 2019
Al2Klimov added a commit that referenced this issue Aug 28, 2019
@Al2Klimov Al2Klimov self-assigned this Aug 28, 2019
@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

The referenced PR doesn't entirely solve the problem. We're now somewhere in the event polling loops in Boost Asio on Windows with socket cancel actions. So this definitely only is a Windows specific problem, only with the RC1 and Git Master.

@stevie-sy

This comment has been minimized.

Copy link
Author

commented Aug 30, 2019

That are a Little bit bad news. Because that means for you more hard work to fix. But you can do that, we're sure.
But on the other hand it's good to know that the Linux version is stable after the rework of the network stack. Very good work. In any case, the hard work has paid off.

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Aug 30, 2019

http::async_read_header() may return an error during liveness disconnect. For some reason, reading the header doesn't return in this case. Calling cancel() in Disconnect() does not cancel running operations, nor would it react shuttingDown. Either we're overlapping buffers here (which I highly doubt), or there's a race condition involved where Boost asio fails. Or something fails with coroutines and strands here.

Upgrade to Boost 1.71 would also be a test option for next week. Code resides in bugfix/windows-nessus-crash-http now @Al2Klimov.

Last resort: Disable IOCP on Windows and force to use select() for IO handling with Boost Asio.

The error is somewhat different, at some point the thrown error is somehow broken. Maybe there is an overlap with the error codes as well. Looked into ceph's handling here too https://github.com/ceph/ceph/blob/6171399fdedd928b4249d135b4036e3de25079aa/src/rgw/rgw_asio_frontend.cc

Screen Shot 2019-08-30 at 16 51 32

Screen Shot 2019-08-30 at 17 05 42
Screen Shot 2019-08-30 at 17 09 37

Btw, you cannot even go into the Windows specific DLLs to read source code. Only Assembler. Closed source sucks.

@stevie-sy

This comment has been minimized.

Copy link
Author

commented Sep 5, 2019

@dnsmichi We'll keep our fingers crossed, that your ideas are helpful!

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Sep 5, 2019

Dynamically link Boost DLLs

Still crashes, but at least a crash log is now written with an unhandled SEH exception.

C:\Program Files\ICINGA2\sbin> Get-Content "C:\ProgramData\icinga2\var\log\icinga2/crash/report.1567670652.985000"
  Application version: v2.11.0-rc1-93-gcd5aa4390

System information:
  Platform: Windows
  Platform version: 8
  Kernel: Windows
  Kernel version: 6.2
  Architecture: x86_64

Build information:
  Compiler: MSVC 19.22.27905.0
  Build host: WINMIF

Application information:

General paths:
  Config directory: C:\ProgramData\icinga2\etc\icinga2
  Data directory: C:\ProgramData\icinga2\var\lib\icinga2
  Log directory: C:\ProgramData\icinga2\var\log\icinga2
  Cache directory: C:\ProgramData\icinga2\var\cache\icinga2
  Spool directory: C:\ProgramData\icinga2\var\spool\icinga2
  Run directory: C:\ProgramData\icinga2\var\run\icinga2

Old paths (deprecated):
  Installation root: C:\Program Files\ICINGA2\
  Sysconf directory: C:\ProgramData\icinga2\etc
  Run directory (base): C:\ProgramData\icinga2\var\run
  Local state directory: C:\ProgramData\icinga2\var

Internal paths:
  Package data directory: C:\Program Files\ICINGA2\\share\icinga2
  State path: C:\ProgramData\icinga2\var\lib\icinga2/icinga2.state
  Modified attributes path: C:\ProgramData\icinga2\var\lib\icinga2/modified-attributes.conf
  Objects path: C:\ProgramData\icinga2\var\cache\icinga2/icinga2.debug
  Vars path: C:\ProgramData\icinga2\var\cache\icinga2/icinga2.vars
  PID path: C:\ProgramData\icinga2\var\run\icinga2/icinga2.pid
Caught unhandled SEH exception.
Current time: 2019-09-05 10:04:13 +0200

Stacktrace:

        (0): free+40
        (1): (unknown function)
        (2): (unknown function)
        (3): (unknown function)
        (4): (unknown function)
        (5): (unknown function)
        (6): (unknown function)
        (7): (unknown function)
        (8): (unknown function)
        (9): (unknown function)
        (10): (unknown function)
        (11): (unknown function)
        (12): (unknown function)
        (13): (unknown function)
        (14): (unknown function)
        (15): (unknown function)
        (16): (unknown function)
        (17): (unknown function)
        (18): (unknown function)
        (19): (unknown function)
        (20): (unknown function)
        (21): (unknown function)
        (22): (unknown function)
        (23): (unknown function)
        (24): (unknown function)
        (25): (unknown function)
        (26): (unknown function)
        (27): (unknown function)
        (28): (unknown function)
        (29): (unknown function)
        (30): (unknown function)
        (31): (unknown function)
        (32): (unknown function)
        (33): register_onexit_function+236
        (34): BaseThreadInitThunk+20
        (35): RtlUserThreadStart+33
***
* This would indicate a runtime problem or configuration error. If you believe this is a bug in Icinga 2
* please submit a bug report at https://github.com/Icinga/icinga2 and include this stack trace as well as any other
* information that might be useful in order to reproduce this problem.
***
dnsmichi added a commit that referenced this issue Sep 6, 2019
dnsmichi added a commit that referenced this issue Sep 6, 2019
When run within a coroutine, exceptions on Windows may influence
bad behaviour here. Instead, we'll check for the error code
and extract the message from memory. In contrast to exceptions
which are stored on the stack frame and then return, this costs
a little more memory but simplifies the logic.

This doesn't fix the linked issue, but is related to the analysis.

refs #7431
@dnsmichi dnsmichi added the core/crash label Sep 6, 2019
dnsmichi added a commit that referenced this issue Sep 6, 2019
…them in sbin/

This comes with the motivation to avoid bundling Boost context
into our main binary. Whenever the context library allocates
stack memory, or does something weird, we want to see this
in the stack traces as origin from an external DLL and not the
main icinga2.exe binary.

TL;DR - this doesn't fix #7431 but makes debugging and development
a bit more easy. And it follows the Windows approach with DLLs.
dnsmichi added a commit that referenced this issue Sep 9, 2019
Exceptions in Disconnect() might be thrown (this has been reworked
into error_code locally) which are swallowed inside the Destructor
for being dangerous. On the other hand, swallowing them may
corrupt the stack unwinding operation from the coroutine layer.

The best is to avoid Defer inside lib/remote and call Disconnect()
directly after breaking from other operations.

refs #7351
refs #7431
dnsmichi added a commit that referenced this issue Sep 9, 2019
Throwing local exceptions unnecessarily pollutes the exception
stack with immediate unwinding. Avoid this pattern at all cost within
Boost.Coroutines. MSVC may handle exceptions differently and cause
problems with stack unwinding.

refs #7431
refs #7351
dnsmichi added a commit that referenced this issue Sep 9, 2019
dnsmichi added a commit that referenced this issue Sep 9, 2019
dnsmichi added a commit that referenced this issue Sep 9, 2019
…eptions

This is required to

- catch all exceptions and wrap them into Boost exceptions. They
are the only ones allowed with Boost.Coroutine.
- set a dedicated coroutine stack size for Windows.

refs #7431
dnsmichi added a commit that referenced this issue Sep 9, 2019
…eptions

This is required to

- catch all exceptions and wrap them into Boost exceptions. They
are the only ones allowed with Boost.Coroutine.
- set a dedicated coroutine stack size for Windows.

refs #7431
@dnsmichi

This comment has been minimized.

Copy link
Member

commented Sep 9, 2019

TL;DR - 2 weeks with emotional ups and downs and a lot of technical background learned. Icinga now survives a Nessus scan.

Screen Shot 2019-09-09 at 17 46 50

Analysis

Setup

Scanner Setup

Both Nessus and OpenVAS work. Thanks to Stefan & Philipp for providing a setup documentation for Nessus Essential. Won't be provided for security reasons here.

Wireshark TCP Dump

https://www.netways.de/blog/2018/07/26/ssl-geknackt-naja-fast/

Change the cipher_list, add the node's private key for following the TLS traffic, extract the sent malformed requests.
This worked to mitigate the request which caused the immediate crash. Unfortunately this isn't always the same, and we found another way
to trigger the crash as well.

Request Analysis

Only failing requests would lead to stack buffer underrun or memory corruption errors.

Memory Analysis

  • ucrtbase.dll is the Windows Universal C Runtime. Something like libstdc++ on Linux, where all memory allocations etc. are provided.
  • A crash within new() or free() typically means that the application does not crash from a current problem, but memory is already corrupted

Origin

Boost Beast HTTP Parser

http::async_read_header may throw exceptions leading into crashes. In order not to populate the exception
stack, it sounds reasonable to move to error_code handling anyways.

Boost ASIO Socket IO

Same strand in m_IoStrand as with other operations. No special influence here.

boostorg/asio#154
https://www.boost.org/doc/libs/1_47_0/doc/html/boost_asio/overview/core/strands.html

http://www.crazygaze.com/blog/2016/03/17/how-strands-work-and-why-you-should-use-them/

Boost Coroutines

ASIO wraps them and makes them easier to use.

https://stackoverflow.com/questions/30557582/what-does-boostasiospawn-do

Coroutine Stack Unwinding on Complete

https://stackoverflow.com/questions/9286550/is-there-any-method-that-causes-whole-stack-frame-unwinding-in-c-except-usin

However, I've found some critical flaw. Any user code that "swallow exception" as below will completely break the assumption of cooperative scheduling.

try {
    ...
} catch(...) { // ContextClosingException 
    // do nothing, just swallow exception.
}

So I need to find other way to invoke stack unwinding (or any other way to destruct stack object in the continuation). Standard conformance way would be nice - but the continuation implementation itself is dependent on platform specific API, so non-portable way would be acceptable. (I'm using win32)

Swallowing exceptions is evil with Boost Coroutines.

Root Cause

Coroutines work in the way that they allocate their own stack. Whenever an exception is thrown,
it is expected to be caught or to terminate the coroutine.

Exceptions are asynchronous and put onto the stack. This stack is also used by the Coroutine to
store the function pointer when yielding the operation.

At some point, the thrown exceptions corrupt the stack. Only on Windows, only with Visual Studio as compiler.

In order to trigger this problem, either fire many Nessus Scan Attacks against the HTTP API.
Or you'll go with something via curl/debug console which always triggers an exception.

The easiest way to reproduce this is to use the debug console and evaluate a script expression
which is incorrect. At first glance, the generated ScriptError exception somehow works. Later iterations
always lead into a crash.

C:\Program Files\ICINGA2\sbin> .\icinga2.exe console --connect 'https://root:icinga@localhost:5665/' --eval 'get_host("bla)'

Note

Rewriting the code parts which invoke std::invalid_argument do not fix this, the
memory is still corrupted. Also, EnsureValidHeaders() is passed without throwing
any exception.

We really need to work on the exceptions themselves.

Coroutines and Context Stack

Our (exception) stack is foobar at some point. Either Beast exceptions throw, our own ScriptError exceptions throw and something asio related itself.

One can manipulate the coroutine stack size allocated by Boost.Context. This defaults to 64KB.

Default constructor using boost::context::default_stacksize(), does unwind the stack after coroutine/generator is complete.

https://gist.github.com/chfast/bfc1e1af4176960b4722#file-context-cpp-L85

typedef ctx::simple_stack_allocator<
    8 * 1024 * 1024, // 8MB
    64 * 1024, // 64kB
    8 * 1024 // 8kB
>       stack_allocator;

https://emcl-gitlab.iwr.uni-heidelberg.de/hiflow3.org/hiflow3/blob/c7a3ea7216a9cb15a638112dfb17732e932cc265/contrib/boost_libraries/libs/context/performance/winfiber/performance.cpp

Sets the default stack size to 8 MB. 64MB. Does not immediately crash, but later.

https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/stack.html

Exceptions

http://boost.2283326.n4.nabble.com/corountine2-context-forced-unwind-exception-on-Windows-with-TDM-GCC-td4706583.html

forced_unwind is part of the internal plumbing of Boost.Context and can 
be thrown from any yield point when the coroutine is destroyed while 
suspended.  It's required that this exception is not swallowed and the 
whole call stack between the yield point and the "outside" of the 
coroutine is not noexcept. 

raptorjit/raptorjit#110

https://asio-users.narkive.com/QP9g9dj9/can-t-catch-boost-asio-exceptions-inside-coroutines

Are you linking to pre-built boost DLL's provided by MXE? If so, the
DLL is probably not seeing the exception handler created by your
program, probably because the compiler is not creating the exception
handler in the way expected by the DLL. If that is the case, there
might be some compiler command line options that would fix it, or
possibly the compiler was not built correctly to support exceptions
coming from DLL's. Linking to a static boost library instead of a DLL
might fix it though either way.

http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html

[context] Crash in case of exception from a context

I use boost::context to implement cooperative multitasking. At the moment I use boost 1.57 on Windows 7 with VisualStudio 2013. 

Everything works fine until I throw an exception from a context created by "boost::context::make_fcontext(.......)".  I have seen
someone had the same problem and it was caused by the Windows OS, that checks the exception hierarchy in the stack and if it finds
that the starting point is not the one that should come from the OS then it terminates the process.

It suspects some malitious code that corrupted the stack. 

https://www.acodersjourney.com/top-15-c-exception-handling-mistakes-avoid/

Beyond knowing how to use the try/catch syntax, one of the fundamental concepts to know regarding C++ exception handling is the concept
of Stack Unwinding.

When an exception is thrown and control passes from a try block to a handler, the C++ run time calls destructors for all automatic objects
constructed since the beginning of the try block. This process is called stack unwinding. The automatic objects are destroyed in reverse order
of their construction. If an exception is thrown during construction of an object consisting of sub-objects or array elements, destructors are
only called for those sub-objects or array elements successfully constructed before the exception was thrown.

Why should you know this ? Because this'll help you understand the exception handling tips and tricks for making your code robust and efficient.
A full discussion of the Stack Unwinding process is beyond the scope of this article – but here is an excellent reference from msdn : 
https://msdn.microsoft.com/en-us/library/hh254939.aspx.

https://docs.microsoft.com/en-us/cpp/cpp/exceptions-and-stack-unwinding-in-cpp?view=vs-2019

Mistake # 15: Swallowing Exceptions
Swallowing critical exceptions will cause your program to do either of two things –  to fail in unexpected ways downstream or prevent the program from fulfilling it’s purpose. Sometimes programmers will catch any exception via catch(…) and then swallow them . This is usually done for exceptions that the programmer did not foresee happening. However, this can lead to downstream failure – sometimes with no obvious reason for the failure since the stacktrace disappears with the swallowed exception.

If you must swallow exceptions, make sure that you Log the exception as well as document them in code and in your documentation. 

https://groups.google.com/forum/#!topic/boost-list/6oHRdLmuQ5k

The documentation for boost context contains the following warning: "Do not jump from inside a catch block and then re-throw the exception in another continuation. "
I can attempt to answer "why" from a high-level point of view: I think is that it could be summarized as: the exception machinery is not reentrant. Throwing and catching exceptions performs a lot of messy manipulations with the data on the stack. During exception unwinding, you have to traverse the stack (a single-linked list) during which 1) a check is made at each level for whether the current stack frame has an exception handler, and if not 2) call destructors for local objects. Stack traversal is implemented by the runtime and even possibly relies on the OS' data structures (in case of Windows) and *might* use also thread-local storage (e.g., exception_ptr).


Switching the stack aborts this process without informing the run-time exception machinery and.. what happens when you resume it? What if another exceptions has been thrown and handled on the same thread before resuming a suspended exception handler? A handler which is written with the assumption that only a single active invocation can exist on any particular thread?


There are other concerns too: on Windows you can get asynchronous callbacks from the kernel (akin to unix signals, just nicer) and if you switch the stack when called from the kernel, you can really hose your process.

Incompatibility between std, boost and ScriptError exceptions?

I've found a library which wraps the coroutine exceptions from their different types.

https://github.com/niekbouman/commelec-api/blob/master/commelec-api/coroutine-exception.hpp#L35

The implementation is safe and good to use - since I've now learned that coroutines only support
Boost exceptions and nothing else.

Compiler Optimization

The Visual Studio Compiler optimizes several bits in a way we as developers do not expect it.
Keep in mind, the underlaying code works 100% on Linux/Unix when compiled with GCC.

In terms of Boost, we need to disable certain flags already:

# Disable optimization for Boost::context
# https://www.boost.org/doc/libs/1_69_0/libs/context/doc/html/context/overview.html
# https://docs.microsoft.com/en-us/cpp/build/reference/gl-whole-program-optimization?view=vs-2017
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} /bigobj /GL- /EHs")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} /bigobj /GL- /EHs")

# detect if 32-bit target
if(CMAKE_VS_PLATFORM_NAME STREQUAL "Win32")
  # SAFESEH is not supported in Boost on Windows x86
  # maybe it is when Boost is compiled with it...
  # https://lists.boost.org/Archives/boost/2013/10/206720.php
  # https://docs.microsoft.com/en-us/cpp/build/reference/safeseh-image-has-safe-exception-handlers?view=vs-2017
  set(CMAKE_EXE_LINKER_FLAGS "${CMAKE_EXE_LINKER_FLAGS} /SAFESEH:NO")
endif()

These optimizations don't have much of influence here.

Instead, I've found some compiler optimizations going on to reduce the binary sizes.
This includes improving the exception handling and stack unwinding operations.

https://habr.com/en/company/microsoft/blog/442996/
https://developercommunity.visualstudio.com/content/problem/273518/incorrect-stack-unwinding-after-throwing-exception.html

https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/ff/implementations__fcontext_t__ucontext_t_and_winfiber.html

Because the TIB (thread information block on Windows) is not fully described in the MSDN, it might be possible that not all required TIB-parts are swapped. Using WinFiber implementation migh be an alternative.

ASIO IO Threads manipulating memory?

https://stackoverflow.com/questions/14981291/boostasio-internal-threads

On Windows, defining BOOST_ASIO_DISABLE_IOCP will cause Asio to use the select-based implementation, which is slower but should not spawn additional threads. 

Tested this, but it did not fix the crash. Won't apply the patch on Windows therefore.

Windows Stack

Well, turns out that the default allocated stack size for a coroutine is 64 KB. Whenever it has exceptions inside, this stack may be too low.

https://stackoverflow.com/questions/51110650/minimal-possible-stack-size-on-windows-when-using-c-exceptions-using-boost-co

Raising the default coroutine stack on Windows actually fixes the problem.

https://docs.microsoft.com/en-us/cpp/build/exception-handling-x64?view=vs-2019
https://devblogs.microsoft.com/cppblog/making-cpp-exception-handling-smaller-x64/

Final Analysis

Statement: The crash only happens when an error is raised.

Whenever valid requests are made, everything is fine.

Statement: When something doesn't work, it throws an exception.

Boost Beast http::async_read_header can throw an exception. We can re-build this into error codes. Doesn't change.
The debug console triggers a ScriptError raised from the config compiler.
Wrong header reading triggers std::invalid_argument()

Not all of them are boost compatible exceptions.

Statement: Whenever an exception happens, the coroutine returns where called and unwinds the stack.

This includes a stack deallocate() call which calls std::free(). This crashes with pointing to memory which is corrupted.

Statement: Throwing std::invalid_argument() on Windows corrupts the exception stack.

Yes. Rewriting EnsureValidHeaders() to not use these exceptions allows the code to complete.

No fucking idea what the Visual Studio compiler renders from this code bit. It may originate from static INLINE too.

Statement: ScriptError still causes the application to crash.

The stack size with 64 KB is too low. Why? Look into what's stored inside the exception. The whole DebugInfo struct is massively bloated.

It may also be the case that exceptions inside coroutines cause an overhead which is
timing related to many requests exceeding the limit.

Statement: Swallowing exceptions with catch(...) is evil.

Yes, Boost Coroutine doesn't work with the unwind exception.

Statement: Defer-Disconnect with exceptions swalled in the Destructor don't work.

Yes, these two patterns are incompatible within the new network stack. Avoid them at all cost.

Statement: It is unclear which coroutine stack size is needed for Windows and MSVC.

Tests have proven something between 8 and 64 MB. This still is too high, but mitigates the issue.
With all the error handling and code alignments we are even more safe, also in regard to Linux/Unix.
A windows agent typically doesn't expose much HTTP traffic (only scanners), JSON-RPC also is low (2 parent satellites).

Fixes

Ensure that Coroutine Stack unwinding is fully supported within the API.

  • HttpServerConnection: Prefer error codes over Boost exceptions #7477
  • JsonRpcConnection: Don't swallow exceptions in Boost.Coroutine #7483
  • Rewrite error handling in HttpServerConnection#EnsureValidHeaders() #7486
  • Avoid the Defer-Disconnect destructor pattern with Boost.Coroutines #7485
  • Re-throw boost::coroutines::detail::forced_unwind #7351

Ensure that HTTP Disconnects don't fire I/O.

  • API: Avoid I/O on shutdown #7487

Wrap thrown exceptions into Boost compatible exceptions in boost::asio::spawn().
Windows coroutine stack size: Increase stack size to prevent crashes with MSVC.

  • Introduce IoEngine::SpawnCoroutine wrapping asio::spawn and Boost exceptions #7491

Side Fixes

  • Quality: Replace deprecated Boost IO service code #7490
  • Dev Docs: Update Windows to Visual Studio 2019 #7480

References and URLs

Everything I have had opened and shed some light into finding the cause.

https://geidav.wordpress.com/2013/03/21/anatomy-of-dynamic-stack-allocations/
https://docs.microsoft.com/en-us/windows/uwp/cpp-and-winrt-apis/concurrency-2
https://www.boost.org/doc/libs/1_62_0/libs/fiber/doc/html/fiber/stack.html
https://stackoverflow.com/a/45027110/1821348
https://www.boost.org/doc/libs/1_54_0/doc/html/boost_asio/overview/core/spawn.html
https://docs.microsoft.com/de-de/cpp/cpp/exception-specifications-throw-cpp?view=vs-2019
https://stackoverflow.com/questions/9286550/is-there-any-method-that-causes-whole-stack-frame-unwinding-in-c-except-usin
https://gist.github.com/chfast/bfc1e1af4176960b4722#file-context-cpp-L85
boostorg/beast#1055
https://stackoverflow.com/questions/41030285/boostproperty-treeread-xml-segfaults-in-an-asio-handler-spawned-using-boost
https://www.boost.org/doc/libs/1_66_0/libs/coroutine/doc/html/coroutine/attributes.html
boostorg/beast#1209
https://www.boost.org/doc/libs/1_68_0/libs/context/doc/html/context/stack.html
boostorg/coroutine2#19
http://www.mikemarcin.com/post/coroutine_a_million_stacks/
https://docs.microsoft.com/en-us/windows/win32/memory/memory-limits-for-windows-releases#memory_limits
mxe/mxe#1559
http://www.blinnov.com/en/2013/04/11/english-boostcontext-and-seh/
http://boost.2283326.n4.nabble.com/context-Crash-in-case-of-exception-from-a-context-td4672953.html
https://news.ycombinator.com/item?id=11781201
https://rethinkdb.com/blog/improving-a-large-c-project-with-coroutines/
https://github.com/facebook/folly/tree/master/folly/fibers
https://stackoverflow.com/questions/46738872/weird-after-exit-error-on-coroutine-test-module-on-windows-x64
https://stackoverflow.com/questions/50429345/fast-fibers-coroutines-under-x64-windows
https://www.boost.org/doc/libs/1_65_1/libs/fiber/doc/html/fiber/performance.html
https://paoloseverini.wordpress.com/tag/fibers/
https://news.ycombinator.com/item?id=11781201
https://stackoverflow.com/questions/49036542/visual-c-2017-bug-compiler-optimizes-away-expression?noredirect=1#comment85079616_49036542
https://docs.microsoft.com/de-de/cpp/build/reference/f-set-stack-size?view=vs-2019
https://developercommunity.visualstudio.com/content/problem/345025/heap-corruption-in-stack-unwinding-when-inlining-f.html
https://www-user.tu-chemnitz.de/~heha/viewchm.php/hs/Win32SEH.chm/

Libraries

https://github.com/owt5008137/libcopp
https://github.com/yyzybb537/libgo

Google Searches

Newest first.

6.9.2019 

boost coroutine std::exception problem
boost thread default stack size windows
boost fibers coroutine stack size
windows c++ disable exception hierarchy in the stack
BOOST_ASIO_DISABLE_THREADS
boost coroutine exceptions windows
windows export linked libraries of binary
boost coroutine windows visual studio std::invalid_argument exceptions stack corruption site:stackoverflow.com
boost coroutine windows visual studio std::invalid_argument exceptions stack corruption
windows visual studio std::invalid_argument exceptions stack corruption
boost coroutine stack with exception templates windows
boost coroutine stack with exception templates
boost coroutine STATIC_POOL_SIZE
BOOST_PROPERTY_TREE_RAPIDXML_STATIC_POOL_SIZE
visual studio threadsanitizer
windows visual studio use clang++ cmake
boost context windows visual studio stack corruption
boost context windows stack corruption
windows visual studio stackful coroutines
detached coroutine boost asio
boost asio socket cancel io strand
boost asio spawn avoid exceptions windows visual studio
boost asio spawn avoid exceptions windows
coroutine do not unwind stack windows misaligned
coroutine do not unwind stack windows
boost coroutines stack unwind option
boost windows fibers stack size performance
windows visual studio optimizer couroutines stack size
windows visual studio compiler stack guards optimization
windows visual studio compiler stack optimization
cmake windows disable stack guard
fno stack protector
windows exception stack guards
windows allocate stack guards
boost::context::default_stacksize
boost::context::default_stacksize windows
boost::coroutines::attributes
boost::asio::spawn default coroutines stack size
boost coroutine exceptions windows
boost beast exception coroutine windows stack
boost asio coroutine exception stack windows
boost asio coroutine exception stack
boost asio io strand spawn
boost asio io strand spawn windows exception stack broken
visual studio optimization boost asio spawn
visual studio c++ lambda context optimized unused
visual studio lambda context optimized unused
windows crash in register_onexit_function
error C2491: 'boost::unit_test::unit_test_main': definition of dllimport function not allowed
boost::uuids BCryptOpenAlgorithmProvider
windows boost Boost_USE_STATIC_LIBS
windows boost context static lib crash
Og compiler flag
visual studio disable seh exceptions boost asio
visual studio disable seh exceptions
boost asio spawn exception memory violation windows
visual studio c++ exception types
visual studio exception types
visual studio set exception handling boost context asio
boost coroutine exceptions windows
boost coroutines basic_standard_stack_allocator deallocate memory violation
CMAKE_CXX_FLAGS cmake windows wrong cache
cmake visual studio compiler flags
cmake visual studio compile /GL-
windows msvc turn off global program optimization (/GL)
boost context stack unwound windows
visual studio boost context /Og
boost context increase stack size windows
windows c++ set stack size
windows exception execution_context_v2 in Boost Context
boost binaries context safeseh
disable compiler optimization visual studio 2017 c++
boost context visual studio 2017 crash
cmake windows override /EHsc
basic_standard_stack_allocator boost windows crash
compile boost context windows
visual studio 2017 memory profiler
windows c++ print memory usage from self
windows boost asio context out of memory
boost asio spawn shared_from_this
C++ exception: boost::wrapexcept<boost::system::system_error> at memory location
visual studio exception thrown at memory location
endless http::async_read_header
boost beast async_read_header utf8 problem
 Microsoft C++ exception: boost::system::system_error at memory location
boost asio error code
boost::asio::error::operation_aborted
BOOST_ASIO_DISABLE_IOCP
windows boost asio bad_address
windows bad_address EFAULT
boost::system::error_code bad_address windows
boost::system::error_code access violation windows
boost beast win_iocp
boost beast http::async_read_header ec null pointer
boost beast http::async_read exception
boost asio http::async_read exception windows invalid memory access
Microsoft C++ exception: boost::system::system_error at memory location

28.8.2019

lowest_layer().cancel()
Der E/A-Vorgang wurde wegen eines Threadendes oder einer Anwendungsanforderung abgebrochen
boost http::async_read_header ucrtbase crash windows
windows analyse wer dump file visual studio 2017
reset nessus You've now scanned over 16 different IP addresses over time, and Nessus will not let you scan any additi... 


@stevie-sy

This comment has been minimized.

Copy link
Author

commented Sep 10, 2019

Thank you very much @dnsmichi, @Al2Klimov and everybody in background who helped to fix this. That's really amazaing that you find the cause of this Problem. Thank you for your hard and awesome work! You have our big respect! Chapeau!

@dnsmichi

This comment has been minimized.

Copy link
Member

commented Sep 12, 2019

Thanks for the kind feedback, now we can start preparing the 2.11 release.

As written on Twitter already, the intention with all the details is to not only help an Icinga developer, but also everyone developing with Boost, Coroutines, etc. on Windows and debugging the same issue. Sharing knowledge is what makes open source strong, not only the source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.