Skip to content

Timeout: lift VERIFY -> ASSERT to prevent crashes in release builds#10789

Merged
julianbrost merged 3 commits intomasterfrom
otel-fixes
Apr 16, 2026
Merged

Timeout: lift VERIFY -> ASSERT to prevent crashes in release builds#10789
julianbrost merged 3 commits intomasterfrom
otel-fixes

Conversation

@yhabteab
Copy link
Copy Markdown
Member

@yhabteab yhabteab commented Apr 13, 2026

strand.running_in_this_thread() relies on thread-local storage internally, and may return false positives if the coroutine is resumed on a different thread than it was suspended in. In debug builds, this is not a problem, since there's no TLS optimization done by the compiler, but in release builds, the compiler might cache the address of the thread-local variable read before the coroutine suspension, and thus potentially reuse the same address in a different thread after resumption, which would cause running_in_this_thread() to return false or even crash (but we didn't see any crashes related to this). So, perform the assertion only in debug builds to prevent potential wrong usages of the Timeout class. For more details, see 123.

To verify the bug, you can apply the following patch to the OTel class:

diff --git a/lib/otel/otel.cpp b/lib/otel/otel.cpp
index 42be64f3a..1ed39d589 100644
--- a/lib/otel/otel.cpp
+++ b/lib/otel/otel.cpp
@@ -282,6 +282,7 @@ void OTel::Connect(boost::asio::yield_context& yc)
 				boost::system::error_code ec;
 				m_RetryExportAndConnTimer.expires_after(Backoff(attempt));
 				m_RetryExportAndConnTimer.async_wait(yc[ec]);
+				VERIFY(m_Strand.running_in_this_thread());
 			}
 		}
 	}
@@ -314,6 +315,7 @@ void OTel::ExportLoop(boost::asio::yield_context& yc)
 		// avoid waiting indefinitely in that case.
 		while (!m_Request && !m_Stopped) {
 			m_ExportAsioCV.Wait(yc);
+			VERIFY(m_Strand.running_in_this_thread());
 		}

 		if (m_Stopped) {
Crush Dumps
[2026-04-13 09:31:27 +0200] information/OTelExporter: Connecting to OpenTelemetry backend on host 'localhost:8428'.
[2026-04-13 09:31:27 +0200] critical/OTelExporter: Cannot connect to OpenTelemetry backend 'localhost:8428' (attempt #1): Connection refused [system:61 at /opt/homebrew/include/boost/asio/detail/reactive_socket_connect_op.hpp:97:37 in function 'do_complete']
/Users/yhabteab/Workspace/icinga2/lib/base/io-engine.hpp:240: assertion failed: strand.running_in_this_thread()
Caught SIGABRT.
Current time: 2026-04-13 09:31:27 +0200

[2026-04-13 09:31:27 +0200] critical/Application: Icinga 2 has terminated unexpectedly. Additional information can be found in '/Users/yhabteab/Workspace/icinga2/prefix/var/log/icinga2/crash/report.1776065487.747801'
[2026-04-13 09:31:27 +0200] notice/cli: Seamless worker (PID 14160) stopped, stopping as well
Stacktrace:
 0# icinga::Application::SigAbrtHandler(int) in /Users/yhabteab/Workspace/icinga2/prefix/lib/icinga2/sbin/icinga2
 1# _sigtramp in /usr/lib/system/libsystem_platform.dylib
 2# pthread_kill in /usr/lib/system/libsystem_pthread.dylib
 3# abort in /usr/lib/system/libsystem_c.dylib
 4# icinga::IoEngine::Get() in /Users/yhabteab/Workspace/icinga2/prefix/lib/icinga2/sbin/icinga2
 5# icinga::OTel::Connect(boost::asio::basic_yield_context<boost::asio::executor>&) in /Users/yhabteab/Workspace/icinga2/prefix/lib/icinga2/sbin/icinga2
 6# icinga::OTel::ExportLoop(boost::asio::basic_yield_context<boost::asio::executor>&) in /Users/yhabteab/Workspace/icinga2/prefix/lib/icinga2/sbin/icinga2
 7# void boost::context::detail::fiber_entry<boost::context::detail::fiber_record<boost::context::fiber, boost::context::basic_fixedsize_stack<boost::context::stack_traits>, boost::asio::detail::spawned_fiber_thread::entry_point<boost::asio::detail::spawn_entry_point<boost::asio::io_context::strand, void icinga::IoEngine::SpawnCoroutine<boost::asio::io_context::strand, icinga::OTel::Start()::$_0>(boost::asio::io_context::strand&, icinga::OTel::Start()::$_0)::'lambda'(boost::asio::basic_yield_context<boost::asio::executor>), boost::asio::detail::detached_handler>>>>(boost::context::detail::transfer_t) in /Users/yhabteab/Workspace/icinga2/prefix/lib/icinga2/sbin/icinga2

fixes #10783

Footnotes

  1. https://github.com/chriskohlhoff/asio/issues/1366

  2. https://bugs.llvm.org/show_bug.cgi?id=19177

  3. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461

@yhabteab yhabteab added this to the 2.16.0 milestone Apr 13, 2026
@yhabteab yhabteab added bug Something isn't working area/opentelemetry Metrics to OpenTelemetry. labels Apr 13, 2026
@cla-bot cla-bot Bot added the cla/signed label Apr 13, 2026
@yhabteab yhabteab changed the title OTel: fix problematic Timeout class usages Timeout: lift VERIFY -> ASSERT to prevent crashes in release builds Apr 15, 2026
@yhabteab
Copy link
Copy Markdown
Member Author

After all the debugging and head-scratching about this in the last 2 days offline, the problem was spotted by @jschmidt-icinga and the mysterious crash exclusively on aarch64 is caused by compiler optimizations and Asio's use of thread-local storage with Fibers that can migrate between threads. See the updated PR description for more details.

Instead of completely removing the assertion, I have lifted it to ASSERT as requested internally by @julianbrost, so please have a look at the updated PR now. Ty.

Comment thread lib/base/io-engine.hpp Outdated
`strand.running_in_this_thread()` relies on thread-local storage
internally, and may return false positives if the coroutine is resumed
in a different thread than it was suspended in. In debug builds, this is
not problem, since there's no TLS optimization done by the compiler, but
in release builds, the compiler might cache the address of the
thread-local variable read before the coroutine suspension, and thus
potentially reuse the same address in a different thread after
resumption, which would cause `running_in_this_thread()` to return false
or even crash (but we didn't see any crashes related to this). So,
perform the assertion only in debug builds to prevent potential wrong
usages of the `Timeout` class. For more details, see [^1][^2][^3].

[^1]: chriskohlhoff/asio#1366
[^2]: https://bugs.llvm.org/show_bug.cgi?id=19177
[^3]: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=26461
Copy link
Copy Markdown
Contributor

@jschmidt-icinga jschmidt-icinga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. 👍

@julianbrost julianbrost merged commit 647b2b2 into master Apr 16, 2026
28 checks passed
@julianbrost julianbrost deleted the otel-fixes branch April 16, 2026 10:48
@julianbrost
Copy link
Copy Markdown
Member

I hate computers, and C++ in particular.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/opentelemetry Metrics to OpenTelemetry. bug Something isn't working cla/signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTLPMetricsWriter VictoriaMetrics - Error: Broken pipe

3 participants