Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[5.0] don't run EOS VM OC's monitor compile task callback when socket being dtored #1827

Merged
merged 1 commit into from
Oct 26, 2023

Conversation

spoonincode
Copy link
Member

This may resolve #1823 & #1794 based on core dump analysis of an oc-monitor crash.

A single OC monitor process is created at launch of a parent process which plans to use OC (nodeos, a unit test application, etc). Critically, a single io_context is used in this single monitor process. As code_caches are created on the parent process, compile_monitor_sessions are created in the monitor process for each of those code_cache instances.

When a code_cache needs to compile a contract, it sends a message to its out of process compile_monitor_session. This will then call off to another process -- the OC trampoline -- but the most important actions here to note is that the compile_monitor_session notes in its current_compiles a socket to receive a reply on from the trampoline,

current_compiles.emplace_front(code_id, std::move(response_socket));

and then goes off to do an async_wait() on that socket,
socket.async_wait(local::datagram_protocol::socket::wait_read, [this, current_compile_it](auto ec) {

Consider that current_compiles holds some outstanding sockets because compiles are ongoing. Now consider that code_cache is destroyed. When that happens the read_message_from_nodeos() (which is a misnomer: it's really read_message_from_codecache) will indicate a signal,

void read_message_from_nodeos() {
_nodeos_instance_socket.async_wait(local::datagram_protocol::socket::wait_read, [this](auto ec) {
if(ec) {
connection_dead_signal();
return;

this signal is an indication that the compile_monitor_session can be destroyed (since its code_cache pairing has been destroyed). So.. it's destroyed,
_compile_sessions.front().connection_dead_signal.connect([&, it = _compile_sessions.begin()]() {
ctx.post([&]() {
_compile_sessions.erase(it);
});
});

As compile_monitor_session is destroyed its current_compiles is first to be dtored. As part of this destruction the local::datagram_protocol::socket will be destroyed ergo it will be cancel()ed egro it will make a callback to the async_wait() with boost::asio::error::operation_aborted. That's not strictly a problem, but things fall apart quickly here because the async_wait() handler then accesses an iterator to the currently-being-destroyed current_compiles, along with potentially using the currently-being-destroyed socket.

So, bail doing anything in async_wait()'s callback if there is an error.

Unfortunately while all the above makes sense, I can't really explain why the failure doesn't happen more often with just nodeos. My hunch is that a crashing oc-monitor on nodeos shutdown is simply never noticed nor has any negative side effects. It's only in these new tests that expect oc-monitor to (as designed) be long lived where something gets angry that oc-monitor has disappeared (because it crashed)

@heifner heifner changed the base branch from main to release/5.0 October 26, 2023 17:14
// for now just consider any error as being due to cancellation at dtor time and completely bail out (there aren't many other
// potential errors for an asnyc_wait)
if(ec)
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to log this for potential debugging or just too much noises?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No higher than debug if you do log it; don't think we need it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logging config won't be active in the monitor process since it's forked off before main() -- so debug won't be visible, and any error/warn/info messages won't honor the user's logging config either. So probably best to not log anything here.

@spoonincode spoonincode merged commit f371af5 into release/5.0 Oct 26, 2023
29 checks passed
@spoonincode spoonincode deleted the oc_monitor_skip_cb_on_dtor_5x branch October 26, 2023 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants