[5.0] refactor threading of `snapshot_scheduler_test` #1821

spoonincode · 2023-10-25T00:22:44Z

In the process of investigating #1794 I got my build in to a state (via various compile options & compilers I guess) where snapshot_scheduler_test would consistently fail via

test_snapshot_scheduler.cpp(115): fatal error: in "producer_snapshot_scheduler_tests/snapshot_scheduler_test": critical check validate_snapshot_request(4, 12, 10, true) has failed

Upon closer look, some aspects of threading look invalid and racy in this test. After this refactor my troublesome build always passes. I'd like to consider this resolving #1794 until we know otherwise (we should know shortly 🙂)

In particular,
There is a race between the app->exec() and the chain_plug->chain().block_start.connect() -- blocks could be created before that connection is made and thus missed by the connection. I resolved this by moving the connection to occur before app->exec(). I believe this is where my local failure stumbled.

pp->schedule_snapshot() and pp->get_snapshot_requests() touch variables that are not guarded by any mutex. These should only be called by the "main thread" (in this case the app_thread) so wrap them in a .post() to the app.

I also replaced the effectively sleep(10) for a wait on block 20 (however long it takes).

linh2931

Great fix!

The failed to read response from monitor process failure now appears in another test: https://github.com/AntelopeIO/leap/actions/runs/6634407607/job/18023991204. I added some some debug logging in compile_monitor.cpp in oc_compile_monitor_debug branch, run CICD manually hoping to see some logs, but the logs I added did not show up. I thought the monitor process might have already exited before get_connection_to_compile_monitor was called.

I could not reproduce it locally.

spoonincode · 2023-10-25T02:31:53Z

Not sure... I've been working under the premise that the error message was actually indicative of a crash since that's what we were thinking with #1736/#1766. But these tests we're seeing the error message on are of the few that reuse appbase in this manner... so maybe it's something with the way plugins are getting destroyed/recreated that's messing something up in OC's external process glue.

refactor threading of snapshot_scheduler_test

841d547

linh2931 approved these changes Oct 25, 2023

View reviewed changes

heifner approved these changes Oct 25, 2023

View reviewed changes

heifner linked an issue Oct 25, 2023 that may be closed by this pull request

Test Failure: failed to read response from monitor process in snapshot_scheduler_test #1794

Closed

spoonincode merged commit 81a9d5c into release/5.0 Oct 25, 2023
29 checks passed

spoonincode deleted the snapschedthreads_5x branch October 25, 2023 13:14

This was referenced Oct 25, 2023

[5.0 -> main] refactor threading of snapshot_scheduler_test #1822

Merged

Test Failure: failed to read response from monitor process in snapshot_scheduler_test #1794

Closed

linh2931 mentioned this pull request Oct 25, 2023

Test Failure: "failed to read response from monitor process" in read_only_trxs/with_3_read_only_threads #1823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[5.0] refactor threading of `snapshot_scheduler_test` #1821

[5.0] refactor threading of `snapshot_scheduler_test` #1821

spoonincode commented Oct 25, 2023

linh2931 left a comment

spoonincode commented Oct 25, 2023

[5.0] refactor threading of snapshot_scheduler_test #1821

[5.0] refactor threading of snapshot_scheduler_test #1821

Conversation

spoonincode commented Oct 25, 2023

linh2931 left a comment

Choose a reason for hiding this comment

spoonincode commented Oct 25, 2023

[5.0] refactor threading of `snapshot_scheduler_test` #1821

[5.0] refactor threading of `snapshot_scheduler_test` #1821