Starting point for a timeout service#1777
Conversation
|
Hi @michaelmackenzie,
which require these tests: build. @Mu2e/fnalbuild-users, @Mu2e/write have access to CI actions on main. ⌛ The following tests have been triggered for d569030: build (Build queue - API unavailable) |
There was a problem hiding this comment.
Pull request overview
Adds an initial cooperative timeout “watchdog” service intended to help modules/events enforce soft processing deadlines, with an example integration in AgnosticHelixFinder.
Changes:
- Introduces
TimeoutWatchdogService(thread-local deadline tracking + cooperativecheck()/stop_tokenAPI, plus RAIIModuleGuard). - Integrates optional timeout checks into
CalPatRec::AgnosticHelixFindervia a newtimeoutMsmodule parameter. - Wires the new package into both CMake and SCons builds and enables it in the standard reco services set.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| TimeoutService/inc/TimeoutWatchdog.hh | Declares the new service API + FHiCL config and ModuleGuard. |
| TimeoutService/src/TimeoutWatchdog.cc | Implements per-thread event/module deadlines and cooperative cancellation. |
| TimeoutService/src/TimeoutWatchdog_service.cc | Registers the art service plugin. |
| TimeoutService/CMakeLists.txt | Builds/installs the new library + service plugin. |
| TimeoutService/src/SConscript | Adds SCons build rules for the new package. |
| CMakeLists.txt | Adds the TimeoutService subdirectory to the top-level build. |
| fcl/standardServices.fcl | Adds the watchdog service to the standard reco services tables. |
| CalPatRec/src/AgnosticHelixFinder_module.cc | Adds optional timeout budgeting and periodic timeout checks in hot loops. |
| CalPatRec/CMakeLists.txt | Links AgnosticHelixFinder against Offline::TimeoutService. |
| CalPatRec/src/SConscript | Adds SCons deps on mu2e_TimeoutService. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| RecoOnly : { | ||
| TimeoutWatchdog : { | ||
| eventTimeoutMs : 0. | ||
| moduleTimeoutMs : 0. | ||
| debugLevel : 0 | ||
| } |
There was a problem hiding this comment.
The service is configured under the key TimeoutWatchdog, but the plugin/service class is TimeoutWatchdogService (see DEFINE_ART_SERVICE(mu2e::TimeoutWatchdogService) and DECLARE_ART_SERVICE(...)). As written, art will fail to locate/load the service when @table::Services.RecoOnly is included. Rename the service table entry to TimeoutWatchdogService (or change the plugin/service name to match) to keep configuration consistent.
| if (shouldExitForTimeout(__func__)) { | ||
| if(_doTiming > 0) _watch->StopTime("per-time cluster"); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Inside the timeout early-exit path you call _watch->StopTime("per-time cluster") and then break, but produce() later unconditionally calls _watch->StopTime("per-time cluster") after the loop. This results in a double StopTime() on the same timer when a timeout occurs, which will skew timing metrics and can change the timer’s active state unexpectedly. Consider removing the inner StopTime() (or guarding the outer one) so each Increment("per-time cluster") scope is stopped exactly once.
| 0.0}; | ||
| fhicl::Atom<int> debugLevel{ | ||
| Name("debugLevel"), | ||
| Comment("Service debug verbosity: 0=off, 1=timeouts, 2=module/event boundaries"), |
There was a problem hiding this comment.
debugLevel’s comment says 2=module/event boundaries, but endModule() only logs when debugLevel_ > 2 (i.e., level 3+). Either update the comment to reflect the actual levels, or adjust the endModule() condition so level 2 includes the end-of-module boundary message.
| Comment("Service debug verbosity: 0=off, 1=timeouts, 2=module/event boundaries"), | |
| Comment("Service debug verbosity: 0=off, 1=timeouts, 3=module/event boundaries (higher = more detail)"), |
| #include <optional> | ||
| #include <stop_token> | ||
| #include <string> | ||
| #include <unordered_map> |
There was a problem hiding this comment.
#include <unordered_map> is currently unused in this header. Dropping it will reduce compile time and avoid unused-include warnings in stricter builds.
| #include <unordered_map> |
| if (debugLevel_ > 1) { | ||
| std::printf("[TimeoutWatchdogService::%s] event=%s eventTimeoutMs=%.3f\n", | ||
| __func__, | ||
| tls_.eventId.c_str(), | ||
| eventTimeoutMs_); | ||
| } |
There was a problem hiding this comment.
This service currently logs via std::printf. In this codebase, services typically use MessageFacility (mf::LogInfo/LogDebug/LogWarning) so messages are routed through the configured MessageLogger, can be filtered by category, and include job context. Consider switching these debug prints to MessageFacility (and adding the appropriate dependency) to avoid unmanaged stdout/stderr output in multi-threaded jobs.
|
☀️ The build tests passed at d569030.
N.B. These results were obtained from a build of this Pull Request at d569030 after being merged into the base branch at 67d99ba. For more information, please check the job page here. |
rlcee
left a comment
There was a problem hiding this comment.
I think this all looks reasonable. It is a little "too professional style" compared to most of our code, but I think we need to get used to that. One thing I might suggest is to remove threading references. I don't think it is wrong, I just can't say it is right until we develop some expertise and policies in threading patterns. Until then, it is little misleading in some sense.
|
I'll review this pull request by examining its details and changes.Now let me get the full files for a thorough review:Here's my review of PR #1777. This introduces a cooperative 🔴 Bug:
|
| Severity | Issue |
|---|---|
| 🔴 Bug | startModule() resets stop_source, discarding event-level timeout signals |
| 🔴 Bug | SHARED service + thread_local relies on unguaranteed thread-schedule affinity |
| 🟡 | check() marked const but mutates TLS |
| 🟡 | String-based event identity comparison is unnecessarily expensive |
| 🟡 | printf instead of MessageFacility for diagnostics |
| 🟡 | ModuleGuard calling startEvent() creates confused ownership |
| 🟡 | Unused #include <unordered_map> |
| 🟡 | Duplicate mu2e_DataProducts in SConscript |
The most critical issue is the startModule() resetting the stop_source — this effectively makes event-level timeouts unreliable across multi-module events, which is likely a core use case for this service.
|
📝 The HEAD of |
|
@FNALbuild build test |
…er, comment resolutions
|
I believe I addressed all of the comments. I removed the thread_local logic and switched to LEGACY service without potential threading. The event start is now optionally registered as a pre-event call, where for trigger timing tests in simulations we'll turn this off to not include the pre-fetch module timing in the tests. |
|
@FNALbuild run build test |
|
⌛ The following tests have been triggered for 47d374c: build (Build queue - API unavailable) |
|
☔ The build tests failed for 47d374c.
N.B. These results were obtained from a build of this Pull Request at 47d374c after being merged into the base branch at a31b07e. For more information, please check the job page here. |
|
It seems having threading abilities is actually required by Offline: Should I add the thread_local back in or should I remove this service from the multi-threaded G4 job? |
Interesting. We removed the user-requested geant multi-threading since the last geant upgrade when threads mostly would not run. I see that it is "MT" test, which explains why it is tested. This is a little funny in that 1) I don't know how art senses this, two no other services are designed to be threaded, and 3) geant threading (the only threading that was commissioned) is not actual art threading, it is geant threads inside the art module, as I recall. But I am not expert in this - it was Lisa, Andrei, Rob and Kyle. |
|
@FNALbuild run build test |
|
⌛ The following tests have been triggered for cc8af60: build (Build queue - API unavailable) |
|
☀️ The build tests passed at cc8af60.
N.B. These results were obtained from a build of this Pull Request at cc8af60 after being merged into the base branch at a31b07e. For more information, please check the job page here. |
Create a timeout service to track event and module timing. This creates deadlines for modules/events, where an example is added to AgnosticHelixFinder. In general, trigger modules with large time hits will need to add this service to periodically check if they've exceeded their deadlines. How the trigger in general handles these cases should be discussed, but this service should give flexibility for most possibilities.