Skip to content

Starting point for a timeout service#1777

Merged
oksuzian merged 3 commits intoMu2e:mainfrom
michaelmackenzie:TimeoutService
Apr 6, 2026
Merged

Starting point for a timeout service#1777
oksuzian merged 3 commits intoMu2e:mainfrom
michaelmackenzie:TimeoutService

Conversation

@michaelmackenzie
Copy link
Copy Markdown
Contributor

@michaelmackenzie michaelmackenzie commented Apr 1, 2026

Create a timeout service to track event and module timing. This creates deadlines for modules/events, where an example is added to AgnosticHelixFinder. In general, trigger modules with large time hits will need to add this service to periodically check if they've exceeded their deadlines. How the trigger in general handles these cases should be discussed, but this service should give flexibility for most possibilities.

@FNALbuild
Copy link
Copy Markdown
Collaborator

Hi @michaelmackenzie,
You have proposed changes to files in these packages:

  • fcl
  • CalPatRec
  • /
  • TimeoutService

which require these tests: build.

@Mu2e/fnalbuild-users, @Mu2e/write have access to CI actions on main.

⌛ The following tests have been triggered for d569030: build (Build queue - API unavailable)

About FNALbuild. Code review on Mu2e/Offline.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds an initial cooperative timeout “watchdog” service intended to help modules/events enforce soft processing deadlines, with an example integration in AgnosticHelixFinder.

Changes:

  • Introduces TimeoutWatchdogService (thread-local deadline tracking + cooperative check()/stop_token API, plus RAII ModuleGuard).
  • Integrates optional timeout checks into CalPatRec::AgnosticHelixFinder via a new timeoutMs module parameter.
  • Wires the new package into both CMake and SCons builds and enables it in the standard reco services set.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
TimeoutService/inc/TimeoutWatchdog.hh Declares the new service API + FHiCL config and ModuleGuard.
TimeoutService/src/TimeoutWatchdog.cc Implements per-thread event/module deadlines and cooperative cancellation.
TimeoutService/src/TimeoutWatchdog_service.cc Registers the art service plugin.
TimeoutService/CMakeLists.txt Builds/installs the new library + service plugin.
TimeoutService/src/SConscript Adds SCons build rules for the new package.
CMakeLists.txt Adds the TimeoutService subdirectory to the top-level build.
fcl/standardServices.fcl Adds the watchdog service to the standard reco services tables.
CalPatRec/src/AgnosticHelixFinder_module.cc Adds optional timeout budgeting and periodic timeout checks in hot loops.
CalPatRec/CMakeLists.txt Links AgnosticHelixFinder against Offline::TimeoutService.
CalPatRec/src/SConscript Adds SCons deps on mu2e_TimeoutService.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread fcl/standardServices.fcl
Comment on lines +57 to +62
RecoOnly : {
TimeoutWatchdog : {
eventTimeoutMs : 0.
moduleTimeoutMs : 0.
debugLevel : 0
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service is configured under the key TimeoutWatchdog, but the plugin/service class is TimeoutWatchdogService (see DEFINE_ART_SERVICE(mu2e::TimeoutWatchdogService) and DECLARE_ART_SERVICE(...)). As written, art will fail to locate/load the service when @table::Services.RecoOnly is included. Rename the service table entry to TimeoutWatchdogService (or change the plugin/service name to match) to keep configuration consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +485 to +488
if (shouldExitForTimeout(__func__)) {
if(_doTiming > 0) _watch->StopTime("per-time cluster");
break;
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside the timeout early-exit path you call _watch->StopTime("per-time cluster") and then break, but produce() later unconditionally calls _watch->StopTime("per-time cluster") after the loop. This results in a double StopTime() on the same timer when a timeout occurs, which will skew timing metrics and can change the timer’s active state unexpectedly. Consider removing the inner StopTime() (or guarding the outer one) so each Increment("per-time cluster") scope is stopped exactly once.

Copilot uses AI. Check for mistakes.
0.0};
fhicl::Atom<int> debugLevel{
Name("debugLevel"),
Comment("Service debug verbosity: 0=off, 1=timeouts, 2=module/event boundaries"),
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

debugLevel’s comment says 2=module/event boundaries, but endModule() only logs when debugLevel_ > 2 (i.e., level 3+). Either update the comment to reflect the actual levels, or adjust the endModule() condition so level 2 includes the end-of-module boundary message.

Suggested change
Comment("Service debug verbosity: 0=off, 1=timeouts, 2=module/event boundaries"),
Comment("Service debug verbosity: 0=off, 1=timeouts, 3=module/event boundaries (higher = more detail)"),

Copilot uses AI. Check for mistakes.
Comment thread TimeoutService/inc/TimeoutWatchdog.hh Outdated
#include <optional>
#include <stop_token>
#include <string>
#include <unordered_map>
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include <unordered_map> is currently unused in this header. Dropping it will reduce compile time and avoid unused-include warnings in stricter builds.

Suggested change
#include <unordered_map>

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +48
if (debugLevel_ > 1) {
std::printf("[TimeoutWatchdogService::%s] event=%s eventTimeoutMs=%.3f\n",
__func__,
tls_.eventId.c_str(),
eventTimeoutMs_);
}
Copy link

Copilot AI Apr 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This service currently logs via std::printf. In this codebase, services typically use MessageFacility (mf::LogInfo/LogDebug/LogWarning) so messages are routed through the configured MessageLogger, can be filtered by category, and include job context. Consider switching these debug prints to MessageFacility (and adding the appropriate dependency) to avoid unmanaged stdout/stderr output in multi-threaded jobs.

Copilot uses AI. Check for mistakes.
@FNALbuild
Copy link
Copy Markdown
Collaborator

☀️ The build tests passed at d569030.

Test Result Details
test with Command did not list any other PRs to include
merge Merged d569030 at 67d99ba
build (prof) Log file. Build time: 08 min 26 sec
ceSimReco Log file.
g4test_03MT Log file.
transportOnly Log file.
POT Log file.
g4study Log file.
cosmicSimReco Log file.
cosmicOffSpill Log file.
ceSteps Log file.
ceDigi Log file.
muDauSteps Log file.
ceMix Log file.
rootOverlaps Log file.
g4surfaceCheck Log file.
trigger Log file. Return Code 1.
check_cmake Log file.
FIXME, TODO TODO (0) FIXME (0) in 4 files
clang-tidy ➡️ 5 errors 106 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at d569030 after being merged into the base branch at 67d99ba.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

Copy link
Copy Markdown
Collaborator

@rlcee rlcee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this all looks reasonable. It is a little "too professional style" compared to most of our code, but I think we need to get used to that. One thing I might suggest is to remove threading references. I don't think it is wrong, I just can't say it is right until we develop some expertise and policies in threading patterns. Until then, it is little misleading in some sense.

@oksuzian
Copy link
Copy Markdown
Collaborator

oksuzian commented Apr 4, 2026

I'll review this pull request by examining its details and changes.Now let me get the full files for a thorough review:Here's my review of PR #1777. This introduces a cooperative TimeoutWatchdogService for tracking event/module timing, with an example integration in AgnosticHelixFinder. I found several issues:


🔴 Bug: startModule resets the event-level stop_source, breaking event deadlines

void TimeoutWatchdogService::startModule(std::string const& moduleLabel,
                                         std::optional<double> allowedTimeMs) {
  // Module scope starts a fresh stop token so checks are local to this invocation.
  tls_.moduleLabel = moduleLabel;
  tls_.stopSource = std::stop_source{};
  tls_.stopToken = tls_.stopSource.get_token();

Every call to startModule() replaces tls_.stopSource with a fresh instance. If a previous module already triggered an event-level timeout (which called tls_.stopSource.request_stop()), that cancellation state is silently discarded. The next module in the same event will see stop_requested() == false even though the event deadline has already passed.

Fix: Either (a) use separate stop_source/stop_token pairs for event vs. module cancellation, or (b) don't reset the stop_source in startModule() if the event-level timeout has already fired. Consider checking tls_.eventDeadline before resetting.


🔴 Bug: DECLARE_ART_SERVICE(..., SHARED) with thread_local state is fragile

DECLARE_ART_SERVICE(mu2e::TimeoutWatchdogService, SHARED)

The service is declared SHARED (one instance across all schedules), but all mutable state lives in thread_local. This works as long as art pins each schedule to a unique thread for the duration of an event, but art does not guarantee this. If art ever migrates an event between threads (or reuses threads across schedules non-deterministically), the thread_local state will be stale or belong to a different event.

Consider either declaring LEGACY (if single-threaded is acceptable), or keying state by art::ScheduleID instead of thread-local storage.


🟡 Issue: check() is const but mutates thread_local state

bool TimeoutWatchdogService::check() const {
  ...
  tls_.stopSource.request_stop();

check() is marked const but calls tls_.stopSource.request_stop(), which mutates the thread_local state. While technically legal in C++ (the const only covers this members, not TLS), it's semantically misleading. Callers may assume check() is side-effect-free. Consider either dropping the const or documenting this side-effect clearly in the header.


🟡 Issue: Event identity comparison using string is inefficient

static std::string eventIdString_(art::Event const& e) {
  std::ostringstream os;
  os << e.id().run() << ":" << e.id().subRun() << ":" << e.id().event();
  return os.str();
}

A string allocation + ostringstream construction happens on every startEvent() call just to compare event identity. Since art::EventID should support equality comparison directly, you could store the art::EventID (or its numeric components) and compare those, avoiding heap allocations on every event.


🟡 Issue: Diagnostic output uses printf instead of art's MessageFacility

Throughout both TimeoutWatchdog.cc and the AgnosticHelixFinder integration, diagnostics use std::printf(...). The Mu2e/art convention is to use mf::LogInfo, mf::LogDebug, etc. from the MessageFacility library. Using printf bypasses the configurable message filtering and log-file routing that the framework provides.


🟡 Issue: ModuleGuard calls startEvent() — responsibility confusion

  {
    svc_.startEvent(e);       // idempotent per-event if you call it for every module
    svc_.startModule(moduleLabel, allowedTimeMs);
  }

The ModuleGuard constructor calls startEvent() "idempotently," but this conflates two levels of concern: each module guard silently owns event-level setup. If no module creates a guard (e.g., timeout is disabled for all modules but event-level is desired), the event deadline is never set. Consider hooking startEvent() via an art::ActivityRegistry callback (sPreProcessEvent) instead, so event timing is always tracked if configured.


🟡 Issue: Unused #include <unordered_map> in header

#include <unordered_map>

<unordered_map> is included but never used. This should be removed.


🟡 Issue: Duplicate dependency in CalPatRec/src/SConscript

                                 'mu2e_DataProducts',
                                 'mu2e_ConfigTools',
                                 'mu2e_TimeoutService',
                                 'mu2e_DataProducts',

mu2e_DataProducts appears twice in the mainlib dependency list (lines 29 and 31 in the patched file). The new mu2e_TimeoutService line was inserted between the duplicate entries.


🟢 Minor: _timeoutMs stored as both fhicl config output and a class member

In AgnosticHelixFinder_module.cc, the field _timeoutMs is populated via the OptionalAtom config extraction and then also passed to ModuleGuard. This works but the value could be read directly from the config once and passed through without a redundant member variable. Very minor.


Summary

Severity Issue
🔴 Bug startModule() resets stop_source, discarding event-level timeout signals
🔴 Bug SHARED service + thread_local relies on unguaranteed thread-schedule affinity
🟡 check() marked const but mutates TLS
🟡 String-based event identity comparison is unnecessarily expensive
🟡 printf instead of MessageFacility for diagnostics
🟡 ModuleGuard calling startEvent() creates confused ownership
🟡 Unused #include <unordered_map>
🟡 Duplicate mu2e_DataProducts in SConscript

The most critical issue is the startModule() resetting the stop_source — this effectively makes event-level timeouts unreliable across multi-module events, which is likely a core use case for this service.

@FNALbuild
Copy link
Copy Markdown
Collaborator

📝 The HEAD of main has changed to a31b07e. Tests are now out of date.

@oksuzian
Copy link
Copy Markdown
Collaborator

oksuzian commented Apr 4, 2026

@FNALbuild build test

@michaelmackenzie
Copy link
Copy Markdown
Contributor Author

I believe I addressed all of the comments. I removed the thread_local logic and switched to LEGACY service without potential threading. The event start is now optionally registered as a pre-event call, where for trigger timing tests in simulations we'll turn this off to not include the pre-fetch module timing in the tests.

@michaelmackenzie
Copy link
Copy Markdown
Contributor Author

@FNALbuild run build test

@FNALbuild
Copy link
Copy Markdown
Collaborator

⌛ The following tests have been triggered for 47d374c: build (Build queue - API unavailable)

@FNALbuild
Copy link
Copy Markdown
Collaborator

☔ The build tests failed for 47d374c.

Test Result Details
test with Command did not list any other PRs to include
merge Merged 47d374c at a31b07e
build (prof) Log file. Build time: 08 min 26 sec
ceSimReco Log file.
g4test_03MT Log file. Return Code 1.
transportOnly Log file.
POT Log file.
g4study Log file.
cosmicSimReco Log file.
cosmicOffSpill Log file.
ceSteps Log file.
ceDigi Log file.
muDauSteps Log file.
ceMix Log file.
rootOverlaps Log file.
g4surfaceCheck Log file.
trigger Log file.
check_cmake Log file.
FIXME, TODO ➡️ TODO (0) FIXME (1) in 5 files
clang-tidy ➡️ 7 errors 209 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at 47d374c after being merged into the base branch at a31b07e.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

@michaelmackenzie
Copy link
Copy Markdown
Contributor Author

It seems having threading abilities is actually required by Offline:

---- OtherArt BEGIN
  ServiceCreation
  ---- Configuration BEGIN
    The service 'TimeoutWatchdog' is a legacy service,
    which can be used with only one schedule and one thread.
    This job uses 5 schedules and 5 threads.
    Please reconfigure your job to use only one schedule/thread.
  ---- Configuration END
  cet::exception caught during construction of service type mu2e::TimeoutWatchdogService
---- OtherArt END

Should I add the thread_local back in or should I remove this service from the multi-threaded G4 job?

@rlcee
Copy link
Copy Markdown
Collaborator

rlcee commented Apr 4, 2026

It seems having threading abilities is actually required by Offline:

Interesting. We removed the user-requested geant multi-threading since the last geant upgrade when threads mostly would not run. I see that it is "MT" test, which explains why it is tested. This is a little funny in that 1) I don't know how art senses this, two no other services are designed to be threaded, and 3) geant threading (the only threading that was commissioned) is not actual art threading, it is geant threads inside the art module, as I recall. But I am not expert in this - it was Lisa, Andrei, Rob and Kyle.
But it look like we are declaring our services "shared" like here so if you add that, then I suspect art will accept it. Frankly, I forgot about that, and I'm not sure what it means and why it's required. I'll have to dig up the docs and references..

@michaelmackenzie
Copy link
Copy Markdown
Contributor Author

@FNALbuild run build test

@FNALbuild
Copy link
Copy Markdown
Collaborator

⌛ The following tests have been triggered for cc8af60: build (Build queue - API unavailable)

@FNALbuild
Copy link
Copy Markdown
Collaborator

☀️ The build tests passed at cc8af60.

Test Result Details
test with Command did not list any other PRs to include
merge Merged cc8af60 at a31b07e
build (prof) Log file. Build time: 04 min 11 sec
ceSimReco Log file.
g4test_03MT Log file.
transportOnly Log file.
POT Log file.
g4study Log file.
cosmicSimReco Log file.
cosmicOffSpill Log file.
ceSteps Log file.
ceDigi Log file.
muDauSteps Log file.
ceMix Log file.
rootOverlaps Log file.
g4surfaceCheck Log file.
trigger Log file.
check_cmake Log file.
FIXME, TODO ➡️ TODO (0) FIXME (1) in 5 files
clang-tidy ➡️ 7 errors 209 warnings
whitespace check no whitespace errors found

N.B. These results were obtained from a build of this Pull Request at cc8af60 after being merged into the base branch at a31b07e.

For more information, please check the job page here.
Build artifacts are deleted after 5 days. If this is not desired, select Keep this build forever on the job page.

@oksuzian oksuzian merged commit 2967415 into Mu2e:main Apr 6, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants