RSDK-10722: Prevent spawning infinite restart handlers when module exits quickly #4997

jmatth · 2025-05-19T16:18:05Z

This fixes an issue where OUE handlers would proliferate endlessly as long as the module kept crashing, but also significantly changes the way modmanager handles crashing modules.

Module startup

New behavior

Calling Manager.Add will attempt to start a module. This includes starting the process as well as several checks to ensure the gRPC connection is working. If any part of this process fails we kill the process and return an error.

Old behavior

Pretty much the same but there was a race where if a module process crashed during startup the OUE could keep trying to restart it even though Add returned an error.

Module restarts

New behavior

If a module process crashes we will attempt to restart it indefinitely with a 5 second backoff between retries. During this time the module and its resources will remain in the Manager + resource graph. Calls to Remove, Reconfigure, Close, etc. cancel a context to terminate the module restart attempts.

Old behavior

If a module crashed we tried to restart it 3 times, then removed it + its resources as if they had been reconfigured away.

…rces

jmatth · 2025-05-19T17:56:19Z

With the change from attempting restarts in a loop to infinitely w/ backoff in the OUE handler I think the existing test coverage is sufficient and this is ready for review.

dgottlieb · 2025-05-20T13:27:57Z

module/modmanager/manager.go

@@ -914,8 +920,17 @@ func (mgr *Manager) newOnUnexpectedExitHandler(mod *module) pexec.UnexpectedExit
 		// and we should remove orphaned resources. Since we handle process
 		// restarting ourselves, return false here so goutils knows not to attempt
 		// a process restart.
+		blockSubRestarts := make(chan struct{})


I think we can delete this now?

dgottlieb · 2025-05-20T13:39:03Z

module/modmanager/manager_test.go

@@ -678,7 +678,7 @@ func TestModuleReloading(t *testing.T) {
 		test.That(t, logs.FilterMessageSnippet("Module successfully restarted").Len(),
 			test.ShouldEqual, 0)
 		test.That(t, logs.FilterMessageSnippet("Error while restarting crashed module").Len(),
-			test.ShouldEqual, 3)
+			test.ShouldBeGreaterThanOrEqualTo, 1)


A rare inspiration of reviewing the test for me. I know you didn't write this, but being it was touched, now I'm curious:

what's the test setup that results in the module restart to fail?

Is this log line guaranteed to happen before the assertion or is this a race?

(hidden diff -- GitHub PRs is not a professional CR product)

// Remove testmodule binary, so process cannot be successfully restarted // after crash.

(thanks to the later test to inspire me to look for filepath manipulation -- Windows lets one remove a file being executed?). Consider the first question answered.

Still curious about the second question.

This isn't a race because the WaitForAssertion earlier in the test is waiting for a line that comes after all the lines in this group of assertions. If we made it past that we're guaranteed that this set of assertions should be accurate.

Ah, I saw that, but now it makes sense. Because* the module won't restart, we go ahead with removing (permanently).

jmatth · 2025-05-28T14:16:00Z

module/modmanager/manager.go

-			}
-		} else {
-			break
+	blockRestart := make(chan struct{})


@dgottlieb I know you didn't like this pattern last time so I included an alternate implementation here: 7841737. Personally I prefer this because the code to modify the OUE behavior is in one contiguous block, while the other version involves a branch in a deferred call and manipulating state on the module struct. On the other hand the other implementation builds on the existing context and lock usage rather than introducing another synchronization primitive so 🤷.

There's also a third option to rely on each chained OUE handler to perform the next restart attempt rather than looping in the first OUE, but I think the loop is more readable.

I think I understand the logic here, but the pattern that's being used here is a little hard for me to follow. I think what you have is fine, but could you leave a comment above L989 to give a little more context on how this works (blocking restart and returning false on lack of success in particular).

@benjirewis I added a comment explaining the problem this is solving in the most recent commit. To expand on it a bit, consider the following code:

func startProc() pexec.ManagedProcess { cgf := pexec.ProcessConfig{ //... OnUnexpectedExit: func(_ int) bool { startProc() return false }, } return exec.NewManagedProcess(cfg, logger) } proc := startProc() // some other code proc.Stop()

If the process crashes before you call proc.Stop() you just leaked a process because the OUE handler created a new process that you don't have any reference to. Obviously this code is simplified but it's representative of how the "two running module processes" bug occurred in the first place. The channel + wrapper function is one place this PR fixes the issue: the channel allows us to block the new OUE from executing until the surrounding function returns, and the function wrapping the real OUE returns early without attempting a restart if the call to attemptRestart ultimately returned an error.

Awesome. The explanation here makes sense to me, and the comment you left is great. Thanks!

benjirewis

Thanks for looping me in! I have some nits/questions, but I know I don't have the full context on these changes, so feel free to push back on anything. I have not looked at your changes to the testing files yet.

benjirewis · 2025-05-28T17:44:43Z

module/modmanager/manager.go

@@ -353,7 +353,9 @@ func (mgr *Manager) startModule(ctx context.Context, mod *module) error {
 		ctx, "Waiting for module to complete startup and registration", "module", mod.cfg.Name, mod.logger)
 	defer cleanup()

-	if err := mgr.startModuleProcess(mod); err != nil {
+	var restartCtx context.Context


[nit] Could we maybe call this moduleRestartCtx? I like the idea of having module-specific restart contexts, but I just want to distinguish it from the manager-wide restartCtx.

benjirewis · 2025-05-28T17:50:25Z

module/modmanager/manager.go

+				orphanedResourceNames = append(orphanedResourceNames, name)
+				continue
+			}
+			_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps})


Two questions:

Why switch to mod.client.AddResource instead of using mgr.addResource?

Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?

Not using mgr.addResounce: primarily because mgr.addResource performs more work that we don't need to repeat: populating some maps and constructing a resource.Resource. In a crashed state we already did this and users of Manager may have cached instances of resource.Resources that will start working again once the module is successfully restarted. Looking at it again, I don't think calling mgr.addResource would actually break anything, but it would waste time doing repeat work, instantiating a resource.Resource that will just immediately be GCed, and emitting log lines that may just be noise.

Not removing resources: the high-level abstract answer is because we're still trying to restart the module, so until a reconfigure removes the module they're not orphaned. More concretely, so that clients who try to use resources on a crashed module won't see "this resource doesn't exist" and instead see "this resource exists but the connection failed", which more closely matches reality.

Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?

Adding on to @jmatth 's response

Josh came to me with how orphaned resources with respect to the resource graph ought to behave in these module crashing states. And I correlated that to remote resources when connections to remotes go offline (the "will a video stream come back when the network heals" problem).

I think my perspective was also the simpler one to code towards. But happy to go over this decision more thoughtfully.

Your answers to both my questions make sense to me, so this seems totally fine. I had already assumed based on the code here we were moving toward the "treat crashed modules as downed remotes" strategy, so I'm happy to hear that that was an offline discussion already.

module/modmanager/manager.go

benjirewis · 2025-05-28T17:51:52Z

module/modmanager/manager.go

-		mod.logger.Infow("Module resources successfully re-added after module restart", "module", mod.cfg.Name)
+		mod.logger.Warnw("Module resources successfully re-added after module restart",
+			"module", mod.cfg.Name,
+			"resources", restoredResourceNamesStr)
 		return
 	}
 }

 // attemptRestart will attempt to restart the module up to three times and


up to three times

No longer true, right?

benjirewis · 2025-05-28T17:54:16Z

module/modmanager/manager.go

@@ -985,12 +977,6 @@ func (mgr *Manager) attemptRestart(ctx context.Context, mod *module) []resource.
 		}
 	}()

-	if ctx.Err() != nil {


[q] Why remove this early exit from attemptRestart? Shouldn't we abandon the attempt if the ctx passed into attemptRestart has already errored?

It's superfluous: we already check if the context was cancelled after we take the lock in the OUE handler that's calling this function.

benjirewis · 2025-05-28T17:59:32Z

module/modmanager/manager.go

-			}
-		} else {
-			break
+	blockRestart := make(chan struct{})


I think I understand the logic here, but the pattern that's being used here is a little hard for me to follow. I think what you have is fine, but could you leave a comment above L989 to give a little more context on how this works (blocking restart and returning false on lack of success in particular).

benjirewis · 2025-05-28T18:04:04Z

module/modmanager/module.go

-	mgr.modules.Delete(m.cfg.Name)
+	m.deregisterResources()
+	if err := m.sharedConn.Close(); err != nil {
+		m.logger.Warnw("Error closing connection to crashed module", "error", err)


[nit] I think this could still be an unchecked error, but seems fine.

This was copied from the old version of the OUE handler. It kind of is unchecked: all we do with err is log it and continue with the rest of the cleanup steps. I'm inclined to leave it that way unless we think it will cause too much log noise.

Edit: I just dug into utils.UncheckedError and see it does not just throw away the error like I assumed but produces a special global log message at the debug level. Are you saying we should do that rather than log via the module scoped logger here?

Are you saying we should do that rather than log via the module scoped logger here?

Whoops, no. I also forgot what utils.UncheckedError here means. I meant that we should completely ignore the error and not log it at all. It's not a super strong opinion, nor do I think totally ignoring the error is critical. This is just a pattern NetCode comes up against often: a failure to close something, we've found, is rarely an interesting thing to report to a user, especially at the Error log level.

benjirewis · 2025-05-28T18:06:03Z

module/modmanager/module.go

-		}
-		return true
-	})
-	mgr.modules.Delete(m.cfg.Name)


[q] Again, why remove code around deletion from maps like mgr.modules and mgr.rMap?

Partially for the same reason as [my other comment][1]: the new behavior is to retain modules and their resources until they are reconfigured away rather than implicitly removing them. We need to retain a reference to the module for another reason though: if a crashing module is being removed or reconfigured we need to terminate its OUE handler, and to do that we need a reference to the module struct that holds the cancel function for the module's restart context. If Manager drops its references to a crashing module then it has no way to cancel that context and we just leaked the goroutine executing the OUE handler.

Per other comment, this reasoning also makes sense to me (plus the extra reasoning around maintaining a handle to the module struct). Thanks!

dgottlieb · 2025-05-28T20:05:35Z

module/modmanager/manager.go

@@ -353,7 +353,9 @@ func (mgr *Manager) startModule(ctx context.Context, mod *module) error {
 		ctx, "Waiting for module to complete startup and registration", "module", mod.cfg.Name, mod.logger)
 	defer cleanup()

-	if err := mgr.startModuleProcess(mod); err != nil {
+	var moduleRestartCtx context.Context
+	moduleRestartCtx, mod.restartCancel = context.WithCancel(mgr.restartCtx)


It's pretty straightforward, but let's document motivation, lifetime and contract of the moduleRestartCtx.

I see it is documented below. I think the only piece missing is that the manager mutex is required for reading the context state as well as for canceling the context.

dgottlieb · 2025-05-28T20:08:08Z

module/modmanager/manager.go

@@ -494,12 +495,11 @@ func (mgr *Manager) closeModule(mod *module, reconfigure bool) error {

 	mod.deregisterResources()

-	mgr.rMap.Range(func(r resource.Name, m *module) bool {
+	for r, m := range mgr.rMap.Range {


Is this from the new Go custom range/iterator implementations?

Yes. I think an earlier version of these changes modified the code surrounding this loop so it wasn't just a random drive by syntax update. Happy to remove it if it's just cluttering the diff now.

Seems fine to me I think it's much easier to read.

dgottlieb · 2025-05-28T20:24:47Z

module/modmanager/manager.go

-			mgr.ftdc.Remove(mod.getFTDCName())
-		}
+			if !cleanupPerformed {
+				mod.cleanupAfterCrash(mgr)


I assume it's a precondition that cleanupAfterCrash is called before attemptRestart?

I see attemptRestart also calls cleanupAfterCrash when it errors.

Given we have the manager mutex (which I assume prevents concurrent "create stuff"), do we need this cleanupPerformed reference inside the for-loop? I imagine the goal is that we want to first check the context before cleaning things up?

And if we pulled the cleanupAfterCrash out before the for loop, we'd have to also pull out the context check?

I'm just kind of weirded out about this important sounding cleanup method invocation that I'm not particularly familiar with.

dgottlieb · 2025-05-28T20:29:49Z

module/modmanager/manager.go

+				orphanedResourceNames = append(orphanedResourceNames, name)
+				continue
+			}
+			_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps})


Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?

Adding on to @jmatth 's response

Josh came to me with how orphaned resources with respect to the resource graph ought to behave in these module crashing states. And I correlated that to remote resources when connections to remotes go offline (the "will a video stream come back when the network heals" problem).

I think my perspective was also the simpler one to code towards. But happy to go over this decision more thoughtfully.

dgottlieb · 2025-05-28T20:45:12Z

module/modmanager/manager.go

 		}
+		lock()
+		defer unlock()
+


If I understand correctly, the context can only be canceled by the goroutine holding the manager mutex?

I think it'd be slightly better to always follow the lock call with a context check. I agree that's functionally what we have now. But I think having them literally back to back would be better.

Right now it looks like the context check is an optimization to bail and save time. But it's actually a correctness requirement to avoid double-starting.

…g level

benjirewis

Pretty much LGTM! Just a question about not lowering the OUE interval in the test. Thanks for diving into this stuff; it's super hairy.

benjirewis · 2025-05-29T16:15:26Z

module/modmanager/manager.go

@@ -494,12 +495,11 @@ func (mgr *Manager) closeModule(mod *module, reconfigure bool) error {

 	mod.deregisterResources()

-	mgr.rMap.Range(func(r resource.Name, m *module) bool {
+	for r, m := range mgr.rMap.Range {


Seems fine to me I think it's much easier to read.

benjirewis · 2025-05-29T16:20:26Z

module/modmanager/manager.go

+				orphanedResourceNames = append(orphanedResourceNames, name)
+				continue
+			}
+			_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps})


Your answers to both my questions make sense to me, so this seems totally fine. I had already assumed based on the code here we were moving toward the "treat crashed modules as downed remotes" strategy, so I'm happy to hear that that was an offline discussion already.

benjirewis · 2025-05-29T16:22:16Z

module/modmanager/manager.go

-			}
-		} else {
-			break
+	blockRestart := make(chan struct{})


Awesome. The explanation here makes sense to me, and the comment you left is great. Thanks!

benjirewis · 2025-05-29T16:25:03Z

module/modmanager/manager_test.go

-	defer func(oriOrigVal time.Duration) {
-		oueRestartInterval = oriOrigVal
-	}(oueRestartInterval)
-	oueRestartInterval = 10 * time.Millisecond


Does not lowering this interval make the test quite long? Or, now that we don't have the three restart attempts, is this interval irrelevant?

Since we don't have 3 restart attempts the test had to be adjusted and the timeout is irrelevant. With the new code it actually caused a lot of log noise as it failed to restart the module every 10ms for most of the lifetime of the test.

Avoid incorrectly reporting restart success for modules with no resou…

438bfa2

…rces

viambot added the safe to test label May 19, 2025

jmatth force-pushed the RSDK-10722-1 branch from ed1dbf9 to 09c9783 Compare May 19, 2025 16:22