Skip to content

RSDK-10722: Prevent spawning infinite restart handlers when module exits quickly #4997

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 26 commits into from
May 29, 2025

Conversation

jmatth
Copy link
Member

@jmatth jmatth commented May 19, 2025

This fixes an issue where OUE handlers would proliferate endlessly as long as the module kept crashing, but also significantly changes the way modmanager handles crashing modules.

Module startup

New behavior

Calling Manager.Add will attempt to start a module. This includes starting the process as well as several checks to ensure the gRPC connection is working. If any part of this process fails we kill the process and return an error.

Old behavior

Pretty much the same but there was a race where if a module process crashed during startup the OUE could keep trying to restart it even though Add returned an error.

Module restarts

New behavior

If a module process crashes we will attempt to restart it indefinitely with a 5 second backoff between retries. During this time the module and its resources will remain in the Manager + resource graph. Calls to Remove, Reconfigure, Close, etc. cancel a context to terminate the module restart attempts.

Old behavior

If a module crashed we tried to restart it 3 times, then removed it + its resources as if they had been reconfigured away.

@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label May 19, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2025
@jmatth jmatth requested a review from dgottlieb May 19, 2025 16:32
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2025
@jmatth jmatth marked this pull request as ready for review May 19, 2025 17:55
@jmatth
Copy link
Member Author

jmatth commented May 19, 2025

With the change from attempting restarts in a loop to infinitely w/ backoff in the OUE handler I think the existing test coverage is sufficient and this is ready for review.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 19, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 20, 2025
@@ -914,8 +920,17 @@ func (mgr *Manager) newOnUnexpectedExitHandler(mod *module) pexec.UnexpectedExit
// and we should remove orphaned resources. Since we handle process
// restarting ourselves, return false here so goutils knows not to attempt
// a process restart.
blockSubRestarts := make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can delete this now?

@@ -678,7 +678,7 @@ func TestModuleReloading(t *testing.T) {
test.That(t, logs.FilterMessageSnippet("Module successfully restarted").Len(),
test.ShouldEqual, 0)
test.That(t, logs.FilterMessageSnippet("Error while restarting crashed module").Len(),
test.ShouldEqual, 3)
test.ShouldBeGreaterThanOrEqualTo, 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A rare inspiration of reviewing the test for me. I know you didn't write this, but being it was touched, now I'm curious:

  • what's the test setup that results in the module restart to fail?
  • Is this log line guaranteed to happen before the assertion or is this a race?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(hidden diff -- GitHub PRs is not a professional CR product)

		// Remove testmodule binary, so process cannot be successfully restarted
		// after crash.

(thanks to the later test to inspire me to look for filepath manipulation -- Windows lets one remove a file being executed?). Consider the first question answered.

Still curious about the second question.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a race because the WaitForAssertion earlier in the test is waiting for a line that comes after all the lines in this group of assertions. If we made it past that we're guaranteed that this set of assertions should be accurate.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I saw that, but now it makes sense. Because* the module won't restart, we go ahead with removing (permanently).

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 27, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 28, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 28, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 28, 2025
}
} else {
break
blockRestart := make(chan struct{})
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgottlieb I know you didn't like this pattern last time so I included an alternate implementation here: 7841737. Personally I prefer this because the code to modify the OUE behavior is in one contiguous block, while the other version involves a branch in a deferred call and manipulating state on the module struct. On the other hand the other implementation builds on the existing context and lock usage rather than introducing another synchronization primitive so 🤷.

There's also a third option to rely on each chained OUE handler to perform the next restart attempt rather than looping in the first OUE, but I think the loop is more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the logic here, but the pattern that's being used here is a little hard for me to follow. I think what you have is fine, but could you leave a comment above L989 to give a little more context on how this works (blocking restart and returning false on lack of success in particular).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benjirewis I added a comment explaining the problem this is solving in the most recent commit. To expand on it a bit, consider the following code:

func startProc() pexec.ManagedProcess {
  cgf := pexec.ProcessConfig{
    //...
    OnUnexpectedExit: func(_ int) bool {
      startProc()
      return false
    },
  }
  return exec.NewManagedProcess(cfg, logger)
}

proc := startProc()
// some other code
proc.Stop()

If the process crashes before you call proc.Stop() you just leaked a process because the OUE handler created a new process that you don't have any reference to. Obviously this code is simplified but it's representative of how the "two running module processes" bug occurred in the first place. The channel + wrapper function is one place this PR fixes the issue: the channel allows us to block the new OUE from executing until the surrounding function returns, and the function wrapping the real OUE returns early without attempting a restart if the call to attemptRestart ultimately returned an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. The explanation here makes sense to me, and the comment you left is great. Thanks!

@jmatth jmatth requested review from benjirewis and dgottlieb May 28, 2025 14:16
Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looping me in! I have some nits/questions, but I know I don't have the full context on these changes, so feel free to push back on anything. I have not looked at your changes to the testing files yet.

@@ -353,7 +353,9 @@ func (mgr *Manager) startModule(ctx context.Context, mod *module) error {
ctx, "Waiting for module to complete startup and registration", "module", mod.cfg.Name, mod.logger)
defer cleanup()

if err := mgr.startModuleProcess(mod); err != nil {
var restartCtx context.Context
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Could we maybe call this moduleRestartCtx? I like the idea of having module-specific restart contexts, but I just want to distinguish it from the manager-wide restartCtx.

orphanedResourceNames = append(orphanedResourceNames, name)
continue
}
_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two questions:

  • Why switch to mod.client.AddResource instead of using mgr.addResource?
  • Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not using mgr.addResounce: primarily because mgr.addResource performs more work that we don't need to repeat: populating some maps and constructing a resource.Resource. In a crashed state we already did this and users of Manager may have cached instances of resource.Resources that will start working again once the module is successfully restarted. Looking at it again, I don't think calling mgr.addResource would actually break anything, but it would waste time doing repeat work, instantiating a resource.Resource that will just immediately be GCed, and emitting log lines that may just be noise.

Not removing resources: the high-level abstract answer is because we're still trying to restart the module, so until a reconfigure removes the module they're not orphaned. More concretely, so that clients who try to use resources on a crashed module won't see "this resource doesn't exist" and instead see "this resource exists but the connection failed", which more closely matches reality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?

Adding on to @jmatth 's response

Josh came to me with how orphaned resources with respect to the resource graph ought to behave in these module crashing states. And I correlated that to remote resources when connections to remotes go offline (the "will a video stream come back when the network heals" problem).

I think my perspective was also the simpler one to code towards. But happy to go over this decision more thoughtfully.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your answers to both my questions make sense to me, so this seems totally fine. I had already assumed based on the code here we were moving toward the "treat crashed modules as downed remotes" strategy, so I'm happy to hear that that was an offline discussion already.

mod.logger.Infow("Module resources successfully re-added after module restart", "module", mod.cfg.Name)
mod.logger.Warnw("Module resources successfully re-added after module restart",
"module", mod.cfg.Name,
"resources", restoredResourceNamesStr)
return
}
}

// attemptRestart will attempt to restart the module up to three times and
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

up to three times

No longer true, right?

@@ -985,12 +977,6 @@ func (mgr *Manager) attemptRestart(ctx context.Context, mod *module) []resource.
}
}()

if ctx.Err() != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] Why remove this early exit from attemptRestart? Shouldn't we abandon the attempt if the ctx passed into attemptRestart has already errored?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's superfluous: we already check if the context was cancelled after we take the lock in the OUE handler that's calling this function.

}
} else {
break
blockRestart := make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand the logic here, but the pattern that's being used here is a little hard for me to follow. I think what you have is fine, but could you leave a comment above L989 to give a little more context on how this works (blocking restart and returning false on lack of success in particular).

mgr.modules.Delete(m.cfg.Name)
m.deregisterResources()
if err := m.sharedConn.Close(); err != nil {
m.logger.Warnw("Error closing connection to crashed module", "error", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] I think this could still be an unchecked error, but seems fine.

Copy link
Member Author

@jmatth jmatth May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was copied from the old version of the OUE handler. It kind of is unchecked: all we do with err is log it and continue with the rest of the cleanup steps. I'm inclined to leave it that way unless we think it will cause too much log noise.

Edit: I just dug into utils.UncheckedError and see it does not just throw away the error like I assumed but produces a special global log message at the debug level. Are you saying we should do that rather than log via the module scoped logger here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying we should do that rather than log via the module scoped logger here?

Whoops, no. I also forgot what utils.UncheckedError here means. I meant that we should completely ignore the error and not log it at all. It's not a super strong opinion, nor do I think totally ignoring the error is critical. This is just a pattern NetCode comes up against often: a failure to close something, we've found, is rarely an interesting thing to report to a user, especially at the Error log level.

}
return true
})
mgr.modules.Delete(m.cfg.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[q] Again, why remove code around deletion from maps like mgr.modules and mgr.rMap?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partially for the same reason as [my other comment][1]: the new behavior is to retain modules and their resources until they are reconfigured away rather than implicitly removing them. We need to retain a reference to the module for another reason though: if a crashing module is being removed or reconfigured we need to terminate its OUE handler, and to do that we need a reference to the module struct that holds the cancel function for the module's restart context. If Manager drops its references to a crashing module then it has no way to cancel that context and we just leaked the goroutine executing the OUE handler.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per other comment, this reasoning also makes sense to me (plus the extra reasoning around maintaining a handle to the module struct). Thanks!

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 28, 2025
@@ -353,7 +353,9 @@ func (mgr *Manager) startModule(ctx context.Context, mod *module) error {
ctx, "Waiting for module to complete startup and registration", "module", mod.cfg.Name, mod.logger)
defer cleanup()

if err := mgr.startModuleProcess(mod); err != nil {
var moduleRestartCtx context.Context
moduleRestartCtx, mod.restartCancel = context.WithCancel(mgr.restartCtx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty straightforward, but let's document motivation, lifetime and contract of the moduleRestartCtx.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see it is documented below. I think the only piece missing is that the manager mutex is required for reading the context state as well as for canceling the context.

@@ -494,12 +495,11 @@ func (mgr *Manager) closeModule(mod *module, reconfigure bool) error {

mod.deregisterResources()

mgr.rMap.Range(func(r resource.Name, m *module) bool {
for r, m := range mgr.rMap.Range {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this from the new Go custom range/iterator implementations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think an earlier version of these changes modified the code surrounding this loop so it wasn't just a random drive by syntax update. Happy to remove it if it's just cluttering the diff now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me I think it's much easier to read.

mgr.ftdc.Remove(mod.getFTDCName())
}
if !cleanupPerformed {
mod.cleanupAfterCrash(mgr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume it's a precondition that cleanupAfterCrash is called before attemptRestart?

I see attemptRestart also calls cleanupAfterCrash when it errors.

Given we have the manager mutex (which I assume prevents concurrent "create stuff"), do we need this cleanupPerformed reference inside the for-loop? I imagine the goal is that we want to first check the context before cleaning things up?

And if we pulled the cleanupAfterCrash out before the for loop, we'd have to also pull out the context check?

I'm just kind of weirded out about this important sounding cleanup method invocation that I'm not particularly familiar with.

orphanedResourceNames = append(orphanedResourceNames, name)
continue
}
_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?

Adding on to @jmatth 's response

Josh came to me with how orphaned resources with respect to the resource graph ought to behave in these module crashing states. And I correlated that to remote resources when connections to remotes go offline (the "will a video stream come back when the network heals" problem).

I think my perspective was also the simpler one to code towards. But happy to go over this decision more thoughtfully.

}
lock()
defer unlock()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the context can only be canceled by the goroutine holding the manager mutex?

I think it'd be slightly better to always follow the lock call with a context check. I agree that's functionally what we have now. But I think having them literally back to back would be better.

Right now it looks like the context check is an optimization to bail and save time. But it's actually a correctness requirement to avoid double-starting.

@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 29, 2025
@viambot viambot added safe to test This pull request is marked safe to test from a trusted zone and removed safe to test This pull request is marked safe to test from a trusted zone labels May 29, 2025
Copy link
Member

@benjirewis benjirewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much LGTM! Just a question about not lowering the OUE interval in the test. Thanks for diving into this stuff; it's super hairy.

@@ -494,12 +495,11 @@ func (mgr *Manager) closeModule(mod *module, reconfigure bool) error {

mod.deregisterResources()

mgr.rMap.Range(func(r resource.Name, m *module) bool {
for r, m := range mgr.rMap.Range {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me I think it's much easier to read.

orphanedResourceNames = append(orphanedResourceNames, name)
continue
}
_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your answers to both my questions make sense to me, so this seems totally fine. I had already assumed based on the code here we were moving toward the "treat crashed modules as downed remotes" strategy, so I'm happy to hear that that was an offline discussion already.

}
} else {
break
blockRestart := make(chan struct{})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome. The explanation here makes sense to me, and the comment you left is great. Thanks!

defer func(oriOrigVal time.Duration) {
oueRestartInterval = oriOrigVal
}(oueRestartInterval)
oueRestartInterval = 10 * time.Millisecond
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does not lowering this interval make the test quite long? Or, now that we don't have the three restart attempts, is this interval irrelevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we don't have 3 restart attempts the test had to be adjusted and the timeout is irrelevant. With the new code it actually caused a lot of log noise as it failed to restart the module every 10ms for most of the lifetime of the test.

@jmatth jmatth merged commit 95716ec into viamrobotics:main May 29, 2025
18 checks passed
@jmatth jmatth deleted the RSDK-10722-1 branch May 29, 2025 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
safe to test This pull request is marked safe to test from a trusted zone
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants