-
Notifications
You must be signed in to change notification settings - Fork 122
RSDK-10722: Prevent spawning infinite restart handlers when module exits quickly #4997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
With the change from attempting restarts in a loop to infinitely w/ backoff in the OUE handler I think the existing test coverage is sufficient and this is ready for review. |
module/modmanager/manager.go
Outdated
@@ -914,8 +920,17 @@ func (mgr *Manager) newOnUnexpectedExitHandler(mod *module) pexec.UnexpectedExit | |||
// and we should remove orphaned resources. Since we handle process | |||
// restarting ourselves, return false here so goutils knows not to attempt | |||
// a process restart. | |||
blockSubRestarts := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can delete this now?
module/modmanager/manager_test.go
Outdated
@@ -678,7 +678,7 @@ func TestModuleReloading(t *testing.T) { | |||
test.That(t, logs.FilterMessageSnippet("Module successfully restarted").Len(), | |||
test.ShouldEqual, 0) | |||
test.That(t, logs.FilterMessageSnippet("Error while restarting crashed module").Len(), | |||
test.ShouldEqual, 3) | |||
test.ShouldBeGreaterThanOrEqualTo, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A rare inspiration of reviewing the test for me. I know you didn't write this, but being it was touched, now I'm curious:
- what's the test setup that results in the module restart to fail?
- Is this log line guaranteed to happen before the assertion or is this a race?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(hidden diff -- GitHub PRs is not a professional CR product)
// Remove testmodule binary, so process cannot be successfully restarted
// after crash.
(thanks to the later test to inspire me to look for filepath manipulation -- Windows lets one remove a file being executed?). Consider the first question answered.
Still curious about the second question.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a race because the WaitForAssertion
earlier in the test is waiting for a line that comes after all the lines in this group of assertions. If we made it past that we're guaranteed that this set of assertions should be accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I saw that, but now it makes sense. Because* the module won't restart, we go ahead with removing (permanently).
} | ||
} else { | ||
break | ||
blockRestart := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dgottlieb I know you didn't like this pattern last time so I included an alternate implementation here: 7841737. Personally I prefer this because the code to modify the OUE behavior is in one contiguous block, while the other version involves a branch in a deferred call and manipulating state on the module struct. On the other hand the other implementation builds on the existing context and lock usage rather than introducing another synchronization primitive so 🤷.
There's also a third option to rely on each chained OUE handler to perform the next restart attempt rather than looping in the first OUE, but I think the loop is more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the logic here, but the pattern that's being used here is a little hard for me to follow. I think what you have is fine, but could you leave a comment above L989 to give a little more context on how this works (blocking restart and returning false
on lack of success in particular).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benjirewis I added a comment explaining the problem this is solving in the most recent commit. To expand on it a bit, consider the following code:
func startProc() pexec.ManagedProcess {
cgf := pexec.ProcessConfig{
//...
OnUnexpectedExit: func(_ int) bool {
startProc()
return false
},
}
return exec.NewManagedProcess(cfg, logger)
}
proc := startProc()
// some other code
proc.Stop()
If the process crashes before you call proc.Stop()
you just leaked a process because the OUE handler created a new process that you don't have any reference to. Obviously this code is simplified but it's representative of how the "two running module processes" bug occurred in the first place. The channel + wrapper function is one place this PR fixes the issue: the channel allows us to block the new OUE from executing until the surrounding function returns, and the function wrapping the real OUE returns early without attempting a restart if the call to attemptRestart
ultimately returned an error.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. The explanation here makes sense to me, and the comment you left is great. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looping me in! I have some nits/questions, but I know I don't have the full context on these changes, so feel free to push back on anything. I have not looked at your changes to the testing files yet.
module/modmanager/manager.go
Outdated
@@ -353,7 +353,9 @@ func (mgr *Manager) startModule(ctx context.Context, mod *module) error { | |||
ctx, "Waiting for module to complete startup and registration", "module", mod.cfg.Name, mod.logger) | |||
defer cleanup() | |||
|
|||
if err := mgr.startModuleProcess(mod); err != nil { | |||
var restartCtx context.Context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] Could we maybe call this moduleRestartCtx
? I like the idea of having module-specific restart contexts, but I just want to distinguish it from the manager-wide restartCtx
.
orphanedResourceNames = append(orphanedResourceNames, name) | ||
continue | ||
} | ||
_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two questions:
- Why switch to
mod.client.AddResource
instead of usingmgr.addResource
? - Why not remove orphaned resources from
mgr.rMap
andmod.resources
as we were doing before?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not using mgr.addResounce
: primarily because mgr.addResource
performs more work that we don't need to repeat: populating some maps and constructing a resource.Resource
. In a crashed state we already did this and users of Manager
may have cached instances of resource.Resource
s that will start working again once the module is successfully restarted. Looking at it again, I don't think calling mgr.addResource
would actually break anything, but it would waste time doing repeat work, instantiating a resource.Resource
that will just immediately be GCed, and emitting log lines that may just be noise.
Not removing resources: the high-level abstract answer is because we're still trying to restart the module, so until a reconfigure removes the module they're not orphaned. More concretely, so that clients who try to use resources on a crashed module won't see "this resource doesn't exist" and instead see "this resource exists but the connection failed", which more closely matches reality.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?
Adding on to @jmatth 's response
Josh came to me with how orphaned resources with respect to the resource graph ought to behave in these module crashing states. And I correlated that to remote resources when connections to remotes go offline (the "will a video stream come back when the network heals" problem).
I think my perspective was also the simpler one to code towards. But happy to go over this decision more thoughtfully.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your answers to both my questions make sense to me, so this seems totally fine. I had already assumed based on the code here we were moving toward the "treat crashed modules as downed remotes" strategy, so I'm happy to hear that that was an offline discussion already.
module/modmanager/manager.go
Outdated
mod.logger.Infow("Module resources successfully re-added after module restart", "module", mod.cfg.Name) | ||
mod.logger.Warnw("Module resources successfully re-added after module restart", | ||
"module", mod.cfg.Name, | ||
"resources", restoredResourceNamesStr) | ||
return | ||
} | ||
} | ||
|
||
// attemptRestart will attempt to restart the module up to three times and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
up to three times
No longer true, right?
@@ -985,12 +977,6 @@ func (mgr *Manager) attemptRestart(ctx context.Context, mod *module) []resource. | |||
} | |||
}() | |||
|
|||
if ctx.Err() != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[q] Why remove this early exit from attemptRestart
? Shouldn't we abandon the attempt if the ctx
passed into attemptRestart
has already errored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's superfluous: we already check if the context was cancelled after we take the lock in the OUE handler that's calling this function.
} | ||
} else { | ||
break | ||
blockRestart := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I understand the logic here, but the pattern that's being used here is a little hard for me to follow. I think what you have is fine, but could you leave a comment above L989 to give a little more context on how this works (blocking restart and returning false
on lack of success in particular).
mgr.modules.Delete(m.cfg.Name) | ||
m.deregisterResources() | ||
if err := m.sharedConn.Close(); err != nil { | ||
m.logger.Warnw("Error closing connection to crashed module", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nit] I think this could still be an unchecked error, but seems fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was copied from the old version of the OUE handler. It kind of is unchecked: all we do with err
is log it and continue with the rest of the cleanup steps. I'm inclined to leave it that way unless we think it will cause too much log noise.
Edit: I just dug into utils.UncheckedError
and see it does not just throw away the error like I assumed but produces a special global log message at the debug level. Are you saying we should do that rather than log via the module scoped logger here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying we should do that rather than log via the module scoped logger here?
Whoops, no. I also forgot what utils.UncheckedError
here means. I meant that we should completely ignore the error and not log it at all. It's not a super strong opinion, nor do I think totally ignoring the error is critical. This is just a pattern NetCode comes up against often: a failure to close something, we've found, is rarely an interesting thing to report to a user, especially at the Error
log level.
} | ||
return true | ||
}) | ||
mgr.modules.Delete(m.cfg.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[q] Again, why remove code around deletion from maps like mgr.modules
and mgr.rMap
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Partially for the same reason as [my other comment][1]: the new behavior is to retain modules and their resources until they are reconfigured away rather than implicitly removing them. We need to retain a reference to the module for another reason though: if a crashing module is being removed or reconfigured we need to terminate its OUE handler, and to do that we need a reference to the module
struct that holds the cancel function for the module's restart context. If Manager
drops its references to a crashing module then it has no way to cancel that context and we just leaked the goroutine executing the OUE handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per other comment, this reasoning also makes sense to me (plus the extra reasoning around maintaining a handle to the module
struct). Thanks!
@@ -353,7 +353,9 @@ func (mgr *Manager) startModule(ctx context.Context, mod *module) error { | |||
ctx, "Waiting for module to complete startup and registration", "module", mod.cfg.Name, mod.logger) | |||
defer cleanup() | |||
|
|||
if err := mgr.startModuleProcess(mod); err != nil { | |||
var moduleRestartCtx context.Context | |||
moduleRestartCtx, mod.restartCancel = context.WithCancel(mgr.restartCtx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's pretty straightforward, but let's document motivation, lifetime and contract of the moduleRestartCtx
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see it is documented below. I think the only piece missing is that the manager mutex is required for reading the context state as well as for canceling the context.
@@ -494,12 +495,11 @@ func (mgr *Manager) closeModule(mod *module, reconfigure bool) error { | |||
|
|||
mod.deregisterResources() | |||
|
|||
mgr.rMap.Range(func(r resource.Name, m *module) bool { | |||
for r, m := range mgr.rMap.Range { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this from the new Go custom range/iterator implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I think an earlier version of these changes modified the code surrounding this loop so it wasn't just a random drive by syntax update. Happy to remove it if it's just cluttering the diff now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me I think it's much easier to read.
mgr.ftdc.Remove(mod.getFTDCName()) | ||
} | ||
if !cleanupPerformed { | ||
mod.cleanupAfterCrash(mgr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume it's a precondition that cleanupAfterCrash
is called before attemptRestart
?
I see attemptRestart
also calls cleanupAfterCrash
when it errors.
Given we have the manager mutex (which I assume prevents concurrent "create stuff"), do we need this cleanupPerformed
reference inside the for-loop? I imagine the goal is that we want to first check the context before cleaning things up?
And if we pulled the cleanupAfterCrash
out before the for loop, we'd have to also pull out the context check?
I'm just kind of weirded out about this important sounding cleanup method invocation that I'm not particularly familiar with.
orphanedResourceNames = append(orphanedResourceNames, name) | ||
continue | ||
} | ||
_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not remove orphaned resources from mgr.rMap and mod.resources as we were doing before?
Adding on to @jmatth 's response
Josh came to me with how orphaned resources with respect to the resource graph ought to behave in these module crashing states. And I correlated that to remote resources when connections to remotes go offline (the "will a video stream come back when the network heals" problem).
I think my perspective was also the simpler one to code towards. But happy to go over this decision more thoughtfully.
} | ||
lock() | ||
defer unlock() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, the context can only be canceled by the goroutine holding the manager mutex?
I think it'd be slightly better to always follow the lock
call with a context check. I agree that's functionally what we have now. But I think having them literally back to back would be better.
Right now it looks like the context check is an optimization to bail and save time. But it's actually a correctness requirement to avoid double-starting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much LGTM! Just a question about not lowering the OUE interval in the test. Thanks for diving into this stuff; it's super hairy.
@@ -494,12 +495,11 @@ func (mgr *Manager) closeModule(mod *module, reconfigure bool) error { | |||
|
|||
mod.deregisterResources() | |||
|
|||
mgr.rMap.Range(func(r resource.Name, m *module) bool { | |||
for r, m := range mgr.rMap.Range { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me I think it's much easier to read.
orphanedResourceNames = append(orphanedResourceNames, name) | ||
continue | ||
} | ||
_, err = mod.client.AddResource(ctx, &pb.AddResourceRequest{Config: confProto, Dependencies: res.deps}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your answers to both my questions make sense to me, so this seems totally fine. I had already assumed based on the code here we were moving toward the "treat crashed modules as downed remotes" strategy, so I'm happy to hear that that was an offline discussion already.
} | ||
} else { | ||
break | ||
blockRestart := make(chan struct{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome. The explanation here makes sense to me, and the comment you left is great. Thanks!
defer func(oriOrigVal time.Duration) { | ||
oueRestartInterval = oriOrigVal | ||
}(oueRestartInterval) | ||
oueRestartInterval = 10 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does not lowering this interval make the test quite long? Or, now that we don't have the three restart attempts, is this interval irrelevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we don't have 3 restart attempts the test had to be adjusted and the timeout is irrelevant. With the new code it actually caused a lot of log noise as it failed to restart the module every 10ms for most of the lifetime of the test.
This fixes an issue where OUE handlers would proliferate endlessly as long as the module kept crashing, but also significantly changes the way modmanager handles crashing modules.
Module startup
New behavior
Calling
Manager.Add
will attempt to start a module. This includes starting the process as well as several checks to ensure the gRPC connection is working. If any part of this process fails we kill the process and return an error.Old behavior
Pretty much the same but there was a race where if a module process crashed during startup the OUE could keep trying to restart it even though Add returned an error.
Module restarts
New behavior
If a module process crashes we will attempt to restart it indefinitely with a 5 second backoff between retries. During this time the module and its resources will remain in the Manager + resource graph. Calls to Remove, Reconfigure, Close, etc. cancel a context to terminate the module restart attempts.
Old behavior
If a module crashed we tried to restart it 3 times, then removed it + its resources as if they had been reconfigured away.