fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122
fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122renecannao wants to merge 1 commit into
Conversation
…e race Closes #121. `RemoveSandbox` (and `RemoveCustomSandbox`, plus an earlier custom-delete path) runs the sandbox's stop / send_kill script and then immediately `rm -rf`s the sandbox dir. The `destroy` mode of `send_kill` does `kill -9 $MYPID`, which doesn't synchronously wait for the killed process's open file descriptors to be released. During that brief window, the dying mysqld's last buffered I/O can land inside `master/data/` after `rm`'s readdir() walk has already passed — making the final `rmdir(2)` return ENOTEMPTY: rm: cannot remove '<sandbox>/master/data': Directory not empty error while deleting sandbox <sandbox> Recently observed on: - master Integration Tests run 25839621095 (`MariaDB (10.11.9)`) - PR #119 hit the same flake on 4 consecutive runs before clearing on the 5th retry - PR #114 hit it during initial review Fix: wrap the rm in a small exponential-backoff retry loop (200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race resolves in milliseconds in practice; retries handle it without changing the stop-script contract or template scripts across topologies. A more thorough fix would be to have `send_kill` wait for the killed PIDs to disappear from /proc before returning, but that's a larger change touching template scripts across single, replication, multiple, tidb, etc. The localized retry in `RemoveSandbox` solves the user-visible symptom with one focused change. Applies to all four `rm -rf` call-sites in sandbox/sandbox.go: - The early custom-delete path (around line 213) used by sandboxDef recreation, both sandbox dir and log directory. - The synchronous (non-concurrent) branch of `RemoveSandbox`. - The synchronous branch of `RemoveCustomSandbox`. The concurrent-mode branches (which enqueue an `ExecCommand` into the parallel execution list) are unchanged — those run later and their race window is already wider by virtue of scheduling.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughThe PR introduces a new ChangesDirectory removal with exponential backoff
Estimated code review effort🎯 2 (Simple) | ⏱️ ~8 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a removeDirWithRetry function to handle race conditions during directory deletion by implementing an exponential backoff retry mechanism. This helper replaces direct calls to the external rm command in checkDirectory, RemoveSandbox, and RemoveCustomSandbox. Feedback suggests using the more idiomatic os.RemoveAll function to avoid process overhead and points out a calculation error in the docstring regarding the total maximum wait time.
| func removeDirWithRetry(target string) error { | ||
| backoff := 200 * time.Millisecond | ||
| var lastErr error | ||
| for attempt := 0; attempt < 5; attempt++ { | ||
| if attempt > 0 { | ||
| time.Sleep(backoff) | ||
| backoff *= 2 | ||
| } | ||
| if _, err := common.RunCmdWithArgs("rm", []string{"-rf", target}); err == nil { | ||
| return nil | ||
| } else { | ||
| lastErr = err | ||
| } | ||
| } | ||
| return lastErr | ||
| } |
There was a problem hiding this comment.
Using os.RemoveAll is more idiomatic and efficient in Go than spawning an external rm process, as it avoids the overhead of process creation and is more portable. Additionally, the loop logic can be simplified by removing the redundant else block. Note that the total sleep time with 5 attempts is actually 3 seconds (200+400+800+1600ms), which differs from the 6.2s mentioned in the docstring.
func removeDirWithRetry(target string) error {
backoff := 200 * time.Millisecond
var err error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(backoff)
backoff *= 2
}
if err = os.RemoveAll(target); err == nil {
return nil
}
}
return err
}
Summary
Closes #121.
RemoveSandbox,RemoveCustomSandbox, and an earlier custom-delete path insandbox/sandbox.goall do:send_killscript (thedestroymode ofsend_killdoeskill -9 $MYPID).rm -rfthe sandbox directory.kill -9doesn't synchronously wait for the killed process to release its open file descriptors. During that brief window, the dying mysqld's last buffered I/O can land insidemaster/data/afterrm'sreaddir()walk has already passed — making the finalrmdir(2)return ENOTEMPTY:This is an intermittent flake but persistent enough to bite:
25839621095failed onMariaDB (10.11.9).Fix
Wrap the
rm -rfin a small exponential-backoff retry loop (200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race resolves in milliseconds in practice; retries handle it without changing the stop-script contract or templates across topologies.A more thorough fix would be for
send_killto wait for the killed PIDs to disappear from/procbefore returning, but that's a larger change touching templates for single/replication/multiple/tidb/etc. The localized retry inRemoveSandboxsolves the user-visible symptom with one focused change.Files
sandbox/sandbox.go— adds aremoveDirWithRetry(target string) errorhelper, then routes all four synchronousrm -rfcall-sites through it:RemoveSandbox.RemoveCustomSandbox.The concurrent-mode branches that enqueue an
ExecCommandinto the parallel execution list are unchanged.Test plan
go build ./...clean.go test ./sandbox/...— the three failures (TestUseTemplate_MySQL,TestUseTemplate_MariaDB,TestReplicationStopAndUse_DelegateToSingleTemplates) are pre-existing on master HEAD without my change. They require real MySQL binaries in$SANDBOX_BINARYto pass and are auto-skipped bytest/go-unit-tests.shin CI. Nothing related to the rm/delete code I touched.MariaDB (10.11.9)is the most reliable repro of the race the new retry is supposed to handle.Note
The retry is bounded (max ~6s) and the per-attempt delay is short, so this doesn't measurably slow down healthy deletes — the rm normally succeeds on the first attempt and we never hit a sleep.
Summary by CodeRabbit