Skip to content

fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122

Open
renecannao wants to merge 1 commit into
masterfrom
fix/delete-sandbox-race-on-rm
Open

fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122
renecannao wants to merge 1 commit into
masterfrom
fix/delete-sandbox-race-on-rm

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented May 14, 2026

Summary

Closes #121.

RemoveSandbox, RemoveCustomSandbox, and an earlier custom-delete path in sandbox/sandbox.go all do:

  1. Run the sandbox's stop / send_kill script (the destroy mode of send_kill does kill -9 $MYPID).
  2. Immediately rm -rf the sandbox directory.

kill -9 doesn't synchronously wait for the killed process to release its open file descriptors. During that brief window, the dying mysqld's last buffered I/O can land inside master/data/ after rm's readdir() walk has already passed — making the final rmdir(2) return ENOTEMPTY:

rm: cannot remove '<sandbox>/master/data': Directory not empty
error while deleting sandbox <sandbox>

This is an intermittent flake but persistent enough to bite:

Fix

Wrap the rm -rf in a small exponential-backoff retry loop (200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race resolves in milliseconds in practice; retries handle it without changing the stop-script contract or templates across topologies.

A more thorough fix would be for send_kill to wait for the killed PIDs to disappear from /proc before returning, but that's a larger change touching templates for single/replication/multiple/tidb/etc. The localized retry in RemoveSandbox solves the user-visible symptom with one focused change.

Files

sandbox/sandbox.go — adds a removeDirWithRetry(target string) error helper, then routes all four synchronous rm -rf call-sites through it:

  • Early custom-delete path (around line 213) — both sandbox dir and log directory.
  • The synchronous branch of RemoveSandbox.
  • The synchronous branch of RemoveCustomSandbox.

The concurrent-mode branches that enqueue an ExecCommand into the parallel execution list are unchanged.

Test plan

  • go build ./... clean.
  • go test ./sandbox/... — the three failures (TestUseTemplate_MySQL, TestUseTemplate_MariaDB, TestReplicationStopAndUse_DelegateToSingleTemplates) are pre-existing on master HEAD without my change. They require real MySQL binaries in $SANDBOX_BINARY to pass and are auto-skipped by test/go-unit-tests.sh in CI. Nothing related to the rm/delete code I touched.
  • CI should run the full Integration Tests matrix; in particular MariaDB (10.11.9) is the most reliable repro of the race the new retry is supposed to handle.

Note

The retry is bounded (max ~6s) and the per-attempt delay is short, so this doesn't measurably slow down healthy deletes — the rm normally succeeds on the first attempt and we never hit a sleep.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of sandbox directory cleanup by adding retry logic with exponential backoff to handle transient deletion failures.

Review Change Stack

…e race

Closes #121.

`RemoveSandbox` (and `RemoveCustomSandbox`, plus an earlier custom-delete
path) runs the sandbox's stop / send_kill script and then immediately
`rm -rf`s the sandbox dir. The `destroy` mode of `send_kill` does
`kill -9 $MYPID`, which doesn't synchronously wait for the killed
process's open file descriptors to be released. During that brief
window, the dying mysqld's last buffered I/O can land inside
`master/data/` after `rm`'s readdir() walk has already passed —
making the final `rmdir(2)` return ENOTEMPTY:

    rm: cannot remove '<sandbox>/master/data': Directory not empty
    error while deleting sandbox <sandbox>

Recently observed on:
- master Integration Tests run 25839621095 (`MariaDB (10.11.9)`)
- PR #119 hit the same flake on 4 consecutive runs before clearing on
  the 5th retry
- PR #114 hit it during initial review

Fix: wrap the rm in a small exponential-backoff retry loop
(200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race
resolves in milliseconds in practice; retries handle it without
changing the stop-script contract or template scripts across topologies.

A more thorough fix would be to have `send_kill` wait for the killed
PIDs to disappear from /proc before returning, but that's a larger
change touching template scripts across single, replication, multiple,
tidb, etc. The localized retry in `RemoveSandbox` solves the
user-visible symptom with one focused change.

Applies to all four `rm -rf` call-sites in sandbox/sandbox.go:
- The early custom-delete path (around line 213) used by sandboxDef
  recreation, both sandbox dir and log directory.
- The synchronous (non-concurrent) branch of `RemoveSandbox`.
- The synchronous branch of `RemoveCustomSandbox`.

The concurrent-mode branches (which enqueue an `ExecCommand` into the
parallel execution list) are unchanged — those run later and their
race window is already wider by virtue of scheduling.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f4f390a-2f99-459e-a968-3304e8132a08

📥 Commits

Reviewing files that changed from the base of the PR and between a7d55f9 and e8e39f6.

📒 Files selected for processing (1)
  • sandbox/sandbox.go

📝 Walkthrough

Walkthrough

The PR introduces a new removeDirWithRetry helper that replaces direct rm -rf calls in three locations—checkDirectory (force overwrite), RemoveSandbox (deprecated sequential cleanup), and RemoveCustomSandbox (sequential cleanup)—to retry failed directory removal with exponential backoff and handle transient ENOTEMPTY race conditions.

Changes

Directory removal with exponential backoff

Layer / File(s) Summary
removeDirWithRetry helper function
sandbox/sandbox.go
New helper executes rm -rf with exponential backoff sleep and up to 5 retries, returning the last error if all attempts fail.
Integration at deletion call sites
sandbox/sandbox.go
checkDirectory (force overwrite), RemoveSandbox (non-concurrent path), and RemoveCustomSandbox (non-concurrent path) now call removeDirWithRetry instead of direct rm -rf, preserving error wrapping on failure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

Poem

🐰 With retry and backoff, our dirs now persist,
Racing conditions—we're hardened 'gainst this!
Five attempts and exponential grace,
ENOTEMPTY won't win this race. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding retry logic to rm -rf operations in sandbox deletion to handle a race condition caused by kill -9 not synchronously releasing file descriptors.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/delete-sandbox-race-on-rm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a removeDirWithRetry function to handle race conditions during directory deletion by implementing an exponential backoff retry mechanism. This helper replaces direct calls to the external rm command in checkDirectory, RemoveSandbox, and RemoveCustomSandbox. Feedback suggests using the more idiomatic os.RemoveAll function to avoid process overhead and points out a calculation error in the docstring regarding the total maximum wait time.

Comment thread sandbox/sandbox.go
Comment on lines +1191 to +1206
func removeDirWithRetry(target string) error {
backoff := 200 * time.Millisecond
var lastErr error
for attempt := 0; attempt < 5; attempt++ {
if attempt > 0 {
time.Sleep(backoff)
backoff *= 2
}
if _, err := common.RunCmdWithArgs("rm", []string{"-rf", target}); err == nil {
return nil
} else {
lastErr = err
}
}
return lastErr
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using os.RemoveAll is more idiomatic and efficient in Go than spawning an external rm process, as it avoids the overhead of process creation and is more portable. Additionally, the loop logic can be simplified by removing the redundant else block. Note that the total sleep time with 5 attempts is actually 3 seconds (200+400+800+1600ms), which differs from the 6.2s mentioned in the docstring.

func removeDirWithRetry(target string) error {
	backoff := 200 * time.Millisecond
	var err error
	for attempt := 0; attempt < 5; attempt++ {
		if attempt > 0 {
			time.Sleep(backoff)
			backoff *= 2
		}
		if err = os.RemoveAll(target); err == nil {
			return nil
		}
	}
	return err
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dbdeployer delete intermittently fails with 'Directory not empty' after kill -9

1 participant