fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race by renecannao · Pull Request #122 · ProxySQL/dbdeployer

renecannao · 2026-05-14T08:59:09Z

Summary

Closes #121.

RemoveSandbox, RemoveCustomSandbox, and an earlier custom-delete path in sandbox/sandbox.go all do:

Run the sandbox's stop / send_kill script (the destroy mode of send_kill does kill -9 $MYPID).
Immediately rm -rf the sandbox directory.

kill -9 doesn't synchronously wait for the killed process to release its open file descriptors. During that brief window, the dying mysqld's last buffered I/O can land inside master/data/ after rm's readdir() walk has already passed — making the final rmdir(2) return ENOTEMPTY:

rm: cannot remove '<sandbox>/master/data': Directory not empty
error while deleting sandbox <sandbox>

This is an intermittent flake but persistent enough to bite:

master Integration Tests run 25839621095 failed on MariaDB (10.11.9).
PR fix(postgresql): avoid port collisions across single/replication/multiple deploys #119 hit the same flake on 4 consecutive reruns before clearing on the 5th.
PR fix: actionable instructions when sandbox-binary directory is missing #114 hit it during initial review (cleared on first retry).

Fix

Wrap the rm -rf in a small exponential-backoff retry loop (200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race resolves in milliseconds in practice; retries handle it without changing the stop-script contract or templates across topologies.

A more thorough fix would be for send_kill to wait for the killed PIDs to disappear from /proc before returning, but that's a larger change touching templates for single/replication/multiple/tidb/etc. The localized retry in RemoveSandbox solves the user-visible symptom with one focused change.

Files

sandbox/sandbox.go — adds a removeDirWithRetry(target string) error helper, then routes all four synchronous rm -rf call-sites through it:

Early custom-delete path (around line 213) — both sandbox dir and log directory.
The synchronous branch of RemoveSandbox.
The synchronous branch of RemoveCustomSandbox.

The concurrent-mode branches that enqueue an ExecCommand into the parallel execution list are unchanged.

Test plan

go build ./... clean.
go test ./sandbox/... — the three failures (TestUseTemplate_MySQL, TestUseTemplate_MariaDB, TestReplicationStopAndUse_DelegateToSingleTemplates) are pre-existing on master HEAD without my change. They require real MySQL binaries in $SANDBOX_BINARY to pass and are auto-skipped by test/go-unit-tests.sh in CI. Nothing related to the rm/delete code I touched.
CI should run the full Integration Tests matrix; in particular MariaDB (10.11.9) is the most reliable repro of the race the new retry is supposed to handle.

Note

The retry is bounded (max ~6s) and the per-attempt delay is short, so this doesn't measurably slow down healthy deletes — the rm normally succeeds on the first attempt and we never hit a sleep.

Summary by CodeRabbit

Bug Fixes
- Improved reliability of sandbox directory cleanup by adding retry logic with exponential backoff to handle transient deletion failures.

…e race Closes #121. `RemoveSandbox` (and `RemoveCustomSandbox`, plus an earlier custom-delete path) runs the sandbox's stop / send_kill script and then immediately `rm -rf`s the sandbox dir. The `destroy` mode of `send_kill` does `kill -9 $MYPID`, which doesn't synchronously wait for the killed process's open file descriptors to be released. During that brief window, the dying mysqld's last buffered I/O can land inside `master/data/` after `rm`'s readdir() walk has already passed — making the final `rmdir(2)` return ENOTEMPTY: rm: cannot remove '<sandbox>/master/data': Directory not empty error while deleting sandbox <sandbox> Recently observed on: - master Integration Tests run 25839621095 (`MariaDB (10.11.9)`) - PR #119 hit the same flake on 4 consecutive runs before clearing on the 5th retry - PR #114 hit it during initial review Fix: wrap the rm in a small exponential-backoff retry loop (200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race resolves in milliseconds in practice; retries handle it without changing the stop-script contract or template scripts across topologies. A more thorough fix would be to have `send_kill` wait for the killed PIDs to disappear from /proc before returning, but that's a larger change touching template scripts across single, replication, multiple, tidb, etc. The localized retry in `RemoveSandbox` solves the user-visible symptom with one focused change. Applies to all four `rm -rf` call-sites in sandbox/sandbox.go: - The early custom-delete path (around line 213) used by sandboxDef recreation, both sandbox dir and log directory. - The synchronous (non-concurrent) branch of `RemoveSandbox`. - The synchronous branch of `RemoveCustomSandbox`. The concurrent-mode branches (which enqueue an `ExecCommand` into the parallel execution list) are unchanged — those run later and their race window is already wider by virtue of scheduling.

coderabbitai · 2026-05-14T08:59:30Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4f4f390a-2f99-459e-a968-3304e8132a08

📥 Commits

Reviewing files that changed from the base of the PR and between a7d55f9 and e8e39f6.

📒 Files selected for processing (1)

sandbox/sandbox.go

📝 Walkthrough

Walkthrough

The PR introduces a new removeDirWithRetry helper that replaces direct rm -rf calls in three locations—checkDirectory (force overwrite), RemoveSandbox (deprecated sequential cleanup), and RemoveCustomSandbox (sequential cleanup)—to retry failed directory removal with exponential backoff and handle transient ENOTEMPTY race conditions.

Changes

Directory removal with exponential backoff

Layer / File(s)	Summary
removeDirWithRetry helper function `sandbox/sandbox.go`	New helper executes `rm -rf` with exponential backoff sleep and up to 5 retries, returning the last error if all attempts fail.
Integration at deletion call sites `sandbox/sandbox.go`	`checkDirectory` (force overwrite), `RemoveSandbox` (non-concurrent path), and `RemoveCustomSandbox` (non-concurrent path) now call `removeDirWithRetry` instead of direct `rm -rf`, preserving error wrapping on failure.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related issues

dbdeployer delete intermittently fails with 'Directory not empty' after kill -9 #121: Directly related; implements the same retry-on-failure rm -rf fix in sandbox/sandbox.go for RemoveSandbox and RemoveCustomSandbox to address ENOTEMPTY errors after kill -9.

Poem

🐰 With retry and backoff, our dirs now persist,
Racing conditions—we're hardened 'gainst this!
Five attempts and exponential grace,
ENOTEMPTY won't win this race. ✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 40.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding retry logic to rm -rf operations in sandbox deletion to handle a race condition caused by kill -9 not synchronously releasing file descriptors.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch fix/delete-sandbox-race-on-rm

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a removeDirWithRetry function to handle race conditions during directory deletion by implementing an exponential backoff retry mechanism. This helper replaces direct calls to the external rm command in checkDirectory, RemoveSandbox, and RemoveCustomSandbox. Feedback suggests using the more idiomatic os.RemoveAll function to avoid process overhead and points out a calculation error in the docstring regarding the total maximum wait time.

gemini-code-assist · 2026-05-14T09:00:42Z

+func removeDirWithRetry(target string) error {
+	backoff := 200 * time.Millisecond
+	var lastErr error
+	for attempt := 0; attempt < 5; attempt++ {
+		if attempt > 0 {
+			time.Sleep(backoff)
+			backoff *= 2
+		}
+		if _, err := common.RunCmdWithArgs("rm", []string{"-rf", target}); err == nil {
+			return nil
+		} else {
+			lastErr = err
+		}
+	}
+	return lastErr
+}


Using os.RemoveAll is more idiomatic and efficient in Go than spawning an external rm process, as it avoids the overhead of process creation and is more portable. Additionally, the loop logic can be simplified by removing the redundant else block. Note that the total sleep time with 5 attempts is actually 3 seconds (200+400+800+1600ms), which differs from the 6.2s mentioned in the docstring.

func removeDirWithRetry(target string) error { backoff := 200 * time.Millisecond var err error for attempt := 0; attempt < 5; attempt++ { if attempt > 0 { time.Sleep(backoff) backoff *= 2 } if err = os.RemoveAll(target); err == nil { return nil } } return err }

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122

fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122
renecannao wants to merge 1 commit into
masterfrom
fix/delete-sandbox-race-on-rm

renecannao commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renecannao commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Fix

Files

Test plan

Note

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renecannao commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading