dbdeployer delete intermittently fails with 'Directory not empty' after kill -9

## Symptom

`dbdeployer delete all --skip-confirm` (and the underlying `RemoveSandbox` flow) intermittently fails on CI with:

```
executing "send_kill" on slave 1
Terminating the server immediately --- kill -9 3446
...
cmd:    rm
err:    exit status 1
stderr: rm: cannot remove '/home/runner/sandboxes/rsandbox_10_11_9/master/data': Directory not empty
error while deleting sandbox error while deleting sandbox /home/runner/sandboxes/rsandbox_10_11_9
```

The replication / sandbox test that runs immediately before this *succeeds* ("OK: MariaDB replication works") — the failure is purely in the cleanup step.

## Reproduction

This is intermittent. Recent occurrences:

- Master CI run `25839621095` (Integration Tests, 2026-05-14T03:18) — `MariaDB (10.11.9)` failed at cleanup.
- PR #119 hit the same failure on 4 consecutive runs before clearing on the 5th retry.
- PR #114 hit it once during initial review (cleared on first retry).

So at minimum it's a real intermittent flake on master, and at worst it can hit several reruns in a row.

## Root cause

`sandbox/sandbox.go::RemoveSandbox` (and the analogous `RemoveCustomSandbox`):

1. Runs the sandbox's `stop` / `send_kill_all` script. The `destroy` mode of `send_kill` (`sandbox/templates/single/send_kill.gotxt`) does `kill -9 $MYPID` to terminate `mysqld` and then unlinks the PID and socket files. **It does not wait for the killed `mysqld` to actually finish exiting.**
2. Immediately afterward, `RemoveSandbox` runs `rm -rf <fullPath>` (sandbox/sandbox.go:1276).

After `kill -9`, the kernel still needs a brief window to reap the process and release its open file descriptors. During that window, `mysqld`'s last buffered I/O can land inside `master/data/`, which `rm`'s `readdir()` walk has already passed — so the final `rmdir(2)` returns `ENOTEMPTY` ("Directory not empty"). With concurrent `rm -rf` from multiple sandbox dirs (in `RUN_CONCURRENTLY=1` mode), the race is more likely.

## Suggested fix

Make the `rm -rf` in `RemoveSandbox`/`RemoveCustomSandbox` retry on failure with a short exponential backoff (e.g. 5 attempts at 200ms/400ms/800ms/1.6s/3.2s — total ~6s). The race resolves in milliseconds in practice; retrying handles it without changing the stop-script contract.

A more thorough fix would be for `send_kill` to wait for the killed PIDs to disappear from `/proc` before returning, but that's a larger change touching template scripts across all topologies. The retry in `RemoveSandbox` solves the user-visible symptom with one localized change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dbdeployer delete intermittently fails with 'Directory not empty' after kill -9 #121

Symptom

Reproduction

Root cause

Suggested fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

dbdeployer delete intermittently fails with 'Directory not empty' after kill -9 #121

Description

Symptom

Reproduction

Root cause

Suggested fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions