Skip to content

dbdeployer delete intermittently fails with 'Directory not empty' after kill -9 #121

@renecannao

Description

@renecannao

Symptom

dbdeployer delete all --skip-confirm (and the underlying RemoveSandbox flow) intermittently fails on CI with:

executing "send_kill" on slave 1
Terminating the server immediately --- kill -9 3446
...
cmd:    rm
err:    exit status 1
stderr: rm: cannot remove '/home/runner/sandboxes/rsandbox_10_11_9/master/data': Directory not empty
error while deleting sandbox error while deleting sandbox /home/runner/sandboxes/rsandbox_10_11_9

The replication / sandbox test that runs immediately before this succeeds ("OK: MariaDB replication works") — the failure is purely in the cleanup step.

Reproduction

This is intermittent. Recent occurrences:

So at minimum it's a real intermittent flake on master, and at worst it can hit several reruns in a row.

Root cause

sandbox/sandbox.go::RemoveSandbox (and the analogous RemoveCustomSandbox):

  1. Runs the sandbox's stop / send_kill_all script. The destroy mode of send_kill (sandbox/templates/single/send_kill.gotxt) does kill -9 $MYPID to terminate mysqld and then unlinks the PID and socket files. It does not wait for the killed mysqld to actually finish exiting.
  2. Immediately afterward, RemoveSandbox runs rm -rf <fullPath> (sandbox/sandbox.go:1276).

After kill -9, the kernel still needs a brief window to reap the process and release its open file descriptors. During that window, mysqld's last buffered I/O can land inside master/data/, which rm's readdir() walk has already passed — so the final rmdir(2) returns ENOTEMPTY ("Directory not empty"). With concurrent rm -rf from multiple sandbox dirs (in RUN_CONCURRENTLY=1 mode), the race is more likely.

Suggested fix

Make the rm -rf in RemoveSandbox/RemoveCustomSandbox retry on failure with a short exponential backoff (e.g. 5 attempts at 200ms/400ms/800ms/1.6s/3.2s — total ~6s). The race resolves in milliseconds in practice; retrying handles it without changing the stop-script contract.

A more thorough fix would be for send_kill to wait for the killed PIDs to disappear from /proc before returning, but that's a larger change touching template scripts across all topologies. The retry in RemoveSandbox solves the user-visible symptom with one localized change.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions