-
Notifications
You must be signed in to change notification settings - Fork 0
Server processes become orphaned — CLI loses track of running servers #89
Description
Summary
server start can spawn ClickHouse processes that the CLI subsequently loses track of. server list shows all servers as stopped even though the processes are alive and listening on ports.
Observed behavior
$ clickhousectl local server start
Server 'default' started in background (PID: 77042)
$ clickhousectl local server start
Server 'default' started in background (PID: 77085)
$ clickhousectl local server list
# All 16 servers shown as "stopped", 0 running
Meanwhile ps shows 15 live ClickHouse processes listening on various ports, and only a single .json metadata file exists in .clickhouse/servers/.
Root cause (initial investigation)
Server tracking uses JSON metadata files (e.g. default.json) in .clickhouse/servers/. The problem is a combination of:
-
load_running_info()deletes metadata when a PID check fails — ifis_process_alive(pid)returns false for any reason (process exited, or a transient check failure), the.jsonfile is removed immediately and the process becomes orphaned. -
resolve_name(None)reuses "default" — it callsis_server_running("default")which goes throughload_running_info. If the previous default's PID is dead (or was cleaned up), it returns "default" again. The newsave_server_infooverwrites the old metadata, orphaning whatever process was previously tracked. -
Random-named servers also lose metadata — servers like
blue-bird,dark-bolt, etc. had their.jsonfiles cleaned up duringlist_all_servers()calls, even though their processes were still alive. This suggests the PID liveness check (kill(pid, 0)) can fail for running processes, or the metadata was never saved correctly.
Evidence
- 15 ClickHouse processes running (confirmed via
ps auxandlsof— all listening on ports) - Only 1
.jsonmetadata file in.clickhouse/servers/ server listreports 16 servers, 0 running
Needs investigation
- Reliable reproduction steps (may depend on timing, port exhaustion, or macOS-specific PID behavior)
- Whether
is_process_alive(usingkill(pid, 0)) can return false for a running process in some edge case - Whether the health check window (300ms in
check_spawn_health) is too short, causing the metadata to be saved for a process that then dies, leaving a stale directory but no running process