Skip to content

Server processes become orphaned — CLI loses track of running servers #89

@sdairs

Description

@sdairs

Summary

server start can spawn ClickHouse processes that the CLI subsequently loses track of. server list shows all servers as stopped even though the processes are alive and listening on ports.

Observed behavior

$ clickhousectl local server start
Server 'default' started in background (PID: 77042)

$ clickhousectl local server start
Server 'default' started in background (PID: 77085)

$ clickhousectl local server list
# All 16 servers shown as "stopped", 0 running

Meanwhile ps shows 15 live ClickHouse processes listening on various ports, and only a single .json metadata file exists in .clickhouse/servers/.

Root cause (initial investigation)

Server tracking uses JSON metadata files (e.g. default.json) in .clickhouse/servers/. The problem is a combination of:

  1. load_running_info() deletes metadata when a PID check fails — if is_process_alive(pid) returns false for any reason (process exited, or a transient check failure), the .json file is removed immediately and the process becomes orphaned.

  2. resolve_name(None) reuses "default" — it calls is_server_running("default") which goes through load_running_info. If the previous default's PID is dead (or was cleaned up), it returns "default" again. The new save_server_info overwrites the old metadata, orphaning whatever process was previously tracked.

  3. Random-named servers also lose metadata — servers like blue-bird, dark-bolt, etc. had their .json files cleaned up during list_all_servers() calls, even though their processes were still alive. This suggests the PID liveness check (kill(pid, 0)) can fail for running processes, or the metadata was never saved correctly.

Evidence

  • 15 ClickHouse processes running (confirmed via ps aux and lsof — all listening on ports)
  • Only 1 .json metadata file in .clickhouse/servers/
  • server list reports 16 servers, 0 running

Needs investigation

  • Reliable reproduction steps (may depend on timing, port exhaustion, or macOS-specific PID behavior)
  • Whether is_process_alive (using kill(pid, 0)) can return false for a running process in some edge case
  • Whether the health check window (300ms in check_spawn_health) is too short, causing the metadata to be saved for a process that then dies, leaving a stale directory but no running process

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions