Skip to content

Newly-added replica is never restarted after remote_servers is published — Distributed/Dictionary/Refreshable-MV stay permanently broken (CLUSTER_DOESNT_EXIST) after scale-up #2013

Description

@dashashutosh80

Operator version

  • clickhouse-operator: 0.27.2 (also reproduced on a 0.26.x build; code paths are unchanged on master)
  • ClickHouse server: 25.8 (clickhouse/clickhouse-server:25.8, observed on 25.8.17.x / 25.8.24.x)
  • Kubernetes: 1.25 (reproduced on kind) and on production k8s
  • Install: operator-managed ClickHouseInstallation, single shard, ReplicatedMergeTree + Distributed + DICTIONARY + refreshable MATERIALIZED VIEW

Summary

When a cluster is scaled up (layout.replicasCount 1 → 2), the newly added replica can start ClickHouse before the operator-generated remote_servers (the cluster definition) is resolvable in that pod. If any object on that replica needs the cluster during its startup table load — e.g. a Distributed(<cluster>, …) table with pending data, a DICTIONARY that reads it, or a refreshable MV that dictGets it — the async startup load fails terminally with Code: 701 CLUSTER_DOESNT_EXIST (and dependents become CANCELED).

The operator then publishes the final remote_servers (now including the new host) and applies it via SYSTEM RELOAD CONFIG (no restart). After that, system.clusters lists the cluster correctly — but the cached terminal async-load failures are never re-run. ClickHouse does not retry terminal FAILED/CANCELED loader jobs for the lifetime of the process. As a result, the Distributed/dictionary/refreshable-MV chain on the new replica stays broken forever, and the affected tables never appear on it.

The operator never restarts the newly-added host after the cluster definition is published, so there is no automatic recovery. The only workaround today is a manual pod restart.

Expected behavior

After scale-up, the newly added replica should end up healthy automatically: once the final remote_servers (with all hosts) is in place, the new host should be in a state where cluster-dependent objects load successfully — without a human manually restarting the pod.

Actual behavior

The newly added replica comes up Ready but degraded:

  • SELECT … FROM <distributed_table>Code: 701 CLUSTER_DOESNT_EXIST
  • the dependent DICTIONARY is stuck (load CANCELED); dictGet(...)Code: 722 ASYNC_LOAD_WAIT_FAILED701
  • the refreshable MV keeps failing every refresh; its target tables never populate on the new replica
  • system.clusters does list the cluster (after the operator's reload), which makes the failure look paradoxical
  • system.asynchronous_loader shows the relevant jobs as terminal FAILED/CANCELED and they are never re-run

A manual kubectl delete pod <new-replica> (or any process restart) fixes it permanently, because the fresh process re-runs the loader with the cluster present.

Steps to reproduce

  1. Install a CHI with a single shard, 1 replica, clickhouse/clickhouse-server:25.8, backed by Keeper.

  2. Create the dependency chain and ingest some data so the Distributed table has pending/local data:

    CREATE TABLE default.rmt ON CLUSTER 'c1'
    ( ts DateTime, metric String, val Float64 )
    ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/rmt', '{replica}')
    ORDER BY (metric, ts);
    
    INSERT INTO default.rmt
    SELECT now() - number, concat('m', toString(number % 5)), number FROM numbers(2000);
    
    CREATE TABLE default.dist ON CLUSTER 'c1' AS default.rmt
    ENGINE = Distributed('c1', default, rmt, rand());
    
    CREATE DICTIONARY default.max_ts_dict ON CLUSTER 'c1'
    ( metric String, max_ts DateTime )
    PRIMARY KEY metric
    SOURCE(CLICKHOUSE(QUERY 'SELECT metric, max(ts) AS max_ts FROM default.dist GROUP BY metric'))
    LAYOUT(COMPLEX_KEY_HASHED())
    LIFETIME(MIN 30 MAX 60);
    
    CREATE TABLE default.agg ON CLUSTER 'c1'
    ( metric String, max_ts DateTime, cnt UInt64 )
    ENGINE = MergeTree ORDER BY metric;
    
    CREATE MATERIALIZED VIEW default.agg_mv ON CLUSTER 'c1'
    REFRESH EVERY 10 SECOND TO default.agg AS
    SELECT metric, dictGet('default.max_ts_dict','max_ts', tuple(metric)) AS max_ts, count() AS cnt
    FROM default.dist GROUP BY metric;
  3. Scale the cluster: set spec.configuration.clusters[0].layout.replicasCount: 2.

  4. After the operator reports Completed, query the new replica:

    SELECT count() FROM default.dist;                              -- Code: 701 CLUSTER_DOESNT_EXIST
    SELECT status, last_exception FROM system.dictionaries
      WHERE name LIKE '%max_ts_dict%';                             -- not LOADED
    SELECT view, status, exception FROM system.view_refreshes
      WHERE view = 'agg_mv';                                       -- exception, never succeeds
    SELECT status, count() FROM system.asynchronous_loader GROUP BY status; -- FAILED / CANCELED present
    SELECT name, value FROM system.errors WHERE name = 'CLUSTER_DOESNT_EXIST';
  5. system.clusters on the new replica already lists c1 (the operator applied remote_servers via reload), yet the queries above keep failing — demonstrating the failure is sticky across SYSTEM RELOAD CONFIG.

  6. kubectl delete pod <new-replica-pod> → after it restarts, everything is healthy. This is the only recovery today.

Representative server log (new replica)

<Error> ... CLUSTER_DOESNT_EXIST ... Requested cluster 'c1' not found
   ... StorageDistributed::initializeDirectoryQueuesForDisk
   ... Context::getCluster
<Error> AsyncLoader: load job 'load table default.dist' failed: CLUSTER_DOESNT_EXIST
<Error> AsyncLoader: load job 'startup table default.dist' canceled
<Error> AsyncLoader: load job 'load dictionary default.max_ts_dict' canceled
... RefreshTask: ... Code: 722. ASYNC_LOAD_WAIT_FAILED ... -> CLUSTER_DOESNT_EXIST

Root cause (operator side)

Two operator behaviors combine to leave the new replica permanently broken:

  1. The preliminary common ConfigMap is generated without the newly-added hosts.
    getRemoteServersGeneratorOptions() (pkg/controller/chi/worker-status-helpers.go) always excludes hosts with status ObjectStatusRequested. The new replica's pod is therefore created and may boot before its remote_servers is fully resolvable (and ConfigMap propagation into a brand-new pod is asynchronous on top of that). This opens the boot-before-cluster window for the new host's startup table loads.

  2. The final remote_servers is applied without restarting the new host.
    reconcileCRAuxObjectsFinal() (pkg/controller/chi/worker-reconciler-chi.go) writes the complete remote_servers and calls includeAllHostsIntoCluster(), which propagates via SYSTEM RELOAD CONFIGno restart. ClickHouse, by design, does not re-run terminal FAILED/CANCELED async-load jobs after a config reload, so the cluster appears in system.clusters but the cached startup-load failures remain.

There is no code path that restarts the newly-added host after the cluster definition lands:

  • shouldForceRestartHost() (pkg/controller/chi/worker.go) early-returns false for a host whose status is ObjectStatusRequested or that has no ancestor — i.e. exactly the newly-added replica — so even restart: RollingUpdate and image/config-change checks are skipped for it.
  • IsConfigurationChangeRequiresReboot() (pkg/model/chop_config.go) only diffs zookeeper/profiles/quotas/settings/files. The cluster's remote_servers is generated from clusters[].layout into chop-generated-remote_servers.xml, not from .spec.configuration.settings, so configurationRestartPolicy: settings/remote_servers/*: "yes" never matches a cluster-membership change.

Settings that do not mitigate it (verified on 0.27.2)

Setting Why it does not help
configurationRestartPolicy: settings/remote_servers/*: "yes" remote_servers comes from clusters[].layout, not .spec.configuration.settings; IsConfigurationChangeRequiresReboot has no remote_servers/clusters branch, so it never matches. Even if it did, shouldForceRestartHost skips the new host first.
CHI spec.restart: "RollingUpdate" shouldForceRestartHost returns true for RollingUpdate only after the ObjectStatusRequested/!HasAncestor early-returns, so the new replica is skipped.
reconcile.host.wait.include: true Only polls IsHostInCluster; never restarts. A host can be "in cluster" while its loader jobs stay terminally FAILED.
reconcile.host.wait.replicas.{new,all}: true Waits for replication catch-up; doesn't restart; the broken objects (Distributed/dictionary/MV) are non-replicated.
reconcile.host.wait.probes.{startup,readiness}: true The pod becomes Ready (server answers SELECT 1) despite the failed async loads, so the wait passes and nothing restarts.
reconcile.statefulSet.{create,update}.onFailure Only fire when the pod is not Ready within timeout; here it is Ready-but-degraded.
async_load_databases: false Strictly worse — synchronous load makes the missing-cluster failure abort startup → pod CrashLoopBackOff (exit 210).

Proposed fix

Restart only the hosts created during the current reconcile, once, after the final remote_servers (with all hosts) has been published and propagated. Concretely, at the end of reconcileCRAuxObjectsFinal() (right after includeAllHostsIntoCluster()):

  • iterate hosts with reconcile status ObjectStatusCreated,
  • skip single-host clusters (cluster.HostsCount() < 2) — their remote_servers only references localhost, which always resolves,
  • WaitForConfigMapPropagation(host) then hostSoftwareRestart(host).

This is:

  • scoped to the new host only — the existing replica keeps serving, so there is no client-visible downtime,
  • idempotent / one-shot — the host becomes Same (has an ancestor) on the next reconcile, so it is not restarted again (no loop),
  • ordered after the full remote_servers is present, so the post-restart boot always loads with the cluster defined, closing the race.

An equally valid (and arguably cleaner) alternative is to seed the newly-added host's mounted remote_servers with the full cluster before the pod is created (i.e. don't exclude ObjectStatusRequested hosts from the config the new pod will mount), so the async load never fails in the first place. The restart-after-publish approach is the smaller, lower-risk change.

Impact

  • Silent on scale-up: the new replica is Ready, the CHI is Completed, and system.clusters looks correct, so monitoring does not flag it — yet Distributed/dictionary/refreshable-MV data never lands on the new replica.
  • Requires manual pod restarts to recover, which is not acceptable for automated/at-scale fleets.

I have a minimal patch implementing the restart-after-publish approach and have verified it end-to-end on kind (operator emits the restart for the new host after remote_servers is published; the new replica then loads all objects, system.asynchronous_loader is all OK, dictionary LOADED, refreshable MV succeeding, no CLUSTER_DOESNT_EXIST). Happy to open a PR.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions