Operator version
- clickhouse-operator: 0.27.2 (also reproduced on a 0.26.x build; code paths are unchanged on
master)
- ClickHouse server: 25.8 (
clickhouse/clickhouse-server:25.8, observed on 25.8.17.x / 25.8.24.x)
- Kubernetes: 1.25 (reproduced on kind) and on production k8s
- Install: operator-managed
ClickHouseInstallation, single shard, ReplicatedMergeTree + Distributed + DICTIONARY + refreshable MATERIALIZED VIEW
Summary
When a cluster is scaled up (layout.replicasCount 1 → 2), the newly added replica can start ClickHouse before the operator-generated remote_servers (the cluster definition) is resolvable in that pod. If any object on that replica needs the cluster during its startup table load — e.g. a Distributed(<cluster>, …) table with pending data, a DICTIONARY that reads it, or a refreshable MV that dictGets it — the async startup load fails terminally with Code: 701 CLUSTER_DOESNT_EXIST (and dependents become CANCELED).
The operator then publishes the final remote_servers (now including the new host) and applies it via SYSTEM RELOAD CONFIG (no restart). After that, system.clusters lists the cluster correctly — but the cached terminal async-load failures are never re-run. ClickHouse does not retry terminal FAILED/CANCELED loader jobs for the lifetime of the process. As a result, the Distributed/dictionary/refreshable-MV chain on the new replica stays broken forever, and the affected tables never appear on it.
The operator never restarts the newly-added host after the cluster definition is published, so there is no automatic recovery. The only workaround today is a manual pod restart.
Expected behavior
After scale-up, the newly added replica should end up healthy automatically: once the final remote_servers (with all hosts) is in place, the new host should be in a state where cluster-dependent objects load successfully — without a human manually restarting the pod.
Actual behavior
The newly added replica comes up Ready but degraded:
SELECT … FROM <distributed_table> → Code: 701 CLUSTER_DOESNT_EXIST
- the dependent
DICTIONARY is stuck (load CANCELED); dictGet(...) → Code: 722 ASYNC_LOAD_WAIT_FAILED → 701
- the refreshable MV keeps failing every refresh; its target tables never populate on the new replica
system.clusters does list the cluster (after the operator's reload), which makes the failure look paradoxical
system.asynchronous_loader shows the relevant jobs as terminal FAILED/CANCELED and they are never re-run
A manual kubectl delete pod <new-replica> (or any process restart) fixes it permanently, because the fresh process re-runs the loader with the cluster present.
Steps to reproduce
-
Install a CHI with a single shard, 1 replica, clickhouse/clickhouse-server:25.8, backed by Keeper.
-
Create the dependency chain and ingest some data so the Distributed table has pending/local data:
CREATE TABLE default.rmt ON CLUSTER 'c1'
( ts DateTime, metric String, val Float64 )
ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/rmt', '{replica}')
ORDER BY (metric, ts);
INSERT INTO default.rmt
SELECT now() - number, concat('m', toString(number % 5)), number FROM numbers(2000);
CREATE TABLE default.dist ON CLUSTER 'c1' AS default.rmt
ENGINE = Distributed('c1', default, rmt, rand());
CREATE DICTIONARY default.max_ts_dict ON CLUSTER 'c1'
( metric String, max_ts DateTime )
PRIMARY KEY metric
SOURCE(CLICKHOUSE(QUERY 'SELECT metric, max(ts) AS max_ts FROM default.dist GROUP BY metric'))
LAYOUT(COMPLEX_KEY_HASHED())
LIFETIME(MIN 30 MAX 60);
CREATE TABLE default.agg ON CLUSTER 'c1'
( metric String, max_ts DateTime, cnt UInt64 )
ENGINE = MergeTree ORDER BY metric;
CREATE MATERIALIZED VIEW default.agg_mv ON CLUSTER 'c1'
REFRESH EVERY 10 SECOND TO default.agg AS
SELECT metric, dictGet('default.max_ts_dict','max_ts', tuple(metric)) AS max_ts, count() AS cnt
FROM default.dist GROUP BY metric;
-
Scale the cluster: set spec.configuration.clusters[0].layout.replicasCount: 2.
-
After the operator reports Completed, query the new replica:
SELECT count() FROM default.dist; -- Code: 701 CLUSTER_DOESNT_EXIST
SELECT status, last_exception FROM system.dictionaries
WHERE name LIKE '%max_ts_dict%'; -- not LOADED
SELECT view, status, exception FROM system.view_refreshes
WHERE view = 'agg_mv'; -- exception, never succeeds
SELECT status, count() FROM system.asynchronous_loader GROUP BY status; -- FAILED / CANCELED present
SELECT name, value FROM system.errors WHERE name = 'CLUSTER_DOESNT_EXIST';
-
system.clusters on the new replica already lists c1 (the operator applied remote_servers via reload), yet the queries above keep failing — demonstrating the failure is sticky across SYSTEM RELOAD CONFIG.
-
kubectl delete pod <new-replica-pod> → after it restarts, everything is healthy. This is the only recovery today.
Representative server log (new replica)
<Error> ... CLUSTER_DOESNT_EXIST ... Requested cluster 'c1' not found
... StorageDistributed::initializeDirectoryQueuesForDisk
... Context::getCluster
<Error> AsyncLoader: load job 'load table default.dist' failed: CLUSTER_DOESNT_EXIST
<Error> AsyncLoader: load job 'startup table default.dist' canceled
<Error> AsyncLoader: load job 'load dictionary default.max_ts_dict' canceled
... RefreshTask: ... Code: 722. ASYNC_LOAD_WAIT_FAILED ... -> CLUSTER_DOESNT_EXIST
Root cause (operator side)
Two operator behaviors combine to leave the new replica permanently broken:
-
The preliminary common ConfigMap is generated without the newly-added hosts.
getRemoteServersGeneratorOptions() (pkg/controller/chi/worker-status-helpers.go) always excludes hosts with status ObjectStatusRequested. The new replica's pod is therefore created and may boot before its remote_servers is fully resolvable (and ConfigMap propagation into a brand-new pod is asynchronous on top of that). This opens the boot-before-cluster window for the new host's startup table loads.
-
The final remote_servers is applied without restarting the new host.
reconcileCRAuxObjectsFinal() (pkg/controller/chi/worker-reconciler-chi.go) writes the complete remote_servers and calls includeAllHostsIntoCluster(), which propagates via SYSTEM RELOAD CONFIG — no restart. ClickHouse, by design, does not re-run terminal FAILED/CANCELED async-load jobs after a config reload, so the cluster appears in system.clusters but the cached startup-load failures remain.
There is no code path that restarts the newly-added host after the cluster definition lands:
shouldForceRestartHost() (pkg/controller/chi/worker.go) early-returns false for a host whose status is ObjectStatusRequested or that has no ancestor — i.e. exactly the newly-added replica — so even restart: RollingUpdate and image/config-change checks are skipped for it.
IsConfigurationChangeRequiresReboot() (pkg/model/chop_config.go) only diffs zookeeper/profiles/quotas/settings/files. The cluster's remote_servers is generated from clusters[].layout into chop-generated-remote_servers.xml, not from .spec.configuration.settings, so configurationRestartPolicy: settings/remote_servers/*: "yes" never matches a cluster-membership change.
Settings that do not mitigate it (verified on 0.27.2)
| Setting |
Why it does not help |
configurationRestartPolicy: settings/remote_servers/*: "yes" |
remote_servers comes from clusters[].layout, not .spec.configuration.settings; IsConfigurationChangeRequiresReboot has no remote_servers/clusters branch, so it never matches. Even if it did, shouldForceRestartHost skips the new host first. |
CHI spec.restart: "RollingUpdate" |
shouldForceRestartHost returns true for RollingUpdate only after the ObjectStatusRequested/!HasAncestor early-returns, so the new replica is skipped. |
reconcile.host.wait.include: true |
Only polls IsHostInCluster; never restarts. A host can be "in cluster" while its loader jobs stay terminally FAILED. |
reconcile.host.wait.replicas.{new,all}: true |
Waits for replication catch-up; doesn't restart; the broken objects (Distributed/dictionary/MV) are non-replicated. |
reconcile.host.wait.probes.{startup,readiness}: true |
The pod becomes Ready (server answers SELECT 1) despite the failed async loads, so the wait passes and nothing restarts. |
reconcile.statefulSet.{create,update}.onFailure |
Only fire when the pod is not Ready within timeout; here it is Ready-but-degraded. |
async_load_databases: false |
Strictly worse — synchronous load makes the missing-cluster failure abort startup → pod CrashLoopBackOff (exit 210). |
Proposed fix
Restart only the hosts created during the current reconcile, once, after the final remote_servers (with all hosts) has been published and propagated. Concretely, at the end of reconcileCRAuxObjectsFinal() (right after includeAllHostsIntoCluster()):
- iterate hosts with reconcile status
ObjectStatusCreated,
- skip single-host clusters (
cluster.HostsCount() < 2) — their remote_servers only references localhost, which always resolves,
WaitForConfigMapPropagation(host) then hostSoftwareRestart(host).
This is:
- scoped to the new host only — the existing replica keeps serving, so there is no client-visible downtime,
- idempotent / one-shot — the host becomes
Same (has an ancestor) on the next reconcile, so it is not restarted again (no loop),
- ordered after the full
remote_servers is present, so the post-restart boot always loads with the cluster defined, closing the race.
An equally valid (and arguably cleaner) alternative is to seed the newly-added host's mounted remote_servers with the full cluster before the pod is created (i.e. don't exclude ObjectStatusRequested hosts from the config the new pod will mount), so the async load never fails in the first place. The restart-after-publish approach is the smaller, lower-risk change.
Impact
- Silent on scale-up: the new replica is
Ready, the CHI is Completed, and system.clusters looks correct, so monitoring does not flag it — yet Distributed/dictionary/refreshable-MV data never lands on the new replica.
- Requires manual pod restarts to recover, which is not acceptable for automated/at-scale fleets.
I have a minimal patch implementing the restart-after-publish approach and have verified it end-to-end on kind (operator emits the restart for the new host after remote_servers is published; the new replica then loads all objects, system.asynchronous_loader is all OK, dictionary LOADED, refreshable MV succeeding, no CLUSTER_DOESNT_EXIST). Happy to open a PR.
Operator version
master)clickhouse/clickhouse-server:25.8, observed on25.8.17.x/25.8.24.x)ClickHouseInstallation, single shard,ReplicatedMergeTree+Distributed+DICTIONARY+ refreshableMATERIALIZED VIEWSummary
When a cluster is scaled up (
layout.replicasCount1 → 2), the newly added replica can start ClickHouse before the operator-generatedremote_servers(the cluster definition) is resolvable in that pod. If any object on that replica needs the cluster during its startup table load — e.g. aDistributed(<cluster>, …)table with pending data, aDICTIONARYthat reads it, or a refreshable MV thatdictGets it — the async startup load fails terminally withCode: 701 CLUSTER_DOESNT_EXIST(and dependents becomeCANCELED).The operator then publishes the final
remote_servers(now including the new host) and applies it viaSYSTEM RELOAD CONFIG(no restart). After that,system.clusterslists the cluster correctly — but the cached terminal async-load failures are never re-run. ClickHouse does not retry terminalFAILED/CANCELEDloader jobs for the lifetime of the process. As a result, the Distributed/dictionary/refreshable-MV chain on the new replica stays broken forever, and the affected tables never appear on it.The operator never restarts the newly-added host after the cluster definition is published, so there is no automatic recovery. The only workaround today is a manual pod restart.
Expected behavior
After scale-up, the newly added replica should end up healthy automatically: once the final
remote_servers(with all hosts) is in place, the new host should be in a state where cluster-dependent objects load successfully — without a human manually restarting the pod.Actual behavior
The newly added replica comes up Ready but degraded:
SELECT … FROM <distributed_table>→Code: 701 CLUSTER_DOESNT_EXISTDICTIONARYis stuck (loadCANCELED);dictGet(...)→Code: 722 ASYNC_LOAD_WAIT_FAILED→701system.clustersdoes list the cluster (after the operator's reload), which makes the failure look paradoxicalsystem.asynchronous_loadershows the relevant jobs as terminalFAILED/CANCELEDand they are never re-runA manual
kubectl delete pod <new-replica>(or any process restart) fixes it permanently, because the fresh process re-runs the loader with the cluster present.Steps to reproduce
Install a CHI with a single shard, 1 replica,
clickhouse/clickhouse-server:25.8, backed by Keeper.Create the dependency chain and ingest some data so the
Distributedtable has pending/local data:Scale the cluster: set
spec.configuration.clusters[0].layout.replicasCount: 2.After the operator reports
Completed, query the new replica:system.clusterson the new replica already listsc1(the operator appliedremote_serversvia reload), yet the queries above keep failing — demonstrating the failure is sticky acrossSYSTEM RELOAD CONFIG.kubectl delete pod <new-replica-pod>→ after it restarts, everything is healthy. This is the only recovery today.Representative server log (new replica)
Root cause (operator side)
Two operator behaviors combine to leave the new replica permanently broken:
The preliminary common ConfigMap is generated without the newly-added hosts.
getRemoteServersGeneratorOptions()(pkg/controller/chi/worker-status-helpers.go) always excludes hosts with statusObjectStatusRequested. The new replica's pod is therefore created and may boot before itsremote_serversis fully resolvable (and ConfigMap propagation into a brand-new pod is asynchronous on top of that). This opens the boot-before-cluster window for the new host's startup table loads.The final
remote_serversis applied without restarting the new host.reconcileCRAuxObjectsFinal()(pkg/controller/chi/worker-reconciler-chi.go) writes the completeremote_serversand callsincludeAllHostsIntoCluster(), which propagates viaSYSTEM RELOAD CONFIG— no restart. ClickHouse, by design, does not re-run terminalFAILED/CANCELEDasync-load jobs after a config reload, so the cluster appears insystem.clustersbut the cached startup-load failures remain.There is no code path that restarts the newly-added host after the cluster definition lands:
shouldForceRestartHost()(pkg/controller/chi/worker.go) early-returnsfalsefor a host whose status isObjectStatusRequestedor that has no ancestor — i.e. exactly the newly-added replica — so evenrestart: RollingUpdateand image/config-change checks are skipped for it.IsConfigurationChangeRequiresReboot()(pkg/model/chop_config.go) only diffszookeeper/profiles/quotas/settings/files. The cluster'sremote_serversis generated fromclusters[].layoutintochop-generated-remote_servers.xml, not from.spec.configuration.settings, soconfigurationRestartPolicy: settings/remote_servers/*: "yes"never matches a cluster-membership change.Settings that do not mitigate it (verified on 0.27.2)
configurationRestartPolicy: settings/remote_servers/*: "yes"remote_serverscomes fromclusters[].layout, not.spec.configuration.settings;IsConfigurationChangeRequiresReboothas noremote_servers/clustersbranch, so it never matches. Even if it did,shouldForceRestartHostskips the new host first.spec.restart: "RollingUpdate"shouldForceRestartHostreturnstruefor RollingUpdate only after theObjectStatusRequested/!HasAncestorearly-returns, so the new replica is skipped.reconcile.host.wait.include: trueIsHostInCluster; never restarts. A host can be "in cluster" while its loader jobs stay terminallyFAILED.reconcile.host.wait.replicas.{new,all}: truereconcile.host.wait.probes.{startup,readiness}: trueSELECT 1) despite the failed async loads, so the wait passes and nothing restarts.reconcile.statefulSet.{create,update}.onFailureasync_load_databases: falseCrashLoopBackOff(exit 210).Proposed fix
Restart only the hosts created during the current reconcile, once, after the final
remote_servers(with all hosts) has been published and propagated. Concretely, at the end ofreconcileCRAuxObjectsFinal()(right afterincludeAllHostsIntoCluster()):ObjectStatusCreated,cluster.HostsCount() < 2) — theirremote_serversonly referenceslocalhost, which always resolves,WaitForConfigMapPropagation(host)thenhostSoftwareRestart(host).This is:
Same(has an ancestor) on the next reconcile, so it is not restarted again (no loop),remote_serversis present, so the post-restart boot always loads with the cluster defined, closing the race.An equally valid (and arguably cleaner) alternative is to seed the newly-added host's mounted
remote_serverswith the full cluster before the pod is created (i.e. don't excludeObjectStatusRequestedhosts from the config the new pod will mount), so the async load never fails in the first place. The restart-after-publish approach is the smaller, lower-risk change.Impact
Ready, the CHI isCompleted, andsystem.clusterslooks correct, so monitoring does not flag it — yet Distributed/dictionary/refreshable-MV data never lands on the new replica.I have a minimal patch implementing the restart-after-publish approach and have verified it end-to-end on kind (operator emits the restart for the new host after
remote_serversis published; the new replica then loads all objects,system.asynchronous_loaderis allOK, dictionaryLOADED, refreshable MV succeeding, noCLUSTER_DOESNT_EXIST). Happy to open a PR.