Newly-added replica is never restarted after `remote_servers` is published — Distributed/Dictionary/Refreshable-MV stay permanently broken (`CLUSTER_DOESNT_EXIST`) after scale-up

## Operator version

- clickhouse-operator: **0.27.2** (also reproduced on a 0.26.x build; code paths are unchanged on `master`)
- ClickHouse server: **25.8** (`clickhouse/clickhouse-server:25.8`, observed on `25.8.17.x` / `25.8.24.x`)
- Kubernetes: 1.25 (reproduced on kind) and on production k8s
- Install: operator-managed `ClickHouseInstallation`, single shard, `ReplicatedMergeTree` + `Distributed` + `DICTIONARY` + refreshable `MATERIALIZED VIEW`

## Summary

When a cluster is scaled up (`layout.replicasCount` 1 → 2), the **newly added** replica can start ClickHouse **before** the operator-generated `remote_servers` (the cluster definition) is resolvable in that pod. If any object on that replica needs the cluster during its **startup table load** — e.g. a `Distributed(<cluster>, …)` table with pending data, a `DICTIONARY` that reads it, or a refreshable MV that `dictGet`s it — the async startup load fails terminally with **`Code: 701 CLUSTER_DOESNT_EXIST`** (and dependents become `CANCELED`).

The operator then publishes the final `remote_servers` (now including the new host) and applies it via **`SYSTEM RELOAD CONFIG` (no restart)**. After that, `system.clusters` lists the cluster correctly — **but the cached terminal async-load failures are never re-run**. ClickHouse does not retry terminal `FAILED`/`CANCELED` loader jobs for the lifetime of the process. As a result, the Distributed/dictionary/refreshable-MV chain on the new replica stays broken forever, and the affected tables never appear on it.

**The operator never restarts the newly-added host after the cluster definition is published**, so there is no automatic recovery. The only workaround today is a manual pod restart.

## Expected behavior

After scale-up, the newly added replica should end up healthy automatically: once the final `remote_servers` (with all hosts) is in place, the new host should be in a state where cluster-dependent objects load successfully — without a human manually restarting the pod.

## Actual behavior

The newly added replica comes up **Ready but degraded**:

- `SELECT … FROM <distributed_table>` → `Code: 701 CLUSTER_DOESNT_EXIST`
- the dependent `DICTIONARY` is stuck (load `CANCELED`); `dictGet(...)` → `Code: 722 ASYNC_LOAD_WAIT_FAILED` → `701`
- the refreshable MV keeps failing every refresh; its target tables never populate on the new replica
- `system.clusters` *does* list the cluster (after the operator's reload), which makes the failure look paradoxical
- `system.asynchronous_loader` shows the relevant jobs as terminal `FAILED`/`CANCELED` and they are never re-run

A manual `kubectl delete pod <new-replica>` (or any process restart) fixes it permanently, because the fresh process re-runs the loader with the cluster present.

## Steps to reproduce

1. Install a CHI with a single shard, **1 replica**, `clickhouse/clickhouse-server:25.8`, backed by Keeper.
2. Create the dependency chain and ingest some data so the `Distributed` table has pending/local data:

   ```sql
   CREATE TABLE default.rmt ON CLUSTER 'c1'
   ( ts DateTime, metric String, val Float64 )
   ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/rmt', '{replica}')
   ORDER BY (metric, ts);

   INSERT INTO default.rmt
   SELECT now() - number, concat('m', toString(number % 5)), number FROM numbers(2000);

   CREATE TABLE default.dist ON CLUSTER 'c1' AS default.rmt
   ENGINE = Distributed('c1', default, rmt, rand());

   CREATE DICTIONARY default.max_ts_dict ON CLUSTER 'c1'
   ( metric String, max_ts DateTime )
   PRIMARY KEY metric
   SOURCE(CLICKHOUSE(QUERY 'SELECT metric, max(ts) AS max_ts FROM default.dist GROUP BY metric'))
   LAYOUT(COMPLEX_KEY_HASHED())
   LIFETIME(MIN 30 MAX 60);

   CREATE TABLE default.agg ON CLUSTER 'c1'
   ( metric String, max_ts DateTime, cnt UInt64 )
   ENGINE = MergeTree ORDER BY metric;

   CREATE MATERIALIZED VIEW default.agg_mv ON CLUSTER 'c1'
   REFRESH EVERY 10 SECOND TO default.agg AS
   SELECT metric, dictGet('default.max_ts_dict','max_ts', tuple(metric)) AS max_ts, count() AS cnt
   FROM default.dist GROUP BY metric;
   ```

3. Scale the cluster: set `spec.configuration.clusters[0].layout.replicasCount: 2`.
4. After the operator reports `Completed`, query the **new** replica:

   ```sql
   SELECT count() FROM default.dist;                              -- Code: 701 CLUSTER_DOESNT_EXIST
   SELECT status, last_exception FROM system.dictionaries
     WHERE name LIKE '%max_ts_dict%';                             -- not LOADED
   SELECT view, status, exception FROM system.view_refreshes
     WHERE view = 'agg_mv';                                       -- exception, never succeeds
   SELECT status, count() FROM system.asynchronous_loader GROUP BY status; -- FAILED / CANCELED present
   SELECT name, value FROM system.errors WHERE name = 'CLUSTER_DOESNT_EXIST';
   ```

5. `system.clusters` on the new replica already lists `c1` (the operator applied `remote_servers` via reload), yet the queries above keep failing — demonstrating the failure is **sticky** across `SYSTEM RELOAD CONFIG`.
6. `kubectl delete pod <new-replica-pod>` → after it restarts, everything is healthy. This is the only recovery today.

## Representative server log (new replica)

```
<Error> ... CLUSTER_DOESNT_EXIST ... Requested cluster 'c1' not found
   ... StorageDistributed::initializeDirectoryQueuesForDisk
   ... Context::getCluster
<Error> AsyncLoader: load job 'load table default.dist' failed: CLUSTER_DOESNT_EXIST
<Error> AsyncLoader: load job 'startup table default.dist' canceled
<Error> AsyncLoader: load job 'load dictionary default.max_ts_dict' canceled
... RefreshTask: ... Code: 722. ASYNC_LOAD_WAIT_FAILED ... -> CLUSTER_DOESNT_EXIST
```

## Root cause (operator side)

Two operator behaviors combine to leave the new replica permanently broken:

1. **The preliminary common ConfigMap is generated without the newly-added hosts.**
   `getRemoteServersGeneratorOptions()` (`pkg/controller/chi/worker-status-helpers.go`) always excludes hosts with status `ObjectStatusRequested`. The new replica's pod is therefore created and may boot before its `remote_servers` is fully resolvable (and ConfigMap propagation into a brand-new pod is asynchronous on top of that). This opens the boot-before-cluster window for the new host's startup table loads.

2. **The final `remote_servers` is applied without restarting the new host.**
   `reconcileCRAuxObjectsFinal()` (`pkg/controller/chi/worker-reconciler-chi.go`) writes the complete `remote_servers` and calls `includeAllHostsIntoCluster()`, which propagates via `SYSTEM RELOAD CONFIG` — **no restart**. ClickHouse, by design, does not re-run terminal `FAILED`/`CANCELED` async-load jobs after a config reload, so the cluster appears in `system.clusters` but the cached startup-load failures remain.

There is no code path that restarts the newly-added host after the cluster definition lands:

- `shouldForceRestartHost()` (`pkg/controller/chi/worker.go`) early-returns `false` for a host whose status is `ObjectStatusRequested` or that has no ancestor — i.e. exactly the newly-added replica — so even `restart: RollingUpdate` and image/config-change checks are skipped for it.
- `IsConfigurationChangeRequiresReboot()` (`pkg/model/chop_config.go`) only diffs `zookeeper`/`profiles`/`quotas`/`settings`/`files`. The cluster's `remote_servers` is generated from `clusters[].layout` into `chop-generated-remote_servers.xml`, **not** from `.spec.configuration.settings`, so `configurationRestartPolicy: settings/remote_servers/*: "yes"` never matches a cluster-membership change.

### Settings that do **not** mitigate it (verified on 0.27.2)

| Setting | Why it does not help |
|---|---|
| `configurationRestartPolicy: settings/remote_servers/*: "yes"` | `remote_servers` comes from `clusters[].layout`, not `.spec.configuration.settings`; `IsConfigurationChangeRequiresReboot` has no `remote_servers`/`clusters` branch, so it never matches. Even if it did, `shouldForceRestartHost` skips the new host first. |
| CHI `spec.restart: "RollingUpdate"` | `shouldForceRestartHost` returns `true` for RollingUpdate only **after** the `ObjectStatusRequested`/`!HasAncestor` early-returns, so the new replica is skipped. |
| `reconcile.host.wait.include: true` | Only polls `IsHostInCluster`; never restarts. A host can be "in cluster" while its loader jobs stay terminally `FAILED`. |
| `reconcile.host.wait.replicas.{new,all}: true` | Waits for replication catch-up; doesn't restart; the broken objects (Distributed/dictionary/MV) are non-replicated. |
| `reconcile.host.wait.probes.{startup,readiness}: true` | The pod becomes **Ready** (server answers `SELECT 1`) despite the failed async loads, so the wait passes and nothing restarts. |
| `reconcile.statefulSet.{create,update}.onFailure` | Only fire when the pod is **not** Ready within timeout; here it is Ready-but-degraded. |
| `async_load_databases: false` | Strictly worse — synchronous load makes the missing-cluster failure abort startup → pod `CrashLoopBackOff` (exit 210). |

## Proposed fix

Restart **only the hosts created during the current reconcile**, once, **after** the final `remote_servers` (with all hosts) has been published and propagated. Concretely, at the end of `reconcileCRAuxObjectsFinal()` (right after `includeAllHostsIntoCluster()`):

- iterate hosts with reconcile status `ObjectStatusCreated`,
- skip single-host clusters (`cluster.HostsCount() < 2`) — their `remote_servers` only references `localhost`, which always resolves,
- `WaitForConfigMapPropagation(host)` then `hostSoftwareRestart(host)`.

This is:

- **scoped to the new host only** — the existing replica keeps serving, so there is no client-visible downtime,
- **idempotent / one-shot** — the host becomes `Same` (has an ancestor) on the next reconcile, so it is not restarted again (no loop),
- **ordered after the full `remote_servers` is present**, so the post-restart boot always loads with the cluster defined, closing the race.

An equally valid (and arguably cleaner) alternative is to **seed the newly-added host's mounted `remote_servers` with the full cluster before the pod is created** (i.e. don't exclude `ObjectStatusRequested` hosts from the config the new pod will mount), so the async load never fails in the first place. The restart-after-publish approach is the smaller, lower-risk change.

## Impact

- Silent on scale-up: the new replica is `Ready`, the CHI is `Completed`, and `system.clusters` looks correct, so monitoring does not flag it — yet Distributed/dictionary/refreshable-MV data never lands on the new replica.
- Requires manual pod restarts to recover, which is not acceptable for automated/at-scale fleets.

I have a minimal patch implementing the restart-after-publish approach and have verified it end-to-end on kind (operator emits the restart for the new host after `remote_servers` is published; the new replica then loads all objects, `system.asynchronous_loader` is all `OK`, dictionary `LOADED`, refreshable MV succeeding, no `CLUSTER_DOESNT_EXIST`). Happy to open a PR.


Setting	Why it does not help
`configurationRestartPolicy: settings/remote_servers/*: "yes"`	`remote_servers` comes from `clusters[].layout`, not `.spec.configuration.settings`; `IsConfigurationChangeRequiresReboot` has no `remote_servers`/`clusters` branch, so it never matches. Even if it did, `shouldForceRestartHost` skips the new host first.
CHI `spec.restart: "RollingUpdate"`	`shouldForceRestartHost` returns `true` for RollingUpdate only after the `ObjectStatusRequested`/`!HasAncestor` early-returns, so the new replica is skipped.
`reconcile.host.wait.include: true`	Only polls `IsHostInCluster`; never restarts. A host can be "in cluster" while its loader jobs stay terminally `FAILED`.
`reconcile.host.wait.replicas.{new,all}: true`	Waits for replication catch-up; doesn't restart; the broken objects (Distributed/dictionary/MV) are non-replicated.
`reconcile.host.wait.probes.{startup,readiness}: true`	The pod becomes Ready (server answers `SELECT 1`) despite the failed async loads, so the wait passes and nothing restarts.
`reconcile.statefulSet.{create,update}.onFailure`	Only fire when the pod is not Ready within timeout; here it is Ready-but-degraded.
`async_load_databases: false`	Strictly worse — synchronous load makes the missing-cluster failure abort startup → pod `CrashLoopBackOff` (exit 210).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Newly-added replica is never restarted after `remote_servers` is published — Distributed/Dictionary/Refreshable-MV stay permanently broken (`CLUSTER_DOESNT_EXIST`) after scale-up #2013

Operator version

Summary

Expected behavior

Actual behavior

Steps to reproduce

Representative server log (new replica)

Root cause (operator side)

Settings that do not mitigate it (verified on 0.27.2)

Proposed fix

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Newly-added replica is never restarted after remote_servers is published — Distributed/Dictionary/Refreshable-MV stay permanently broken (CLUSTER_DOESNT_EXIST) after scale-up #2013

Description

Operator version

Summary

Expected behavior

Actual behavior

Steps to reproduce

Representative server log (new replica)

Root cause (operator side)

Settings that do not mitigate it (verified on 0.27.2)

Proposed fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Newly-added replica is never restarted after `remote_servers` is published — Distributed/Dictionary/Refreshable-MV stay permanently broken (`CLUSTER_DOESNT_EXIST`) after scale-up #2013