Description
com.clickhouse.client.api.Client#getNextAliveNode() (client-v2) is hard-coded to return the first registered endpoint and never rotates:
// client-v2/src/main/java/com/clickhouse/client/api/Client.java:2200
private Endpoint getNextAliveNode() {
return endpoints.get(0);
}
This method is the only mechanism the client-v2 INSERT / write / query code paths use to (re-)select an endpoint on retry. Every retry call site (in Client.java around lines 1329, 1359, 1376, 1551, 1567, 1583, 1680, 1697, 1725) reassigns selectedEndpoint = getNextAliveNode(), but since the method ignores any rotation/liveness state, retries always target the same (first) endpoint. If the first endpoint is unreachable, no registered alternative is ever attempted, and the client surfaces a connection failure even though other configured endpoints could have served the request.
The naming (getNextAliveNode) suggests the surrounding retry loop was authored expecting rotation/liveness logic that was never implemented, so the failover behavior is silently absent rather than explicitly disabled. The feature request to add real failover (#1838) was closed as not-planned (stale) on 2026-02-10, so this stub remains in the shipping code.
This is the Java-side analogue of clickhouse-go#136 (the Go driver failing to fall back to alt_hosts when the first host is down).
ClickHouse server version
Code analysis only; not verified against a running server. (Server in the investigation environment was 26.4.2.10, but reproduction is purely client-side and does not depend on server version.)
Reproduction
Minimal client-v2 snippet that exercises the broken path:
import com.clickhouse.client.api.Client;
import com.clickhouse.client.api.query.QueryResponse;
public class FailoverRepro {
public static void main(String[] args) throws Exception {
// First endpoint is dead (no listener on :1); second is healthy.
try (Client client = new Client.Builder()
.addEndpoint("http://127.0.0.1:1")
.addEndpoint("http://localhost:8123")
.setUsername("default")
.setPassword("")
.setRetryOnFailures(com.clickhouse.client.api.ClientFaultCause.ConnectTimeout,
com.clickhouse.client.api.ClientFaultCause.NoHttpResponse,
com.clickhouse.client.api.ClientFaultCause.ServerRetryable)
.retryOnFailures(5) // any N >= 1
.build()) {
try (QueryResponse r = client.query("SELECT 1").get()) {
// Expected: succeed via the healthy second endpoint after the first fails.
// Actual: all N+1 attempts hit 127.0.0.1:1 and the query throws a
// connection-refused / connect-timeout exception.
}
}
}
}
Expected: the retry loop tries localhost:8123 after 127.0.0.1:1 fails to connect, and the query returns successfully.
Actual: every retry repeatedly targets 127.0.0.1:1 because getNextAliveNode() returns endpoints.get(0) unconditionally; the healthy alternate is never tried.
Suggested fix
Client.java:2200-2202. At minimum, getNextAliveNode() should rotate across endpoints — e.g., an AtomicInteger round-robin index incremented on each call, with the result taken modulo endpoints.size(). A more complete fix would also track endpoints recently marked as failed within the current retry loop and skip them, and only advance the index on retries triggered by connection-level failures or retryable 5xx responses so that successful requests keep affinity to a single endpoint.
Link
Related upstream report (Go client analogue): ClickHouse/clickhouse-go#136
Closed feature request for the same functionality in client-v2: #1838
Description
com.clickhouse.client.api.Client#getNextAliveNode()(client-v2) is hard-coded to return the first registered endpoint and never rotates:This method is the only mechanism the client-v2 INSERT / write / query code paths use to (re-)select an endpoint on retry. Every retry call site (in
Client.javaaround lines 1329, 1359, 1376, 1551, 1567, 1583, 1680, 1697, 1725) reassignsselectedEndpoint = getNextAliveNode(), but since the method ignores any rotation/liveness state, retries always target the same (first) endpoint. If the first endpoint is unreachable, no registered alternative is ever attempted, and the client surfaces a connection failure even though other configured endpoints could have served the request.The naming (
getNextAliveNode) suggests the surrounding retry loop was authored expecting rotation/liveness logic that was never implemented, so the failover behavior is silently absent rather than explicitly disabled. The feature request to add real failover (#1838) was closed as not-planned (stale) on 2026-02-10, so this stub remains in the shipping code.This is the Java-side analogue of clickhouse-go#136 (the Go driver failing to fall back to
alt_hostswhen the first host is down).ClickHouse server version
Code analysis only; not verified against a running server. (Server in the investigation environment was 26.4.2.10, but reproduction is purely client-side and does not depend on server version.)
Reproduction
Minimal client-v2 snippet that exercises the broken path:
Expected: the retry loop tries
localhost:8123after127.0.0.1:1fails to connect, and the query returns successfully.Actual: every retry repeatedly targets
127.0.0.1:1becausegetNextAliveNode()returnsendpoints.get(0)unconditionally; the healthy alternate is never tried.Suggested fix
Client.java:2200-2202. At minimum,getNextAliveNode()should rotate acrossendpoints— e.g., anAtomicIntegerround-robin index incremented on each call, with the result taken moduloendpoints.size(). A more complete fix would also track endpoints recently marked as failed within the current retry loop and skip them, and only advance the index on retries triggered by connection-level failures or retryable 5xx responses so that successful requests keep affinity to a single endpoint.Link
Related upstream report (Go client analogue): ClickHouse/clickhouse-go#136
Closed feature request for the same functionality in client-v2: #1838