[client-v2] getNextAliveNode() always returns endpoints.get(0) — retries never fail over to alternate endpoints

## Description

`com.clickhouse.client.api.Client#getNextAliveNode()` (client-v2) is hard-coded to return the first registered endpoint and never rotates:

```java
// client-v2/src/main/java/com/clickhouse/client/api/Client.java:2200
private Endpoint getNextAliveNode() {
    return endpoints.get(0);
}
```

This method is the only mechanism the client-v2 INSERT / write / query code paths use to (re-)select an endpoint on retry. Every retry call site (in `Client.java` around lines 1329, 1359, 1376, 1551, 1567, 1583, 1680, 1697, 1725) reassigns `selectedEndpoint = getNextAliveNode()`, but since the method ignores any rotation/liveness state, retries always target the same (first) endpoint. If the first endpoint is unreachable, no registered alternative is ever attempted, and the client surfaces a connection failure even though other configured endpoints could have served the request.

The naming (`getNextAliveNode`) suggests the surrounding retry loop was authored expecting rotation/liveness logic that was never implemented, so the failover behavior is silently absent rather than explicitly disabled. The feature request to add real failover (#1838) was closed as not-planned (stale) on 2026-02-10, so this stub remains in the shipping code.

This is the Java-side analogue of [clickhouse-go#136](https://github.com/ClickHouse/clickhouse-go/issues/136) (the Go driver failing to fall back to `alt_hosts` when the first host is down).

## ClickHouse server version

Code analysis only; not verified against a running server. (Server in the investigation environment was 26.4.2.10, but reproduction is purely client-side and does not depend on server version.)

## Reproduction

Minimal client-v2 snippet that exercises the broken path:

```java
import com.clickhouse.client.api.Client;
import com.clickhouse.client.api.query.QueryResponse;

public class FailoverRepro {
    public static void main(String[] args) throws Exception {
        // First endpoint is dead (no listener on :1); second is healthy.
        try (Client client = new Client.Builder()
                .addEndpoint("http://127.0.0.1:1")
                .addEndpoint("http://localhost:8123")
                .setUsername("default")
                .setPassword("")
                .setRetryOnFailures(com.clickhouse.client.api.ClientFaultCause.ConnectTimeout,
                                   com.clickhouse.client.api.ClientFaultCause.NoHttpResponse,
                                   com.clickhouse.client.api.ClientFaultCause.ServerRetryable)
                .retryOnFailures(5)  // any N >= 1
                .build()) {

            try (QueryResponse r = client.query("SELECT 1").get()) {
                // Expected: succeed via the healthy second endpoint after the first fails.
                // Actual:   all N+1 attempts hit 127.0.0.1:1 and the query throws a
                //           connection-refused / connect-timeout exception.
            }
        }
    }
}
```

Expected: the retry loop tries `localhost:8123` after `127.0.0.1:1` fails to connect, and the query returns successfully.

Actual: every retry repeatedly targets `127.0.0.1:1` because `getNextAliveNode()` returns `endpoints.get(0)` unconditionally; the healthy alternate is never tried.

## Suggested fix

`Client.java:2200-2202`. At minimum, `getNextAliveNode()` should rotate across `endpoints` — e.g., an `AtomicInteger` round-robin index incremented on each call, with the result taken modulo `endpoints.size()`. A more complete fix would also track endpoints recently marked as failed within the current retry loop and skip them, and only advance the index on retries triggered by connection-level failures or retryable 5xx responses so that successful requests keep affinity to a single endpoint.

## Link

Related upstream report (Go client analogue): https://github.com/ClickHouse/clickhouse-go/issues/136
Closed feature request for the same functionality in client-v2: #1838

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[client-v2] getNextAliveNode() always returns endpoints.get(0) — retries never fail over to alternate endpoints #2855

Description

ClickHouse server version

Reproduction

Suggested fix

Link

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[client-v2] getNextAliveNode() always returns endpoints.get(0) — retries never fail over to alternate endpoints #2855

Description

Description

ClickHouse server version

Reproduction

Suggested fix

Link

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions