Skip to content

bug: EtcdServiceRegistry.start() fails immediately if etcd unavailable with no retry #287

@sfloess

Description

@sfloess

Bug Description

The start() method creates the etcd client without attempting to verify connectivity or providing any retry logic. If etcd is temporarily unavailable at startup, the registry starts successfully but all operations will fail until manually restarted.

Location

jplatform-registry-etcd/src/main/java/org/flossware/jplatform/registry/etcd/EtcdServiceRegistry.java:76-86

Problematic Code

public void start() {
    if (client == null) {
        client = Client.builder()
            .endpoints(config.getEndpoints().toArray(new String[0]))
            .build();  // No connection verification!
    }

    leaseRenewalExecutor = Executors.newSingleThreadScheduledExecutor();
    leaseRenewalExecutor.scheduleAtFixedRate(this::renewLeases,
        config.getLeaseTtl() / 2, config.getLeaseTtl() / 2, TimeUnit.SECONDS);
}

Impact

  • Silent failure: start() succeeds even if etcd is down
  • Deferred errors: Operations fail later with connection errors
  • No retry: Once client created, no reconnection attempts
  • Service unavailable: Applications can't register services
  • Misleading state: Registry appears started but doesn't work
  • Poor resilience: Can't handle temporary etcd outages during startup

Example

// etcd cluster is down for maintenance
EtcdServiceRegistry registry = new EtcdServiceRegistry(config);

registry.start();
// Client created successfully
// No error thrown
// Method returns normally
// User thinks registry is ready

registry.registerService(MyService.class, impl);
// First operation attempts to connect
// Connection refused or timeout
// Exception thrown
// Service not registered

// User must manually restart application
// Or catch exception and retry
// No automatic recovery

// OR etcd comes back online 1 minute later
registry.registerService(MyService.class, impl);
// Still fails - client never reconnects
// Must call start() again or recreate registry

Proposed Fix

Option 1 - Verify connectivity on start with retry:

public void start() {
    synchronized (startLock) {
        if (started) {
            logger.warn("EtcdServiceRegistry already started");
            return;
        }
        
        if (client == null) {
            // Validate endpoints
            List<String> endpoints = config.getEndpoints();
            if (endpoints == null || endpoints.isEmpty()) {
                throw new IllegalStateException("No etcd endpoints configured");
            }
            
            client = Client.builder()
                .endpoints(endpoints.toArray(new String[0]))
                .build();
            
            logger.info("Created etcd client with endpoints: {}", endpoints);
        }

        // Verify connectivity with retry
        int maxRetries = 5;
        int retryDelay = 2000; // 2 seconds
        boolean connected = false;
        
        for (int attempt = 1; attempt <= maxRetries; attempt++) {
            try {
                // Simple connectivity check - list keys
                client.getKVClient()
                    .get(ByteSequence.from("/health", StandardCharsets.UTF_8))
                    .get(5, TimeUnit.SECONDS);
                
                logger.info("Successfully connected to etcd on attempt {}", attempt);
                connected = true;
                break;
                
            } catch (Exception e) {
                logger.warn("Failed to connect to etcd (attempt {}/{}): {}", 
                           attempt, maxRetries, e.getMessage());
                
                if (attempt < maxRetries) {
                    try {
                        Thread.sleep(retryDelay);
                    } catch (InterruptedException ie) {
                        Thread.currentThread().interrupt();
                        throw new IllegalStateException("Interrupted while connecting to etcd", ie);
                    }
                } else {
                    throw new IllegalStateException(
                        "Failed to connect to etcd after " + maxRetries + " attempts", e);
                }
            }
        }

        if (!connected) {
            throw new IllegalStateException("Could not connect to etcd");
        }

        // Start lease renewal
        leaseRenewalExecutor = Executors.newSingleThreadScheduledExecutor(r -> {
            Thread t = new Thread(r, "etcd-lease-renewal");
            t.setDaemon(true);
            return t;
        });
        
        long renewalPeriod = config.getLeaseTtl() / 2;
        leaseRenewalExecutor.scheduleAtFixedRate(() -> {
            try {
                renewLeases();
            } catch (Exception e) {
                logger.error("Exception during lease renewal", e);
            }
        }, renewalPeriod, renewalPeriod, TimeUnit.SECONDS);
        
        started = true;
        logger.info("EtcdServiceRegistry started successfully");
    }
}

Option 2 - Make connectivity check optional via configuration:

public class EtcdRegistryConfig {
    private final boolean verifyConnectivity;
    private final int connectionRetries;
    private final int connectionRetryDelayMs;
    
    public static class Builder {
        private boolean verifyConnectivity = true;
        private int connectionRetries = 5;
        private int connectionRetryDelayMs = 2000;
        
        public Builder verifyConnectivity(boolean verify) {
            this.verifyConnectivity = verify;
            return this;
        }
        
        public Builder connectionRetries(int retries) {
            this.connectionRetries = retries;
            return this;
        }
        
        public Builder connectionRetryDelay(int delayMs) {
            this.connectionRetryDelayMs = delayMs;
            return this;
        }
    }
}

// In start():
if (config.isVerifyConnectivity()) {
    verifyConnectivity(config.getConnectionRetries(), config.getConnectionRetryDelayMs());
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions