Skip to content
This repository has been archived by the owner on Dec 13, 2023. It is now read-only.

Dyno client in conductor doesn't try the second Dynomite server when first one in unrechable #354

Closed
manjurad opened this issue Oct 27, 2017 · 6 comments

Comments

@manjurad
Copy link

manjurad commented Oct 27, 2017

I have setup Conductor with 3 rack Dynomite Cluster. For testing purposed I shutdown the connectivity to the first dynomite server hoping Conductor will use the other options. It doesn't seem to happen. It creates connection pool for the first server and the other 2 but keeps trying the one that is not reachable. I see these logs over and over.

2081250 [pool-2-thread-1] INFO com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl - Adding host connection pool for host: Host [hostname=host1, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
2081250 [pool-2-thread-1] INFO com.netflix.dyno.connectionpool.impl.HostConnectionPoolImpl - Priming connection pool for host:Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up], with conns:10
2081252 [pool-2-thread-1] WARN com.netflix.dyno.connectionpool.impl.HostConnectionPoolImpl - Unable to make any successful connections to host Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
2081254 [pool-2-thread-1] INFO com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl - Failed to init host pool for host: Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
com.netflix.dyno.connectionpool.exception.DynoConnectException: DynoConnectException: [host=Host [hostname=UNKNOWN, ipAddress=UNKNOWN, port=0, rack: UNKNOWN, datacenter: UNKNOW, status: Down], latency=0(0), attempts=0]Unable to make ANY successful connections to host Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
at com.netflix.dyno.connectionpool.impl.HostConnectionPoolImpl.primeConnections(HostConnectionPoolImpl.java:173)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl.addHost(ConnectionPoolImpl.java:176)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl.addHost(ConnectionPoolImpl.java:151)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl.updateHosts(ConnectionPoolImpl.java:261)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl$3.run(ConnectionPoolImpl.java:537)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

----- Conductor config

# Database persistence model.  Possible values are memory, redis, and dynomite.
# If ommitted, the persistence used is memory
#
# memory : The data is stored in memory and lost when the server dies.  Useful for testing or demo
# redis : non-Dynomite based redis instance
# dynomite : Dynomite cluster.  Use this for HA configuration.

db=dynomite

# Dynomite Cluster details.
# format is host:port:rack separated by semicolon
workflow.dynomite.cluster.hosts=host1:8102:us-east-1a;host2:8102:us-east-1b;host3:8102:us-east-1c

# Dynomite cluster name
workflow.dynomite.cluster.name=dyn_o_mite

# Namespace for the keys stored in Dynomite/Redis
workflow.namespace.prefix=conductor

# Namespace prefix for the dyno queues
workflow.namespace.queue.prefix=conductor_queues

# No. of threads allocated to dyno-queues (optional)
queues.dynomite.threads=10

# Non-quorum port used to connect to local redis.  Used by dyno-queues.
# When using redis directly, set this to the same port as redis server
# For Dynomite, this is 22122 by default or the local redis-server port used by Dynomite.
queues.dynomite.nonQuorum.port=22122


# Transport address to elasticsearch
workflow.elasticsearch.url=elasticsearch.default.svc.cluster.local:9300

# Name of the elasticsearch cluster
workflow.elasticsearch.index.name=conductor

EC2_AVAILABILTY_ZONE=us-east-1a

# Additional modules (optional)
# conductor.additional.modules=class_extending_com.google.inject.AbstractModule

# Load sample kitchen sink workflow
# loadSample=true
@cheveyo20
Copy link

I do have the same issue. I think it has something to do with the tokenmapsupplier, the dyno jedis client can only load balance if it is aware of the topology, which is currently not implemented in dynomite (but planned i think) via API

@cheveyo20
Copy link

I also made the same mistake like you did, because of copy & paste. It's:
EC2_AVAILABILITY_ZONE=us-east-1a
but your question stays the same ;)

@manjurad
Copy link
Author

@cheveyo20 I tried the second AZ to see if conductor would use the second listed server in the AZ specified but that didn't help either. Looks like conductor only tries the first server listed, irrespective of AZ. I was off the impression Dyno client is topologically aware and would pick the server closest to this instance but that doesn't seem to happen.

Can anyone help clarify is this suppose to work, or does it rely on some other service for failover and AZ awareness to work correctly within Conductor?

@cheveyo20
Copy link

cheveyo20 commented Oct 30, 2017

@manjurad Connection to the Dynamite Cluster Is done via the Dyno jedis Client (which conductor does use) https://github.com/Netflix/dyno/wiki/Getting-started-with-Redis-client as far as i know there are implementation abstract, for Netflix Eureka Service Discovery and for Hashicorp Consul https://www.consul.io

Its set to Eureka by default See line 74
https://github.com/Netflix/conductor/blob/ff3298b8cf6160431428f19bd89fce95c5f2a5e7/redis-persistence/src/main/java/com/netflix/conductor/dao/dynomite/queue/DynoQueueDAO.java
Line 136
https://github.com/Netflix/conductor/blob/ff3298b8cf6160431428f19bd89fce95c5f2a5e7/server/src/main/java/com/netflix/conductor/server/ConductorServer.java

Hopefully someone with more experience can help with a concret example :)

@gauravmishrakec
Copy link

Hi All,
Is this issue got fixed? I am facing same issue.

@manjurad
Copy link
Author

You need to pass a TokenMap for the dyno hosts, without that failover doesn't happen. I did the following to get failover to function.

Create a token map with the entire token space assigned to each host in the config file - this means there is no sharding of data.

https://github.com/CiscoM31/conductor/blob/master/server/src/main/java/com/netflix/conductor/server/ConductorServer.java#L87

Use that map to create the connection pool
https://github.com/CiscoM31/conductor/blob/master/server/src/main/java/com/netflix/conductor/server/ConductorServer.java#L168

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants