Dyno client in conductor doesn't try the second Dynomite server when first one in unrechable #354

manjurad · 2017-10-27T04:54:01Z

I have setup Conductor with 3 rack Dynomite Cluster. For testing purposed I shutdown the connectivity to the first dynomite server hoping Conductor will use the other options. It doesn't seem to happen. It creates connection pool for the first server and the other 2 but keeps trying the one that is not reachable. I see these logs over and over.

2081250 [pool-2-thread-1] INFO com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl - Adding host connection pool for host: Host [hostname=host1, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
2081250 [pool-2-thread-1] INFO com.netflix.dyno.connectionpool.impl.HostConnectionPoolImpl - Priming connection pool for host:Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up], with conns:10
2081252 [pool-2-thread-1] WARN com.netflix.dyno.connectionpool.impl.HostConnectionPoolImpl - Unable to make any successful connections to host Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
2081254 [pool-2-thread-1] INFO com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl - Failed to init host pool for host: Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
com.netflix.dyno.connectionpool.exception.DynoConnectException: DynoConnectException: [host=Host [hostname=UNKNOWN, ipAddress=UNKNOWN, port=0, rack: UNKNOWN, datacenter: UNKNOW, status: Down], latency=0(0), attempts=0]Unable to make ANY successful connections to host Host [hostname=host1.default.svc.cluster.local, ipAddress=null, port=8102, rack: us-east-1a, datacenter: us-east-1, status: Up]
at com.netflix.dyno.connectionpool.impl.HostConnectionPoolImpl.primeConnections(HostConnectionPoolImpl.java:173)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl.addHost(ConnectionPoolImpl.java:176)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl.addHost(ConnectionPoolImpl.java:151)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl.updateHosts(ConnectionPoolImpl.java:261)
at com.netflix.dyno.connectionpool.impl.ConnectionPoolImpl$3.run(ConnectionPoolImpl.java:537)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

----- Conductor config

# Database persistence model.  Possible values are memory, redis, and dynomite.
# If ommitted, the persistence used is memory
#
# memory : The data is stored in memory and lost when the server dies.  Useful for testing or demo
# redis : non-Dynomite based redis instance
# dynomite : Dynomite cluster.  Use this for HA configuration.

db=dynomite

# Dynomite Cluster details.
# format is host:port:rack separated by semicolon
workflow.dynomite.cluster.hosts=host1:8102:us-east-1a;host2:8102:us-east-1b;host3:8102:us-east-1c

# Dynomite cluster name
workflow.dynomite.cluster.name=dyn_o_mite

# Namespace for the keys stored in Dynomite/Redis
workflow.namespace.prefix=conductor

# Namespace prefix for the dyno queues
workflow.namespace.queue.prefix=conductor_queues

# No. of threads allocated to dyno-queues (optional)
queues.dynomite.threads=10

# Non-quorum port used to connect to local redis.  Used by dyno-queues.
# When using redis directly, set this to the same port as redis server
# For Dynomite, this is 22122 by default or the local redis-server port used by Dynomite.
queues.dynomite.nonQuorum.port=22122


# Transport address to elasticsearch
workflow.elasticsearch.url=elasticsearch.default.svc.cluster.local:9300

# Name of the elasticsearch cluster
workflow.elasticsearch.index.name=conductor

EC2_AVAILABILTY_ZONE=us-east-1a

# Additional modules (optional)
# conductor.additional.modules=class_extending_com.google.inject.AbstractModule

# Load sample kitchen sink workflow
# loadSample=true

The text was updated successfully, but these errors were encountered:

cheveyo20 · 2017-10-27T14:54:45Z

I do have the same issue. I think it has something to do with the tokenmapsupplier, the dyno jedis client can only load balance if it is aware of the topology, which is currently not implemented in dynomite (but planned i think) via API

cheveyo20 · 2017-10-30T13:04:56Z

I also made the same mistake like you did, because of copy & paste. It's:
EC2_AVAILABILITY_ZONE=us-east-1a
but your question stays the same ;)

manjurad · 2017-10-30T16:29:28Z

@cheveyo20 I tried the second AZ to see if conductor would use the second listed server in the AZ specified but that didn't help either. Looks like conductor only tries the first server listed, irrespective of AZ. I was off the impression Dyno client is topologically aware and would pick the server closest to this instance but that doesn't seem to happen.

Can anyone help clarify is this suppose to work, or does it rely on some other service for failover and AZ awareness to work correctly within Conductor?

cheveyo20 · 2017-10-30T22:51:50Z

@manjurad Connection to the Dynamite Cluster Is done via the Dyno jedis Client (which conductor does use) https://github.com/Netflix/dyno/wiki/Getting-started-with-Redis-client as far as i know there are implementation abstract, for Netflix Eureka Service Discovery and for Hashicorp Consul https://www.consul.io

Its set to Eureka by default See line 74
https://github.com/Netflix/conductor/blob/ff3298b8cf6160431428f19bd89fce95c5f2a5e7/redis-persistence/src/main/java/com/netflix/conductor/dao/dynomite/queue/DynoQueueDAO.java
Line 136
https://github.com/Netflix/conductor/blob/ff3298b8cf6160431428f19bd89fce95c5f2a5e7/server/src/main/java/com/netflix/conductor/server/ConductorServer.java

Hopefully someone with more experience can help with a concret example :)

gauravmishrakec · 2018-03-09T19:22:04Z

Hi All,
Is this issue got fixed? I am facing same issue.

manjurad · 2018-03-10T00:48:48Z

You need to pass a TokenMap for the dyno hosts, without that failover doesn't happen. I did the following to get failover to function.

Create a token map with the entire token space assigned to each host in the config file - this means there is no sharding of data.

https://github.com/CiscoM31/conductor/blob/master/server/src/main/java/com/netflix/conductor/server/ConductorServer.java#L87

Use that map to create the connection pool
https://github.com/CiscoM31/conductor/blob/master/server/src/main/java/com/netflix/conductor/server/ConductorServer.java#L168

manjurad mentioned this issue Nov 1, 2017

Dyno Client with Redis Sentinel setup Netflix/dyno#189

Closed

pctreddy added the help_wanted label Feb 21, 2018

apanicker-nflx closed this as completed May 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dyno client in conductor doesn't try the second Dynomite server when first one in unrechable #354

Dyno client in conductor doesn't try the second Dynomite server when first one in unrechable #354

manjurad commented Oct 27, 2017 •

edited

Loading

cheveyo20 commented Oct 27, 2017

cheveyo20 commented Oct 30, 2017

manjurad commented Oct 30, 2017

cheveyo20 commented Oct 30, 2017 •

edited

Loading

gauravmishrakec commented Mar 9, 2018

manjurad commented Mar 10, 2018

Dyno client in conductor doesn't try the second Dynomite server when first one in unrechable #354

Dyno client in conductor doesn't try the second Dynomite server when first one in unrechable #354

Comments

manjurad commented Oct 27, 2017 • edited Loading

cheveyo20 commented Oct 27, 2017

cheveyo20 commented Oct 30, 2017

manjurad commented Oct 30, 2017

cheveyo20 commented Oct 30, 2017 • edited Loading

gauravmishrakec commented Mar 9, 2018

manjurad commented Mar 10, 2018

manjurad commented Oct 27, 2017 •

edited

Loading

cheveyo20 commented Oct 30, 2017 •

edited

Loading