com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException #127

Closed
orrgal1 opened this Issue Oct 12, 2012 · 10 comments

7 participants

@orrgal1

getting unstable behavior with an integration test. i have some code that does about 20 writes in a short time. if i put breakpoints in and run it step by step taking time between steps it usually succeeds. if i attempt to run it in one go without stops it fails with the above. here is the full trace. running on windows 7 64 bit.

com.netflix.astyanax.connectionpool.exceptions.NoAvailableHostsException: NoAvailableHostsException: [host=None(0.0.0.0):0, latency=0(0), attempts=0] No hosts to borrow from
at com.netflix.astyanax.connectionpool.impl.RoundRobinExecuteWithFailover.(RoundRobinExecuteWithFailover.java:31)
at com.netflix.astyanax.connectionpool.impl.RoundRobinConnectionPoolImpl.newExecuteWithFailover(RoundRobinConnectionPoolImpl.java:52)
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:229)
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.executeOperation(ThriftKeyspaceImpl.java:455)
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl.access$400(ThriftKeyspaceImpl.java:62)
at com.netflix.astyanax.thrift.ThriftKeyspaceImpl$1.execute(ThriftKeyspaceImpl.java:115)
at com.numbeez.server.dao.VDaoUtils.putString(VDaoUtils.java:101)
at com.numbeez.server.dao.VServiceUtils.putString(VServiceUtils.java:49)
at com.numbeez.server.kpi.KpiUtils.logKpi(KpiUtils.java:133)
at com.numbeez.server.tests.KpiDumperTests.logKpi(KpiDumperTests.java:50)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:45)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:15)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:42)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:20)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:28)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:30)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:263)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:68)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:47)
at org.junit.runners.ParentRunner$3.run(ParentRunner.java:231)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:60)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:229)
at org.junit.runners.ParentRunner.access$000(ParentRunner.java:50)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:222)
at org.junit.runners.ParentRunner.run(ParentRunner.java:300)
at org.eclipse.jdt.internal.junit4.runner.JUnit4TestReference.run(JUnit4TestReference.java:50)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:467)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:683)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:390)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:197)

@jvalencia

I have a similar issue. Quick succession of writes often gives this error.

@orrgal1

i think this is resolved if you configure the connectionpool to be more resilient. https://github.com/Netflix/astyanax/wiki/Configuration

@jvalencia

.withConnectionPoolConfiguration(
new ConnectionPoolConfigurationImpl("MyConnectionPool") // Defaults in comments
.setPort(port)
.setSeeds(seed)
.setMaxOperationsPerConnection(100) // 10000
.setMaxPendingConnectionsPerHost(20) // 2
.setMaxConnsPerHost(20) // 2
.setConnectionLimiterMaxPendingCount(20) // 20
.setTimeoutWindow(10000) // 10000
.setConnectionLimiterWindowSize(1000) // 2000
.setMaxTimeoutCount(3) // 3
.setConnectTimeout(5000) // 2000
.setMaxFailoverCount(-1) // -1
.setLatencyAwareBadnessThreshold(20)// float DEFAULT_LATENCY_AWARE_BADNESS_THRESHOLD = 0.10f
.setLatencyAwareUpdateInterval(1000) //10000
.setLatencyAwareResetInterval(10000) //60000
.setLatencyAwareWindowSize(100) // 100
.setLatencyAwareSentinelCompare(100f) // DEFAULT_LATENCY_AWARE_SENTINEL_COMPARE = 0.768f
)

Any suggestions?

@jvalencia

I suppose I should note that if I disconnect after each operation I get no problems. It really is related to the connection pool. I've seen a number of open unanswered threads with similar issues and little guidance. Perhaps this is a good area for better docs.

As an aside, we are trying to run with multiple threads sharing a connection pool. I've been thinking that perhaps the pool is simply not thread safe?

@elandau
Netflix, Inc. member

The connection pool is thread safe and non-blocking. I'll go ahead and add a separate page on the wiki further detailing the connection pool design. Can you please send a unit test which reproduces the problem. Also, please update to the latest version since there have been recent improvements to connection pool stability.

@jvalencia

Unfortunately, the code I have is nested deep in another application. However, I was able to mitigate the issue by switching ring describe to none.

@rabinnh

I have found the same issue. The test configuration is a single node configured for NodeDiscoveryType.RING_DESCRIBE and ConnectionPoolType.TOKEN_AWARE. My sample code does the following:

buildKeyspace
createKeyspace
keyspace.createColumnFamily
Insert some data

If the column family does not exist, this works every time. However, running the following fails on a one node cluster:
buildKeyspace
dropKeyspace (FAILS! with NoAvailableHostsException every time)

Reconfiguring with NodeDiscoveryType.NONE and ConnectionPoolType.BAG fixes the issue. The faulty code appears to be in:

size = topology.getAllPools().getPools().size();

which returns 0 even though there is a node in the cluster.

@raresp

I had the same problem.
Thanks rabinhh for your post!

@pokstar

👍
I was able to reproduce the bug in Astyanax 1.56.42 using Cassandra 1.2.6 with

new AstyanaxConfigurationImpl()
    ...
    .setDiscoveryType(NodeDiscoveryType.RING_DESCRIBE)
    .setConnectionPoolType(ConnectionPoolType.TOKEN_AWARE);
@opuneet

I just tried this and I'm unable to reproduce this.
Please make sure that you call context.start() since that init the conn pool.

Closing this out assuming that this isn't an issue anymore.

@opuneet opuneet closed this Oct 9, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment