Skip to content

Timeouts

opuneet edited this page Nov 19, 2014 · 1 revision

Timeouts in Astyanax

Configuring timeouts in Astyanax is simple, however understanding how the timeouts actually affect the application is not entirely trivial.

Here are the 2 important params for configuring timeouts.

connectTimeout - max time you will wait in millis when creating a new connection to Cassandra. Astyanax re-uses connections, so one can leave the default value to 3000 ms.

socketTimeout - this is the max read timeout on the actual socket call, just like in HttpClient. Default is 10 seconds, so yes one can bring this down.

Both these params are available on the ConnectionPoolConfigurationImpl object.

Within Netflix we use Archaius to configure Astyanax. Hence the mechanism to set these values (within Netflix) is {CLUSTER}.{keyspace}.astyanax.connectTimeout=2000 (2000 ms) {CLUSTER}.{keyspace}.astyanax.socketTimeout=200 (200 ms)

Note that you need to supply your CLUSTER and keyspace names.

**But please note 2 imp things. **

socketTimeout is not an end to end operation timeout

The socketTimeout is the read timeout on the socket. This means that is Cassandra server takes longer than 200 ms to get back to Astyanax with the 1st byte, then yes Astyanax will timeout. But if the backend db responds back in say 180 ms, and then the server takes more time to respond with the full payload (say total of 250 ms), then there will be no timeout triggered. This is also how the timeout will work in HttpClient.

If what you need is an end to end operation timeout, then the best thing to do is wrap the Astyanax call in another thread and then time that thread using something like

Future<Result> future = threadPool.submit( new Callable<Result>() {
    public Result call() throws Exception {
        return keyspace.prepareQuery(CF).withCql(cqlStatement).execute();
    }
}

future.get(200, TimeUnit.MILLISECOND);

This way if the call (running in a separate thread) takes more than 200 ms, you will be able to actually walk away from the Astyanax call in 200 ms.

At Netflix we use Hystrix when making calls to remote systems. Hystrix executes the call in a remote thread, and hence can actually time it correctly and walk away from a run away query that does not get interrupted.

Watch out for retry storms

Short timeouts do NOT help with retry storms, hence you need load shedding If you are shortening your timeouts, then you also need to ensure that the application does not retry immediately, since this essentially causes a retry storm. We've already seen this happen with other teams within Netflix using Cassandra, where a short timeout makes the caller retry even faster, which then causes a retry storm that ultimately takes the db offline. So please ensure that you have some load shedding in place, in case your app starts retrying too aggressively, or if callers to your app start retrying calls to your service aggressively.

Hystrix semaphore is nice way to achieve quota limit management.

Takeaway

Long story short; if you are considering application resiliency when tuning Astyanax, then have a look at Hystrix which has resiliency built in with it's very core design principles. Hystrix embraces mature resiliency concepts such as bulk heading, isolation, fallbacks, circuit breaking which is what one needs to consider when tuning Astyanax timeouts.

Clone this wiki locally