Hystrix circuitbreaker / dashboard issue #236

Closed
rvdwenden opened this Issue Mar 28, 2014 · 11 comments

Comments

Projects
None yet
3 participants

Used Hystrix version (hystrix-core, hystrix-metrics-event-stream, hystrix-codehale-metrics-publisher): 1.3.13
Used Hystrix dashboard: 1.3.9

The circuitbreaker works fine (i.e. opens/closes) in our application, as one aspects, without using a dashboard session for this application.
But as soon as we use the dashboard on this application, the circuitbreaker doens't work fine anymore: i.e. the circuitbreaker opens after failures (that's ok), but doesn't recover (closes) after the services work fine again.
When we stop the dashboard for a little while, and start again, we see that the open circuit breaker is closed again ! We see this behavior over and over.

I looked at the Java circuitbreaker logic and found out that the isOpen() method does some extra bookkeeping. The circuitbreaker and streaming-servlet calls isOpen().
I changed the Hystrix' Circuitbreaker code: I changed the isOpen() to 'return circuitOpen.get()' only. I created a new method isOpenPlus() based on the existing isOpen() functionality and changed the isOpen() calls in allowRequest() into a call to isOpenPlus().
And hey presto, this works fine together with dashboard. We cannot explain why this opproach works (or at least seems to work).

Can you explain and/or fix this in a new release of Hystrix (no need to change streaming-servlet, nor the dashboard)

Contributor

benjchristensen commented Mar 28, 2014

Sorry, I don't understand the problem, nor do I see how the isOpen() method as currently implemented could result in the circuit not recovering. One strong piece of evidence to this is that at Netflix we've been running this code for a long time with constant real-time monitoring via the streaming-servlet and not seen the described behavior.

That doesn't mean a bug doesn't exist, I'm just not seeing it as I read through the code. Can you please provide a unit test demonstrating the issue?

The existing unit tests are here: https://github.com/Netflix/Hystrix/blob/master/hystrix-core/src/test/java/com/netflix/hystrix/HystrixCircuitBreakerTest.java

Some background on the related code ...

The isOpen() method is doing the "extra bookkeeping" on purpose, it is what calculates whether the circuit needs to be tripped.

When the circuit is tripped, isOpen() will always short-cut to immediately returning true.

The only way it will start returning false again is if an actual request (not the dashboard) gets permitted via allowSingleTest() and it succeeds and then calls markSuccess().

The "single test" route is permitted via this code inside allowRequest:

return !isOpen() || allowSingleTest();

I created a little unit test to recreate the problem.
You can play with the log-counter-logic in the logHC() method. The more a call is made to retrieve metrics from a command, the more it won't work, i.e. the circuit stays open.

package yourpackage

import com.netflix.hystrix.HystrixCommand;
import com.netflix.hystrix.HystrixCommandGroupKey;
import com.netflix.hystrix.HystrixCommandMetrics;
import com.netflix.hystrix.strategy.HystrixPlugins;
import com.netflix.hystrix.strategy.executionhook.HystrixCommandExecutionHook;
import org.junit.Before;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.mockito.runners.MockitoJUnitRunner;

import java.util.Random;

/** @author Rob van der Wenden */
@RunWith(MockitoJUnitRunner.class)
@SuppressWarnings("rawtypes")   
public class HystrixCircuitBreakerTest {

    @Before
    public void setup() {
        HystrixPlugins.getInstance().registerCommandExecutionHook(new MyHystrixCommandExecutionHook());
    }

   @Test
   public void testListenerNOK() throws InterruptedException {

        performLoad(200, 0, 40);
        performLoad(250, 100, 40);
        performLoad(300, 0, 40);

   }

   void performLoad(int totalNumCalls, int errPerc, int waitMillis) {

      Random rnd = new Random();

      for (int i = 0; i < totalNumCalls; i++) {
         //System.out.println(i);

         try {
            boolean err = rnd.nextFloat() * 100 < errPerc;

            TestCommand cmd = new TestCommand(err);
            cmd.execute();

         } catch (Exception e) {
            //System.err.println(e.getMessage());
         }

         try {
            Thread.sleep(waitMillis);
         } catch (InterruptedException e) {
         }
      }
   }

   public class TestCommand extends HystrixCommand<String> {

      boolean error;

      public TestCommand(final boolean error) {
            super(HystrixCommandGroupKey.Factory.asKey("group"));

         this.error = error;
      }

      @Override
      protected String run() throws Exception {

         if (error) {
            throw new Exception();
         } else {
            return "response";
         }
      }

      @Override
      protected String getFallback() {
            if (isFailedExecution()) {
             return getFailedExecutionException().getMessage();
            } else {
                return "other fail reason";
            }
      }

   }


    public class MyHystrixCommandExecutionHook extends HystrixCommandExecutionHook {

        @Override
        public <T> T onComplete(final HystrixCommand<T> command, final T response) {

            logHC(command);

            return super.onComplete(command, response);
        }

        private int counter = 0;

        private <T> void logHC(HystrixCommand<T> command) {

            //if ((counter++ % 20) == 0) {
                HystrixCommandMetrics metrics = command.getMetrics();
                System.out.println("cb/error-count/%/total: "
                        + command.isCircuitBreakerOpen() + " "
                        + metrics.getHealthCounts().getErrorCount() + " "
                        + metrics.getHealthCounts().getErrorPercentage() + " "
                        + metrics.getHealthCounts().getTotalRequests());
            //}
        }
    }

}

I solved this (in my local version of Hystrix) by changing the resetCounter() logic in HystrixCommandMetrics like so:

    /* package */void resetCounter() {
        counter.reset();
        healthCountsSnapshot = null;
   }

because the getHealthCounts() used an invalid/old version of HealthCounts and this invalid/old version is also retrieved by the HystrixCircuitBreaker isOpen() : metrics.getHealthCounts().
This causes the circuit to be opened again, just after it was closed !!!

Can you fix this in the code for 1.3.14 ?

Xorlev commented Apr 14, 2014

@rvdwenden Out of curiosity, what kind of load do you have on your circuits? This might explain some flapping behavior we've seen.

Contributor

benjchristensen commented Apr 15, 2014

@rvdwenden I'll be looking at this soon, been busy elsewhere.

We have applications running on 50 tps (order of magnitude)

Next to above mentioned proposed fix/solution, I still strongly believe that the HystrixCircuitBreaker isOpen() shouldn't have side-effects and only yield something like circuitOpen.get(). See my 1st remark.

Contributor

benjchristensen commented Apr 15, 2014

I'll be looking at this closely in the near future. I'm not jumping on it as it has been this way in production for over 2 years with circuits ranging from <1rps to >5000rps and I've been busy recently on some other higher priority things.

Don't hurry. Now we can explain this behavior, we can take migating measures to this issue on our side!

But, by the way, we changed the load program, as described above, a little bit. we manage to produce loads between 1000 to 10.000 tps easily.
Every 500 milliseconds (just like Hystrix dashboard) the program reads the circuitbreaker state and health counts.
As we aspected, the issue occurs at these tps's, still (using out-of-the-box default property settings)

Contributor

benjchristensen commented Apr 25, 2014

Great bug report everyone ... I've fixed it in both 1.3.x and master (1.4) and will release both versions shortly.

Amazing this has escaped detection for years.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment