Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Circuit breaker does not close even after remote service has recovered #1633

Open
utwyko opened this issue Jul 19, 2017 · 18 comments
Open

Circuit breaker does not close even after remote service has recovered #1633

utwyko opened this issue Jul 19, 2017 · 18 comments

Comments

@utwyko
Copy link

utwyko commented Jul 19, 2017

We've had the following problem occur three times in about a month:

  • Remote service has issues and does not respond within the set Hystrix timeout period
  • Circuit breaker opens
  • Remote service recovers
  • Circuit breaker does not close as expected

We're running a service on two nodes. A case occurred where one node's circuit breaker properly closed, and one remained open, while both nodes are talking to the exact same remote service:

image

At the same time, the circuit breaker for a call to another endpoint on that same remote service remained open, with no requests at all bypassing the circuit breaker:

image

We have hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds set to 1000, which I would expect to let one request per second bypass the circuit breaker.

I've tried to reproduce this in an integration test, where I simulated the remote service timing out/producing errors, and every single time the circuit breaker successfully opens and closes. So unfortunately at this moment I am unable to provide an exact reproduction path.

Some other information

  • This occurred in a medium-traffic environment, where the remote services were called with a frequency of about 25 to 300 requests/sec
  • Using Hystrix 1.5.12, Spring-Cloud Netflix Dalston.SR1 and Javanica Hystrix Annotations.
@utwyko utwyko changed the title Circuit breaker does not open even after remote service has recovered Circuit breaker does not close even after remote service has recovered Jul 19, 2017
@gszeliga
Copy link

gszeliga commented Oct 4, 2017

We might be experiencing the exact same issue but it's really hard to reproduce (it only took place once in PROD and twice under "extreme" load testing).

I'll try to find the time for more digging

(we're currently using 1.5.13)

@jgaribaldi
Copy link

We're having the same issue. Hystrix version is 1.5.12 and we're also using Javanica.

@gszeliga
Copy link

gszeliga commented Nov 7, 2017

I've spent a little bit more trying to reproduce the case without any success. Having said that, going back to the logs, I've seen that there's been an important refactor around HystrixCircuitBreaker introduced in version 1.5.12

Now, I'm gonna wild guess here, but is it possible that there's a race condition between markSuccess and line 192 as part of metrics.getHealthCountsStream() subscriber? There's a small window where:

  • A transition HALF-OPEN -> CLOSE happens
  • In line 192, the status trip from CLOSE -> OPEN
  • The metrics stream gets reset (line 205)

Why I am mentioning this? Because in my specific scenario, the CB gets stuck between an OPEN -> HALF-OPEN -> CLOSE to immediately trip to OPEN and so on.

@francovitali
Copy link

Hi, we're having the same issue (version 1.5.12), the circuit randomly OPENS AND NEVER CLOSES AGAIN, even if the backend service work properly.

This is our config (we know is very conservative):

  • metrics.rollingStats.timeInMilliseconds = 10000
  • hystrix.command.default.circuitBreaker.requestVolumeThreshold = 10
  • hystrix.command.default.circuitBreaker.sleepWindowInMilliseconds = 1000
  • hystrix.command.default.circuitBreaker.errorThresholdPercentage = 50

No more "success", "badRequest" or "failures". In fact, the command no longer get's executed (we can view this on other dashboards such as RestClient metrics).

Suggestion: it would be useful for debugging, if the circuit "open" and "close" events can be logged on HystrixEventNotifier.markEvent(...)

For now, we're relaxing the configuration, downdgrading to 1.5.11 and monitor if the issue persists.

image

@francovitali
Copy link

Update: with the downgrade we no longer experience the circuit blocked on OPEN state.

However, randomly, when a backend service begin to fail, once the circuit transitions to CLOSE, the behavior changes and the circuit begins to OPEN and CLOSE repeatedly (in a "twitchy" state, as if the stats were not reseted). We have to re-deploy the servers for the circuits to keep CLOSED with the same backend error rate.

There are a lot of opened issues and PRs addressing similar issues.

Are there some guidelines (besides the documentation) about safe configuration values for Circuit Breakers?

@JayeshS
Copy link

JayeshS commented Jan 3, 2018

I've seen this problem occur as well and downgrading to 1.5.11 fixed it.
Also seems to happen when the backend service is slow (I simulated with a fixed delay of ~1500ms)

@asolanaruiz
Copy link

I'm also having this problem in production with version 1.5.13.
In our logs I can see the following:

  • A dependency became unhealthy and all the requests to it started returning a 503
  • For 10 minutes I can see hystrix trying to test if the dependency is back to normal. That is, after several instances of the short-circuited and no fallback available error I see one instance of the 503 error. The time between the 503 is aprox. the value of the circuitBreaker.sleepWindowInMilliseconds setting (5 seconds).
  • After 10 minutes of this behavior we only see the short-circuited and no fallback available error. It's almost like hystrix gave up trying to see if the dependency was healthy and the circuit stayed permanently open.

After all that, we had to restart the service to see the circuit closed again.
We haven't tried downgrading to 1.5.11 but we plan to do that soon.

@bedrin
Copy link

bedrin commented Feb 14, 2018

Probably regression in 1.5.13 caused by a820344#diff-82a974c5de99c7b7fa59df2c2b823ae1R385
Also see #1723

@litalk
Copy link

litalk commented Mar 20, 2018

Is there a solution for this issue?
Right now my only option is to disable circuitBreaker :(

@jiacai2050
Copy link

Sound like this #1640

Are there any maintainers looking into this?

@davidvara
Copy link

I have the same issue. I would like to see the circuit breaker closed again. Will that feature be implemented?

@jiacai2050
Copy link

@davidvara Just let you know, I decide to downgrade to 1.5.11 (1.5.12 introduce a big refactor of circuit) and see what happens.

@petropolis
Copy link

We ran into this. This is a very serious issue. It basically is a showstopper for using Hystrix. Hopefully it will be addressed soon (or the bad code backed out).

phaneesh added a commit to phaneesh/revolver that referenced this issue Jul 23, 2018
@godofwharf
Copy link

+1

@jiacai2050
Copy link

jiacai2050 commented Nov 12, 2018

Update. after downgrading to 1.5.11, I haven't seen this issue so far.

@antdavidl
Copy link

Does anyone know if this is fixed under the last release, 1.5.18 of 16 Nov 2018?

So far, it seems the best approach is to downgrade to 1.5.11, isn't it?

@breun
Copy link

breun commented Feb 23, 2019

1.5.11 was rereleased as 1.5.18, so they're the same: https://github.com/Netflix/Hystrix/releases/tag/v1.5.18

And Hystrix is no longer in active development after this release: https://github.com/Netflix/Hystrix#hystrix-status

@rx091v
Copy link

rx091v commented Dec 11, 2020

Hello,

We are using hystrix-core:jar:1.5.6 and experiencing the same issue.

Error: com.netflix.hystrix.exception.HystrixRuntimeException:xx.xx short-circuited and fallback disabled

It doesn't look like there was a problem with 1.5.12. does it ?

This problem is with older version of hystrix too.
Unfortunately, I am unable to reproduce the issue in my local environment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests