Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.Sign up
Advanced Circuit Breaker
Clone this wiki locally
The original CircuitBreaker
The original Polly
CircuitBreaker takes the number of consecutive exceptions thrown as its indicator of the health of the underlying actions. This remains highly effective in many scenarios, is easy to understand, and simple to configure. We recommend it as the starting point for many situations.
There are however situations where a breaker with more detailed configuration parameters may be useful. In high-throughput (and variable throughput) scenarios in particular, proportion of failures can be a more consistent indicator of circuit health than consecutive count, which may fluctuate with load.
The AdvancedCircuitBreaker (v4.2)
AdvancedCircuitBreaker offers a circuit-breaker which:
- Reacts on proportion of failures, the
failureThreshold; eg break if over 50% of actions result in a handled failure
- Measures that proportion over a rolling
samplingDuration, so that older failures can be excluded and have no effect
- Imposes a
minimumThroughputbefore acting, such that the circuit reacts only when statistically significant, and does not trip in 'slow' periods
Policy .Handle<Whatever>(...) .AdvancedCircuitBreaker( double failureThreshold, TimeSpan samplingDuration, int minimumThroughput, TimeSpan durationOfBreak)
The circuit will break if, within any timespan of duration
samplingDuration, the proportion of actions resulting in a handled exception exceeds
failureThreshold, provided also that the number of actions through the circuit in the timespan is at least
failureThreshold: the proportion of failures at which to break. A double between 0 and 1. For example, 0.5 represents break on 50% or more of actions through the circuit resulting in a handled failure.
samplingDuration: the failure rate is considered for actions over this period. Successes/failures older than the period are discarded from metrics.
minimumThroughput: this many calls must have passed through the circuit within the active
samplingDuration for the circuit to consider breaking.
A starting configuration for governing RESTful calls to a downstream system might be:
Policy .Handle<TException>(...) .AdvancedCircuitBreaker( failureThreshold: 0.5, samplingDuration: TimeSpan.FromSeconds(5), minimumThroughput: 20, durationOfBreak: TimeSpan.FromSeconds(30))
... if you want your circuit to respond to a faulty underlying subsystem within at most 10 seconds (see discussion below), assuming your calls meet this minimum throughput and you want to break for 30 seconds at a time.
There is no substitute for tuning circuit configuration in light of the performance characteristics of your individual system. A good strategy for tuning can be to take circuit settings from a configuration file, have code monitor that file for changes, and replace the live circuit with a newly-configured one any time you detect changes in the underlying config file. This allows you to tune your circuit without recompiling and redeploying your application.
Tolerances to set can depend on many factors - not only the governed system, but the nature of fallback or failover alternatives in place if this circuit breaks.
The following points are also worth bearing in mind:
Middling values between 0.5 (break on 50% failure or above) to 0.7 (70% failure) are good starting points from which to adjust.
High values such as 0.9 (break on 90% failure or above) may lead to your circuit not breaking easily. Consider also how this does not add much value: if you choose to move to 100% failure (to break) only when you have already reached a 90% failure rate, you are not offering either your consumers (in terms of providing a faster failure than typical timeout) or the underlying system (in terms of protecting it from calls when it is struggling) much advantage. A lot of callers will be failing (and on slow timeouts) before you help them by breaking the circuit and failing faster.
Low values such as 0.1 (break on 10% failure or above) or 0.2 will cause your circuit to be 'trigger-happy' and break very eagerly. This penalises a large proportion of users (90%) by pushing them to outright call rejection (circuit broken) when they might otherwise succeed.
As previously stated, however, appropriate settings will depend on the characteristics of your particular systems and the nature of fallback or failover alternatives in place if this circuit breaks.
samplingDuration is the duration over which statistics are measured; any statistic aging beyond this period is forgotten.
- Keep in mind that this translates into how quickly your circuit will respond to failure. For a responsive circuit, configure
samplingDurationin the order of seconds (rather than minutes or hours).
To understand this, consider how a 'long tail' of successes may affect statistics. Consider a circuit with reasonable throughput, configured with a sampling duration of 5 minutes and set to break at 50% failure threshold. Suppose the actions governed by the circuit have been working perfectly (100% success) for the past 5 minutes plus, then 100% failure starts occurring. Such a circuit would (other things being equal) take around 2.5 minutes to reach 50% failure rate, to 'work off' the statistics from the 100% success era. Generalising, a circuit with sampling duration T requiring a failure ratio r to trip, would (other things being equal) require T * r time to react to worst case (100% failure) from best case (100% success). While the real world may not be as regular as this, circuit responsiveness remains broadly proportional to
samplingDuration in this manner.
- The minimum permitted
samplingDurationis 20 milliseconds, reflecting the minimum resolution for the circuit's timers.
Set a value to keep statistics significant, and to eliminate hard-breaking in 'slower' periods if desired.
Keep in mind that the value should be considered a minimum, not close to the circuit's typical throughput. If the value is in practice too close to the circuit's typical throughput, the circuit may spend a significant proportion of time waiting to meet the minimum throughput threshold, and therefore not break as often as you might want or expect.
At the lower end of the scale, bear in mind how
minimumThroughputwill translate into the minimum resolution of the breaker's failure statistics. A low
minimumThroughputwill result in a coarse initial resolution of the statistics. For example, a
minimumThroughputvalue of 2 (the lowest permitted) will mean your circuit's minimum/initial resolution will be the value set 0%, 50%, 100%. This may be too coarse depending on your configured
Detailed operation of failure statistics
Internally, the circuit-breaker measures statistics with a rolling statistical window. The configured
samplingDuration T is divided into ten slices, such that statistics for 10% of the period T are discarded every time T / 10. This smooths the calculation of failure rates to acceptable tolerances (compared to disposing 100% of the statistics every time T).
samplingDuration is not however further subdivided into slices if
samplingDuration is set below 200 milliseconds. This prevents excessively frequent recalculation when there is little responsiveness gain.