Advanced Circuit Breaker

reisenberger edited this page Dec 9, 2017 · 17 revisions
Clone this wiki locally

The original CircuitBreaker

The original Polly CircuitBreaker takes the number of consecutive exceptions thrown as its indicator of the health of the underlying actions. This remains highly effective in many scenarios, is easy to understand, and simple to configure. We recommend it as the starting point for many situations.

There are however situations where a breaker with more detailed configuration parameters may be useful. In high-throughput (and variable throughput) scenarios in particular, proportion of failures can be a more consistent indicator of circuit health than consecutive count, which may fluctuate with load.

The AdvancedCircuitBreaker (v4.2)

The AdvancedCircuitBreaker offers a circuit-breaker which:

  • Reacts on proportion of failures, the failureThreshold; eg break if over 50% of actions result in a handled failure
  • Measures that proportion over a rolling samplingDuration, so that older failures can be excluded and have no effect
  • Imposes a minimumThroughput before acting, such that the circuit reacts only when statistically significant, and does not trip in 'slow' periods

Syntax

Policy
   .Handle<Whatever>(...)
   .AdvancedCircuitBreaker(
        double failureThreshold, 
        TimeSpan samplingDuration, 
        int minimumThroughput, 
        TimeSpan durationOfBreak)

Syntax example above is sync; comparable async overloads exist for asynchronous operation. See readme and wiki for more details.

Definition

The circuit will break if, within any timespan of duration samplingDuration, the proportion of actions resulting in a handled exception exceeds failureThreshold, provided also that the number of actions through the circuit in the timespan is at least minimumThroughput.

Parameters

failureThreshold

failureThreshold: the proportion of failures at which to break. A double between 0 and 1. For example, 0.5 represents break on 50% or more of actions through the circuit resulting in a handled failure.

samplingDuration

samplingDuration: the failure rate is considered for actions over this period. Successes/failures older than the period are discarded from metrics.

minimumThroughput

minimumThroughput: this many calls must have passed through the circuit within the active samplingDuration for the circuit to consider breaking.

Configuration recommendations

A starting configuration for governing RESTful calls to a downstream system might be:

Policy
   .Handle<TException>(...)
   .AdvancedCircuitBreaker(
        failureThreshold: 0.5,
        samplingDuration: TimeSpan.FromSeconds(5),
        minimumThroughput: 20, 
        durationOfBreak: TimeSpan.FromSeconds(30))

... if you want your circuit to respond to a faulty underlying subsystem within at most 10 seconds (see discussion below), assuming your calls meet this minimum throughput and you want to break for 30 seconds at a time.

There is no substitute for tuning circuit configuration in light of the performance characteristics of your individual system. A good strategy for tuning can be to take circuit settings from a configuration file, have code monitor that file for changes, and replace the live circuit with a newly-configured one any time you detect changes in the underlying config file. This allows you to tune your circuit without recompiling and redeploying your application.

Tolerances to set can depend on many factors - not only the governed system, but the nature of fallback or failover alternatives in place if this circuit breaks.

The following points are also worth bearing in mind:

failureThreshold

  • Middling values between 0.5 (break on 50% failure or above) to 0.7 (70% failure) are good starting points from which to adjust.

  • High values such as 0.9 (break on 90% failure or above) may lead to your circuit not breaking easily. Consider also how this does not add much value: if you choose to move to 100% failure (to break) only when you have already reached a 90% failure rate, you are not offering either your consumers (in terms of providing a faster failure than typical timeout) or the underlying system (in terms of protecting it from calls when it is struggling) much advantage. A lot of callers will be failing (and on slow timeouts) before you help them by breaking the circuit and failing faster.

  • Low values such as 0.1 (break on 10% failure or above) or 0.2 will cause your circuit to be 'trigger-happy' and break very eagerly. This penalises a large proportion of users (90%) by pushing them to outright call rejection (circuit broken) when they might otherwise succeed.

As previously stated, however, appropriate settings will depend on the characteristics of your particular systems and the nature of fallback or failover alternatives in place if this circuit breaks.

samplingDuration

The samplingDuration is the duration over which statistics are measured; any statistic aging beyond this period is forgotten.

  • Keep in mind that this translates into how quickly your circuit will respond to failure. For a responsive circuit, configure samplingDuration in the order of seconds (rather than minutes or hours).

To understand this, consider how a 'long tail' of successes may affect statistics. Consider a circuit with reasonable throughput, configured with a sampling duration of 5 minutes and set to break at 50% failure threshold. Suppose the actions governed by the circuit have been working perfectly (100% success) for the past 5 minutes plus, then 100% failure starts occurring. Such a circuit would (other things being equal) take around 2.5 minutes to reach 50% failure rate, to 'work off' the statistics from the 100% success era. Generalising, a circuit with sampling duration T requiring a failure ratio r to trip, would (other things being equal) require T * r time to react to worst case (100% failure) from best case (100% success). While the real world may not be as regular as this, circuit responsiveness remains broadly proportional to samplingDuration in this manner.

  • The minimum permitted samplingDuration is 20 milliseconds, reflecting the minimum resolution for the circuit's timers.

Minimum throughput

Set a value to keep statistics significant, and to eliminate hard-breaking in 'slower' periods if desired.

  • Keep in mind that the value should be considered a minimum, not close to the circuit's typical throughput. If the value is in practice too close to the circuit's typical throughput, the circuit may spend a significant proportion of time waiting to meet the minimum throughput threshold, and therefore not break as often as you might want or expect.

  • At the lower end of the scale, bear in mind how minimumThroughput will translate into the minimum resolution of the breaker's failure statistics. A low minimumThroughput will result in a coarse initial resolution of the statistics. For example, a minimumThroughput value of 2 (the lowest permitted) will mean your circuit's minimum/initial resolution will be the value set 0%, 50%, 100%. This may be too coarse depending on your configured failureThreshold.

Detailed operation of failure statistics

Internally, the circuit-breaker measures statistics with a rolling statistical window. The configured samplingDuration T is divided into ten slices, such that statistics for 10% of the period T are discarded every time T / 10. This smooths the calculation of failure rates to acceptable tolerances (compared to disposing 100% of the statistics every time T).

The samplingDuration is not however further subdivided into slices if samplingDuration is set below 200 milliseconds. This prevents excessively frequent recalculation when there is little responsiveness gain.