Fix experiments that were not working#896
Conversation
| mean_latency = 0.15 | ||
| # Provide barely enough threads for us to handle the expected load (using Little's law) | ||
| max_threads_per_service = ((@requests_per_second / @service_count) * (0.5 * mean_latency)).to_i | ||
| max_threads_per_service = ((@requests_per_second / @service_count) * mean_latency).to_i |
There was a problem hiding this comment.
This is the right Little's Law equation. I had set it at 0.5 before because other parts of the code were buggy, but then we fixed the other parts and didn't fix this one, which broke it. Now it is good
| index: bucket_idx, | ||
| color: color, | ||
| width: 2, | ||
| width: 1, |
There was a problem hiding this comment.
just decreasing the contention of drawing state transitions
| timeout: 10, | ||
| max_threads: @with_max_threads ? max_threads_per_service : 0, | ||
| queue_timeout: 0.0, | ||
| queue_timeout: 1.0, |
There was a problem hiding this comment.
this means that requests won't be dropped immediately if the service doesn't have capacity, which is more realistic
| semian_config: { | ||
| success_threshold: 2, | ||
| error_threshold: 10, | ||
| error_threshold: 2, |
There was a problem hiding this comment.
For this experiment, we have 10 services, so I set the error threshold lower because it will be broken down among different services
| resource_name: "protected_service", | ||
| degradation_phases: [Semian::Experiments::DegradationPhase.new(healthy: true)] * 1 + | ||
| [Semian::Experiments::DegradationPhase.new(latency: 4.95)] * 10 + # Most requests to the target service will timeout | ||
| [Semian::Experiments::DegradationPhase.new(latency: 9.95)] * 10 + # Most requests to the target service will timeout |
There was a problem hiding this comment.
I had increased the timeout to 10 seconds at some point, and forgot to change that
There was a problem hiding this comment.
this shows clearly the protection pattern of the classic circuit breaker: when it's open, you're 100% protected, when it closes, you tank
There was a problem hiding this comment.
We can clearly see here that when requests do timeout (and thus generate an error), we end up opening the circuit breaker to a certain degree, providing a level of protection
There was a problem hiding this comment.
this was rejecting all the time, even before the slow query was introduced 😅
There was a problem hiding this comment.
this was rejecting all the time, even before the slow query was introduced 😅
There was a problem hiding this comment.
wow, this is a noticeable improvement
This PR fixes several things that were not great in experiments:
one_of_many_service_degradationwas accidentally not timing out