Skip to content

Fix experiments that were not working#896

Merged
AbdulRahmanAlHamali merged 1 commit intopid-take-2from
fix-broken-experiments
Nov 27, 2025
Merged

Fix experiments that were not working#896
AbdulRahmanAlHamali merged 1 commit intopid-take-2from
fix-broken-experiments

Conversation

@AbdulRahmanAlHamali
Copy link
Contributor

This PR fixes several things that were not great in experiments:

  1. The one_of_many_service_degradation was accidentally not timing out
  2. The slow_query experiments needed some adjustments to actually make the slow query impactful
  3. In classic circuit breaker experiments, we have been setting the error rate to 0 (as opposed to 1% for adaptive), which is not apples-to-apples. Instead, I set it back to 0.01, and fixed the error threshold to make sure the circuit breaker opens/closes correctly

mean_latency = 0.15
# Provide barely enough threads for us to handle the expected load (using Little's law)
max_threads_per_service = ((@requests_per_second / @service_count) * (0.5 * mean_latency)).to_i
max_threads_per_service = ((@requests_per_second / @service_count) * mean_latency).to_i
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right Little's Law equation. I had set it at 0.5 before because other parts of the code were buggy, but then we fixed the other parts and didn't fix this one, which broke it. Now it is good

index: bucket_idx,
color: color,
width: 2,
width: 1,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just decreasing the contention of drawing state transitions

timeout: 10,
max_threads: @with_max_threads ? max_threads_per_service : 0,
queue_timeout: 0.0,
queue_timeout: 1.0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this means that requests won't be dropped immediately if the service doesn't have capacity, which is more realistic

semian_config: {
success_threshold: 2,
error_threshold: 10,
error_threshold: 2,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this experiment, we have 10 services, so I set the error threshold lower because it will be broken down among different services

resource_name: "protected_service",
degradation_phases: [Semian::Experiments::DegradationPhase.new(healthy: true)] * 1 +
[Semian::Experiments::DegradationPhase.new(latency: 4.95)] * 10 + # Most requests to the target service will timeout
[Semian::Experiments::DegradationPhase.new(latency: 9.95)] * 10 + # Most requests to the target service will timeout
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had increased the timeout to 10 seconds at some point, and forgot to change that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shows clearly the protection pattern of the classic circuit breaker: when it's open, you're 100% protected, when it closes, you tank

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can clearly see here that when requests do timeout (and thus generate an error), we end up opening the circuit breaker to a certain degree, providing a level of protection

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was rejecting all the time, even before the slow query was introduced 😅

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was rejecting all the time, even before the slow query was introduced 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this is a noticeable improvement

@AbdulRahmanAlHamali AbdulRahmanAlHamali merged commit 4c34eff into pid-take-2 Nov 27, 2025
32 checks passed
@AbdulRahmanAlHamali AbdulRahmanAlHamali deleted the fix-broken-experiments branch November 27, 2025 15:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants