Fix experiments that were not working by AbdulRahmanAlHamali · Pull Request #896 · Shopify/semian

AbdulRahmanAlHamali · 2025-11-27T03:34:27Z

This PR fixes several things that were not great in experiments:

The one_of_many_service_degradation was accidentally not timing out
The slow_query experiments needed some adjustments to actually make the slow query impactful
In classic circuit breaker experiments, we have been setting the error rate to 0 (as opposed to 1% for adaptive), which is not apples-to-apples. Instead, I set it back to 0.01, and fixed the error threshold to make sure the circuit breaker opens/closes correctly

AbdulRahmanAlHamali · 2025-11-27T03:35:26Z

experiments/experiment_helpers.rb

        mean_latency = 0.15
        # Provide barely enough threads for us to handle the expected load (using Little's law)
-        max_threads_per_service = ((@requests_per_second / @service_count) * (0.5 * mean_latency)).to_i
+        max_threads_per_service = ((@requests_per_second / @service_count) * mean_latency).to_i


This is the right Little's Law equation. I had set it at 0.5 before because other parts of the code were buggy, but then we fixed the other parts and didn't fix this one, which broke it. Now it is good

AbdulRahmanAlHamali · 2025-11-27T03:35:45Z

experiments/experiment_helpers.rb

            index: bucket_idx,
            color: color,
-            width: 2,
+            width: 1,


just decreasing the contention of drawing state transitions

AbdulRahmanAlHamali · 2025-11-27T03:36:14Z

experiments/experiment_helpers.rb

            timeout: 10,
            max_threads: @with_max_threads ? max_threads_per_service : 0,
-            queue_timeout: 0.0,
+            queue_timeout: 1.0,


this means that requests won't be dropped immediately if the service doesn't have capacity, which is more realistic

AbdulRahmanAlHamali · 2025-11-27T03:36:51Z

experiments/experiments/experiment_one_of_many_services_degradation.rb

  semian_config: {
    success_threshold: 2,
-    error_threshold: 10,
+    error_threshold: 2,


For this experiment, we have 10 services, so I set the error threshold lower because it will be broken down among different services

AbdulRahmanAlHamali · 2025-11-27T03:37:19Z

experiments/experiments/experiment_one_of_many_services_degradation.rb

  resource_name: "protected_service",
  degradation_phases: [Semian::Experiments::DegradationPhase.new(healthy: true)] * 1 +
-                      [Semian::Experiments::DegradationPhase.new(latency: 4.95)] * 10 + # Most requests to the target service will timeout
+                      [Semian::Experiments::DegradationPhase.new(latency: 9.95)] * 10 + # Most requests to the target service will timeout


I had increased the timeout to 10 seconds at some point, and forgot to change that

AbdulRahmanAlHamali · 2025-11-27T03:39:07Z

experiments/results/main_graphs/one_of_many_services_latency_degradation.png

this shows clearly the protection pattern of the classic circuit breaker: when it's open, you're 100% protected, when it closes, you tank

AbdulRahmanAlHamali · 2025-11-27T03:39:48Z

experiments/results/main_graphs/one_of_many_services_latency_degradation_adaptive.png

We can clearly see here that when requests do timeout (and thus generate an error), we end up opening the circuit breaker to a certain degree, providing a level of protection

AbdulRahmanAlHamali · 2025-11-27T03:40:41Z

experiments/results/main_graphs/slow_query.png

this was rejecting all the time, even before the slow query was introduced 😅

AbdulRahmanAlHamali · 2025-11-27T03:40:48Z

experiments/results/main_graphs/slow_query_adaptive.png

this was rejecting all the time, even before the slow query was introduced 😅

adriangudas · 2025-11-27T15:33:59Z

experiments/results/main_graphs/sudden_error_spike_100_adaptive.png

wow, this is a noticeable improvement

fix experiments that were not working

3a12f12

AbdulRahmanAlHamali commented Nov 27, 2025

View reviewed changes

adriangudas reviewed Nov 27, 2025

View reviewed changes

experiments/results/main_graphs/sudden_error_spike_100_adaptive.png

Copy link

Contributor

adriangudas Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow, this is a noticeable improvement

adriangudas approved these changes Nov 27, 2025

View reviewed changes

AbdulRahmanAlHamali merged commit 4c34eff into pid-take-2 Nov 27, 2025
32 checks passed

AbdulRahmanAlHamali deleted the fix-broken-experiments branch November 27, 2025 15:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix experiments that were not working#896

Fix experiments that were not working#896
AbdulRahmanAlHamali merged 1 commit intopid-take-2from
fix-broken-experiments

AbdulRahmanAlHamali commented Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

AbdulRahmanAlHamali Nov 27, 2025

Uh oh!

adriangudas Nov 27, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AbdulRahmanAlHamali commented Nov 27, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants