Skip to content

Sliding window implementation for PID controller#874

Merged
AbdulRahmanAlHamali merged 8 commits intopid-take-2from
sliding-implementation
Nov 26, 2025
Merged

Sliding window implementation for PID controller#874
AbdulRahmanAlHamali merged 8 commits intopid-take-2from
sliding-implementation

Conversation

@AbdulRahmanAlHamali
Copy link
Contributor

@AbdulRahmanAlHamali AbdulRahmanAlHamali commented Nov 15, 2025

Our tests have shown the discrete window implementation to be too janky, with rejection rate dropping too aggressively. Sliding window provides better smoothness.

fixes https://github.com/Shopify/resiliency/issues/6615

@AbdulRahmanAlHamali AbdulRahmanAlHamali changed the base branch from main to kris-gaudel/ses-replace-p2 November 15, 2025 13:52
@@ -0,0 +1,97 @@
# frozen_string_literal: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason simple_sliding_window.rb wouldn't work here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this could inherit from it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm, I'll look at its implementation, but we'll have to use it in a slightly different way: the classic circuit breaker only slides the window when it receives a new response, but we want to slide it continuously every second. So it will probably need some tweaking

@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the sliding-implementation branch 4 times, most recently from ce2ef9f to 1f6fa64 Compare November 19, 2025 19:43
initial_history_duration: 900, # 15 minutes of initial history for p90 calculation
initial_error_rate: options[:initial_error_rate] || 0.01, # 1% error rate for initial p90 calculation
thread_safe: Semian.thread_safe?,
implementation: implementation(**options),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly part of the PR, but just making the implementation more consistent with the rest of the repo, where they define an implementation and it's either "Simple" or "ThreadSafe"


@last_p_value = 0.0
end
module Simple
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tip: set "Hide Whitespaces" to true while reviewing this file

initial_value: initial_error_rate,
)

@errors = implementation::SlidingWindow.new
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could've technically put these three into 1 array with entries looking like:

{type:, timestamp}

But this is more efficient since we later have to scan the array to decide how many errors vs. successes, while this way we could simply look at sizes

# Clean up old observations
current_timestamp = current_time
cutoff_time = current_timestamp - @window_size
@errors.reject! { |timestamp| timestamp < cutoff_time }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could be implemented more efficiently, since we know entries are ordered by time. So if found to be too slow, we can optimize


def push(value)
resize_to(@max_size - 1) # make room
resize_to(@max_size - 1) unless @max_size.nil? # make room
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think max_size makes sense for the classic, because we know exactly how many requests we want to store before opening the circuit. For adaptive, we don't know that, so I elected to make them unlimited.

This should be ok, since we don't expect to see too many requests to a single dependency. But we could add an optional max_size later if we see issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it might not be as concrete as the classic, but just thinking out loud it would be good to have some limit to prevent unbound growth

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. Do you have a number in mind? Or should we just put some large value, e.g. 1000 or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we should choose a setting that increases/decreases with window size. I'm guessing 100 x window_size. So for window_size = 10, that would be a 1000.


assert_equal(0, @controller.metrics[:rejection_rate])
end
# def test_update_flow
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests have to be fixed. However, in the next PR, I'm going to simplify the controller, from a PID controller to a simple P controller as discussed. So I'm not going to invest in this now

@AbdulRahmanAlHamali AbdulRahmanAlHamali changed the title Sliding implementation Sliding window implementation for PID controller Nov 19, 2025
@AbdulRahmanAlHamali AbdulRahmanAlHamali marked this pull request as ready for review November 19, 2025 19:54
@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the sliding-implementation branch 3 times, most recently from 29fc9e8 to a3130b9 Compare November 20, 2025 14:48
Base automatically changed from kris-gaudel/ses-replace-p2 to pid-take-2 November 20, 2025 16:24
@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the sliding-implementation branch 7 times, most recently from 8dfcc48 to 20299bd Compare November 20, 2025 22:14

def wait_for_window
Kernel.sleep(@window_size)
Kernel.sleep(@sliding_interval)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this still be the window size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, if we're sliding every second, we would want to wake the thread up every second, but then look back at the entire window size

@@ -0,0 +1,97 @@
# frozen_string_literal: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this could inherit from it

observations_per_minute: 60 / sliding_interval,
)

@errors = implementation::SlidingWindow.new(max_size: 100 * window_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the meaning of multiplying by 100?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

basically I'm allowing up to a 100 requests per second to be stored. So a 1000 requests in a 10-second time window.

It's an arbitrary limit. Just not to keep things unbound

@kris-gaudel kris-gaudel self-requested a review November 26, 2025 04:31
Copy link
Contributor

@Aguasvivas22 Aguasvivas22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

def calculate_window_error_rate
total_requests = @current_window_requests[:success] + @current_window_requests[:error]
return 0.0 if total_requests == 0
total_requests = @successes.size + @errors.size

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't total requests include the rejections? I feel like there was a reason for this, but I'm blanking on it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you want to find the error rate in the requests that did actually make it to the backend

@AbdulRahmanAlHamali AbdulRahmanAlHamali merged commit a7d7a57 into pid-take-2 Nov 26, 2025
32 checks passed
@AbdulRahmanAlHamali AbdulRahmanAlHamali deleted the sliding-implementation branch November 26, 2025 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants