Skip to content

Add defensiveness coefficient, and simplify the controller into a P-only controller#875

Closed
AbdulRahmanAlHamali wants to merge 4 commits intopid-take-2from
rejection-rate
Closed

Add defensiveness coefficient, and simplify the controller into a P-only controller#875
AbdulRahmanAlHamali wants to merge 4 commits intopid-take-2from
rejection-rate

Conversation

@AbdulRahmanAlHamali
Copy link
Contributor

@AbdulRahmanAlHamali AbdulRahmanAlHamali commented Nov 17, 2025

A few realizations from our experiments:

  1. We should not only attempt to match the error rate. Our goal is to attempt to eliminate the error rate
  2. Also, when the error rate drops, our P value should not immediately drop

This PR addresses this by introducing a "defensiveness" coefficient, making the P equation:

(error_rate - ideal_error_rate) - 1 / defensiveness * rejection_rate

I set the default value of defensiveness to 5, which means that:

  • The controller is willing to block 5 times as many requests as the error rate, in attempting to eliminate it (as opposed to only once as many, which we currently do)
  • When the error rate is eliminated, the rejection rate will go down gradually as well.

The other thing we noticed is that both the I and the D have little to no contribution to our performance. Thus, I removed them, significantly simplifying the code

fixes https://github.com/Shopify/resiliency/issues/6644 and https://github.com/Shopify/resiliency/issues/6645

@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the sliding-implementation branch 11 times, most recently from 8dfcc48 to 20299bd Compare November 20, 2025 22:14
@AbdulRahmanAlHamali AbdulRahmanAlHamali changed the title only count P, and add rejection rate coefficient Add defensiveness coefficient, and simplify the controller into a P-only controller Nov 21, 2025
@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the rejection-rate branch 3 times, most recently from 11bf32d to bc0f019 Compare November 21, 2025 13:50
@AbdulRahmanAlHamali AbdulRahmanAlHamali marked this pull request as ready for review November 21, 2025 13:52
Copy link
Contributor

@Aguasvivas22 Aguasvivas22 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, however there might be an issue in PR 874.

The graphs for sudden error spikes and sustained load look off and when I ran the sudden spikes test, I see an output that shows 300 windows vs what should be only 30 windows.

I haven't been able to review the other PR so not sure if that's by design or a bug, so wanted to put this out there.

No issues with this PR though but if it is an error, than we'd want to fix that and regenerate graphs here.

UPDATED:
This is actually by design since in #874 we change to sliding windows and adjusting every second instead of every 10s.

Comment on lines +10 to +16
# P = (error_rate - ideal_error_rate) - (1/defensiveness) * rejection_rate
# Note: P increases when error_rate increases
# P decreases when rejection_rate increases (providing feedback)
class PIDController
class ProcessController
attr_reader :name, :rejection_rate

def initialize(kp:, ki:, kd:, window_size:, sliding_interval:, implementation:, initial_history_duration:,
def initialize(defensiveness:, window_size:, sliding_interval:, implementation:,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically a coefficient for the rejection_rate. I'd prefer if we just treat it like that directly in the formula (without the 1/) and illustrate what the range of coefficients will represent. I'm also not a fan of the exact naming of this variable.

As an aside, if we are weighting the rejection_rate differently, is it worth considering doing something similar for the error_rate as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an aside, if we are weighting the rejection_rate differently, is it worth considering doing something similar for the error_rate as well.

So a coefficient for the error rate would basically allow us to block less than the error rate if we want to. Which I guess is a possibility, I'm just not sure if there is value in doing that

Comment on lines -20 to -22
@kp = kp # Proportional gain
@ki = ki # Integral gain
@kd = kd # Derivative gain

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't done any experiments or have much basis, but given that the actual targeting equation P is changing I would think it might be worthwhile playing around with these values before removing I & D altogether.

If it works fine with just the P and that's good enough, then I'm fine with the simplification too

kd: 0.5, # Small derivative gain (as per design doc)
window_size: 10, # 10-second window for rate calculation and update interval
sliding_interval: 1, # 1-second interval for background health checks
initial_history_duration: 900, # 15 minutes of initial history for p90 calculation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switch to SES algorithm made this obsolete. This is a cleanup

initial_error_rate:,
smoother_cap_value: SimpleExponentialSmoother::DEFAULT_CAP_VALUE)
# PID coefficients
@kp = kp # Proportional gain

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussion detailing that we should keep Kp value here and set it to 1

initial_error_rate:,
smoother_cap_value: SimpleExponentialSmoother::DEFAULT_CAP_VALUE)
# PID coefficients
@kp = kp # Proportional gain

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep Kp here and set it to 1 (based on discussion)

@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the sliding-implementation branch 2 times, most recently from 72d0208 to 22cd79b Compare November 26, 2025 16:47
Base automatically changed from sliding-implementation to pid-take-2 November 26, 2025 18:16
@AbdulRahmanAlHamali AbdulRahmanAlHamali force-pushed the rejection-rate branch 2 times, most recently from 61cb004 to aa478a8 Compare November 26, 2025 20:29
@AbdulRahmanAlHamali
Copy link
Contributor Author

Replaced by #899

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants