Add defensiveness coefficient, and simplify the controller into a P-only controller#875
Add defensiveness coefficient, and simplify the controller into a P-only controller#875AbdulRahmanAlHamali wants to merge 4 commits intopid-take-2from
Conversation
8dfcc48 to
20299bd
Compare
dccbfa6 to
c66905d
Compare
11bf32d to
bc0f019
Compare
There was a problem hiding this comment.
LGTM, however there might be an issue in PR 874.
The graphs for sudden error spikes and sustained load look off and when I ran the sudden spikes test, I see an output that shows 300 windows vs what should be only 30 windows.
I haven't been able to review the other PR so not sure if that's by design or a bug, so wanted to put this out there.
No issues with this PR though but if it is an error, than we'd want to fix that and regenerate graphs here.
UPDATED:
This is actually by design since in #874 we change to sliding windows and adjusting every second instead of every 10s.
| # P = (error_rate - ideal_error_rate) - (1/defensiveness) * rejection_rate | ||
| # Note: P increases when error_rate increases | ||
| # P decreases when rejection_rate increases (providing feedback) | ||
| class PIDController | ||
| class ProcessController | ||
| attr_reader :name, :rejection_rate | ||
|
|
||
| def initialize(kp:, ki:, kd:, window_size:, sliding_interval:, implementation:, initial_history_duration:, | ||
| def initialize(defensiveness:, window_size:, sliding_interval:, implementation:, |
There was a problem hiding this comment.
This is basically a coefficient for the rejection_rate. I'd prefer if we just treat it like that directly in the formula (without the 1/) and illustrate what the range of coefficients will represent. I'm also not a fan of the exact naming of this variable.
As an aside, if we are weighting the rejection_rate differently, is it worth considering doing something similar for the error_rate as well.
There was a problem hiding this comment.
As an aside, if we are weighting the rejection_rate differently, is it worth considering doing something similar for the error_rate as well.
So a coefficient for the error rate would basically allow us to block less than the error rate if we want to. Which I guess is a possibility, I'm just not sure if there is value in doing that
| @kp = kp # Proportional gain | ||
| @ki = ki # Integral gain | ||
| @kd = kd # Derivative gain |
There was a problem hiding this comment.
I haven't done any experiments or have much basis, but given that the actual targeting equation P is changing I would think it might be worthwhile playing around with these values before removing I & D altogether.
If it works fine with just the P and that's good enough, then I'm fine with the simplification too
| kd: 0.5, # Small derivative gain (as per design doc) | ||
| window_size: 10, # 10-second window for rate calculation and update interval | ||
| sliding_interval: 1, # 1-second interval for background health checks | ||
| initial_history_duration: 900, # 15 minutes of initial history for p90 calculation |
There was a problem hiding this comment.
Switch to SES algorithm made this obsolete. This is a cleanup
| initial_error_rate:, | ||
| smoother_cap_value: SimpleExponentialSmoother::DEFAULT_CAP_VALUE) | ||
| # PID coefficients | ||
| @kp = kp # Proportional gain |
There was a problem hiding this comment.
Discussion detailing that we should keep Kp value here and set it to 1
| initial_error_rate:, | ||
| smoother_cap_value: SimpleExponentialSmoother::DEFAULT_CAP_VALUE) | ||
| # PID coefficients | ||
| @kp = kp # Proportional gain |
There was a problem hiding this comment.
Let's keep Kp here and set it to 1 (based on discussion)
72d0208 to
22cd79b
Compare
bc0f019 to
04c50a7
Compare
61cb004 to
aa478a8
Compare
69b635f to
a992aa6
Compare
|
Replaced by #899 |
A few realizations from our experiments:
This PR addresses this by introducing a "defensiveness" coefficient, making the P equation:
I set the default value of defensiveness to 5, which means that:
The other thing we noticed is that both the I and the D have little to no contribution to our performance. Thus, I removed them, significantly simplifying the code
fixes https://github.com/Shopify/resiliency/issues/6644 and https://github.com/Shopify/resiliency/issues/6645