CCM-16073 - Updated rate limiting behaviour by rhyscoxnhs · Pull Request #158 · NHSDigital/nhs-notify-client-callbacks

rhyscoxnhs · 2026-04-23T09:13:26Z

Description

Context

Type of changes

Refactoring (non-breaking change)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would change existing functionality)
Bug fix (non-breaking change which fixes an issue)

Checklist

I am familiar with the contributing guidelines
I have followed the code style of the project
I have added tests to cover my changes
I have updated the documentation accordingly
This PR is a result of pair or mob programming

Sensitive Information Declaration

To ensure the utmost confidentiality and protect your and others privacy, we kindly ask you to NOT including PII (Personal Identifiable Information) / PID (Personal Identifiable Data) or any other sensitive data in this PR (Pull Request) and the codebase changes. We will remove any PR that do contain any sensitive information. We really appreciate your cooperation in this matter.

I confirm that neither PII/PID nor sensitive data are included in this PR and the codebase changes.

mjewildnhs

As far as reviewed the admin script

mjewildnhs · 2026-04-23T15:32:54Z


-return { 1, "allowed", 0, effectiveRate }
+local reason     = consumedTokens < 1 and "rate_limited" or "allowed"
+local retryAfter = consumedTokens < 1 and 1000 or 0


Do we want to optimise the retry time rather than hardcode to 1s?
We could have it ramp down to 1s based on the period in the recovery.
With the defaults we'll be recovering for 10m with a reduced effectiveRate.
With lower invocationRateLimits (e.g. 10/s) it will take 60s to generate a token.
So there is no point retrying so quickly as there won't be any tokens causing unnecessary spin on the lambda.
Instead we could use the time it takes to generate a token math.ceil(1000 / effectiveRate).

mjewildnhs · 2026-04-23T15:51:40Z

+  const [consumedOrFlag, reason, retryAfterMs, effectiveRate] = raw;

-  if (allowed === 1) {
+  if (reason === "allowed" || reason === "probe") {


The reason "probe" isn't returned anymore.

mjewildnhs · 2026-04-23T16:24:06Z

+local isHalfOpen   = isOpen and now > switchedAt + cooldownMs
+local isRecovering = (not isOpen) and now < switchedAt + recoveryPeriodMs

-  -- Circuit is open and no probe slot is available — reject
-  return { 0, "circuit_open", openedUntil - now, 0 }
-end
-
--------------------------------------------------------------------------------
-- 2. SLIDING WINDOW
--
-- Two windows (current + previous) together approximate a sliding window over
-- cbWindowPeriodMs.  When the current window expires it is promoted to previous
-- and a fresh current window starts.  record-result.lua blends the two windows
-- using a time-based weight to smooth the error rate across the boundary rather
-- than resetting it to zero at expiry.
--
-- record-result.lua is responsible for incrementing the counters; this script
-- is only responsible for rolling the window boundary forward when it expires.
--------------------------------------------------------------------------------
+local effectiveRate

-if cbWindowFrom == 0 then
-  -- No window exists yet — start one now
-  cbWindowFrom = now
-elseif (now - cbWindowFrom) > cbWindowPeriodMs then
-  -- Current window has expired — roll it forward
-  if (now - cbWindowFrom) > (2 * cbWindowPeriodMs) then
-    -- Both current and previous windows are stale: a long quiet period means
-    -- old failure counts are no longer relevant to the health of the endpoint.
-    cbPrevFailures = 0
-    cbPrevAttempts = 0
+if isOpen then
+  if isHalfOpen then
+    effectiveRate = probeRateLimit
  else
-    -- Promote current → previous so it can be blended with the new current window
-    cbPrevFailures = cbFailures
-    cbPrevAttempts = cbAttempts
+    return { 0, "circuit_open", (switchedAt + cooldownMs) - now, 0 }
+  end
+else
+  if isRecovering then
+    effectiveRate = targetRateLimit * (now - switchedAt) / recoveryPeriodMs
+  else
+    effectiveRate = targetRateLimit
  end
-  cbFailures   = 0
-  cbAttempts   = 0
-  cbWindowFrom = now
 end


Declaring the isHalfOpen / isRecovering where they are used tidies up and simplifies it.
Further to that I think flipping the isHalfOpen check and labelling it inCooldown makes it simpler, brings the cooldownMs into context and moves away from half open terminology which i think is a bit problematic.

Suggested change

local isHalfOpen = isOpen and now > switchedAt + cooldownMs

local isRecovering = (not isOpen) and now < switchedAt + recoveryPeriodMs

-- Circuit is open and no probe slot is available — reject

return { 0, "circuit_open", openedUntil - now, 0 }

end

--------------------------------------------------------------------------------

-- 2. SLIDING WINDOW

--

-- Two windows (current + previous) together approximate a sliding window over

-- cbWindowPeriodMs. When the current window expires it is promoted to previous

-- and a fresh current window starts. record-result.lua blends the two windows

-- using a time-based weight to smooth the error rate across the boundary rather

-- than resetting it to zero at expiry.

--

-- record-result.lua is responsible for incrementing the counters; this script

-- is only responsible for rolling the window boundary forward when it expires.

--------------------------------------------------------------------------------

local effectiveRate

if cbWindowFrom == 0 then

-- No window exists yet — start one now

cbWindowFrom = now

elseif (now - cbWindowFrom) > cbWindowPeriodMs then

-- Current window has expired — roll it forward

if (now - cbWindowFrom) > (2 * cbWindowPeriodMs) then

-- Both current and previous windows are stale: a long quiet period means

-- old failure counts are no longer relevant to the health of the endpoint.

cbPrevFailures = 0

cbPrevAttempts = 0

if isOpen then

if isHalfOpen then

effectiveRate = probeRateLimit

else

-- Promote current → previous so it can be blended with the new current window

cbPrevFailures = cbFailures

cbPrevAttempts = cbAttempts

return { 0, "circuit_open", (switchedAt + cooldownMs) - now, 0 }

end

else

if isRecovering then

effectiveRate = targetRateLimit * (now - switchedAt) / recoveryPeriodMs

else

effectiveRate = targetRateLimit

end

cbFailures = 0

cbAttempts = 0

cbWindowFrom = now

end

f isOpen then

local inCooldown = now <= switchedAt + cooldownMs

if inCooldown then

return { 0, "circuit_open", (switchedAt + cooldownMs) - now, 0 }

end

effectiveRate = probeRateLimit

else

local isRecovering = now < switchedAt + recoveryPeriodMs

if isRecovering then

effectiveRate = targetRateLimit * (now - switchedAt) / recoveryPeriodMs

else

effectiveRate = targetRateLimit

end

end

* Rate limit/circuit breaker fixes and logging improvements Additional handler logging and observability Fix retry time for partial batch rate limiting Add follow option to debug test script Add since var to int debug script High resolution storage metrics * Warm up in the circruit breaker test to ensure circuit is closed * Circuit breaker disabled fixes * Fix perf lambda DLQ purge * Add flush and debug ability to perf runner lambda * Up the burst in rate limit test --------- Co-authored-by: Tim Marston <tim.marston2@nhs.net>

* CCM-16073 - Updated rate limiting behaviour Co-authored-by: Mike Wild <mike.wild5@nhs.net> Co-authored-by: Tim Marston <tim.marston2@nhs.net>

rhyscoxnhs requested a review from a team as a code owner April 23, 2026 09:13

rhyscoxnhs force-pushed the feature/CCM-16073-rate-limit branch from a136f1f to 38d0fad Compare April 23, 2026 10:28

mjewildnhs reviewed Apr 23, 2026

View reviewed changes

rhyscoxnhs force-pushed the feature/CCM-16073-rate-limit branch from 90a5096 to 26d8d59 Compare April 24, 2026 07:49

rhyscoxnhs requested a review from a team as a code owner April 24, 2026 14:32

rhyscoxnhs force-pushed the feature/CCM-16073-rate-limit branch from 6259edd to fe2e5f8 Compare April 27, 2026 07:11

CCM-16073 - Updated rate limiting behaviour

9856bc9

rhyscoxnhs force-pushed the feature/CCM-16073-rate-limit branch from fe2e5f8 to 9856bc9 Compare April 27, 2026 07:32

CCM-16073 - Fixed perf runner permissions

07410f7

mjewildnhs reviewed Apr 27, 2026

View reviewed changes

Comment thread lambdas/https-client-lambda/src/services/record-result.lua Outdated

rhyscoxnhs force-pushed the feature/CCM-16073-rate-limit branch from 8fb8f8c to 5cb24f9 Compare April 27, 2026 14:16

mjewildnhs force-pushed the feature/CCM-16073-rate-limit branch from e3fb631 to 5cb24f9 Compare April 27, 2026 15:25

mjewildnhs reviewed Apr 27, 2026

View reviewed changes

Comment thread lambdas/https-client-lambda/src/handler.ts Outdated

rhyscoxnhs force-pushed the feature/CCM-16073-rate-limit branch from 5cb24f9 to 07410f7 Compare April 28, 2026 07:08

CCM-16073 - Updated rate limiting behaviour

6673cca

cgitim reviewed Apr 28, 2026

View reviewed changes

Comment thread lambdas/https-client-lambda/src/services/admit.lua Outdated

cgitim reviewed Apr 28, 2026

View reviewed changes

Comment thread lambdas/https-client-lambda/src/services/record-result.lua Outdated

rhyscoxnhs and others added 3 commits April 28, 2026 12:24

CCM-16073 - Updated rate limiting behaviour

9b0a511

CCM-16073 - Updated rate limiting behaviour

be5d8bc

rhyscoxnhs merged commit d2d2c31 into feature/CCM-16073 Apr 29, 2026
25 checks passed

rhyscoxnhs deleted the feature/CCM-16073-rate-limit branch April 29, 2026 14:41

mjewildnhs added a commit that referenced this pull request Apr 29, 2026

CCM-16073 - Updated rate limiting behaviour (#158)

4093022

* CCM-16073 - Updated rate limiting behaviour Co-authored-by: Mike Wild <mike.wild5@nhs.net> Co-authored-by: Tim Marston <tim.marston2@nhs.net>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CCM-16073 - Updated rate limiting behaviour#158

CCM-16073 - Updated rate limiting behaviour#158
rhyscoxnhs merged 6 commits intofeature/CCM-16073from
feature/CCM-16073-rate-limit

rhyscoxnhs commented Apr 23, 2026

Uh oh!

mjewildnhs left a comment

Uh oh!

mjewildnhs Apr 23, 2026

Uh oh!

mjewildnhs Apr 23, 2026

Uh oh!

mjewildnhs Apr 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rhyscoxnhs commented Apr 23, 2026

Description

Context

Type of changes

Checklist

Sensitive Information Declaration

Uh oh!

mjewildnhs left a comment

Choose a reason for hiding this comment

Uh oh!

mjewildnhs Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjewildnhs Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

mjewildnhs Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants