Improve sustained alert aggregation#21
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces rate-aware early handoff from the ramp phase to the sustained phase, making alerting quieter during high-volume incidents. It also increases the default sustained update interval to 15 minutes and adds a periodCount field to track deltas between updates. Review feedback highlights a logic error in the early handoff implementation that could cause alert spamming and suggests ensuring the transition to sustained mode is one-way. Additionally, an optimization for the sliding window rate calculation was recommended to prevent performance degradation during alert storms.
| const shouldEnterSustainedByRate = | ||
| state.hasSentRampAlert && currentRate >= this.config.rampExitRatePerSecond | ||
|
|
||
| // Phase: sustained (count exceeds rampThreshold) | ||
| if (state.count > this.config.rampThreshold) { | ||
| if (state.count > this.config.rampThreshold || shouldEnterSustainedByRate) { | ||
| state.phase = 'sustained' | ||
| state.everEnteredSustained = true | ||
| result.phase = 'sustained' | ||
|
|
||
| if (now - state.lastAlertedAt >= this.config.digestIntervalMs) { | ||
| const sustainedByRate = shouldEnterSustainedByRate && state.count <= this.config.rampThreshold | ||
| if (sustainedByRate || now - state.lastAlertedAt >= this.config.digestIntervalMs) { | ||
| result.shouldSend = true | ||
| result.suppressedSince = state.count - state.lastAlertedCount | ||
| result.periodCount = state.count - state.lastAlertedCount | ||
| result.suppressedSince = result.periodCount | ||
| state.lastAlertedAt = now | ||
| state.lastAlertedCount = state.count | ||
| } | ||
|
|
||
| return result | ||
| } |
There was a problem hiding this comment.
The current implementation of the rate-aware early handoff causes an alert to be sent on every subsequent event once the rate threshold is crossed, until the rampThreshold is reached. This happens because shouldEnterSustainedByRate (and thus sustainedByRate) remains true for every call in that range, bypassing the digestIntervalMs check. Additionally, if the rate drops below the threshold before reaching rampThreshold, the aggregator could regress from 'sustained' back to 'ramp' phase. The transition to sustained mode should be one-way and only trigger a single immediate "handoff" alert. Furthermore, ensure that context blocks (service name and timestamp) are handled consistently after the phase logic by avoiding early returns within the phase-handling blocks, as per the design document.
const wasSustained = state.phase === 'sustained'
const isRateThresholdHit = state.hasSentRampAlert && currentRate >= this.config.rampExitRatePerSecond
if (state.count > this.config.rampThreshold || isRateThresholdHit || wasSustained) {
state.phase = 'sustained'
state.everEnteredSustained = true
result.phase = 'sustained'
const isEarlyExit = !wasSustained && isRateThresholdHit && state.count <= this.config.rampThreshold
if (isEarlyExit || now - state.lastAlertedAt >= this.config.digestIntervalMs) {
result.shouldSend = true
result.periodCount = state.count - state.lastAlertedCount
result.suppressedSince = result.periodCount
state.lastAlertedAt = now
state.lastAlertedCount = state.count
}
}References
- Ensure that context blocks, such as service name and timestamp, are consistently present across all alert phases (onset, ramp, sustained, resolution) by moving the logic for adding the context block outside and after the switch statement that handles different phases.
| const windowCutoff = now - this.config.rampExitRateWindowMs | ||
| state.rateWindow = state.rateWindow.filter((t) => t > windowCutoff) | ||
| const currentRate = state.rateWindow.length / (RATE_WINDOW_MS / 1000) | ||
| const currentRate = state.rateWindow.length / (this.config.rampExitRateWindowMs / 1000) |
There was a problem hiding this comment.
The use of Array.prototype.filter on state.rateWindow (line 78) during every process call can lead to significant performance degradation during high-volume alert storms. For example, a storm of 10k events/sec would result in filtering and re-allocating a 600k-element array 10k times per second. Consider using a more efficient sliding window approach, such as a queue with shift() or a bucketed counter, to maintain the rate calculation with better time and memory complexity.
Summary
periodCountmetadata and show per-period plus total counts in sustained formatter outputValidation
pnpm testpnpm typecheckpnpm buildpnpm lintCloses #23