-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Description
When the event rate limit (10/min) is exceeded, every single dropped event logs an individual warning message. This causes massive log spam with 340,000+ identical warnings over 2.5 days.
Observed Behavior
[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
[WARN] Failed to create event from status: event rate limit exceeded
[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
[WARN] Failed to create event from status: event rate limit exceeded
... (repeated 342,000+ times)
Log Statistics
| Metric | Value |
|---|---|
| Total log lines | 1,201,289 |
| WARN messages | 710,699 (59%) |
| Rate limit warnings | 342,014 |
| Failed event warnings | 342,011 |
| INFO messages | 712 (<1%) |
Root Cause Analysis
Two locations log warnings for every dropped event:
Location 1: pkg/exporters/kubernetes/event_manager.go:108-109
// Check rate limiting first
if !em.checkRateLimit() {
log.Printf("[WARN] Event rate limit exceeded (%d/min), dropping event: %s",
em.maxEventsPerMin, event.Reason)
return fmt.Errorf("event rate limit exceeded")
}Location 2: pkg/exporters/kubernetes/event_manager.go:146
for _, event := range events {
err := em.CreateEvent(ctx, event)
if err != nil {
log.Printf("[WARN] Failed to create event from status: %v", err)
// ...
}
}Each dropped event triggers BOTH warnings, multiplying log volume.
Files Involved
pkg/exporters/kubernetes/event_manager.go:100-111- CreateEvent methodpkg/exporters/kubernetes/event_manager.go:135-158- CreateEventsFromStatus methodpkg/exporters/kubernetes/event_manager.go:167-190- checkRateLimit method
Impact
- Log storage: ~34-68 MB of duplicate warnings per minute during high event volume
- Log analysis: Impossible to find real issues in the noise
- Performance: Excessive logging I/O overhead
- Alerting: Cannot alert on WARN level due to noise
Suggested Fix
Aggregate rate limit warnings into periodic summaries:
Before:
[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
... (x342,000)
After:
[WARN] Rate limit exceeded: dropped 342 events this minute (PeerUnreachable: 300, CNIConfigError: 42)
Implementation Options
- Aggregate per minute: Track dropped events, log summary once per minute
- Change log level: Use DEBUG for individual drops, WARN for summary
- First-only logging: Log only first rate limit hit per minute
- Circuit breaker: Stop processing batch after first rate limit hit
Acceptance Criteria
- Maximum 1-2 rate limit warnings per minute (not 342,000)
- Warning includes count of dropped events
- No loss of diagnostic information
- Tests verify aggregated logging behavior
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working