Skip to content

Bug: Event rate limit logs individual warning per dropped event, causing massive log spam #9

@mattmattox

Description

@mattmattox

Description

When the event rate limit (10/min) is exceeded, every single dropped event logs an individual warning message. This causes massive log spam with 340,000+ identical warnings over 2.5 days.

Observed Behavior

[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
[WARN] Failed to create event from status: event rate limit exceeded
[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
[WARN] Failed to create event from status: event rate limit exceeded
... (repeated 342,000+ times)

Log Statistics

Metric Value
Total log lines 1,201,289
WARN messages 710,699 (59%)
Rate limit warnings 342,014
Failed event warnings 342,011
INFO messages 712 (<1%)

Root Cause Analysis

Two locations log warnings for every dropped event:

Location 1: pkg/exporters/kubernetes/event_manager.go:108-109

// Check rate limiting first
if !em.checkRateLimit() {
    log.Printf("[WARN] Event rate limit exceeded (%d/min), dropping event: %s",
        em.maxEventsPerMin, event.Reason)
    return fmt.Errorf("event rate limit exceeded")
}

Location 2: pkg/exporters/kubernetes/event_manager.go:146

for _, event := range events {
    err := em.CreateEvent(ctx, event)
    if err != nil {
        log.Printf("[WARN] Failed to create event from status: %v", err)
        // ...
    }
}

Each dropped event triggers BOTH warnings, multiplying log volume.

Files Involved

  • pkg/exporters/kubernetes/event_manager.go:100-111 - CreateEvent method
  • pkg/exporters/kubernetes/event_manager.go:135-158 - CreateEventsFromStatus method
  • pkg/exporters/kubernetes/event_manager.go:167-190 - checkRateLimit method

Impact

  • Log storage: ~34-68 MB of duplicate warnings per minute during high event volume
  • Log analysis: Impossible to find real issues in the noise
  • Performance: Excessive logging I/O overhead
  • Alerting: Cannot alert on WARN level due to noise

Suggested Fix

Aggregate rate limit warnings into periodic summaries:

Before:

[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
[WARN] Event rate limit exceeded (10/min), dropping event: PeerUnreachable
... (x342,000)

After:

[WARN] Rate limit exceeded: dropped 342 events this minute (PeerUnreachable: 300, CNIConfigError: 42)

Implementation Options

  1. Aggregate per minute: Track dropped events, log summary once per minute
  2. Change log level: Use DEBUG for individual drops, WARN for summary
  3. First-only logging: Log only first rate limit hit per minute
  4. Circuit breaker: Stop processing batch after first rate limit hit

Acceptance Criteria

  • Maximum 1-2 rate limit warnings per minute (not 342,000)
  • Warning includes count of dropped events
  • No loss of diagnostic information
  • Tests verify aggregated logging behavior

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions