Skip to content

Conversation

@tmikula-dev
Copy link
Collaborator

@tmikula-dev tmikula-dev commented Nov 4, 2025

Overview

This pull request focuses on making Kafka message delivery more robust against transient failures, timeouts and ensuring that warnings and errors are properly logged and handled.

Release Notes:

  • Adds a timeout error handling for KAFKA writer

Closes #75

Summary by CodeRabbit

  • Improvements

    • Implemented automatic retry logic to improve message delivery reliability
    • Enhanced timeout handling for message persistence operations
    • Improved error tracking with separate logging for different failure types
    • Better diagnostics through enhanced error reporting
  • Chores

    • Updated Kafka library dependency to latest compatible version
    • Updated setup documentation for Python environment configuration

@coderabbitai
Copy link

coderabbitai bot commented Nov 4, 2025

Walkthrough

The pull request adds resilience to Kafka message flushing by introducing retry logic with exponential backoff, timeout handling with fallback support, and improved error tracking. Dependencies are updated, and documentation is modified to use pip3. New test coverage validates retry behavior and edge cases.

Changes

Cohort / File(s) Summary
Documentation & Configuration
DEVELOPER.md, requirements.txt
Documentation updated to use pip3 instead of pip for Python environment setup; confluent-kafka dependency bumped from 2.11.1 to 2.12.1
Kafka Writer Core Logic
src/writer_kafka.py
Introduces retry loop with configurable max retries and backoff timing for Kafka flush operations; adds flush_with_timeout() helper with TypeError fallback for stub compatibility; enhances error tracking with separate Produce/Flush error counters; implements timeout warning when messages remain pending after retries exhaust
Test Suite
tests/test_writer_kafka.py
Adds three FakeProducer variants (FakeProducerFlushSequence, FakeProducerTimeout, FakeProducerTypeError) to simulate flush behaviors; introduces three new test cases covering retry success paths, timeout warnings, and TypeError fallback scenarios

Sequence Diagram

sequenceDiagram
    participant Client
    participant write_messages
    participant flush_with_timeout
    participant Producer
    participant Kafka

    Client->>write_messages: write_messages(records)
    write_messages->>Producer: produce(record) [×N]
    write_messages->>flush_with_timeout: flush_with_timeout(producer, timeout)
    
    rect rgb(100, 150, 255)
    Note over flush_with_timeout: Retry Loop (up to MAX_RETRIES)
    loop Each retry
        flush_with_timeout->>Producer: flush(timeout)
        alt Messages pending
            Producer-->>flush_with_timeout: remaining > 0
            flush_with_timeout->>flush_with_timeout: wait RETRY_BACKOFF_SEC
        else All flushed
            Producer-->>flush_with_timeout: remaining = 0
            flush_with_timeout-->>write_messages: success (0)
        end
    end
    end
    
    alt Retries exhausted & pending remain
        flush_with_timeout-->>write_messages: warning + return remaining
    else TypeError (stub fallback)
        flush_with_timeout->>Producer: flush() [no timeout]
        Producer-->>flush_with_timeout: result
    end
    
    write_messages->>write_messages: aggregate errors (Produce + Flush)
    write_messages-->>Client: result (success/failure)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • src/writer_kafka.py: Review the retry loop logic carefully, particularly the condition checks for retry continuation and the fallback to TypeError handling in flush_with_timeout(). Verify that the timeout parameter is correctly passed and that error aggregation properly distinguishes between produce and flush failures.
  • tests/test_writer_kafka.py: Validate that the three new FakeProducer classes correctly simulate the intended behaviors (decreasing sequence, persistent timeout, signature mismatch) and that test assertions properly verify retry counts, warning logs, and success conditions.
  • requirements.txt: Confirm that confluent-kafka 2.12.1 maintains API compatibility and does not introduce breaking changes affecting the new timeout-based flush calls.

Poem

🐰 Retry, retry, don't give up so quick,
Kafka's flushing now with a trick,
Backoff and timeout, loop till it's done,
Messages pending? We'll retry for fun!
No more intermittent errors in sight,
The warren's Kafka now flushes just right! 🌙

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Out of Scope Changes check ⚠️ Warning The pip3 update in DEVELOPER.md and confluent-kafka version bump in requirements.txt are minor out-of-scope changes unrelated to the timeout handling objectives. Remove the DEVELOPER.md pip→pip3 change and revert confluent-kafka to 2.11.1, or create separate PRs for infrastructure and dependency updates to maintain focus on the core timeout handling feature.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Flush timeout error handling for Kafka writer' directly and concisely summarizes the main change: adding timeout error handling to the Kafka writer component.
Linked Issues check ✅ Passed The code changes comprehensively address issue #75 objectives: retry logic on flush timeouts, enhanced error handling with proper logging, timeout configuration via environment variables, and graceful degradation from errors to warnings.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/75-kafka-timeout-based-errror

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e54970 and fd3b1ba.

📒 Files selected for processing (1)
  • src/writer_kafka.py (4 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/writer_kafka.py (2)
tests/test_event_gate_lambda.py (1)
  • flush (51-52)
tests/test_writer_kafka.py (4)
  • flush (15-16)
  • flush (31-38)
  • flush (47-49)
  • flush (58-60)
🔇 Additional comments (1)
src/writer_kafka.py (1)

128-136: Do not return success when flush still has pending messages.

If remaining > 0 after exhausting retries we only log a warning, so errors stays empty and the function returns (True, None). That tells callers the write succeeded even though Kafka still has undelivered records, which silently drops data under the exact timeout scenario this PR is meant to address. Please make this path fail the write (e.g., append an error entry so the final block returns False). request_verificationraise_critical_issue

-    if isinstance(remaining, int) and remaining > 0:
-        logger.warning(
-            "Kafka flush timeout after %ss: %d message(s) still pending", _KAFKA_FLUSH_TIMEOUT_SEC, remaining
-        )
+    if isinstance(remaining, int) and remaining > 0:
+        timeout_msg = (
+            f"Kafka flush timeout after {_KAFKA_FLUSH_TIMEOUT_SEC}s: {remaining} message(s) still pending"
+        )
+        logger.warning(timeout_msg)
+        errors.append(timeout_msg)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/test_writer_kafka.py (1)

52-60: Remove unused noqa directive.

The FakeProducerTypeError correctly tests the TypeError fallback path. However, line 58 has an unnecessary noqa: D401 directive since there's no docstring to which D401 would apply.

Apply this diff:

-    def flush(self):  # noqa: D401
+    def flush(self):
         self.flush_calls += 1
         return 0
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4af9f10 and 647b693.

📒 Files selected for processing (4)
  • DEVELOPER.md (1 hunks)
  • requirements.txt (1 hunks)
  • src/writer_kafka.py (3 hunks)
  • tests/test_writer_kafka.py (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
tests/test_writer_kafka.py (1)
src/writer_kafka.py (1)
  • write (80-133)
src/writer_kafka.py (2)
tests/test_event_gate_lambda.py (1)
  • flush (51-52)
tests/test_writer_kafka.py (4)
  • flush (15-16)
  • flush (31-38)
  • flush (47-49)
  • flush (58-60)
🪛 Ruff (0.14.3)
tests/test_writer_kafka.py

31-31: Unused method argument: a

(ARG002)


31-31: Unused method argument: kw

(ARG002)


47-47: Unused method argument: a

(ARG002)


47-47: Unused method argument: kw

(ARG002)


58-58: Unused noqa directive (non-enabled: D401)

Remove unused noqa directive

(RUF100)

src/writer_kafka.py

128-128: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (9)
DEVELOPER.md (1)

31-31: LGTM! Clear Python 3 specification.

The change to pip3 aligns with the Python 3.13 requirement and ensures developers explicitly use the correct package manager.

src/writer_kafka.py (4)

25-25: LGTM! Import required for retry backoff.

The time module is appropriately added to support the retry sleep mechanism.


38-41: LGTM! Well-designed configurable retry parameters.

The configuration defaults are sensible:

  • Increased flush timeout (5s → 7s) provides more tolerance for transient delays
  • 3 retries with 0.5s backoff yields ~8.5s max wait time
  • Environment-driven configuration enables tuning without code changes

105-128: Retry logic is well-implemented, but verify the pending-message success behavior.

The retry mechanism correctly:

  • Attempts flush up to _MAX_RETRIES times with exponential backoff
  • Handles stub producers that return None from flush
  • Aggregates errors from produce callbacks
  • Logs appropriate warnings during retry attempts

However, lines 123-128 return True even when messages remain pending after all retries. This means the function succeeds despite potential message loss. Please confirm this aligns with the intended behavior for handling transient Kafka issues.

Consider documenting this behavior in the function docstring to clarify that the function may return success with a warning when messages are still pending but no explicit errors occurred.


130-133: LGTM! Clear exception handling with proper logging.

The KafkaException handling appropriately logs the full exception context and returns a descriptive error message.

tests/test_writer_kafka.py (3)

25-38: LGTM! Well-designed test fixture for retry scenarios.

The FakeProducerFlushSequence correctly simulates decreasing pending message counts across retry attempts. The *a, **kw parameters in the flush method are intentional to accept the timeout argument used in production code.


41-49: LGTM! Appropriate fixture for timeout scenarios.

The FakeProducerTimeout effectively simulates persistent pending messages to test the timeout warning path. The *a, **kw signature is intentional.


102-141: LGTM! Comprehensive test coverage for new retry and timeout behavior.

The three new tests effectively cover:

  1. Retry until success (102-115): Verifies retry loop breaks on zero remaining messages and logs appropriate warnings
  2. Timeout warning (118-130): Confirms the function returns success with a warning when messages remain pending after all retries
  3. TypeError fallback (133-141): Tests the compatibility layer for stub producers without timeout parameter support

The test assertions are precise and the use of monkeypatch for configuration is appropriate.

requirements.txt (1)

14-14: Confluent-kafka 2.12.1 is current and secure.

The version 2.12.1 is the latest release on PyPI, and no security vulnerabilities were found. The minor version bump is backward compatible and requires no action.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
src/writer_kafka.py (2)

38-41: Add validation for environment variable values.

The configuration values are read from environment variables and cast to float/int without validation. Invalid or negative values could cause unexpected behavior or runtime errors.

Consider adding validation:

 # Configurable flush timeouts and retries via env variables to avoid hanging indefinitely
-_KAFKA_FLUSH_TIMEOUT_SEC = float(os.environ.get("KAFKA_FLUSH_TIMEOUT", "7"))
-_MAX_RETRIES = int(os.environ.get("KAFKA_FLUSH_RETRIES", "3"))
-_RETRY_BACKOFF_SEC = float(os.environ.get("KAFKA_RETRY_BACKOFF", "0.5"))
+_KAFKA_FLUSH_TIMEOUT_SEC = max(0.1, float(os.environ.get("KAFKA_FLUSH_TIMEOUT", "7")))
+_MAX_RETRIES = max(1, int(os.environ.get("KAFKA_FLUSH_RETRIES", "3")))
+_RETRY_BACKOFF_SEC = max(0.0, float(os.environ.get("KAFKA_RETRY_BACKOFF", "0.5")))

105-116: Consider breaking early when delivery errors are detected.

The retry loop continues even when the produce callback has already reported delivery errors. Once a message fails to deliver (errors list is populated), retrying flush won't help since Kafka won't retry a failed delivery—it only helps with timeouts/pending messages.

Consider adding an early break on errors:

         remaining: Optional[int] = None
         for attempt in range(1, _MAX_RETRIES + 1):
             remaining = flush_with_timeout(producer, _KAFKA_FLUSH_TIMEOUT_SEC)
+            # Break early if delivery failed
+            if errors:
+                break
             # Treat None (flush returns None in some stubs) as success equivalent to 0 pending
-            if (remaining is None or remaining == 0) and not errors:
+            if remaining is None or remaining == 0:
                 break
             if attempt < _MAX_RETRIES:
                 logger.warning(
                     "Kafka flush pending (%s message(s) remain) attempt %d/%d", remaining, attempt, _MAX_RETRIES
                 )
                 time.sleep(_RETRY_BACKOFF_SEC)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 647b693 and 9e54970.

📒 Files selected for processing (1)
  • src/writer_kafka.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
src/writer_kafka.py (3)
tests/test_extract_token.py (1)
  • get (37-38)
tests/test_event_gate_lambda.py (2)
  • get (85-86)
  • flush (51-52)
tests/test_writer_kafka.py (4)
  • flush (15-16)
  • flush (31-38)
  • flush (47-49)
  • flush (58-60)
🪛 Ruff (0.14.3)
src/writer_kafka.py

128-128: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (2)
src/writer_kafka.py (2)

122-128: Clarify success criteria when messages remain pending.

The function returns True (success) even when remaining > 0 (messages still pending after exhausting retries). This means messages that were never delivered are treated as successful writes, which could mask delivery failures from the caller's perspective.

Is this intentional? If messages are still pending, should this be considered a failure rather than just a warning? Please verify that returning success for pending messages aligns with the expected behavior for downstream consumers of this API.

Note: The static analysis tool (Ruff TRY300) also suggests moving the return statement to an else block for better structure:

-        # Log a warning if there are still pending messages after retries
-        if isinstance(remaining, int) and remaining > 0:
-            logger.warning(
-                "Kafka flush timeout after %ss: %d message(s) still pending", _KAFKA_FLUSH_TIMEOUT_SEC, remaining
-            )
-
-        return True, None
+        # Log a warning if there are still pending messages after retries
+        if isinstance(remaining, int) and remaining > 0:
+            logger.warning(
+                "Kafka flush timeout after %ss: %d message(s) still pending", _KAFKA_FLUSH_TIMEOUT_SEC, remaining
+            )
+        else:
+            return True, None

However, you may want to reconsider whether pending messages should return False instead.


136-149: LGTM! Return type correctly specified.

The return type Optional[int] correctly reflects that stub producers may return None. The TypeError fallback for stub producers is also handled appropriately.

Copy link
Collaborator

@oto-macenauer-absa oto-macenauer-absa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have a look at the publish flow, please.

@github-actions
Copy link

github-actions bot commented Nov 4, 2025

Trivy has completed a full security repository scan ✅ You can find the analysis results for this PR branch on this overview.
Below is the summary of the findings:

TRIVY CRITICAL HIGH MEDIUM LOW TOTAL
vulnerability 0 0 0 0 0
secret 0 6 0 0 6
misconfiguration 1 2 2 10 15
license 0 0 0 0 0
➡️ Total 1 8 2 10 21

@oto-macenauer-absa oto-macenauer-absa merged commit f671bbf into master Nov 5, 2025
16 checks passed
@oto-macenauer-absa oto-macenauer-absa deleted the feature/75-kafka-timeout-based-errror branch November 5, 2025 08:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Intermittent errors caused by Kafka Flush Timeout

3 participants