Flush timeout error handling for Kafka writer #76

tmikula-dev · 2025-11-04T07:20:58Z

Overview

This pull request focuses on making Kafka message delivery more robust against transient failures, timeouts and ensuring that warnings and errors are properly logged and handled.

Release Notes:

Adds a timeout error handling for KAFKA writer

Closes #75

Summary by CodeRabbit

Improvements
- Implemented automatic retry logic to improve message delivery reliability
- Enhanced timeout handling for message persistence operations
- Improved error tracking with separate logging for different failure types
- Better diagnostics through enhanced error reporting
Chores
- Updated Kafka library dependency to latest compatible version
- Updated setup documentation for Python environment configuration

coderabbitai · 2025-11-04T07:21:23Z

Walkthrough

The pull request adds resilience to Kafka message flushing by introducing retry logic with exponential backoff, timeout handling with fallback support, and improved error tracking. Dependencies are updated, and documentation is modified to use pip3. New test coverage validates retry behavior and edge cases.

Changes

Cohort / File(s)	Summary
Documentation & Configuration `DEVELOPER.md`, `requirements.txt`	Documentation updated to use `pip3` instead of `pip` for Python environment setup; confluent-kafka dependency bumped from 2.11.1 to 2.12.1
Kafka Writer Core Logic `src/writer_kafka.py`	Introduces retry loop with configurable max retries and backoff timing for Kafka flush operations; adds `flush_with_timeout()` helper with TypeError fallback for stub compatibility; enhances error tracking with separate Produce/Flush error counters; implements timeout warning when messages remain pending after retries exhaust
Test Suite `tests/test_writer_kafka.py`	Adds three FakeProducer variants (`FakeProducerFlushSequence`, `FakeProducerTimeout`, `FakeProducerTypeError`) to simulate flush behaviors; introduces three new test cases covering retry success paths, timeout warnings, and TypeError fallback scenarios

Sequence Diagram

sequenceDiagram
    participant Client
    participant write_messages
    participant flush_with_timeout
    participant Producer
    participant Kafka

    Client->>write_messages: write_messages(records)
    write_messages->>Producer: produce(record) [×N]
    write_messages->>flush_with_timeout: flush_with_timeout(producer, timeout)
    
    rect rgb(100, 150, 255)
    Note over flush_with_timeout: Retry Loop (up to MAX_RETRIES)
    loop Each retry
        flush_with_timeout->>Producer: flush(timeout)
        alt Messages pending
            Producer-->>flush_with_timeout: remaining > 0
            flush_with_timeout->>flush_with_timeout: wait RETRY_BACKOFF_SEC
        else All flushed
            Producer-->>flush_with_timeout: remaining = 0
            flush_with_timeout-->>write_messages: success (0)
        end
    end
    end
    
    alt Retries exhausted & pending remain
        flush_with_timeout-->>write_messages: warning + return remaining
    else TypeError (stub fallback)
        flush_with_timeout->>Producer: flush() [no timeout]
        Producer-->>flush_with_timeout: result
    end
    
    write_messages->>write_messages: aggregate errors (Produce + Flush)
    write_messages-->>Client: result (success/failure)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

src/writer_kafka.py: Review the retry loop logic carefully, particularly the condition checks for retry continuation and the fallback to TypeError handling in flush_with_timeout(). Verify that the timeout parameter is correctly passed and that error aggregation properly distinguishes between produce and flush failures.
tests/test_writer_kafka.py: Validate that the three new FakeProducer classes correctly simulate the intended behaviors (decreasing sequence, persistent timeout, signature mismatch) and that test assertions properly verify retry counts, warning logs, and success conditions.
requirements.txt: Confirm that confluent-kafka 2.12.1 maintains API compatibility and does not introduce breaking changes affecting the new timeout-based flush calls.

Poem

🐰 Retry, retry, don't give up so quick,
Kafka's flushing now with a trick,
Backoff and timeout, loop till it's done,
Messages pending? We'll retry for fun!
No more intermittent errors in sight,
The warren's Kafka now flushes just right! 🌙

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Out of Scope Changes check	⚠️ Warning	The pip3 update in DEVELOPER.md and confluent-kafka version bump in requirements.txt are minor out-of-scope changes unrelated to the timeout handling objectives.	Remove the DEVELOPER.md pip→pip3 change and revert confluent-kafka to 2.11.1, or create separate PRs for infrastructure and dependency updates to maintain focus on the core timeout handling feature.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Flush timeout error handling for Kafka writer' directly and concisely summarizes the main change: adding timeout error handling to the Kafka writer component.
Linked Issues check	✅ Passed	The code changes comprehensively address issue #75 objectives: retry logic on flush timeouts, enhanced error handling with proper logging, timeout configuration via environment variables, and graceful degradation from errors to warnings.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/75-kafka-timeout-based-errror

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9e54970 and fd3b1ba.

📒 Files selected for processing (1)

src/writer_kafka.py (4 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/writer_kafka.py (2)

tests/test_event_gate_lambda.py (1)

flush (51-52)

tests/test_writer_kafka.py (4)

flush (15-16)

flush (31-38)

flush (47-49)

flush (58-60)

🔇 Additional comments (1)

src/writer_kafka.py (1)
128-136: Do not return success when flush still has pending messages.

If remaining > 0 after exhausting retries we only log a warning, so errors stays empty and the function returns (True, None). That tells callers the write succeeded even though Kafka still has undelivered records, which silently drops data under the exact timeout scenario this PR is meant to address. Please make this path fail the write (e.g., append an error entry so the final block returns False). request_verificationraise_critical_issue
-    if isinstance(remaining, int) and remaining > 0:
-        logger.warning(
-            "Kafka flush timeout after %ss: %d message(s) still pending", _KAFKA_FLUSH_TIMEOUT_SEC, remaining
-        )
+    if isinstance(remaining, int) and remaining > 0:
+        timeout_msg = (
+            f"Kafka flush timeout after {_KAFKA_FLUSH_TIMEOUT_SEC}s: {remaining} message(s) still pending"
+        )
+        logger.warning(timeout_msg)
+        errors.append(timeout_msg)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_writer_kafka.py (1)
52-60: Remove unused noqa directive.

The FakeProducerTypeError correctly tests the TypeError fallback path. However, line 58 has an unnecessary noqa: D401 directive since there's no docstring to which D401 would apply.

Apply this diff:
-    def flush(self):  # noqa: D401
+    def flush(self):
         self.flush_calls += 1
         return 0

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4af9f10 and 647b693.

📒 Files selected for processing (4)

DEVELOPER.md (1 hunks)
requirements.txt (1 hunks)
src/writer_kafka.py (3 hunks)
tests/test_writer_kafka.py (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

tests/test_writer_kafka.py (1)

src/writer_kafka.py (1)

write (80-133)

src/writer_kafka.py (2)

tests/test_event_gate_lambda.py (1)

flush (51-52)

tests/test_writer_kafka.py (4)

flush (15-16)

flush (31-38)

flush (47-49)

flush (58-60)

🪛 Ruff (0.14.3)

tests/test_writer_kafka.py

31-31: Unused method argument: a

(ARG002)

31-31: Unused method argument: kw

(ARG002)

47-47: Unused method argument: a

(ARG002)

47-47: Unused method argument: kw

(ARG002)

58-58: Unused noqa directive (non-enabled: D401)

Remove unused noqa directive

(RUF100)

src/writer_kafka.py

128-128: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (9)

DEVELOPER.md (1)

31-31: LGTM! Clear Python 3 specification.

The change to pip3 aligns with the Python 3.13 requirement and ensures developers explicitly use the correct package manager.

src/writer_kafka.py (4)

25-25: LGTM! Import required for retry backoff.

The time module is appropriately added to support the retry sleep mechanism.

38-41: LGTM! Well-designed configurable retry parameters.

The configuration defaults are sensible:

Increased flush timeout (5s → 7s) provides more tolerance for transient delays

3 retries with 0.5s backoff yields ~8.5s max wait time

Environment-driven configuration enables tuning without code changes

105-128: Retry logic is well-implemented, but verify the pending-message success behavior.

The retry mechanism correctly:

Attempts flush up to _MAX_RETRIES times with exponential backoff

Handles stub producers that return None from flush

Aggregates errors from produce callbacks

Logs appropriate warnings during retry attempts

However, lines 123-128 return True even when messages remain pending after all retries. This means the function succeeds despite potential message loss. Please confirm this aligns with the intended behavior for handling transient Kafka issues.

Consider documenting this behavior in the function docstring to clarify that the function may return success with a warning when messages are still pending but no explicit errors occurred.

130-133: LGTM! Clear exception handling with proper logging.

The KafkaException handling appropriately logs the full exception context and returns a descriptive error message.

tests/test_writer_kafka.py (3)

25-38: LGTM! Well-designed test fixture for retry scenarios.

The FakeProducerFlushSequence correctly simulates decreasing pending message counts across retry attempts. The *a, **kw parameters in the flush method are intentional to accept the timeout argument used in production code.

41-49: LGTM! Appropriate fixture for timeout scenarios.

The FakeProducerTimeout effectively simulates persistent pending messages to test the timeout warning path. The *a, **kw signature is intentional.

102-141: LGTM! Comprehensive test coverage for new retry and timeout behavior.

The three new tests effectively cover:

Retry until success (102-115): Verifies retry loop breaks on zero remaining messages and logs appropriate warnings

Timeout warning (118-130): Confirms the function returns success with a warning when messages remain pending after all retries

TypeError fallback (133-141): Tests the compatibility layer for stub producers without timeout parameter support

The test assertions are precise and the use of monkeypatch for configuration is appropriate.

requirements.txt (1)

14-14: Confluent-kafka 2.12.1 is current and secure.

The version 2.12.1 is the latest release on PyPI, and no security vulnerabilities were found. The minor version bump is backward compatible and requires no action.

src/writer_kafka.py

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

src/writer_kafka.py (2)

38-41: Add validation for environment variable values.

The configuration values are read from environment variables and cast to float/int without validation. Invalid or negative values could cause unexpected behavior or runtime errors.

Consider adding validation:

 # Configurable flush timeouts and retries via env variables to avoid hanging indefinitely
-_KAFKA_FLUSH_TIMEOUT_SEC = float(os.environ.get("KAFKA_FLUSH_TIMEOUT", "7"))
-_MAX_RETRIES = int(os.environ.get("KAFKA_FLUSH_RETRIES", "3"))
-_RETRY_BACKOFF_SEC = float(os.environ.get("KAFKA_RETRY_BACKOFF", "0.5"))
+_KAFKA_FLUSH_TIMEOUT_SEC = max(0.1, float(os.environ.get("KAFKA_FLUSH_TIMEOUT", "7")))
+_MAX_RETRIES = max(1, int(os.environ.get("KAFKA_FLUSH_RETRIES", "3")))
+_RETRY_BACKOFF_SEC = max(0.0, float(os.environ.get("KAFKA_RETRY_BACKOFF", "0.5")))

105-116: Consider breaking early when delivery errors are detected.

The retry loop continues even when the produce callback has already reported delivery errors. Once a message fails to deliver (errors list is populated), retrying flush won't help since Kafka won't retry a failed delivery—it only helps with timeouts/pending messages.

Consider adding an early break on errors:

         remaining: Optional[int] = None
         for attempt in range(1, _MAX_RETRIES + 1):
             remaining = flush_with_timeout(producer, _KAFKA_FLUSH_TIMEOUT_SEC)
+            # Break early if delivery failed
+            if errors:
+                break
             # Treat None (flush returns None in some stubs) as success equivalent to 0 pending
-            if (remaining is None or remaining == 0) and not errors:
+            if remaining is None or remaining == 0:
                 break
             if attempt < _MAX_RETRIES:
                 logger.warning(
                     "Kafka flush pending (%s message(s) remain) attempt %d/%d", remaining, attempt, _MAX_RETRIES
                 )
                 time.sleep(_RETRY_BACKOFF_SEC)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 647b693 and 9e54970.

📒 Files selected for processing (1)

src/writer_kafka.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

src/writer_kafka.py (3)

tests/test_extract_token.py (1)

get (37-38)

tests/test_event_gate_lambda.py (2)

get (85-86)

flush (51-52)

tests/test_writer_kafka.py (4)

flush (15-16)

flush (31-38)

flush (47-49)

flush (58-60)

🪛 Ruff (0.14.3)

src/writer_kafka.py

128-128: Consider moving this statement to an else block

(TRY300)

🔇 Additional comments (2)

src/writer_kafka.py (2)
122-128: Clarify success criteria when messages remain pending.

The function returns True (success) even when remaining > 0 (messages still pending after exhausting retries). This means messages that were never delivered are treated as successful writes, which could mask delivery failures from the caller's perspective.

Is this intentional? If messages are still pending, should this be considered a failure rather than just a warning? Please verify that returning success for pending messages aligns with the expected behavior for downstream consumers of this API.

Note: The static analysis tool (Ruff TRY300) also suggests moving the return statement to an else block for better structure:
-        # Log a warning if there are still pending messages after retries
-        if isinstance(remaining, int) and remaining > 0:
-            logger.warning(
-                "Kafka flush timeout after %ss: %d message(s) still pending", _KAFKA_FLUSH_TIMEOUT_SEC, remaining
-            )
-
-        return True, None
+        # Log a warning if there are still pending messages after retries
+        if isinstance(remaining, int) and remaining > 0:
+            logger.warning(
+                "Kafka flush timeout after %ss: %d message(s) still pending", _KAFKA_FLUSH_TIMEOUT_SEC, remaining
+            )
+        else:
+            return True, None
However, you may want to reconsider whether pending messages should return False instead.

136-149: LGTM! Return type correctly specified.

The return type Optional[int] correctly reflects that stub producers may return None. The TypeError fallback for stub producers is also handled appropriately.

oto-macenauer-absa

Have a look at the publish flow, please.

src/writer_kafka.py

github-actions · 2025-11-04T15:09:26Z

Trivy has completed a full security repository scan ✅ You can find the analysis results for this PR branch on this overview.
Below is the summary of the findings:

TRIVY	CRITICAL	HIGH	MEDIUM	LOW	TOTAL
vulnerability	0	0	0	0	0
secret	0	6	0	0	6
misconfiguration	1	2	2	10	15
license	0	0	0	0	0
➡️ Total	1	8	2	10	21

Flush error handling for Kafka writer.

647b693

tmikula-dev requested review from ABMC831, Zejnilovic, oto-macenauer-absa and petr-pokorny-absa as code owners November 4, 2025 07:20

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

src/writer_kafka.py Outdated Show resolved Hide resolved

Method documentation improvement.

9e54970

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

oto-macenauer-absa reviewed Nov 4, 2025

View reviewed changes

src/writer_kafka.py Outdated Show resolved Hide resolved

src/writer_kafka.py Outdated Show resolved Hide resolved

Comments implementation.

fd3b1ba

oto-macenauer-absa approved these changes Nov 4, 2025

View reviewed changes

oto-macenauer-absa merged commit f671bbf into master Nov 5, 2025
16 checks passed

oto-macenauer-absa deleted the feature/75-kafka-timeout-based-errror branch November 5, 2025 08:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flush timeout error handling for Kafka writer #76

Flush timeout error handling for Kafka writer #76

Uh oh!

tmikula-dev commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 4, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

oto-macenauer-absa left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Flush timeout error handling for Kafka writer #76

Flush timeout error handling for Kafka writer #76

Uh oh!

Conversation

tmikula-dev commented Nov 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Release Notes:

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

oto-macenauer-absa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Nov 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tmikula-dev commented Nov 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 4, 2025 •

edited

Loading