Skip to content

Conversation

@elisafalk
Copy link
Collaborator

@elisafalk elisafalk commented Dec 4, 2025

Changes requested:

In case of orchestrator crashes, add recovery logic in report_repository.py that detects reports stuck in running state for too long and marks them with a timeout status. Store this logic in a function named recover_stalled_reports.

Please review.

Summary by CodeRabbit

  • New Features

    • Reports now support a "timed out" status for executions that exceed configured time limits.
    • Enhanced report diagnostics with captured error messages, structured error details, timing alerts, and generation time metrics.
    • Automatic recovery to detect and mark stalled reports as timed out.
  • Tests

    • Added unit tests to validate stalled-report recovery behavior.
  • Chores

    • Migration scaffolding and migration-runner flow updates (internal).

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link

coderabbitai bot commented Dec 4, 2025

Warning

Rate limit exceeded

@elisafalk has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 64773d8 and 6c9b093.

📒 Files selected for processing (1)
  • backend/app/db/migrations/env.py (2 hunks)

Walkthrough

Adds TIMED_OUT to ReportStatusEnum, extends ReportState with error and timing columns, updates repository session usage and adds recover_stalled_reports(timeout_minutes), includes unit test for recovery, and adjusts migrations/env.py to support sync autogenerate path.

Changes

Cohort / File(s) Summary
Model Extensions
backend/app/db/models/report_state.py
Added TIMED_OUT to ReportStatusEnum; added nullable columns error_message (String), errors (JSON), timing_alerts (JSON), and generation_time (Float) to ReportState.
Repository Implementation
backend/app/db/repositories/report_repository.py
Replaced async with await self.session_factory() with async with self.session_factory() across methods; added recover_stalled_reports(timeout_minutes: int) -> int to mark stale RUNNING-like states as TIMED_OUT and set error_message; standardized update result handling to use scalar_one_or_none(); added datetime imports and preserved commit/rollback semantics.
Repository Tests
backend/app/db/repositories/tests/test_report_repository.py
New test module with async_session_factory and report_repository fixtures; test_recover_stalled_reports creates reports with different statuses/ages and asserts exactly one RUNNING report is transitioned to TIMED_OUT with expected error_message.
Migrations runtime
backend/app/db/migrations/env.py
Added synchronous engine path for autogenerate (uses create_engine and connection.run_sync pattern); retains async runtime for normal operation; reorganized migration-run helper do_run_migrations.
Scaffolded migration
backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py
Added new Alembic revision file (revision id present) with no-op upgrade()/downgrade() placeholders.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay special attention to:
    • recover_stalled_reports correctness: threshold calculation (timezone-aware), UPDATE filter (status set list), returned update count, and error_message content.
    • Async session context changes across repository methods to ensure sessions are acquired and closed correctly.
    • Tests: ensure in-memory SQLite async setup and timestamp manipulation are deterministic.
    • env.py changes: validate autogenerate (sync) vs async migration flows and that migration contexts call do_run_migrations correctly.
    • Migration file scaffold: confirm alembic revision metadata and whether a real migration is required.

Possibly related PRs

Poem

🐰 I nibble on logs and hop through time,
TIMED_OUT now waits where the slow ones climb.
Errors and timings tucked into rows,
A carrot of order where stalled reports doze. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title is incomplete/truncated: 'Changes: In case of orchestrator crashes, add recovery logi' ends mid-word ('logi' instead of 'logic' or similar). While it references the main feature, the truncation makes it unclear and unprofessional. Complete the title to properly convey the intent, e.g., 'Add recovery logic for orchestrator crashes' or 'Implement stalled report recovery mechanism'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/app/db/models/report_state.py (1)

36-39: Create an Alembic migration for the new columns.

The new columns (error_message, errors, timing_alerts, generation_time) have been added to the ReportState model but no migration script exists in backend/app/db/migrations/versions/. Alembic is configured and ready to generate migrations. Create and include the migration script using alembic revision --autogenerate -m "Add error tracking and timing columns to report_state" to ensure existing databases can be properly updated.

🧹 Nitpick comments (2)
backend/app/db/repositories/report_repository.py (1)

103-122: Consider recovering other intermediate "running" states.

The current implementation only recovers reports with RUNNING status. However, the ReportStatusEnum includes several other intermediate states that could also stall during orchestrator crashes:

  • RUNNING_AGENTS
  • GENERATING_NLG
  • GENERATING_SUMMARY

If an orchestrator crashes while a report is in any of these states, it would not be recovered.

 async def recover_stalled_reports(self, timeout_minutes: int) -> int:
     async with self.session_factory() as session:
         try:
             stalled_threshold = datetime.now(timezone.utc) - timedelta(minutes=timeout_minutes)
+            
+            # All states that indicate active processing
+            running_states = [
+                ReportStatusEnum.RUNNING,
+                ReportStatusEnum.RUNNING_AGENTS,
+                ReportStatusEnum.GENERATING_NLG,
+                ReportStatusEnum.GENERATING_SUMMARY,
+            ]
             
             stmt = update(ReportState).where(
-                ReportState.status == ReportStatusEnum.RUNNING,
+                ReportState.status.in_(running_states),
                 ReportState.updated_at < stalled_threshold
             ).values(
backend/app/db/repositories/tests/test_report_repository.py (1)

38-112: Good test coverage for the core functionality.

The test effectively covers:

  • Active report (not stalled) remains unchanged
  • Stalled report is recovered with correct status and error message
  • Failed and completed reports are not affected

Consider adding edge case tests in a follow-up:

  • Test with timeout_minutes=0 (all RUNNING reports should be recovered)
  • Test when no reports match the criteria (verify returns 0)
  • Test boundary condition (report exactly at timeout threshold)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e7b4f6 and b8dd232.

📒 Files selected for processing (3)
  • backend/app/db/models/report_state.py (1 hunks)
  • backend/app/db/repositories/report_repository.py (8 hunks)
  • backend/app/db/repositories/tests/test_report_repository.py (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/app/db/repositories/tests/test_report_repository.py (3)
backend/app/db/models/report.py (1)
  • Report (4-6)
backend/app/db/models/report_state.py (2)
  • ReportState (26-39)
  • ReportStatusEnum (8-21)
backend/app/db/repositories/report_repository.py (1)
  • recover_stalled_reports (103-122)
backend/app/db/repositories/report_repository.py (1)
backend/app/db/models/report_state.py (2)
  • ReportState (26-39)
  • ReportStatusEnum (8-21)
🔇 Additional comments (5)
backend/app/db/models/report_state.py (1)

21-21: LGTM!

The new TIMED_OUT status value is appropriately added to support the recovery logic for stalled reports.

backend/app/db/repositories/report_repository.py (3)

2-2: LGTM!

The datetime imports are correctly added to support timezone-aware threshold calculations.


14-14: LGTM!

The async context manager pattern async with self.session_factory() as session is the correct approach for SQLAlchemy 2.0 async sessions.


109-110: Verify timezone handling in timestamp comparison.

The comparison between updated_at and stalled_threshold requires both to be timezone-aware. While stalled_threshold uses timezone.utc, the updated_at column behavior depends on the database backend:

  • SQLite doesn't natively support timezone-aware datetimes
  • PostgreSQL handles this correctly with TIMESTAMP WITH TIME ZONE

If running tests with SQLite but production with PostgreSQL, ensure the comparison behaves consistently.

backend/app/db/repositories/tests/test_report_repository.py (1)

14-32: LGTM!

The async session factory fixture is well-structured with proper setup and teardown, including table creation and disposal.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/db/repositories/report_repository.py (1)

103-129: Recovery logic looks correct; consider edge cases.

The implementation is well-structured with proper transaction handling. A few considerations:

  1. Negative timeout: If timeout_minutes is negative, the threshold would be in the future, potentially marking active reports as stalled. Consider adding validation:
 async def recover_stalled_reports(self, timeout_minutes: int) -> int:
+    if timeout_minutes <= 0:
+        raise ValueError("timeout_minutes must be positive")
     async with self.session_factory() as session:
  1. PENDING state: Reports stuck in PENDING (never started) are not recovered. Verify if this is intentional or if long-pending reports should also be marked as timed out.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8dd232 and 64773d8.

⛔ Files ignored due to path filters (1)
  • sql_app.db is excluded by !**/*.db
📒 Files selected for processing (3)
  • backend/app/db/migrations/env.py (2 hunks)
  • backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py (1 hunks)
  • backend/app/db/repositories/report_repository.py (8 hunks)
✅ Files skipped from review due to trivial changes (1)
  • backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py
🧰 Additional context used
🧬 Code graph analysis (2)
backend/app/db/migrations/env.py (1)
backend/app/db/connection.py (1)
  • connect (18-62)
backend/app/db/repositories/report_repository.py (1)
backend/app/db/models/report_state.py (2)
  • ReportState (26-39)
  • ReportStatusEnum (8-21)
🔇 Additional comments (3)
backend/app/db/migrations/env.py (1)

64-68: Good extraction of migration logic.

The do_run_migrations helper cleanly encapsulates the migration execution, making it reusable for both sync and async paths.

backend/app/db/repositories/report_repository.py (2)

13-35: Consistent async context manager usage.

The session handling with async with self.session_factory() as session is clean and consistent across all methods, with proper rollback on exceptions.


108-121: State selection and update logic are appropriate.

The selected running states correctly target in-progress operations that could stall due to orchestrator crashes. The bulk update with RETURNING efficiently marks stalled reports while capturing affected IDs for the count.

@felixjordandev
Copy link
Collaborator

the recovery logic in report_repository.py is solid, should fix those stuck reports! 👍

@felixjordandev felixjordandev merged commit a4e86a6 into main Dec 4, 2025
1 check passed
@felixjordandev felixjordandev deleted the feat/20251204145116 branch December 4, 2025 16:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants