Changes: In case of orchestrator crashes, add recovery logi #87

elisafalk · 2025-12-04T14:53:21Z

Changes requested:

In case of orchestrator crashes, add recovery logic in report_repository.py that detects reports stuck in running state for too long and marks them with a timeout status. Store this logic in a function named recover_stalled_reports.

Please review.

Summary by CodeRabbit

New Features
- Reports now support a "timed out" status for executions that exceed configured time limits.
- Enhanced report diagnostics with captured error messages, structured error details, timing alerts, and generation time metrics.
- Automatic recovery to detect and mark stalled reports as timed out.
Tests
- Added unit tests to validate stalled-report recovery behavior.
Chores
- Migration scaffolding and migration-runner flow updates (internal).

_{✏️ Tip: You can customize this high-level summary in your review settings.}

… logic in `report_repository.py` that detects reports stuck in running stat

coderabbitai · 2025-12-04T14:53:30Z

Warning

Rate limit exceeded

@elisafalk has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 28 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 64773d8 and 6c9b093.

📒 Files selected for processing (1)

backend/app/db/migrations/env.py (2 hunks)

Walkthrough

Adds TIMED_OUT to ReportStatusEnum, extends ReportState with error and timing columns, updates repository session usage and adds recover_stalled_reports(timeout_minutes), includes unit test for recovery, and adjusts migrations/env.py to support sync autogenerate path.

Changes

Cohort / File(s)	Summary
Model Extensions `backend/app/db/models/report_state.py`	Added `TIMED_OUT` to `ReportStatusEnum`; added nullable columns `error_message` (String), `errors` (JSON), `timing_alerts` (JSON), and `generation_time` (Float) to `ReportState`.
Repository Implementation `backend/app/db/repositories/report_repository.py`	Replaced `async with await self.session_factory()` with `async with self.session_factory()` across methods; added `recover_stalled_reports(timeout_minutes: int) -> int` to mark stale RUNNING-like states as `TIMED_OUT` and set error_message; standardized update result handling to use `scalar_one_or_none()`; added datetime imports and preserved commit/rollback semantics.
Repository Tests `backend/app/db/repositories/tests/test_report_repository.py`	New test module with `async_session_factory` and `report_repository` fixtures; `test_recover_stalled_reports` creates reports with different statuses/ages and asserts exactly one RUNNING report is transitioned to `TIMED_OUT` with expected error_message.
Migrations runtime `backend/app/db/migrations/env.py`	Added synchronous engine path for autogenerate (uses `create_engine` and `connection.run_sync` pattern); retains async runtime for normal operation; reorganized migration-run helper `do_run_migrations`.
Scaffolded migration `backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py`	Added new Alembic revision file (revision id present) with no-op `upgrade()`/`downgrade()` placeholders.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Pay special attention to:
- recover_stalled_reports correctness: threshold calculation (timezone-aware), UPDATE filter (status set list), returned update count, and error_message content.
- Async session context changes across repository methods to ensure sessions are acquired and closed correctly.
- Tests: ensure in-memory SQLite async setup and timestamp manipulation are deterministic.
- env.py changes: validate autogenerate (sync) vs async migration flows and that migration contexts call do_run_migrations correctly.
- Migration file scaffold: confirm alembic revision metadata and whether a real migration is required.

Possibly related PRs

Changes: Modify orchestrator flow so that timers start when #84 — also adds generation_time to ReportState (same model column).
Feat: Update report state in orchestrator after each stage #67 — touches ReportState and repository session handling similar to this PR.
Feat: Add Report Repository for Async Operations #66 — extends ReportRepository with recovery/update logic related to TIMED_OUT handling.

Poem

🐰 I nibble on logs and hop through time,
TIMED_OUT now waits where the slow ones climb.
Errors and timings tucked into rows,
A carrot of order where stalled reports doze. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.88% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title is incomplete/truncated: 'Changes: In case of orchestrator crashes, add recovery logi' ends mid-word ('logi' instead of 'logic' or similar). While it references the main feature, the truncation makes it unclear and unprofessional.	Complete the title to properly convey the intent, e.g., 'Add recovery logic for orchestrator crashes' or 'Implement stalled report recovery mechanism'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/db/models/report_state.py (1)

36-39: Create an Alembic migration for the new columns.

The new columns (error_message, errors, timing_alerts, generation_time) have been added to the ReportState model but no migration script exists in backend/app/db/migrations/versions/. Alembic is configured and ready to generate migrations. Create and include the migration script using alembic revision --autogenerate -m "Add error tracking and timing columns to report_state" to ensure existing databases can be properly updated.

🧹 Nitpick comments (2)

backend/app/db/repositories/report_repository.py (1)
103-122: Consider recovering other intermediate "running" states.

The current implementation only recovers reports with RUNNING status. However, the ReportStatusEnum includes several other intermediate states that could also stall during orchestrator crashes:

RUNNING_AGENTS

GENERATING_NLG

GENERATING_SUMMARY

If an orchestrator crashes while a report is in any of these states, it would not be recovered.
 async def recover_stalled_reports(self, timeout_minutes: int) -> int:
     async with self.session_factory() as session:
         try:
             stalled_threshold = datetime.now(timezone.utc) - timedelta(minutes=timeout_minutes)
+            
+            # All states that indicate active processing
+            running_states = [
+                ReportStatusEnum.RUNNING,
+                ReportStatusEnum.RUNNING_AGENTS,
+                ReportStatusEnum.GENERATING_NLG,
+                ReportStatusEnum.GENERATING_SUMMARY,
+            ]
             
             stmt = update(ReportState).where(
-                ReportState.status == ReportStatusEnum.RUNNING,
+                ReportState.status.in_(running_states),
                 ReportState.updated_at < stalled_threshold
             ).values(
backend/app/db/repositories/tests/test_report_repository.py (1)

38-112: Good test coverage for the core functionality.

The test effectively covers:

Active report (not stalled) remains unchanged

Stalled report is recovered with correct status and error message

Failed and completed reports are not affected

Consider adding edge case tests in a follow-up:

Test with timeout_minutes=0 (all RUNNING reports should be recovered)

Test when no reports match the criteria (verify returns 0)

Test boundary condition (report exactly at timeout threshold)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2e7b4f6 and b8dd232.

📒 Files selected for processing (3)

backend/app/db/models/report_state.py (1 hunks)
backend/app/db/repositories/report_repository.py (8 hunks)
backend/app/db/repositories/tests/test_report_repository.py (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

backend/app/db/repositories/tests/test_report_repository.py (3)

backend/app/db/models/report.py (1)

Report (4-6)

backend/app/db/models/report_state.py (2)

ReportState (26-39)

ReportStatusEnum (8-21)

backend/app/db/repositories/report_repository.py (1)

recover_stalled_reports (103-122)

backend/app/db/repositories/report_repository.py (1)

backend/app/db/models/report_state.py (2)

ReportState (26-39)

ReportStatusEnum (8-21)

🔇 Additional comments (5)

backend/app/db/models/report_state.py (1)

21-21: LGTM!

The new TIMED_OUT status value is appropriately added to support the recovery logic for stalled reports.

backend/app/db/repositories/report_repository.py (3)

2-2: LGTM!

The datetime imports are correctly added to support timezone-aware threshold calculations.

14-14: LGTM!

The async context manager pattern async with self.session_factory() as session is the correct approach for SQLAlchemy 2.0 async sessions.

109-110: Verify timezone handling in timestamp comparison.

The comparison between updated_at and stalled_threshold requires both to be timezone-aware. While stalled_threshold uses timezone.utc, the updated_at column behavior depends on the database backend:

SQLite doesn't natively support timezone-aware datetimes

PostgreSQL handles this correctly with TIMESTAMP WITH TIME ZONE

If running tests with SQLite but production with PostgreSQL, ensure the comparison behaves consistently.

backend/app/db/repositories/tests/test_report_repository.py (1)

14-32: LGTM!

The async session factory fixture is well-structured with proper setup and teardown, including table creation and disposal.

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

backend/app/db/repositories/report_repository.py (1)
103-129: Recovery logic looks correct; consider edge cases.

The implementation is well-structured with proper transaction handling. A few considerations:

Negative timeout: If timeout_minutes is negative, the threshold would be in the future, potentially marking active reports as stalled. Consider adding validation:
 async def recover_stalled_reports(self, timeout_minutes: int) -> int:
+    if timeout_minutes <= 0:
+        raise ValueError("timeout_minutes must be positive")
     async with self.session_factory() as session:
PENDING state: Reports stuck in PENDING (never started) are not recovered. Verify if this is intentional or if long-pending reports should also be marked as timed out.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b8dd232 and 64773d8.

⛔ Files ignored due to path filters (1)

sql_app.db is excluded by !**/*.db

📒 Files selected for processing (3)

backend/app/db/migrations/env.py (2 hunks)
backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py (1 hunks)
backend/app/db/repositories/report_repository.py (8 hunks)

✅ Files skipped from review due to trivial changes (1)

backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py

🧰 Additional context used

🧬 Code graph analysis (2)

backend/app/db/migrations/env.py (1)

backend/app/db/connection.py (1)

connect (18-62)

backend/app/db/repositories/report_repository.py (1)

backend/app/db/models/report_state.py (2)

ReportState (26-39)

ReportStatusEnum (8-21)

🔇 Additional comments (3)

backend/app/db/migrations/env.py (1)

64-68: Good extraction of migration logic.

The do_run_migrations helper cleanly encapsulates the migration execution, making it reusable for both sync and async paths.

backend/app/db/repositories/report_repository.py (2)

13-35: Consistent async context manager usage.

The session handling with async with self.session_factory() as session is clean and consistent across all methods, with proper rollback on exceptions.

108-121: State selection and update logic are appropriate.

The selected running states correctly target in-progress operations that could stall due to orchestrator crashes. The bulk update with RETURNING efficiently marks stalled reports while capturing affected IDs for the count.

backend/app/db/migrations/env.py

felixjordandev · 2025-12-04T16:06:07Z

the recovery logic in report_repository.py is solid, should fix those stuck reports! 👍

elisafalk added 3 commits December 4, 2025 08:53

chore: applied change — In case of orchestrator crashes, add recovery…

b8dd232

… logic in `report_repository.py` that detects reports stuck in running stat

Enhance report state tracking and recovery

64773d8

Fix: Convert async DB URLs to sync for Alembic autogenerate

6c9b093

coderabbitai bot reviewed Dec 4, 2025

View reviewed changes

backend/app/db/migrations/env.py Show resolved Hide resolved

felixjordandev approved these changes Dec 4, 2025

View reviewed changes

felixjordandev merged commit a4e86a6 into main Dec 4, 2025
1 check passed

felixjordandev deleted the feat/20251204145116 branch December 4, 2025 16:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Changes: In case of orchestrator crashes, add recovery logi #87

Changes: In case of orchestrator crashes, add recovery logi #87

elisafalk commented Dec 4, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 4, 2025 •

edited

Loading

Rate limit exceeded

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

felixjordandev commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Changes: In case of orchestrator crashes, add recovery logi #87

Changes: In case of orchestrator crashes, add recovery logi #87

Conversation

elisafalk commented Dec 4, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

felixjordandev commented Dec 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

elisafalk commented Dec 4, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 4, 2025 •

edited

Loading