-
Notifications
You must be signed in to change notification settings - Fork 0
Changes: In case of orchestrator crashes, add recovery logi #87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… logic in `report_repository.py` that detects reports stuck in running stat
|
Warning Rate limit exceeded@elisafalk has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 10 minutes and 28 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdds TIMED_OUT to ReportStatusEnum, extends ReportState with error and timing columns, updates repository session usage and adds recover_stalled_reports(timeout_minutes), includes unit test for recovery, and adjusts migrations/env.py to support sync autogenerate path. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Possibly related PRs
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
backend/app/db/models/report_state.py (1)
36-39: Create an Alembic migration for the new columns.The new columns (
error_message,errors,timing_alerts,generation_time) have been added to theReportStatemodel but no migration script exists inbackend/app/db/migrations/versions/. Alembic is configured and ready to generate migrations. Create and include the migration script usingalembic revision --autogenerate -m "Add error tracking and timing columns to report_state"to ensure existing databases can be properly updated.
🧹 Nitpick comments (2)
backend/app/db/repositories/report_repository.py (1)
103-122: Consider recovering other intermediate "running" states.The current implementation only recovers reports with
RUNNINGstatus. However, theReportStatusEnumincludes several other intermediate states that could also stall during orchestrator crashes:
RUNNING_AGENTSGENERATING_NLGGENERATING_SUMMARYIf an orchestrator crashes while a report is in any of these states, it would not be recovered.
async def recover_stalled_reports(self, timeout_minutes: int) -> int: async with self.session_factory() as session: try: stalled_threshold = datetime.now(timezone.utc) - timedelta(minutes=timeout_minutes) + + # All states that indicate active processing + running_states = [ + ReportStatusEnum.RUNNING, + ReportStatusEnum.RUNNING_AGENTS, + ReportStatusEnum.GENERATING_NLG, + ReportStatusEnum.GENERATING_SUMMARY, + ] stmt = update(ReportState).where( - ReportState.status == ReportStatusEnum.RUNNING, + ReportState.status.in_(running_states), ReportState.updated_at < stalled_threshold ).values(backend/app/db/repositories/tests/test_report_repository.py (1)
38-112: Good test coverage for the core functionality.The test effectively covers:
- Active report (not stalled) remains unchanged
- Stalled report is recovered with correct status and error message
- Failed and completed reports are not affected
Consider adding edge case tests in a follow-up:
- Test with
timeout_minutes=0(all RUNNING reports should be recovered)- Test when no reports match the criteria (verify returns 0)
- Test boundary condition (report exactly at timeout threshold)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
backend/app/db/models/report_state.py(1 hunks)backend/app/db/repositories/report_repository.py(8 hunks)backend/app/db/repositories/tests/test_report_repository.py(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
backend/app/db/repositories/tests/test_report_repository.py (3)
backend/app/db/models/report.py (1)
Report(4-6)backend/app/db/models/report_state.py (2)
ReportState(26-39)ReportStatusEnum(8-21)backend/app/db/repositories/report_repository.py (1)
recover_stalled_reports(103-122)
backend/app/db/repositories/report_repository.py (1)
backend/app/db/models/report_state.py (2)
ReportState(26-39)ReportStatusEnum(8-21)
🔇 Additional comments (5)
backend/app/db/models/report_state.py (1)
21-21: LGTM!The new
TIMED_OUTstatus value is appropriately added to support the recovery logic for stalled reports.backend/app/db/repositories/report_repository.py (3)
2-2: LGTM!The datetime imports are correctly added to support timezone-aware threshold calculations.
14-14: LGTM!The async context manager pattern
async with self.session_factory() as sessionis the correct approach for SQLAlchemy 2.0 async sessions.
109-110: Verify timezone handling in timestamp comparison.The comparison between
updated_atandstalled_thresholdrequires both to be timezone-aware. Whilestalled_thresholdusestimezone.utc, theupdated_atcolumn behavior depends on the database backend:
- SQLite doesn't natively support timezone-aware datetimes
- PostgreSQL handles this correctly with
TIMESTAMP WITH TIME ZONEIf running tests with SQLite but production with PostgreSQL, ensure the comparison behaves consistently.
backend/app/db/repositories/tests/test_report_repository.py (1)
14-32: LGTM!The async session factory fixture is well-structured with proper setup and teardown, including table creation and disposal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
backend/app/db/repositories/report_repository.py (1)
103-129: Recovery logic looks correct; consider edge cases.The implementation is well-structured with proper transaction handling. A few considerations:
- Negative timeout: If
timeout_minutesis negative, the threshold would be in the future, potentially marking active reports as stalled. Consider adding validation:async def recover_stalled_reports(self, timeout_minutes: int) -> int: + if timeout_minutes <= 0: + raise ValueError("timeout_minutes must be positive") async with self.session_factory() as session:
- PENDING state: Reports stuck in
PENDING(never started) are not recovered. Verify if this is intentional or if long-pending reports should also be marked as timed out.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
sql_app.dbis excluded by!**/*.db
📒 Files selected for processing (3)
backend/app/db/migrations/env.py(2 hunks)backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py(1 hunks)backend/app/db/repositories/report_repository.py(8 hunks)
✅ Files skipped from review due to trivial changes (1)
- backend/app/db/migrations/versions/51d8ee58dab4_add_error_tracking_and_timing_columns_.py
🧰 Additional context used
🧬 Code graph analysis (2)
backend/app/db/migrations/env.py (1)
backend/app/db/connection.py (1)
connect(18-62)
backend/app/db/repositories/report_repository.py (1)
backend/app/db/models/report_state.py (2)
ReportState(26-39)ReportStatusEnum(8-21)
🔇 Additional comments (3)
backend/app/db/migrations/env.py (1)
64-68: Good extraction of migration logic.The
do_run_migrationshelper cleanly encapsulates the migration execution, making it reusable for both sync and async paths.backend/app/db/repositories/report_repository.py (2)
13-35: Consistent async context manager usage.The session handling with
async with self.session_factory() as sessionis clean and consistent across all methods, with proper rollback on exceptions.
108-121: State selection and update logic are appropriate.The selected running states correctly target in-progress operations that could stall due to orchestrator crashes. The bulk update with
RETURNINGefficiently marks stalled reports while capturing affected IDs for the count.
|
the recovery logic in |
Changes requested:
In case of orchestrator crashes, add recovery logic in
report_repository.pythat detects reports stuck in running state for too long and marks them with a timeout status. Store this logic in a function namedrecover_stalled_reports.Please review.
Summary by CodeRabbit
New Features
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.