Gevent: Implement soft time limit by nishika26 · Pull Request #800 · ProjectTech4DevAI/kaapi-backend

nishika26 · 2026-04-29T20:23:39Z

Summary

Target issue is #767

Notes

New Features
- Added a new collection batch job task and applied soft-timeout protection to multiple background job types.
Bug Fixes
- Unified timeout handling across jobs to mark failures, log timeout-specific errors, perform cleaner rollback, and trigger failure callbacks.
- Improved failure/status updates for collection, document transformation, LLM, STT, and TTS workflows.
Tests
- Added timeout-focused tests verifying failure marking and callback behavior for multiple job types.

Summary by CodeRabbit

Bug Fixes
- Improved timeout handling for background jobs and tasks with proper error marking and failure notifications.
- Enhanced error tracking and callback delivery for timed-out operations across collection management, document processing, and AI services.
Tests
- Added comprehensive test coverage for timeout scenarios across job execution services.

coderabbitai · 2026-04-29T20:23:52Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Applies a gevent-based soft-timeout decorator to many Celery task entrypoints and centralizes explicit timeout handling across collection, LLM, document-transform, STT, and TTS job services, mapping gevent/Celery timeouts to a uniform TimeoutError, marking runs/jobs FAILED, and sending failure callbacks where configured. (≤50 words)

Changes

Soft-timeout enforcement & failure handling

Layer / File(s)	Summary
Timeout utility `backend/app/celery/utils.py`	Adds `gevent_timeout(seconds, task_name=None)` decorator that starts a `gevent.Timeout`, logs on timeout, cancels in `finally`, and re-raises the matching timeout instance.
Celery task surface `backend/app/celery/tasks/job_execution.py`	Applies `@gevent_timeout(settings.CELERY_TASK_SOFT_TIME_LIMIT, "<task_name>")` to many Celery task entrypoints to enforce soft timeouts at the task boundary.
Failure handling abstraction `backend/app/services/collections/create_collection.py`, `backend/app/services/doctransform/job.py`	Adds internal `_handle_job_failure(...)` helpers to record span errors, persist FAILED status, optionally delete provider resources, and send failure callbacks; generic exception paths delegate to these helpers.
Explicit timeout branches `backend/app/services/collections/.py`, `backend/app/services/doctransform/`, `backend/app/services/llm/jobs.py`, `backend/app/services/stt_evaluations/`, `backend/app/services/tts_evaluations/`, `backend/app/services/doctransform/registry.py`, `backend/app/services/doctransform/zerox_transformer.py`	Imports `gevent.Timeout` and `celery.exceptions.SoftTimeLimitExceeded`, adds dedicated `except (Timeout, SoftTimeLimitExceeded)` branches that synthesize `TimeoutError("Task exceeded soft time limit")`, log timeout-specific messages, mark jobs/runs as FAILED, trigger failure callbacks when configured, then re-raise the original timeout exception.
Control-flow & validation updates `backend/app/services/stt_evaluations/batch_job.py`, `backend/app/services/tts_evaluations/batch_job.py`	Wraps lookup + start logic in an outer `try`, adds early-return behavior for missing runs/datasets/samples, and ensures runs are marked failed with appropriate messages on error or timeout.
LLM chain handling `backend/app/services/llm/jobs.py`	Extends `handle_job_error(..., chain_id: UUID
Transformers: propagate timeouts `backend/app/services/doctransform/registry.py`, `backend/app/services/doctransform/zerox_transformer.py`	Transformer execution now re-raises `gevent.Timeout` and `SoftTimeLimitExceeded` directly instead of wrapping them, preserving timeout semantics.
Tests `backend/app/tests/services/*`	Adds timeout-focused tests across collections, doctransform, STT/TTS metric/result flows; updates LLM tests to assert `handle_job_error` usage for chain failures and adds chain timeout tests.
Manifests `pyproject.toml`, `requirements.txt`	Dependency/manifests updated to reflect gevent/related changes referenced by the diff.

Sequence Diagram(s)

sequenceDiagram
  participant Celery as "Celery Task"
  participant Decorator as "gevent_timeout decorator"
  participant Service as "execute_* service"
  participant DB as "Database / CRUD"
  participant Callback as "Callback / Webhook"

  rect rgba(200,200,255,0.5)
  Celery->>Decorator: task invoked (trace_id, args...)
  Decorator->>Decorator: start gevent.Timeout(seconds)
  end

  Celery->>Service: _set_trace(trace_id) + call execute_*(..., task_id, task_instance)
  Service->>DB: update job/run status (in-flight / failed)
  Service->>Callback: send failure callback if configured
  Service-->>Celery: returns or raises (Timeout or other)

  rect rgba(255,200,200,0.5)
  Decorator->>Service: on gevent.Timeout -> Timeout propagated
  Service->>DB: mark FAILED with timeout message
  Service->>Callback: send failure callback (if configured)
  Decorator-->>Celery: re-raise Timeout
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

ProjectTech4DevAI/kaapi-backend#729 — Introduced gevent usage/dependency and related timeout utilities used by the new decorator and timeout branches.
ProjectTech4DevAI/kaapi-backend#772 — Modifies similar Celery task entrypoints and OTEL/task-wrapping logic that this change interacts with.
ProjectTech4DevAI/kaapi-backend#714 — Previously touched the same Celery task entrypoints; closely related to task-wiring changes.

Suggested reviewers

kartpop
Prajna1999
vprashrex

Poem

🐰 I wrapped the clock in gentle thread,
A gevent hush above each tread.
When tasks run long I mark and call,
Fail with grace, then warn them all.
Rabbit hops off—soft timeout said.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 42.37% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and concisely describes the main implementation: adding gevent soft time limit functionality across multiple background job types. It directly reflects the primary objective of the PR.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch bug/gevent_timeout

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

Generate code and open pull requests
Plan features and break down work
Investigate incidents and troubleshoot customer tickets together
Automate recurring tasks and respond to alerts with triggers
Summarize progress and report instantly

Built for teams:

Shared memory across your entire org—no repeating context
Per-thread sandboxes to safely plan and execute work
Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

backend/app/services/llm/jobs.py (1)

193-225: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist the FAILED state before attempting the callback.

send_callback() still runs before the DB updates here. If the callback delivery fails, handle_job_error() exits before JobCrud.update() and update_llm_chain_status(), so the new timeout/error paths can leave the job and chain stuck in a non-terminal state.

Suggested fix

 def handle_job_error(
     job_id: UUID,
     callback_url: str | None,
     callback_response: APIResponse,
     organization_id: int | None = None,
     project_id: int | None = None,
     chain_id: UUID | None = None,
 ) -> dict:
     """Handle job failure uniformly — send callback and update DB."""
-    if callback_url:
-        webhook_secret = get_webhook_secret(project_id, organization_id)
-        with tracer.start_as_current_span("llm.send_callback") as cb_span:
-            cb_span.set_attribute("callback.url", callback_url)
-            cb_span.set_attribute("callback.status", "failure")
-            send_callback(
-                callback_url=callback_url,
-                data=callback_response.model_dump(),
-                webhook_secret=webhook_secret,
-            )
-
     with Session(engine) as session:
         JobCrud(session=session).update(
             job_id=job_id,
             job_update=JobUpdate(
                 status=JobStatus.FAILED,
                 error_message=callback_response.error,
             ),
         )
-        if chain_id:
+        if chain_id is not None:
             try:
                 update_llm_chain_status(
                     session,
                     chain_id=chain_id,
                     status=ChainStatus.FAILED,
                     error=callback_response.error,
                 )
             except Exception as update_err:
                 logger.error(
                     f"[handle_job_error] Failed to update chain status: {update_err} | "
                     f"chain_id={chain_id}",
                     exc_info=True,
                 )
+
+    if callback_url:
+        try:
+            webhook_secret = get_webhook_secret(project_id, organization_id)
+            with tracer.start_as_current_span("llm.send_callback") as cb_span:
+                cb_span.set_attribute("callback.url", callback_url)
+                cb_span.set_attribute("callback.status", "failure")
+                send_callback(
+                    callback_url=callback_url,
+                    data=callback_response.model_dump(),
+                    webhook_secret=webhook_secret,
+                )
+        except Exception as callback_err:
+            logger.error(
+                f"[handle_job_error] Failed to send callback: {callback_err} | job_id={job_id}",
+                exc_info=True,
+            )
 
     return callback_response.model_dump()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/llm/jobs.py` around lines 193 - 225, The code calls
send_callback() before persisting the FAILED state, so a callback failure can
prevent JobCrud.update() and update_llm_chain_status() from running; change
handle_job_error to first open a DB session and persist the JobStatus.FAILED
(via JobCrud.update) and, if applicable, call update_llm_chain_status (catching
and logging its exceptions), commit/close the session, and only then start the
tracer span and call send_callback() (catching/logging send_callback errors) so
callback delivery cannot block saving terminal job/chain state; reference
send_callback, JobCrud.update, update_llm_chain_status and the tracer span name
("llm.send_callback") when locating the code to reorder and add exception
handling.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/celery/tasks/job_execution.py`:
- Around line 95-99: Annotate the Celery bound task function
run_collection_batch_job: type the first parameter as celery.Task (i.e., self:
celery.Task), annotate job_id and trace_id as str (they already appear but
ensure explicit hints), type **kwargs as Any (from typing import Any), and add
an explicit return type that matches the return type of execute_batch_job
(import or reference its declared return type) so the signature is fully typed;
update imports if needed to include celery and Any and ensure the function
signature and return annotation align with execute_batch_job.

In `@backend/app/celery/utils.py`:
- Around line 229-230: The timeout log currently interpolates raw args/kwargs
(f"[{name}] Timed out after {seconds}s — args={args}, kwargs={kwargs}"), which
can leak secrets/PII; change it to log a sanitized summary instead. Replace the
direct interpolation of args and kwargs in the timeout message with a
redacted/summarized representation (e.g., counts, types, truncated values, or a
dedicated sanitizer function like redact_task_payload) and/or call a helper
(e.g., redact_task_payload(args, kwargs)) that strips/masks sensitive data
before logging; retain name and seconds as before and ensure the new helper is
used wherever this log occurs.

In `@backend/app/services/collections/create_collection.py`:
- Around line 324-326: The timeout error message currently uses the wrong
function prefix "[execute_setup_job]"; update the string construction for
timeout_err so the prefix matches the enclosing function name (replace
"[execute_setup_job]" with the actual function name used in
create_collection.py, e.g., "[create_collection]") so logs follow the
"[function_name] ..." convention and keep the rest of the message intact (the
f-string and err.seconds usage should remain unchanged).

In `@backend/app/services/doctransform/job.py`:
- Around line 127-132: The log messages inside the _handle_job_failure function
use the wrong prefix "[doc_transform.execute_job]"; update those logger.error
calls to use the current function name in square brackets (e.g.
"[doc_transform._handle_job_failure]") so log filtering is accurate—locate the
logger.error invocations in _handle_job_failure (including the call that logs
failed to persist FAILED status and the later similar logger.error at lines
referenced) and change the prefix string accordingly while preserving the rest
of the message and parameters (context_suffix, job_uuid, db_error, etc.).

In `@backend/app/services/llm/jobs.py`:
- Around line 973-975: The callback payload currently embeds a log prefix and
the raw timeout object; update the assignment to APIResponse.failure_response in
execute_chain_job so the error string is a stable message like "Chain job
exceeded soft time limit" (do not include "[execute_chain_job]" or the err
object), and instead log the detailed err separately (e.g., logger.exception or
logger.error) before returning callback_response; keep request.request_metadata
unchanged so the client-stored error is clean and consistent with execute_job.

---

Outside diff comments:
In `@backend/app/services/llm/jobs.py`:
- Around line 193-225: The code calls send_callback() before persisting the
FAILED state, so a callback failure can prevent JobCrud.update() and
update_llm_chain_status() from running; change handle_job_error to first open a
DB session and persist the JobStatus.FAILED (via JobCrud.update) and, if
applicable, call update_llm_chain_status (catching and logging its exceptions),
commit/close the session, and only then start the tracer span and call
send_callback() (catching/logging send_callback errors) so callback delivery
cannot block saving terminal job/chain state; reference send_callback,
JobCrud.update, update_llm_chain_status and the tracer span name
("llm.send_callback") when locating the code to reorder and add exception
handling.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: bc00e2d4-6bae-428b-86f8-7497e8fc0a61

📥 Commits

Reviewing files that changed from the base of the PR and between ce5346a and 9b24343.

📒 Files selected for processing (10)

backend/app/celery/tasks/job_execution.py
backend/app/celery/utils.py
backend/app/services/collections/create_collection.py
backend/app/services/collections/delete_collection.py
backend/app/services/doctransform/job.py
backend/app/services/llm/jobs.py
backend/app/services/stt_evaluations/batch_job.py
backend/app/services/stt_evaluations/metric_job.py
backend/app/services/tts_evaluations/batch_job.py
backend/app/services/tts_evaluations/batch_result_processing.py

coderabbitai · 2026-04-30T03:22:15Z

+        logger.error(
+            "[doc_transform.execute_job] failed to persist FAILED status%s | job_id=%s | db_error=%s",
+            context_suffix,
+            job_uuid,
+            db_error,
+        )


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use the current function name in helper log prefixes.

These log lines are emitted from _handle_job_failure but are prefixed as execute_job, which makes log filtering less reliable.

As per coding guidelines, "Prefix all log messages with the function name in square brackets".

Also applies to: 139-143

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/services/doctransform/job.py` around lines 127 - 132, The log messages inside the _handle_job_failure function use the wrong prefix "[doc_transform.execute_job]"; update those logger.error calls to use the current function name in square brackets (e.g. "[doc_transform._handle_job_failure]") so log filtering is accurate—locate the logger.error invocations in _handle_job_failure (including the call that logs failed to persist FAILED status and the later similar logger.error at lines referenced) and change the prefix string accordingly while preserving the rest of the message and parameters (context_suffix, job_uuid, db_error, etc.).

codecov · 2026-04-30T04:06:17Z

Codecov Report

❌ Patch coverage is 91.41531% with 37 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/celery/utils.py	27.27%	16 Missing ⚠️
backend/app/celery/tasks/job_execution.py	0.00%	14 Missing ⚠️
...end/app/services/doctransform/zerox_transformer.py	40.00%	3 Missing ⚠️
...kend/app/services/collections/create_collection.py	90.47%	2 Missing ⚠️
backend/app/services/doctransform/job.py	91.66%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

backend/app/tests/services/llm/test_jobs.py (1)
1357-1359: ⚡ Quick win

Add a timeout-path regression test here too.

This assertion only checks that chain_id is forwarded. The new behavior in this PR is the except Timeout path, so it would be worth forcing executor.run() to raise gevent.Timeout and asserting handle_job_error(..., chain_id=...) runs before the timeout is re-raised.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/services/llm/test_jobs.py` around lines 1357 - 1359, Add a
regression test that forces the timeout path by setting executor.run to raise
gevent.Timeout (e.g., set side_effect=gevent.Timeout()) when invoked, then run
the code under test inside pytest.raises(gevent.Timeout) so the Timeout is
re-raised to the test; after the call assert mock_handle_error (or
handle_job_error) was invoked with chain_id equal to chain_id (use
mock_handle_error.call_args/ assert_called_once and inspect kwargs["chain_id"])
to ensure handle_job_error ran before the Timeout propagated. Ensure you
reference executor.run(), mock_handle_error (or handle_job_error) and
gevent.Timeout in the test.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/celery/utils.py`:
- Around line 218-235: Annotate the new decorator to preserve the wrapped
callable's signature: import ParamSpec and TypeVar and declare P =
ParamSpec("P") and R = TypeVar("R"); change gevent_timeout to def
gevent_timeout(seconds: float, task_name: Optional[str] = None) ->
Callable[[Callable[P, R]], Callable[P, R]]; change decorator to def
decorator(func: Callable[P, R]) -> Callable[P, R]; and change wrapper signature
to def wrapper(*args: P.args, **kwargs: P.kwargs) -> R, keeping
functools.wraps(func) and existing runtime logic unchanged.
- Around line 227-229: The except block in the decorator that catches
gevent.Timeout should only handle the specific Timeout instance created for this
wrapper; change the handler in the function where you create the local variable
timeout (the Timeout instance at line ~223) to use "except Timeout as err:" and
then re-raise immediately if err is not timeout (if err is not timeout: raise),
otherwise log the soft-limit message and re-raise; update the logger call there
to remain the same for the matched instance.

In `@backend/app/services/collections/create_collection.py`:
- Around line 323-343: In the Timeout except branch of execute_job, guard
against rolling-back a job that has already been marked SUCCESSFUL by refreshing
the current job/collection state (e.g., reload collection_job or query job
status by job_id) and only call _handle_job_failure(job_id, ...) when the
freshly fetched status is not SUCCESSFUL; if it is SUCCESSFUL, log a warning
including job_id and skip deleting provider resources or flipping job state,
then re-raise the Timeout as before. Use the existing symbols execute_job,
collection_job, job_id, _handle_job_failure, creation_request, provider, and
result to locate and implement this conditional check.
- Around line 178-185: The current failure-path calls send_callback(...)
directly which can raise and mask the original job failure; wrap the
send_callback call in a try/except (mirroring the _handle_job_failure pattern)
so any exception from send_callback is caught and logged but not re-raised,
ensuring the original error remains the primary failure. Specifically, in the
block that builds failure_payload with build_failure_payload(...) and fetches
webhook_secret via get_webhook_secret(...), call send_callback(...) inside a
try/except, log the send error (including exception details and context like
creation_request.callback_url and collection_job id) and swallow the exception
so it cannot override the original failure handling.

In `@backend/app/services/llm/jobs.py`:
- Around line 212-225: The failure persists only after send_callback() so if the
webhook errors the chain update is skipped; move the chain-status update to
occur before calling send_callback() so the failed ChainStatus is persisted even
if the callback fails: inside the same error handling block (e.g., in
handle_job_error) perform the chain_id check and call
update_llm_chain_status(session, chain_id=chain_id, status=ChainStatus.FAILED,
error=callback_response.error) wrapped in its try/except (logging update_err)
before invoking send_callback(), keeping the existing logging and exception
handling intact.

---

Nitpick comments:
In `@backend/app/tests/services/llm/test_jobs.py`:
- Around line 1357-1359: Add a regression test that forces the timeout path by
setting executor.run to raise gevent.Timeout (e.g., set
side_effect=gevent.Timeout()) when invoked, then run the code under test inside
pytest.raises(gevent.Timeout) so the Timeout is re-raised to the test; after the
call assert mock_handle_error (or handle_job_error) was invoked with chain_id
equal to chain_id (use mock_handle_error.call_args/ assert_called_once and
inspect kwargs["chain_id"]) to ensure handle_job_error ran before the Timeout
propagated. Ensure you reference executor.run(), mock_handle_error (or
handle_job_error) and gevent.Timeout in the test.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7b5fc941-4a59-4aa8-9c00-4cb896855ae3

📥 Commits

Reviewing files that changed from the base of the PR and between 9b24343 and 26519d9.

📒 Files selected for processing (5)

backend/app/celery/tasks/job_execution.py
backend/app/celery/utils.py
backend/app/services/collections/create_collection.py
backend/app/services/llm/jobs.py
backend/app/tests/services/llm/test_jobs.py

coderabbitai · 2026-05-03T16:37:36Z

+    if creation_request and creation_request.callback_url and collection_job:
+        failure_payload = build_failure_payload(collection_job, str(err))
+        webhook_secret = get_webhook_secret(project_id, organization_id)
+        send_callback(
+            str(creation_request.callback_url),
+            failure_payload,
+            webhook_secret=webhook_secret,
+        )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't let failure-webhook errors replace the real job failure.

send_callback() is unguarded here, unlike the _handle_job_failure pattern in backend/app/services/doctransform/job.py:107-145. If the webhook send fails, this helper raises a second exception and masks the original timeout/provider failure.

Suggested fix

if creation_request and creation_request.callback_url and collection_job: - failure_payload = build_failure_payload(collection_job, str(err)) - webhook_secret = get_webhook_secret(project_id, organization_id) - send_callback( - str(creation_request.callback_url), - failure_payload, - webhook_secret=webhook_secret, - ) + try: + failure_payload = build_failure_payload(collection_job, str(err)) + webhook_secret = get_webhook_secret(project_id, organization_id) + send_callback( + str(creation_request.callback_url), + failure_payload, + webhook_secret=webhook_secret, + ) + except Exception as cb_error: + logger.error( + "[create_collection.execute_job] Failure callback failed | " + "{'collection_job_id': '%s', 'error': '%s'}", + job_id, + str(cb_error), + exc_info=True, + )

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if creation_request and creation_request.callback_url and collection_job:

failure_payload = build_failure_payload(collection_job, str(err))

webhook_secret = get_webhook_secret(project_id, organization_id)

send_callback(

str(creation_request.callback_url),

failure_payload,

webhook_secret=webhook_secret,

)

if creation_request and creation_request.callback_url and collection_job:

try:

failure_payload = build_failure_payload(collection_job, str(err))

webhook_secret = get_webhook_secret(project_id, organization_id)

send_callback(

str(creation_request.callback_url),

failure_payload,

webhook_secret=webhook_secret,

)

except Exception as cb_error:

logger.error(

"[create_collection.execute_job] Failure callback failed | "

"{'collection_job_id': '%s', 'error': '%s'}",

job_id,

str(cb_error),

exc_info=True,

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 178 - 185, The current failure-path calls send_callback(...) directly which can raise and mask the original job failure; wrap the send_callback call in a try/except (mirroring the _handle_job_failure pattern) so any exception from send_callback is caught and logged but not re-raised, ensuring the original error remains the primary failure. Specifically, in the block that builds failure_payload with build_failure_payload(...) and fetches webhook_secret via get_webhook_secret(...), call send_callback(...) inside a try/except, log the send error (including exception details and context like creation_request.callback_url and collection_job id) and swallow the exception so it cannot override the original failure handling.

coderabbitai · 2026-05-03T16:37:36Z

+        except Timeout as err:
+            timeout_err = TimeoutError(
+                f"[execute_job] Task exceeded soft time limit of {err.seconds}s"
+            )
+            logger.error(
+                "[create_collection.execute_job] Collection Creation Timed Out | {'collection_job_id': '%s', 'error': '%s'}",
+                job_id,
+                str(timeout_err),
+            )
+            _handle_job_failure(
+                span,
+                project_id,
+                organization_id,
+                job_id,
+                timeout_err,
+                collection_job,
+                creation_request,
+                provider,
+                result,
+            )
+            raise


⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Guard against timeouts after the success commit.

This branch can run after lines 273-301 have already committed the collection row and marked the job SUCCESSFUL. If the timeout fires during the success callback at lines 315-321, _handle_job_failure(...) will delete the provider resource and flip the job to FAILED, but the persisted collection row is left behind. That creates a DB/external-state mismatch.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/services/collections/create_collection.py` around lines 323 - 343, In the Timeout except branch of execute_job, guard against rolling-back a job that has already been marked SUCCESSFUL by refreshing the current job/collection state (e.g., reload collection_job or query job status by job_id) and only call _handle_job_failure(job_id, ...) when the freshly fetched status is not SUCCESSFUL; if it is SUCCESSFUL, log a warning including job_id and skip deleting provider resources or flipping job state, then re-raise the Timeout as before. Use the existing symbols execute_job, collection_job, job_id, _handle_job_failure, creation_request, provider, and result to locate and implement this conditional check.

coderabbitai · 2026-05-03T16:37:36Z

+        if chain_id:
+            try:
+                update_llm_chain_status(
+                    session,
+                    chain_id=chain_id,
+                    status=ChainStatus.FAILED,
+                    error=callback_response.error,
+                )
+            except Exception as update_err:
+                logger.error(
+                    f"[handle_job_error] Failed to update chain status: {update_err} | "
+                    f"chain_id={chain_id}",
+                    exc_info=True,
+                )


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Persist failure state before the webhook call can fail.

The new chain-status update lives after send_callback(), so a callback exception still skips both the job update and the new chain_id update. That leaves failed chains stuck in a non-failed state on exactly the path this change is meant to cover.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@backend/app/services/llm/jobs.py` around lines 212 - 225, The failure persists only after send_callback() so if the webhook errors the chain update is skipped; move the chain-status update to occur before calling send_callback() so the failed ChainStatus is persisted even if the callback fails: inside the same error handling block (e.g., in handle_job_error) perform the chain_id check and call update_llm_chain_status(session, chain_id=chain_id, status=ChainStatus.FAILED, error=callback_response.error) wrapped in its try/except (logging update_err) before invoking send_callback(), keeping the existing logging and exception handling intact.

coderabbitai

Actionable comments posted: 8

♻️ Duplicate comments (1)

backend/app/services/collections/create_collection.py (1)

179-186: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard send_callback to prevent masking the original failure.

The send_callback call in _handle_job_failure is unguarded. If the webhook send fails, the exception will propagate and mask the original timeout/provider failure that triggered this handler.

Suggested fix

     if creation_request and creation_request.callback_url and collection_job:
-        failure_payload = build_failure_payload(collection_job, str(err))
-        webhook_secret = get_webhook_secret(project_id, organization_id)
-        send_callback(
-            str(creation_request.callback_url),
-            failure_payload,
-            webhook_secret=webhook_secret,
-        )
+        try:
+            failure_payload = build_failure_payload(collection_job, str(err))
+            webhook_secret = get_webhook_secret(project_id, organization_id)
+            send_callback(
+                str(creation_request.callback_url),
+                failure_payload,
+                webhook_secret=webhook_secret,
+            )
+        except Exception as cb_error:
+            logger.error(
+                "[create_collection._handle_job_failure] Failure callback failed | "
+                "{'job_id': '%s', 'error': '%s'}",
+                job_id,
+                str(cb_error),
+                exc_info=True,
+            )

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@backend/app/services/collections/create_collection.py` around lines 179 -
186, In _handle_job_failure, the unguarded call to send_callback (using
creation_request, build_failure_payload, get_webhook_secret, and collection_job)
can raise and mask the original failure; wrap the send_callback invocation in a
try/except block that catches exceptions from sending the webhook, logs the send
error (including context: project_id/organization_id/callback_url and the
exception), and does not re-raise so the original job failure remains the
primary error path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/celery/tasks/job_execution.py`:
- Around line 157-172: The task run_collection_batch_job imports and calls a
non-existent execute_batch_job from app.services.collections.create_collection;
replace the import and call with the existing execute_job from that module (or
implement execute_batch_job if batch-specific behavior is required); update the
lambda inside _run_with_otel_parent to invoke execute_job(project_id=project_id,
job_id=job_id, task_id=current_task.request.id, task_instance=self, **kwargs)
and verify the execute_job signature accepts task_id and task_instance,
adjusting argument names if needed.

In `@backend/app/services/collections/create_collection.py`:
- Around line 324-342: The TimeoutError is constructed with an unnecessary
f-string prefix; change TimeoutError(f"Task exceeded soft time limit") to a
plain string TimeoutError("Task exceeded soft time limit") (or otherwise include
real placeholders if intended), and keep the logger.error call and subsequent
_handle_job_failure(...) invocation unchanged so timeout_err contains the
correct message when passed to _handle_job_failure and logged with job_id.

In `@backend/app/services/collections/delete_collection.py`:
- Around line 250-268: In delete_collection.execute_job, remove the unnecessary
f-string prefix when creating timeout_err: replace the TimeoutError
instantiation that uses f"Task exceeded soft time limit" with a plain string
literal ("Task exceeded soft time limit") so timeout_err is created without an
extraneous f-string; ensure the variable name timeout_err remains used for
span.record_exception, span.set_status, _mark_job_failed_and_callback and the
raised exception.

In `@backend/app/services/doctransform/job.py`:
- Line 279: The string literal used to construct timeout_err uses an unnecessary
f-string prefix; update the creation of timeout_err (the TimeoutError
instantiation named timeout_err) to use a plain string literal without the
f-prefix (i.e., remove the leading "f" from TimeoutError(f"...") so it becomes
TimeoutError("...")); search for other similar extraneous f-strings in this
module and remove the prefix where no placeholders are present.

In `@backend/app/services/stt_evaluations/batch_job.py`:
- Around line 102-113: The logger.error call and the TimeoutError message in the
except block handling (Timeout, SoftTimeLimitExceeded) use unnecessary f-string
prefixes with no placeholders; update the logger.error call (the string in
logger.error inside the except handling for Timeout/SoftTimeLimitExceeded) to a
plain string (remove the leading f) and likewise remove the f prefix when
constructing timeout_err = TimeoutError(...) so both messages are regular string
literals, leaving the rest of the exception handling (update_stt_run, run_id,
session, raising the error) unchanged.

In `@backend/app/services/stt_evaluations/metric_job.py`:
- Around line 158-169: In the except block catching (Timeout,
SoftTimeLimitExceeded) in execute_metric_computation, remove the unnecessary
f-string prefix when creating timeout_err (change TimeoutError(f"Task exceeded
soft time limit") to a plain string TimeoutError("Task exceeded soft time
limit")); keep the rest of the block (logger.error(...),
update_stt_run(session=session, run_id=run_id, status="failed",
error_message=str(timeout_err)), and re-raise) unchanged so the error message is
identical but without the redundant f-string.

In `@backend/app/services/tts_evaluations/batch_job.py`:
- Around line 130-141: The code creates timeout_err using an unnecessary
f-string (TimeoutError(f"Task exceeded soft time limit")); replace it with a
normal string literal (TimeoutError("Task exceeded soft time limit")) to remove
the extraneous f-prefix; keep the surrounding exception handling and calls to
logger.error and update_tts_run intact (symbols: Timeout, SoftTimeLimitExceeded,
timeout_err, logger.error, update_tts_run).

In `@backend/app/services/tts_evaluations/batch_result_processing.py`:
- Around line 244-255: The logger.error call in execute_tts_result_processing is
using an unnecessary f-string without placeholders; change the string literal
passed to logger.error to a plain string (remove the leading 'f') in the except
block handling Timeout and SoftTimeLimitExceeded where timeout_err is created
and update_tts_run(session=session, run_id=evaluation_run_id, ...) is called so
the log reads correctly while keeping the TimeoutError handling and subsequent
update_tts_run and raise intact.

---

Duplicate comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 179-186: In _handle_job_failure, the unguarded call to
send_callback (using creation_request, build_failure_payload,
get_webhook_secret, and collection_job) can raise and mask the original failure;
wrap the send_callback invocation in a try/except block that catches exceptions
from sending the webhook, logs the send error (including context:
project_id/organization_id/callback_url and the exception), and does not
re-raise so the original job failure remains the primary error path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 98e634c2-192a-4409-a876-455bdb22d5ec

📥 Commits

Reviewing files that changed from the base of the PR and between 26519d9 and 19c6052.

📒 Files selected for processing (15)

backend/app/celery/tasks/job_execution.py
backend/app/services/collections/create_collection.py
backend/app/services/collections/delete_collection.py
backend/app/services/doctransform/job.py
backend/app/services/doctransform/registry.py
backend/app/services/doctransform/zerox_transformer.py
backend/app/services/llm/jobs.py
backend/app/services/stt_evaluations/batch_job.py
backend/app/services/stt_evaluations/metric_job.py
backend/app/services/tts_evaluations/batch_job.py
backend/app/services/tts_evaluations/batch_result_processing.py
backend/app/tests/services/collections/test_create_collection.py
backend/app/tests/services/collections/test_delete_collection.py
backend/app/tests/services/doctransformer/test_job/test_execute_job_errors.py
backend/app/tests/services/llm/test_jobs.py

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (4)

backend/app/tests/services/doctransformer/test_registry.py (2)

102-117: ⚡ Quick win

Add type hints to the new inline transformer classes.

Both TimeoutTransformer.transform and SoftLimitTransformer.transform are missing parameter and return type annotations, violating the project's type-hint rule.

✏️ Proposed fix

     class TimeoutTransformer:
-        def transform(self, input_path, output_path):
+        def transform(self, input_path: Path, output_path: Path) -> Path:
             raise Timeout()
 
     monkeypatch.setitem(TRANSFORMERS, "timeout", TimeoutTransformer)
     with pytest.raises(Timeout):
         convert_document(input_file, output_file, transformer_name="timeout")
 
     # Celery SoftTimeLimitExceeded propagates without being wrapped
     class SoftLimitTransformer:
-        def transform(self, input_path, output_path):
+        def transform(self, input_path: Path, output_path: Path) -> Path:
             raise SoftTimeLimitExceeded()

You'll also need to add from pathlib import Path at the top if it isn't already imported.

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/services/doctransformer/test_registry.py` around lines 102
- 117, Add missing type hints to the inline transformer classes: annotate
TimeoutTransformer.transform(self, input_path: Path, output_path: Path) -> None
and SoftLimitTransformer.transform(self, input_path: Path, output_path: Path) ->
None, and ensure Path is imported from pathlib at the top of the test file; keep
the existing raise statements unchanged so behavior and the pytest.raises checks
for convert_document with transformer_name="timeout"/"softlimit" remain the
same.

75-117: ⚡ Quick win

Consider splitting test_convert_document into focused, single-scenario tests.

The function now contains five distinct test scenarios. A failure in an earlier pytest.raises block will suppress execution of the later ones, making failures harder to isolate. Separate test functions (e.g. test_convert_document_timeout_propagates, test_convert_document_soft_limit_propagates) give independent reporting and cleaner names.

♻️ Sketch of the refactor

`@pytest.fixture`
def dummy_input_output(tmp_path):
    input_file = tmp_path / "input.txt"
    input_file.write_text("test")
    return input_file, tmp_path / "output.txt"


def test_convert_document_success(dummy_input_output, monkeypatch): ...
def test_convert_document_transformer_not_found(dummy_input_output, monkeypatch): ...
def test_convert_document_wraps_generic_error(dummy_input_output, monkeypatch): ...
def test_convert_document_timeout_propagates(dummy_input_output, monkeypatch): ...
def test_convert_document_soft_time_limit_propagates(dummy_input_output, monkeypatch): ...

As per coding guidelines: "Use factory pattern for test fixtures in backend/app/tests/."

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/services/doctransformer/test_registry.py` around lines 75 -
117, Split the large test_convert_document into multiple focused tests: create a
fixture factory (e.g., dummy_input_output) that prepares input_file and
output_file and reuse it across tests; then implement separate test functions
test_convert_document_success, test_convert_document_transformer_not_found,
test_convert_document_wraps_generic_error,
test_convert_document_timeout_propagates, and
test_convert_document_soft_time_limit_propagates which each call
convert_document with the proper monkeypatched TRANSFORMERS entry (use
monkeypatch.setitem(TRANSFORMERS, "...", <TransformerClass>)) and assert the
expected outcome (success content, raises ValueError, raises
TransformationError, propagates Timeout, propagates SoftTimeLimitExceeded); keep
references to convert_document, TRANSFORMERS, and
TransformationError/Timeout/SoftTimeLimitExceeded to locate code.

backend/app/services/collections/create_collection.py (1)

151-161: ⚡ Quick win

Add type hints to _handle_job_failure parameters.

The span, provider, and result parameters lack type annotations. Per coding guidelines, all function parameters must have type hints.

Proposed fix

+from opentelemetry.trace import Span
+from app.services.collections.providers.base import BaseCollectionProvider, ProviderResult
+
 def _handle_job_failure(
-    span,
+    span: Span,
     project_id: int,
     organization_id: int,
     job_id: str,
     err: Exception,
     collection_job: CollectionJob | None,
     creation_request: CreationRequest | None,
-    provider=None,
-    result=None,
+    provider: BaseCollectionProvider | None = None,
+    result: ProviderResult | None = None,
 ) -> None:

As per coding guidelines, "Always add type hints to all function parameters and return values in Python code".

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/services/collections/create_collection.py` around lines 151 -
161, The function _handle_job_failure is missing type annotations for span,
provider, and result; update the signature to add explicit types (e.g. import
typing.Any and typing.Optional and annotate as span: Span | Any, provider:
Optional[Provider] | Any, result: Optional[Any]) or use the concrete Span type
used elsewhere (e.g. opentelemetry.trace.Span) and the project-specific Provider
type if one exists; ensure you import any needed types (Span, Provider, Any,
Optional) and keep the return type as -> None.

backend/app/tests/services/llm/test_jobs.py (1)

224-256: 💤 Low value

Move ChainStatus import to module level.

from app.models.llm.request import ChainStatus at line 226 is the only local-scope import in the file. Placing it at module level makes import errors surface at collection time rather than only when this test runs.

♻️ Suggested change

 from app.models.llm.request import ConfigBlob, LLMCallConfig, LLMChainRequest
 from app.models.llm.request import ChainBlock as ChainBlockModel
+from app.models.llm.request import ChainStatus

Then remove line 226 inside the test method:

-    from app.models.llm.request import ChainStatus
-
     job = JobCrud(session=db).create(

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/tests/services/llm/test_jobs.py` around lines 224 - 256, Move the
local import of ChainStatus out of the test body so import errors surface at
collection time: remove the line "from app.models.llm.request import
ChainStatus" inside test_handle_job_error_with_chain_id_updates_chain_status and
add a module-level import for ChainStatus at the top of the test file; ensure
the test still references ChainStatus unchanged and run tests to confirm no
import regressions.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@backend/app/celery/tasks/job_execution.py`:
- Around line 155-172: run_collection_batch_job duplicates
run_create_collection_job; remove the duplication by either deleting one of the
task functions if they truly serve the same purpose, or implement distinct batch
logic by creating a new execute_batch_job (call from run_collection_batch_job)
so batch-specific processing is isolated from execute_job; if the two tasks are
intentionally separate only for routing, add a clarifying comment above both
task definitions stating that and why they differ; additionally update
run_create_collection_job to include proper type hints on self (celery.Task) and
**kwargs (Any) to match project Python 3.11+ typing conventions.

---

Nitpick comments:
In `@backend/app/services/collections/create_collection.py`:
- Around line 151-161: The function _handle_job_failure is missing type
annotations for span, provider, and result; update the signature to add explicit
types (e.g. import typing.Any and typing.Optional and annotate as span: Span |
Any, provider: Optional[Provider] | Any, result: Optional[Any]) or use the
concrete Span type used elsewhere (e.g. opentelemetry.trace.Span) and the
project-specific Provider type if one exists; ensure you import any needed types
(Span, Provider, Any, Optional) and keep the return type as -> None.

In `@backend/app/tests/services/doctransformer/test_registry.py`:
- Around line 102-117: Add missing type hints to the inline transformer classes:
annotate TimeoutTransformer.transform(self, input_path: Path, output_path: Path)
-> None and SoftLimitTransformer.transform(self, input_path: Path, output_path:
Path) -> None, and ensure Path is imported from pathlib at the top of the test
file; keep the existing raise statements unchanged so behavior and the
pytest.raises checks for convert_document with
transformer_name="timeout"/"softlimit" remain the same.
- Around line 75-117: Split the large test_convert_document into multiple
focused tests: create a fixture factory (e.g., dummy_input_output) that prepares
input_file and output_file and reuse it across tests; then implement separate
test functions test_convert_document_success,
test_convert_document_transformer_not_found,
test_convert_document_wraps_generic_error,
test_convert_document_timeout_propagates, and
test_convert_document_soft_time_limit_propagates which each call
convert_document with the proper monkeypatched TRANSFORMERS entry (use
monkeypatch.setitem(TRANSFORMERS, "...", <TransformerClass>)) and assert the
expected outcome (success content, raises ValueError, raises
TransformationError, propagates Timeout, propagates SoftTimeLimitExceeded); keep
references to convert_document, TRANSFORMERS, and
TransformationError/Timeout/SoftTimeLimitExceeded to locate code.

In `@backend/app/tests/services/llm/test_jobs.py`:
- Around line 224-256: Move the local import of ChainStatus out of the test body
so import errors surface at collection time: remove the line "from
app.models.llm.request import ChainStatus" inside
test_handle_job_error_with_chain_id_updates_chain_status and add a module-level
import for ChainStatus at the top of the test file; ensure the test still
references ChainStatus unchanged and run tests to confirm no import regressions.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 96b92f85-01a6-4b40-8703-b07e163dd7a3

📥 Commits

Reviewing files that changed from the base of the PR and between 19c6052 and 49e5e5f.

📒 Files selected for processing (13)

backend/app/celery/tasks/job_execution.py
backend/app/celery/utils.py
backend/app/services/collections/create_collection.py
backend/app/services/collections/delete_collection.py
backend/app/services/doctransform/job.py
backend/app/services/llm/jobs.py
backend/app/services/stt_evaluations/batch_job.py
backend/app/services/stt_evaluations/metric_job.py
backend/app/services/tts_evaluations/batch_job.py
backend/app/services/tts_evaluations/batch_result_processing.py
backend/app/tests/services/doctransformer/test_registry.py
backend/app/tests/services/llm/test_jobs.py
backend/app/tests/services/stt_evaluations/test_metric_job.py

✅ Files skipped from review due to trivial changes (2)

backend/app/services/doctransform/job.py
backend/app/services/llm/jobs.py

🚧 Files skipped from review as they are similar to previous changes (2)

backend/app/services/stt_evaluations/metric_job.py
backend/app/services/tts_evaluations/batch_job.py

coderabbitai

♻️ Duplicate comments (1)

backend/app/celery/tasks/job_execution.py (1)

2-4: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Complete typing is still missing on helper/task signatures.

self, **kwargs, and return types are still unannotated across task entrypoints, and helper params are also untyped. This violates the repo-wide Python typing rule and leaves celery/Any imports partially unused by intent.

✅ Suggested patch pattern

-from typing import Any
+from collections.abc import Callable
+from typing import Any
 import celery
@@
-def _extract_parent_context(task_instance) -> otel_context.Context:
+def _extract_parent_context(task_instance: celery.Task) -> otel_context.Context:
@@
-def _run_with_otel_parent(task_instance, fn):
+def _run_with_otel_parent(
+    task_instance: celery.Task, fn: Callable[[], Any]
+) -> Any:
@@
-def run_llm_job(self, project_id: int, job_id: str, trace_id: str, **kwargs):
+def run_llm_job(
+    self: celery.Task, project_id: int, job_id: str, trace_id: str, **kwargs: Any
+) -> Any:

Apply the same signature pattern to all run_* functions in this file.

#!/bin/bash
# Verify untyped params/returns in this file (read-only)
python - <<'PY'
import ast
from pathlib import Path

path = Path("backend/app/celery/tasks/job_execution.py")
tree = ast.parse(path.read_text())

for node in tree.body:
    if isinstance(node, ast.FunctionDef):
        missing = []
        args = node.args.posonlyargs + node.args.args + node.args.kwonlyargs
        for a in args:
            if a.annotation is None:
                missing.append(a.arg)
        if node.args.vararg and node.args.vararg.annotation is None:
            missing.append(f"*{node.args.vararg.arg}")
        if node.args.kwarg and node.args.kwarg.annotation is None:
            missing.append(f"**{node.args.kwarg.arg}")
        if node.returns is None:
            missing.append("return")
        if missing:
            print(f"{node.name} (Line {node.lineno}): missing {', '.join(missing)}")
PY

As per coding guidelines, "**/*.py: Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".

Also applies to: 23-42, 65-255

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@backend/app/celery/tasks/job_execution.py` around lines 2 - 4, Annotate all
task entrypoints and helpers in this file (every run_* function and any helpers
they call): add explicit parameter and return type hints (e.g., self:
celery.Task, **kwargs: Any or **kwargs: Dict[str, Any] as appropriate) and
concrete types for helper params/returns, and use typing.Any/Dict/Optional where
exact types are unknown; update function signatures to include return
annotations (e.g., -> Any or -> Dict[str, Any]) and adjust imports (from typing
import Any, Dict, Optional) so no imports are unused; apply the same pattern to
every run_* function referenced in this module so the AST checker in the review
comment reports no missing annotations.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@backend/app/celery/tasks/job_execution.py`:
- Around line 2-4: Annotate all task entrypoints and helpers in this file (every
run_* function and any helpers they call): add explicit parameter and return
type hints (e.g., self: celery.Task, **kwargs: Any or **kwargs: Dict[str, Any]
as appropriate) and concrete types for helper params/returns, and use
typing.Any/Dict/Optional where exact types are unknown; update function
signatures to include return annotations (e.g., -> Any or -> Dict[str, Any]) and
adjust imports (from typing import Any, Dict, Optional) so no imports are
unused; apply the same pattern to every run_* function referenced in this module
so the AST checker in the review comment reports no missing annotations.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e6939101-5972-4ac8-ae39-fc9d40f8ffd2

📥 Commits

Reviewing files that changed from the base of the PR and between 49e5e5f and cdf034b.

📒 Files selected for processing (3)

backend/app/celery/tasks/job_execution.py
backend/app/services/llm/jobs.py
backend/app/tests/services/llm/test_jobs.py

🚧 Files skipped from review as they are similar to previous changes (2)

backend/app/services/llm/jobs.py
backend/app/tests/services/llm/test_jobs.py

vprashrex · 2026-05-06T02:51:38Z


+    except (Timeout, SoftTimeLimitExceeded) as err:
+        timeout_err = TimeoutError("Task exceeded soft time limit")
+        logger.error(


use logger.warning here instead of error ... if u think it is not much critical

vprashrex · 2026-05-06T02:53:51Z

+        except (Timeout, SoftTimeLimitExceeded):
+            raise
        except TimeoutError:
            logger.error(


use logger.warning

vprashrex · 2026-05-06T02:54:28Z

            )

+        except (Timeout, SoftTimeLimitExceeded):
+            logger.error(


use logger.warning

vprashrex · 2026-05-06T02:54:43Z

            return executor.run()

+        except (Timeout, SoftTimeLimitExceeded) as err:
+            logger.error(


logger.warning

vprashrex · 2026-05-06T02:55:23Z


+        except (Timeout, SoftTimeLimitExceeded) as err:
+            timeout_err = TimeoutError("Task exceeded soft time limit")
+            logger.error(


logger.warning

vprashrex · 2026-05-06T02:58:12Z


+        except (Timeout, SoftTimeLimitExceeded) as err:
+            timeout_err = TimeoutError("Task exceeded soft time limit")
+            logger.error(


In cases like timeout or SoftTimeLimitExceeded, these are expected scenarios rather than unexpected failures. We already know why they occur and under what conditions they happen, so logging them with logger.error` is not appropriate. A lower log level such as warning or info would be more suitable.

vprashrex · 2026-05-06T02:59:22Z


-        sample_texts = get_sample_texts_from_dataset(session, dataset, project_id)
+            if not run:
+                logger.error(


logger.warning .. again u know why this is happening .. when run data is not found db its return None .. so at this point either set warning or info

vprashrex · 2026-05-06T03:01:26Z

-            update_tts_run(
+
+            if not dataset:
+                logger.error(


same goes here ... use logger.warning or info

vprashrex · 2026-05-06T03:01:44Z

+            sample_texts = get_sample_texts_from_dataset(session, dataset, project_id)
+
+            if not sample_texts:
+                logger.error(


logger.warning or logger.info

Prajna1999 · 2026-05-06T04:52:46Z

+
+def gevent_timeout(
+    seconds: float | None, task_name: str | None = None
+) -> Callable[[Callable[P, R]], Callable[P, R]]:


nitpick: the return type is largely unreadable for humans. May consider making type safety liberal since this is just a decorator function.

Prajna1999 · 2026-05-06T04:57:11Z


 logger = logging.getLogger(__name__)

+P = ParamSpec("P")


nitpick: may consider YOLO-ing it and forget about type safety here as its unreadable non-standard python.

Prajna1999

added a few nitpicks wrt type safety and readability.

gevent: implement soft time limit

9b24343

nishika26 force-pushed the bug/gevent_timeout branch from c15e0f3 to 9b24343 Compare April 30, 2026 03:16

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

coderabbit review and test cases

8788ee8

nishika26 self-assigned this Apr 30, 2026

nishika26 added enhancement New feature or request ready-for-review labels Apr 30, 2026

nishika26 linked an issue Apr 30, 2026 that may be closed by this pull request

Celery: Implement soft time limit for gevent #767

Closed

Merge branch 'main' into bug/gevent_timeout

26519d9

coderabbitai Bot reviewed May 3, 2026

View reviewed changes

nishika26 added 2 commits May 4, 2026 09:49

adding test cases and prefork

7def1a4

fixing test case

19c6052

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

nishika26 added 2 commits May 4, 2026 11:05

increasing code coverage

fcdc906

test cases and coderabbit pr comments

49e5e5f

coderabbitai Bot reviewed May 4, 2026

View reviewed changes

Comment thread backend/app/celery/tasks/job_execution.py Outdated

increasing test coverage

cdf034b

coderabbitai Bot reviewed May 5, 2026

View reviewed changes

nishika26 requested review from kartpop and vprashrex May 5, 2026 03:42

vprashrex reviewed May 6, 2026

View reviewed changes

nishika26 and others added 2 commits May 6, 2026 09:04

Merge branch 'main' into bug/gevent_timeout

fbad319

logger warning instead of logger error

8e390df

nishika26 requested review from Prajna1999 May 6, 2026 04:07

Prajna1999 reviewed May 6, 2026

View reviewed changes

vprashrex self-requested a review May 6, 2026 04:57

Prajna1999 reviewed May 6, 2026

View reviewed changes

vprashrex approved these changes May 6, 2026

View reviewed changes

Merge branch 'main' into bug/gevent_timeout

56bdc53

Prajna1999 approved these changes May 6, 2026

View reviewed changes

making type hint liberal

cb42eb9

nishika26 merged commit 62a085d into main May 6, 2026
3 checks passed

nishika26 deleted the bug/gevent_timeout branch May 6, 2026 05:53

Conversation

nishika26 commented Apr 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

nishika26 commented Apr 29, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 29, 2026 •

edited

Loading

codecov Bot commented Apr 30, 2026 •

edited

Loading

Prajna1999 May 6, 2026 •

edited

Loading