Skip to content

Assessment: Ai Assessment Pipeline#788

Open
vprashrex wants to merge 8 commits intomainfrom
feature/assessment
Open

Assessment: Ai Assessment Pipeline#788
vprashrex wants to merge 8 commits intomainfrom
feature/assessment

Conversation

@vprashrex
Copy link
Copy Markdown
Collaborator

@vprashrex vprashrex commented Apr 27, 2026

Summary

Target issue is #791
Explain the motivation for making this change. What existing problem does the pull request solve?

This PR introduces a shared assessment pipeline for partner programs, replacing manual and fragmented evaluation workflows. It adds multimodal dataset ingestion, config-driven batch assessment runs, status tracking, retries, cron-based processing, SSE updates, and exportable results for easier comparison and operations.

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

Please add here if any other information is required for the reviewer.

Summary by CodeRabbit

  • New Features
    • Added a full assessment system for multi-configuration batch evaluations (OpenAI & Google), including create/retry flows and parent/child manager views.
    • Added dataset upload/management for CSV/XLSX with validation and size limits.
    • Added real-time SSE event stream for progress and notifications.
    • Added result export/download in JSON, CSV, XLSX, or ZIP (per-run).
    • Improved batch processing and attachment handling for image/file inputs.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 27, 2026

📝 Walkthrough

Walkthrough

Adds a new assessment subsystem: DB models and migration, dataset upload/validation, batch submission for OpenAI/Google, async polling/processing, SSE events, export utilities, mappers, CRUD/service layers, API routes, and router/cron integration.

Changes

Cohort / File(s) Summary
API Routing & Cron Integration
backend/app/api/main.py, backend/app/api/routes/cron.py
Includes assessment routes in main router; converts evaluation cron handler to async and merges assessment polling results via poll_all_pending_assessment_evaluations, capturing errors into the cron result.
Assessment Models & DB Migration
backend/app/assessment/models.py, backend/app/alembic/versions/055_add_assessment_manager_table.py, backend/app/models/__init__.py
Adds Assessment and AssessmentRun SQLModel tables, Pydantic schemas, indexes, and alembic migration; re-exports models from package init.
CRUD & Status Management
backend/app/assessment/crud.py
New CRUD operations for assessments and runs, run status updater, parent status recomputation and run-stats caching logic.
Batch Submission & Attachments
backend/app/assessment/batch.py
Builds provider-specific batch payloads (OpenAI JSONL / Google), resolves dataset rows and attachments (urls, data: URIs, Drive links), maps attachments to provider input parts, and submits batch jobs.
Polling & Processing
backend/app/assessment/processing.py, backend/app/assessment/cron.py
Async polling of pending assessments/runs, provider batch-state checks, result download/parsing/sanitization, status updates, SSE publish, and aggregated counters/details.
Service Orchestration & Routes
backend/app/assessment/service.py, backend/app/assessment/routes.py
Start/retry flows, create manager/run records, submit batches, and new FastAPI endpoints for dataset CRUD, evaluation creation/retry, SSE /events, listings, and results export.
Dataset Handling & Validation
backend/app/assessment/dataset.py, backend/app/assessment/validators.py
Dataset upload, row counting for CSV/XLSX with encoding fallbacks, object storage upload, filename sanitization, and file-size/extension validation (10MB limit).
Export & Parsing Utilities
backend/app/assessment/utils/export.py, backend/app/assessment/utils/parsing.py, backend/app/assessment/utils/__init__.py
Flatten/serialize export rows to JSON/CSV/XLSX, parse stored JSON/JSONL results, compute usage token totals, load export rows for runs, and expose utility exports.
Mappers & Normalization
backend/app/assessment/mappers.py
Maps Kaapi LLM params to OpenAI/Google request formats, normalizes LLM text, enforces schema strictness, handles reasoning/effort mapping, and translates output schema.
Events Broker
backend/app/assessment/events.py
SSE broker with per-subscriber queues, keep-alives, non-blocking publish, and module-level singleton.
Module Init & Minor Model Changes
backend/app/assessment/__init__.py, backend/app/models/batch_job.py, backend/app/models/llm/request.py
Adds assessment module docstring; BatchJobType gains ASSESSMENT; TextLLMParams gains effort, summary, top_p, and max_output_tokens.
Tests Updated
backend/app/tests/api/routes/test_cron.py
Cron tests updated to mock async evaluation processing and patched assessment polling (poll_all_pending_assessment_evaluations).

Sequence Diagram

sequenceDiagram
    participant Client
    participant API as API Routes
    participant Service as Assessment Service
    participant DB as Database
    participant Provider as Batch Provider
    participant Storage as Object Storage
    participant Cron as Cron Poller

    Client->>API: POST /assessment (dataset, configs)
    API->>Service: start_assessment(request)
    Service->>DB: create Assessment (manager)
    Service->>DB: resolve configs

    loop per config
        Service->>DB: create AssessmentRun
        Service->>Provider: submit_batch(payload)
        Provider-->>Service: batch_job_id
        Service->>DB: update run -> processing
    end
    Service-->>Client: AssessmentResponse

    Cron->>DB: fetch pending runs
    Cron->>Provider: check_batch_status(job_id)
    alt batch complete
        Provider-->>Cron: results
        Cron->>Storage: upload_raw_results()
        Cron->>Service: parse_assessment_output
        Cron->>DB: update run -> completed
        Cron->>API: publish SSE (results_ready)
    else batch failed
        Cron->>DB: update run -> failed
    else in-progress
        Cron-->>Cron: keep processing
    end

    Client->>API: GET /assessment/{id}/results?format=csv|json|xlsx
    API->>Service: build_export_response
    Service->>Storage: fetch_results()
    Service-->>Client: StreamingResponse (file)
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

enhancement, ready-for-review

Suggested reviewers

  • Prajna1999
  • kartpop

Poem

🐰 I hopped through rows and configs bright,
Built batches, polled throughout the night,
SSE beacons, CSV dreams,
Providers fed my rabbit schemes—
A bunny cheers: assessments take flight! 🚀

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Assessment: Ai Assessment Pipeline' partially describes the changeset by referencing the assessment feature, but lacks specificity about the primary change (implementation of multi-config batch evaluation system). Consider a more specific title like 'Add assessment batch evaluation pipeline' or 'Implement multi-config assessment system' to better clarify the main feature being introduced.
✅ Passed checks (3 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feature/assessment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 12

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

🟡 Minor comments (15)
backend/app/assessment/utils/parsing.py-25-26 (1)

25-26: ⚠️ Potential issue | 🟡 Minor

Use explicit None checks to avoid masking zero token counts.

usage.get("input_tokens") or usage.get("prompt_tokens") will fall through to the alternate key when the primary key is present but 0. While provider payloads with zero input/output tokens are uncommon, the fallthrough also makes the precedence between the two aliases inconsistent (only when truthy). Prefer an explicit None check so the _tokens key always wins when present.

♻️ Proposed fix
-    input_tokens = usage.get("input_tokens") or usage.get("prompt_tokens")
-    output_tokens = usage.get("output_tokens") or usage.get("completion_tokens")
+    input_tokens = (
+        usage.get("input_tokens")
+        if usage.get("input_tokens") is not None
+        else usage.get("prompt_tokens")
+    )
+    output_tokens = (
+        usage.get("output_tokens")
+        if usage.get("output_tokens") is not None
+        else usage.get("completion_tokens")
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/parsing.py` around lines 25 - 26, The current
token extraction uses truthy fallback which treats 0 as missing; replace the
"or" fallbacks with explicit None checks so that input_tokens prefers
usage["input_tokens"] when the key exists (even if 0) and only uses
usage["prompt_tokens"] when usage.get("input_tokens") is None, and do the
analogous change for output_tokens so usage["output_tokens"] takes precedence
over usage["completion_tokens"]; update the two assignments in parsing.py that
set input_tokens and output_tokens accordingly (use explicit is not None checks
against usage.get(...)).
backend/app/models/user.py-81-81 (1)

81-81: ⚠️ Potential issue | 🟡 Minor

PATCH /me should compute and return features like GET /me, or document the inconsistency.

The features field defaults to [] in most endpoints but is populated by flag computation only in GET /me. This creates an inconsistency for PATCH /me (line 97), which returns the user directly without computing features—clients calling PATCH /me will receive features: [] even if the user has feature flags enabled, contradicting the behavior of GET /me.

Recommend either:

  • Computing features in PATCH /me before returning, or
  • Documenting that features reflects the current state only after GET /me.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/user.py` at line 81, PATCH `/me` returns the User model
with features defaulting to [] because it doesn't run the same flag computation
used by GET `/me`; modify the PATCH `/me` handler to compute and populate the
features field before returning (reuse the same flag computation logic used by
GET `/me`, or extract that logic into a helper like compute_user_features(user)
and call it from both GET `/me` and PATCH `/me`) so the returned User.features
matches GET `/me`.
backend/app/api/permissions.py-92-101 (1)

92-101: ⚠️ Potential issue | 🟡 Minor

Misleading 403 detail when organization context is absent.

When org_id is None (caller has no organization on the auth context), the message still reads "is not enabled for this organization", which conflates "feature disabled" with "no organization scope". Consider distinguishing the two for clearer client-side handling.

🛡️ Suggested differentiation
-        if org_id is None or not is_enabled(
-            session=session,
-            flag=flag,
-            organization_id=org_id,
-            project_id=project_id,
-        ):
-            raise HTTPException(
-                status_code=403,
-                detail=f"Feature '{flag.name}' is not enabled for this organization.",
-            )
+        if org_id is None:
+            raise HTTPException(
+                status_code=403,
+                detail=f"Feature '{flag.name}' requires an organization context.",
+            )
+        if not is_enabled(
+            session=session,
+            flag=flag,
+            organization_id=org_id,
+            project_id=project_id,
+        ):
+            raise HTTPException(
+                status_code=403,
+                detail=f"Feature '{flag.name}' is not enabled for this organization.",
+            )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/permissions.py` around lines 92 - 101, The current 403
message conflates missing org context and a disabled feature; update the check
around org_id and is_enabled in the block containing org_id, is_enabled(...),
and the raised HTTPException so that if org_id is None you raise a 403 (or
appropriate 400/401 if preferred) with a clear message like "No organization
context provided" (referencing org_id), otherwise keep the 403 with detail
f"Feature '{flag.name}' is not enabled for this organization." (use session,
project_id and flag.name as currently used); ensure the two paths are distinct
and use the same HTTPException symbol for consistency.
backend/app/assessment/cron.py-144-182 (1)

144-182: ⚠️ Potential issue | 🟡 Minor

Failure-path itself is unprotected — a single bad cleanup aborts the entire poll.

update_assessment_run_status (commits, may raise on DB error / rollback) and recompute_assessment_status (raises ValueError if the assessment was concurrently deleted) are both invoked inside the except handler with no further protection. If either raises, the exception escapes the per-run loop and skips every remaining assessment in this cron tick.

Wrap the cleanup so a failure on one run can't stall the cron:

🛡️ Proposed defensive cleanup
             except Exception as e:
                 logger.error(
                     "[poll_all_pending_assessment_evaluations] "
                     f"Failed run {run.id} | "
                     ...
                     exc_info=True,
                 )
-                update_assessment_run_status(
-                    session=session,
-                    run=run,
-                    status="failed",
-                    error_message="Processing failed. Check server logs for details.",
-                )
-                refreshed_assessment = recompute_assessment_status(
-                    session=session, assessment_id=assessment.id
-                )
+                try:
+                    update_assessment_run_status(
+                        session=session,
+                        run=run,
+                        status="failed",
+                        error_message="Processing failed. Check server logs for details.",
+                    )
+                    refreshed_assessment = recompute_assessment_status(
+                        session=session, assessment_id=assessment.id
+                    )
+                except Exception as cleanup_err:
+                    logger.error(
+                        "[poll_all_pending_assessment_evaluations] "
+                        f"Cleanup failed for run {run.id} | error={cleanup_err}",
+                        exc_info=True,
+                    )
+                    failed += 1
+                    continue
                 failure_result = {
                     ...
                 }
                 all_results.append(failure_result)
                 assessment_event_broker.publish(
                     _build_callback_payload(
                         refreshed_assessment,
                         run,
                         failure_result,
                     )
                 )
                 failed += 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/cron.py` around lines 144 - 182, The except handler
for run processing must be made defensive so cleanup errors don't abort the
whole poll: wrap calls to update_assessment_run_status,
recompute_assessment_status, and assessment_event_broker.publish (and any
related cleanup logic that may commit/raise) in their own try/except block(s),
catch and log any exceptions (including the run.id, run.run_name and
assessment_id) with exc_info=True, ensure you still append a failure_result and
increment failed for this run even if cleanup fails, and swallow the exception
so the per-run loop continues to the next run.
backend/app/assessment/models.py-32-36 (1)

32-36: ⚠️ Potential issue | 🟡 Minor

id: int = SQLField(default=None, ...) mismatches the type — should be int | None.

Both Assessment.id and AssessmentRun.id are typed int but defaulted to None (until the DB assigns one on insert). Pydantic v2 + SQLModel will raise validation errors when constructing instances pre-flush, and static type-checkers will (correctly) flag every place that reads assessment.id assuming non-None when the row is unflushed.

🛠️ Proposed fix
-    id: int = SQLField(
+    id: int | None = SQLField(
         default=None,
         primary_key=True,
         sa_column_kwargs={"comment": "Unique identifier for the assessment"},
     )
@@
-    id: int = SQLField(
+    id: int | None = SQLField(
         default=None,
         primary_key=True,
         sa_column_kwargs={"comment": "Unique identifier for the assessment run"},
     )

Also applies to: 144-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/models.py` around lines 32 - 36, The id fields are
declared as `id: int = SQLField(default=None, ...)` but default None means the
type should allow None; update the type annotations of Assessment.id and
AssessmentRun.id to be nullable (e.g., `int | None` or `Optional[int]`) to match
the SQLField default and avoid Pydantic/typing errors, and then review any call
sites that assume non-None to handle the unflushed case or assert/coerce as
needed.
backend/app/assessment/mappers.py-10-22 (1)

10-22: ⚠️ Potential issue | 🟡 Minor

Type signature doesn't match call sites — text may be None.

normalize_llm_text is declared as text: str -> str, but it's called with kaapi_params.get("instructions") (lines 100, 185), which returns Optional[str]. The body already handles non-str via the isinstance(text, str) early return, so widening the annotation correctly will keep the code honest and silence type-checkers.

🛠️ Proposed fix
-def normalize_llm_text(text: str) -> str:
-    if not isinstance(text, str) or not text:
-        return text
+def normalize_llm_text(text: str | None) -> str | None:
+    if not isinstance(text, str) or not text:
+        return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 10 - 22, The function
normalize_llm_text has a too-narrow type signature: change its annotation from
def normalize_llm_text(text: str) -> str to accept and return Optional[str]
(e.g., def normalize_llm_text(text: Optional[str]) -> Optional[str]) and add the
necessary Optional import from typing; this matches call sites like
kaapi_params.get("instructions") and the existing early-return behavior that
handles non-str/None values.
backend/app/api/routes/features.py-26-30 (1)

26-30: ⚠️ Potential issue | 🟡 Minor

Missing return type annotation on helper.

_parse_flag_or_400 returns a FeatureFlag enum value but has no return annotation.

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code".

🛠️ Proposed fix
-def _parse_flag_or_400(flag_key: str):
+from app.core.feature_flags import FeatureFlag as FeatureFlagEnum
+
+def _parse_flag_or_400(flag_key: str) -> FeatureFlagEnum:
     try:
         return parse_feature_flag(flag_key)
     except ValueError as exc:
         raise HTTPException(status_code=400, detail=str(exc)) from exc

(Use whatever import path matches your existing parse_feature_flag return type.)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/features.py` around lines 26 - 30, _add a return type
annotation to the helper _parse_flag_or_400 indicating it returns the
FeatureFlag enum returned by parse_feature_flag; update the function signature
to include -> FeatureFlag and import FeatureFlag from its module (or use the
correct fully-qualified import used elsewhere) so the helper's return type is
explicitly annotated and matches parse_feature_flag's return type.
backend/app/assessment/batch.py-107-135 (1)

107-135: ⚠️ Potential issue | 🟡 Minor

Workbook handle is leaked on exceptions.

openpyxl.load_workbook(...) opens a file handle (especially in read_only=True mode where it streams). If next(rows_iter, None), the column build, or the row iteration raises, wb.close() is skipped and the underlying ZIP/file handle is leaked until GC. Wrap the body in try/finally or use a context-manager-style pattern.

🛠️ Proposed fix
-    wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
-    ws = wb.active
-    if ws is None:
-        wb.close()
-        return []
-
-    rows_iter = ws.iter_rows(values_only=True)
-    header = next(rows_iter, None)
-    if header is None:
-        wb.close()
-        return []
-
-    columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
-    result = []
-    for row in rows_iter:
-        if row and any(cell is not None for cell in row):
-            row_dict = {
-                columns[i]: str(cell) if cell is not None else ""
-                for i, cell in enumerate(row)
-                if i < len(columns)
-            }
-            result.append(row_dict)
-
-    wb.close()
-    return result
+    wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
+    try:
+        ws = wb.active
+        if ws is None:
+            return []
+
+        rows_iter = ws.iter_rows(values_only=True)
+        header = next(rows_iter, None)
+        if header is None:
+            return []
+
+        columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
+        result: list[dict[str, str]] = []
+        for row in rows_iter:
+            if row and any(cell is not None for cell in row):
+                result.append(
+                    {
+                        columns[i]: str(cell) if cell is not None else ""
+                        for i, cell in enumerate(row)
+                        if i < len(columns)
+                    }
+                )
+        return result
+    finally:
+        wb.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 107 - 135, The workbook handle
opened by openpyxl.load_workbook in _parse_excel_rows can leak if any subsequent
operation raises; wrap workbook usage in a try/finally or context-manager
pattern so wb.close() always runs: after creating wb, enter a try block that
performs ws = wb.active, header = next(rows_iter, ...), column construction and
row iteration, and put wb.close() in the finally (or use
openpyxl.load_workbook’s context manager if available) to ensure the file/ZIP
handle is closed on success or exception.
backend/app/assessment/processing.py-37-70 (1)

37-70: ⚠️ Potential issue | 🟡 Minor

_sanitize_json_output doesn't escape \b or \f.

JSON spec (RFC 8259) treats every U+0000–U+001F character as illegal inside string values. The current implementation handles \n, \r, \t, but model outputs occasionally contain literal \b (0x08) and \f (0x0C) — most often when they're hallucinated. If those slip through, the second json.loads will still fail.

Tightening the sanitizer to escape any ord(ch) < 0x20 while inside a string covers all illegal control characters in one shot.

🛠️ Proposed fix
-        elif in_string and ch == "\n":
-            result.append("\\n")
-        elif in_string and ch == "\r":
-            result.append("\\r")
-        elif in_string and ch == "\t":
-            result.append("\\t")
-        else:
-            result.append(ch)
+        elif in_string and ord(ch) < 0x20:
+            escapes = {"\n": "\\n", "\r": "\\r", "\t": "\\t",
+                       "\b": "\\b", "\f": "\\f"}
+            result.append(escapes.get(ch, f"\\u{ord(ch):04x}"))
+        else:
+            result.append(ch)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` around lines 37 - 70, The sanitizer
_sanitize_json_output currently only replaces newline, carriage return and tab
inside JSON strings and misses other control characters like \b and \f; update
the loop so that when inside a string (in_string is True) and the current
character is not an escaped character (escape_next is False) you check if
ord(ch) < 0x20 and, if so, append a JSON-safe escape (e.g. a specific mapping
for '\n','\r','\t','\b','\f' or generically append "\\u{ord(ch):04x}") instead
of the raw character; keep the existing handling for backslash (escape_next) and
quote toggling so previously escaped characters are preserved.
backend/app/crud/feature_flag.py-37-63 (1)

37-63: ⚠️ Potential issue | 🟡 Minor

Concurrent create can leak IntegrityError (500) instead of clean conflict.

Between get_feature_flag(...) and session.commit(), two concurrent requests with the same (key, organization_id, project_id) can both pass the existence check and try to insert. The DB's uq_feature_flag_key_org_project constraint correctly rejects the second one, but as an uncaught IntegrityError, which surfaces as HTTP 500 in create_feature_flag_route (it expects None to translate to 409).

Catching the integrity error and returning None (or re-fetching) keeps behavior consistent with the existing None-on-conflict contract.

🛡️ Proposed fix
+from sqlalchemy.exc import IntegrityError
@@
     feature_flag = FeatureFlag(
         key=key,
         organization_id=organization_id,
         project_id=project_id,
         enabled=enabled,
     )
     session.add(feature_flag)
-    session.commit()
-    session.refresh(feature_flag)
-    return feature_flag
+    try:
+        session.commit()
+    except IntegrityError:
+        session.rollback()
+        return None
+    session.refresh(feature_flag)
+    return feature_flag
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/feature_flag.py` around lines 37 - 63, The
create_feature_flag function can raise an uncaught IntegrityError when two
requests concurrently pass get_feature_flag and both attempt to insert the same
(key, organization_id, project_id); wrap the insert/commit in a try/except
catching sqlalchemy.exc.IntegrityError, call session.rollback() in the except,
then return None (or optionally re-fetch via get_feature_flag and return that)
so the function preserves the contract of returning None on conflict instead of
letting a 500 bubble up; update create_feature_flag to import IntegrityError and
handle the exception around session.commit()/session.add() accordingly.
backend/app/assessment/routes.py-332-336 (1)

332-336: ⚠️ Potential issue | 🟡 Minor

Missing return type annotation on event_stream.

-    async def event_stream():
+    async def event_stream() -> AsyncIterator[str]:
         async for chunk in assessment_event_broker.subscribe():
             yield chunk
             await asyncio.sleep(0)

You'll need from collections.abc import AsyncIterator (or from typing import AsyncIterator on older Pythons).

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/routes.py` around lines 332 - 336, Add a return type
annotation to the async generator by changing the signature of event_stream to
specify AsyncIterator[bytes] (e.g. async def event_stream() ->
AsyncIterator[bytes]:) and import AsyncIterator from collections.abc at the top
of the module; keep the body using assessment_event_broker.subscribe() unchanged
so the generator type matches the yielded chunk type.
backend/app/assessment/batch.py-395-420 (1)

395-420: ⚠️ Potential issue | 🟡 Minor

google_params["reasoning"] is set by the mapper but never consumed here.

map_kaapi_to_google_params (mappers.py line 205–207) writes google_params["reasoning"] = reasoning, but the builder at lines 395–420 only reads instructions, temperature, top_p, max_output_tokens, thinking_config, and output_schema. The reasoning value is silently dropped and never included in the Gemini request.

Either map reasoning into the appropriate Gemini field (e.g., merge with thinkingConfig) or remove the assignment from the mapper to eliminate dead code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 395 - 420, The
google_params["reasoning"] value produced by map_kaapi_to_google_params is never
used in the Gemini request builder; update the builder in batch.py to consume
google_params["reasoning"] by merging it into the outgoing generationConfig
(e.g., if thinking_config exists, combine or extend it with reasoning and set
generation_config["thinkingConfig"] accordingly; if not, set
generation_config["thinkingConfig"] = reasoning) so the reasoning intent is sent
as part of the request, or alternatively remove the assignment of
google_params["reasoning"] in map_kaapi_to_google_params to eliminate dead code.
backend/app/assessment/mappers.py-93-106 (1)

93-106: ⚠️ Potential issue | 🟡 Minor

Add guard clause for model before calling litellm.supports_reasoning().

If kaapi_params lacks "model", the code calls litellm.supports_reasoning(model="openai/None"), which returns False (or raises an error depending on litellm version). This causes misleading warnings at lines 131–134 that claim the model "does not support reasoning" when the actual problem is that no model was specified.

Fix
-    support_reasoning = litellm.supports_reasoning(model=f"openai/{model}")
+    support_reasoning = (
+        litellm.supports_reasoning(model=f"openai/{model}") if model else False
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 93 - 106, The code calls
litellm.supports_reasoning with model=f"openai/{model}" even when model is None;
add a guard that checks whether model is present in kaapi_params before invoking
litellm.supports_reasoning: if model is truthy call support_reasoning =
litellm.supports_reasoning(model=f"openai/{model}"), otherwise set
support_reasoning = False (and optionally log or raise a clear "model missing"
message) so you don't pass "openai/None" into supports_reasoning or produce
misleading reasoning-support warnings; update any subsequent logic that uses
support_reasoning accordingly (look for the model variable and support_reasoning
usage in this function).
backend/app/assessment/utils/export.py-395-417 (1)

395-417: ⚠️ Potential issue | 🟡 Minor

assessment_id=run.assessment_id or 0 uses 0 as a sentinel for runs without a parent assessment, inconsistent with other models.

AssessmentExportRow.assessment_id is typed as int, forcing the conversion of None to 0. However, AssessmentRunPublic, AssessmentRunSummary, and the source field run.assessment_id all use int | None. Change AssessmentExportRow.assessment_id to int | None to match the source type and existing patterns in the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 395 - 417,
AssessmentExportRow.assessment_id is currently coerced to 0 in the export
(assessment_id=run.assessment_id or 0) and typed as int; change the
AssessmentExportRow dataclass/typing to accept int | None for assessment_id and
stop converting None to 0 by passing run.assessment_id directly in the
export_rows append. Update the type annotation
(AssessmentExportRow.assessment_id: int | None), any related imports/typing
declarations, and adjust any callers that assumed 0 to handle None where
necessary (references: AssessmentExportRow and the export loop building
export_rows using run.assessment_id).
backend/app/assessment/crud.py-275-292 (1)

275-292: ⚠️ Potential issue | 🟡 Minor

Status taxonomy is inconsistent; "in_progress" is checked but never set, causing silent data loss.

processing_runs accepts both "processing" and "in_progress", while the other counters and the model documentation accept single literals. The documented statuses are only "pending", "processing", "completed", "failed" — yet the aggregator checks for a fifth value that is never assigned anywhere in the codebase.

If any run has a status outside the expected set (typo, legacy value, or other unexpected value), it is excluded from all four counters but still contributes to total_runs = len(runs). When _determine_assessment_status receives mismatched totals, it falls through to "processing", silently masking the inconsistency.

Recommend (a) consolidating on a single canonical set of statuses (StrEnum) enforced at assignment time, and (b) asserting when a run's status falls outside that set.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 275 - 292, The aggregator is
mixing an unexpected "in_progress" literal into processing_runs and not counting
unexpected statuses, causing total_runs mismatch and silent masking in
_determine_assessment_status; fix by consolidating to the canonical status set
(pending, processing, completed, failed) and stop checking "in_progress" in the
processing_runs calculation (only check r.status == "processing"), and add a
validation/assertion step when iterating runs (e.g., validate_run_status or
inline check inside the block that computes
pending_runs/processing_runs/completed_runs/failed_runs) that logs or raises if
r.status is not in the canonical set so invalid/legacy values are caught at
assignment time rather than ignored. Ensure assessment.total_runs still equals
len(runs) and that the counters sum to total_runs unless a validation error is
raised; update references to processing_runs and _determine_assessment_status
accordingly.
🧹 Nitpick comments (16)
backend/app/assessment/utils/parsing.py (1)

13-17: Consider hardening JSON parsing against malformed payloads.

json.loads on the array branch and per-line in the JSONL branch will raise JSONDecodeError straight to callers. If batch outputs from providers occasionally contain a malformed line, the entire result load fails. Depending on caller expectations, you may want to wrap line parsing with a try/except + logger warning so a single bad record doesn’t poison the whole run. Skip if upstream callers are already expected to handle this.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/parsing.py` around lines 13 - 17, The JSON
parsing currently calls json.loads directly on content (array branch using
parsed) and per-line (JSONL branch), which will raise JSONDecodeError for
malformed payloads; wrap the top-level json.loads(content) in a try/except
JSONDecodeError and log a warning then return an empty list (or appropriate
fallback), and for the JSONL branch catch JSONDecodeError around each
json.loads(line) so you log the bad line and skip it instead of letting one bad
record fail the whole parse; use the existing logger (or add one) to emit
warnings that include the offending line and the exception.
backend/app/assessment/validators.py (1)

53-67: Simplify by reading once and checking len(content).

The file.file.seek/tell/seek dance reaches into the underlying SpooledTemporaryFile synchronously, which works but bypasses Starlette's async UploadFile API and is brittle if the upload backend changes. Reading first and validating size on the bytes avoids the seek-around entirely and is also cheaper since read() is being called anyway. The 10 MB upper bound provides a natural memory cap.

-    file.file.seek(0, 2)
-    file_size = file.file.tell()
-    file.file.seek(0)
-
-    if file_size > MAX_FILE_SIZE:
+    content = await file.read()
+    file_size = len(content)
+
+    if file_size > MAX_FILE_SIZE:
         raise HTTPException(
             status_code=413,
             detail=f"File too large. Maximum size: {MAX_FILE_SIZE / (1024 * 1024):.0f}MB",
         )
 
     if file_size == 0:
         raise HTTPException(status_code=422, detail="Empty file uploaded")
 
-    content = await file.read()
     return content, file_ext

Note: if you ever need to enforce the limit before reading (to prevent OOM on very large uploads), prefer streaming chunked reads with a running counter rather than the seek/tell approach — but for a 10 MB cap this is fine.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/validators.py` around lines 53 - 67, Replace the
synchronous file.file.seek/tell/seek usage with a single async read: call await
file.read() once (using UploadFile.read()), then use len(content) to enforce
MAX_FILE_SIZE and detect empty uploads instead of file_size; remove references
to file.file.seek/tell and keep the existing HTTPException behavior and return
(content, file_ext) from the same validator function.
backend/app/main.py (1)

70-75: Add a return type annotation to lifespan.

Per coding guidelines for **/*.py, all function parameters and return values should have type hints. The app: FastAPI parameter is annotated, but the return is not. Ruff's ARG001 warning about unused app is a false positive — FastAPI's lifespan protocol requires this signature.

-@asynccontextmanager
-async def lifespan(app: FastAPI):
-    """App startup / shutdown lifecycle."""
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncIterator[None]:
+    """App startup / shutdown lifecycle."""

(Add from collections.abc import AsyncIterator to imports.) As per coding guidelines: "Always add type hints to all function parameters and return values in Python code".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/main.py` around lines 70 - 75, The lifespan function lacks a
return type hint; update its signature to annotate the return as an async
iterator (e.g., AsyncIterator[None]) and add the required import from
collections.abc (import AsyncIterator) so the signature becomes async def
lifespan(app: FastAPI) -> AsyncIterator[None]:; keep the existing app parameter
annotation (to avoid Ruff false-positive) and no other behavior changes.
backend/app/models/__init__.py (1)

100-100: Consider moving Assessment/AssessmentRun re-exports to a dedicated location or removing them entirely.

The import of Assessment and AssessmentRun from app.assessment.models at line 100 is currently safe (no circular import exists), but consolidating external model re-exports in a central __init__ can create tight coupling if the assessment subpackage ever needs to import from app.models for FK relationships. A cleaner approach:

  1. Remove this re-export and have callers from app.assessment.models import Assessment, AssessmentRun directly (preferred — keeps subpackage models self-contained).
  2. If table discovery is required for Alembic, register models in your env.py/metadata module instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/__init__.py` at line 100, The __init__.py currently
re-exports Assessment and AssessmentRun (from app.assessment.models) which can
create tight coupling; remove the import line exporting Assessment and
AssessmentRun from app.assessment.models in app.models.__init__ and update
callers to import directly from app.assessment.models instead (or, if you need
automatic table discovery for Alembic, register those models in your Alembic
env.py/metadata module rather than re-exporting them here); ensure no remaining
references rely on the central re-export and run tests/migrations to verify
imports still resolve.
backend/app/models/feature_flag.py (1)

61-78: FeatureFlagCreate and FeatureFlagUpdate are identical — collapse them.

They define the exact same fields and semantics; either alias or have one inherit from the other so that future changes to one don't silently diverge.

♻️ Proposed simplification
 class FeatureFlagCreate(SQLModel):
     key: str = Field(min_length=1, max_length=128)
     organization_id: int
     project_id: int | None = None
     enabled: bool


-class FeatureFlagUpdate(SQLModel):
-    key: str = Field(min_length=1, max_length=128)
-    organization_id: int
-    project_id: int | None = None
-    enabled: bool
+class FeatureFlagUpdate(FeatureFlagCreate):
+    """Same payload shape as create; kept distinct for OpenAPI clarity."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/feature_flag.py` around lines 61 - 78, FeatureFlagCreate
and FeatureFlagUpdate are identical; collapse them by introducing a single
shared model (e.g., FeatureFlagBase or keep FeatureFlagCreate as canonical) and
have the other name inherit or alias it to avoid duplication—update references
to use the unified model (FeatureFlagCreate or FeatureFlagUpdate) and leave
FeatureFlagDelete unchanged; ensure the shared model preserves the existing
Field validations (key, organization_id, project_id, enabled) so future changes
apply in one place.
backend/app/assessment/events.py (1)

25-26: Use builtin TimeoutError (Ruff UP041).

In Python 3.11+, asyncio.TimeoutError is an alias of the builtin TimeoutError. Per the project's Python 3.11+ stance and the static analysis hint, prefer the builtin.

-                except asyncio.TimeoutError:
+                except TimeoutError:
                     yield ": keep-alive\n\n"
                     continue

As per coding guidelines: "Use Python 3.11+ with type hints throughout the codebase".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` around lines 25 - 26, Replace the except
that catches asyncio.TimeoutError with the builtin TimeoutError in the async
block where you await queue.get() (the code that sets payload = await
asyncio.wait_for(queue.get(), timeout=15)); locate the try/except around that
await in events.py and change the exception class to TimeoutError to follow
Python 3.11+ conventions and satisfy the Ruff UP041 lint.
backend/app/core/feature_flags/__init__.py (1)

22-23: Misleading "unused" discard.

organization_id and project_id are actually used below — only user_id is unused. The _ = (organization_id, project_id, user_id) line falsely signals all three are unused. If the intent is to silence linters for user_id, prefer renaming the parameter.

♻️ Proposed cleanup
 def is_enabled(
     session: Session,
     flag: FeatureFlag,
     organization_id: int,
     project_id: int | None = None,
-    user_id: int | None = None,
+    user_id: int | None = None,  # noqa: ARG001 — reserved for future user-scoped flags
 ) -> bool:
     """Check whether *flag* is enabled for the given scope."""
-    _ = (organization_id, project_id, user_id)
     resolved_flag = get_feature_flag_enabled(
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/core/feature_flags/__init__.py` around lines 22 - 23, The tuple
assignment "_ = (organization_id, project_id, user_id)" falsely marks all three
parameters as unused; instead keep real usage of organization_id and project_id
and only silence the unused user_id by renaming it (e.g. _user_id) or explicitly
assigning _ = user_id, and remove the tuple assignment; update the function
signature or the single-use discard so that organization_id and project_id are
no longer masked as unused while only user_id is ignored (refer to the
parameters organization_id, project_id, user_id and the surrounding check/flag
function in __init__.py).
backend/app/assessment/service.py (2)

57-62: Redundant UUID(str(...)) round-trip.

run.config_id is already typed as UUID | None on the model; converting to str and back to UUID is a no-op (and would crash on None, but you already validate non-None on lines 52–56).

-        configs.append(
-            AssessmentConfigRef(
-                config_id=UUID(str(run.config_id)),
-                config_version=run.config_version,
-            )
-        )
+        configs.append(
+            AssessmentConfigRef(
+                config_id=run.config_id,
+                config_version=run.config_version,
+            )
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 57 - 62, The code
unnecessarily wraps run.config_id in UUID(str(...)) before constructing
AssessmentConfigRef; since run.config_id is already a UUID (and you validate
it's non-None earlier in the block around lines 52–56), replace
UUID(str(run.config_id)) with run.config_id when creating AssessmentConfigRef to
avoid the redundant conversion and potential crash on None.

144-195: Redundant recompute_assessment_status calls inside the per-run loop.

recompute_assessment_status is invoked after each successful submit (line 178), after each failed submit (line 191), and again once at the end (line 195). For an assessment with N configs, that's N + 1 full re-aggregations of the same rows; only the final one is observable to clients (the others are immediately overwritten).

Move the call out of the loop:

♻️ Proposed simplification
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="processing",
                 batch_job_id=batch_job.id,
                 total_items=batch_job.total_items,
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)

         except Exception as e:
             ...
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="failed",
                 error_message="Batch submission failed. Please try again or contact support.",
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)

         runs.append(run)

     recompute_assessment_status(session=session, assessment_id=assessment.id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 144 - 195, The code currently
calls recompute_assessment_status() inside the per-config loop (after successful
submit and in the except block) causing redundant re-aggregations; remove the
two in-loop calls to recompute_assessment_status (the ones immediately after
update_assessment_run_status in both the try and except paths around
submit_assessment_batch and update_assessment_run_status) and keep only the
single recompute_assessment_status(session=session, assessment_id=assessment.id)
that runs once after the loop finishes so aggregation runs once for the final
state.
backend/app/api/routes/features.py (1)

33-170: Endpoints don't load Swagger descriptions from external markdown.

None of these new endpoints (/features, /feature-flags GET/POST/PATCH/DELETE) supply description=load_description("features/<action>.md"). The assessment dataset routes in this same PR (e.g., backend/app/assessment/routes.py:93) already follow this pattern.

As per coding guidelines: "Load Swagger endpoint descriptions from external markdown files instead of inline strings using load_description("domain/action.md")".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/features.py` around lines 33 - 170, The router
endpoints get_features, get_feature_flags, create_feature_flag_route,
patch_feature_flag, and remove_feature_flag are missing Swagger descriptions
loaded from external Markdown; update each `@router`.<method> decorator to include
description=load_description("features/<action>.md") (use names like
"features/get.md", "features/list.md" or
"features/create.md"/"patch.md"/"delete.md" matching your docs convention) so
the OpenAPI docs pull content from external files rather than inline strings;
ensure you import or have access to load_description and use the exact function
names (get_features, get_feature_flags, create_feature_flag_route,
patch_feature_flag, remove_feature_flag) to locate the decorators to modify.
backend/app/assessment/processing.py (1)

253-253: Log prefix uses run id rather than function name.

log_prefix = f"[assessment_run={run.id}]" is useful context, but the repo convention is to lead with the function name. Combining both keeps the run-scoped context while satisfying the guideline.

-    log_prefix = f"[assessment_run={run.id}]"
+    log_prefix = f"[check_and_process_assessment][run={run.id}]"

As per coding guidelines: "Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message ...")".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` at line 253, Update the log_prefix
assignment so it begins with the current function name in square brackets
followed by the existing run-scoped context (i.e., include both the function
name and assessment run id); change the log_prefix variable (currently set as
log_prefix = f"[assessment_run={run.id}]") to the repo-standard pattern like
"[<function_name>][assessment_run=<run.id>]" (use the actual function name where
this line lives rather than the literal "<function_name>") so all subsequent
logger calls follow the required convention.
backend/app/assessment/routes.py (1)

348-436: Repetitive AssessmentPublic(...) / AssessmentRunPublic(...) construction.

list_assessment_managers, get_assessment_manager, list_evaluations, and get_evaluation each manually copy 14–15 fields one by one. Since both AssessmentPublic and AssessmentRunPublic are pure Pydantic projections of fields already on the SQLModel rows, you can replace these blocks with AssessmentPublic.model_validate(a, from_attributes=True) (Pydantic v2) — or set model_config = ConfigDict(from_attributes=True) on the public schemas and pass the row directly.

This shrinks four endpoints to one-liners and removes the risk of drift when fields are added.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/routes.py` around lines 348 - 436, The endpoints
list_assessment_managers, get_assessment_manager (and the analogous
list_evaluations/get_evaluation) repetitively map fields into
AssessmentPublic/AssessmentRunPublic; replace the manual field-by-field
construction by either enabling attribute parsing on the Pydantic models (set
model_config = ConfigDict(from_attributes=True) on AssessmentPublic and
AssessmentRunPublic) and pass the SQLModel row directly, or call
AssessmentPublic.model_validate(a, from_attributes=True) (and
AssessmentRunPublic.model_validate(...)) when building the APIResponse; update
imports if needed and replace each block in list_assessment_managers,
get_assessment_manager, list_evaluations, and get_evaluation with the single
model_validate (or direct model instantiation) call to return the Pydantic
projection.
backend/app/assessment/crud.py (2)

196-227: Constrain status to known values.

status: str allows any string to be persisted to AssessmentRun.status, which couples directly into the aggregation in recompute_assessment_status (see comment above). Tightening to a Literal["pending", "processing", "completed", "failed", ...] (or the same StrEnum used elsewhere) gives compile-time safety and prevents drift between writers and the aggregator.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 196 - 227, The
update_assessment_run_status function accepts status: str which lets arbitrary
values be persisted to AssessmentRun.status and can drift from the aggregator
logic in recompute_assessment_status; change the status parameter to the
project-wide enum type (or a typing.Literal of the allowed values) used
elsewhere (the same StrEnum/StatusEnum used for assessment runs) and update any
call sites to pass that enum/value so only known statuses are stored and
compile-time/type-checking will prevent drift; ensure type imports and
annotations are updated where update_assessment_run_status is defined and
referenced.

1-12: Place CRUD operations in backend/app/crud/ per project convention.

This new module is a pure data-access layer for the Assessment / AssessmentRun tables but lives at backend/app/assessment/crud.py instead of the central backend/app/crud/ package used elsewhere. Consider moving it (e.g. to backend/app/crud/assessment.py) so future contributors can locate DB code consistently and so it composes with the existing CRUD test/fixtures conventions.

Based on learnings: "Use CRUD pattern for database access operations located in backend/app/crud/."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 1 - 12, The CRUD code for
Assessment and AssessmentRun lives in app.assessment.crud but should be placed
in the central CRUD package; move the module into the project's crud package
(e.g., create app.crud.assessment) and update any imports to reference
app.crud.assessment instead of app.assessment.crud, keeping the same symbols
(Assessment, AssessmentRun, logger, and any CRUD functions/classes) and ensure
the new module exports the same public API so existing callers continue to work.
backend/app/assessment/utils/export.py (2)

160-171: Hard-coded base_fields.index("result_status") will raise ValueError if the field is renamed.

If AssessmentExportRow is later refactored to drop or rename result_status, every CSV/XLSX export call crashes with an opaque ValueError. Consider guarding it (fall back to len(base_fields)) or anchor the column ordering on a stable constant declared in this module so the dependency is explicit.

♻️ Suggested guard
-    output_idx = base_fields.index("result_status") + 1  # after result_status
+    try:
+        output_idx = base_fields.index("result_status") + 1  # place output_keys after result_status
+    except ValueError:
+        output_idx = len(base_fields)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 160 - 171, The code uses
base_fields.index("result_status") which will raise ValueError if that field is
renamed; update the logic in the block that computes output_idx/fieldnames
(referencing base_fields, output_idx, fieldnames, has_unparsed_output) to guard
the lookup: either define and use a stable constant (e.g., RESULT_STATUS_FIELD)
and check membership before calling index, or wrap the index call in a
try/except and fall back to len(base_fields) (or another sensible insertion
point) so exports don't crash; ensure the subsequent insertion of "output_raw"
uses the guarded output_idx when computing the insert position.

200-228: Pin the Excel engine explicitly to "openpyxl".

When pd.ExcelWriter(buf) is called with a BytesIO object, pandas infers the file format as .xlsx and selects an engine. Without an explicit engine parameter, pandas prefers xlsxwriter if installed, otherwise falls back to openpyxl. Since the codebase only declares openpyxl>=3.1.0 as a dependency and the error message explicitly references openpyxl support, the engine should be pinned to ensure deterministic behavior that matches the stated runtime requirement.

♻️ Suggested change
-    with pd.ExcelWriter(buf) as writer:
+    with pd.ExcelWriter(buf, engine="openpyxl") as writer:
         data_frame.to_excel(writer, index=False, sheet_name="results")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 200 - 228, The Excel
writer call should explicitly pin the engine to openpyxl to ensure deterministic
.xlsx output; update the pd.ExcelWriter usage (the block that creates buf,
data_frame = pd.DataFrame(expanded, columns=excel_fields) and the with
pd.ExcelWriter(buf) as writer: ... block) to pass engine="openpyxl" so pandas
uses openpyxl rather than xlsxwriter or another engine, keeping the rest of the
export flow (_drop_empty_columns, expanded, excel_fields,
data_frame.to_excel(...)) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 091c4f08-e02a-48f6-b5b8-7905058d0c10

📥 Commits

Reviewing files that changed from the base of the PR and between 7d4da2e and ec2eeda.

⛔ Files ignored due to path filters (1)
  • backend/uv.lock is excluded by !**/*.lock
📒 Files selected for processing (34)
  • .gitignore
  • backend/app/alembic/versions/050_add_assessment_manager_table.py
  • backend/app/alembic/versions/055_create_feature_flag_table.py
  • backend/app/api/main.py
  • backend/app/api/permissions.py
  • backend/app/api/routes/cron.py
  • backend/app/api/routes/features.py
  • backend/app/api/routes/users.py
  • backend/app/assessment/__init__.py
  • backend/app/assessment/batch.py
  • backend/app/assessment/cron.py
  • backend/app/assessment/crud.py
  • backend/app/assessment/dataset.py
  • backend/app/assessment/events.py
  • backend/app/assessment/mappers.py
  • backend/app/assessment/models.py
  • backend/app/assessment/processing.py
  • backend/app/assessment/routes.py
  • backend/app/assessment/service.py
  • backend/app/assessment/utils/__init__.py
  • backend/app/assessment/utils/export.py
  • backend/app/assessment/utils/parsing.py
  • backend/app/assessment/validators.py
  • backend/app/core/feature_flags/__init__.py
  • backend/app/core/feature_flags/constants.py
  • backend/app/crud/feature_flag.py
  • backend/app/main.py
  • backend/app/models/__init__.py
  • backend/app/models/batch_job.py
  • backend/app/models/feature_flag.py
  • backend/app/models/llm/request.py
  • backend/app/models/user.py
  • backend/app/tests/api/test_permissions.py
  • backend/pyproject.toml

Comment on lines +75 to +81
sa.UniqueConstraint(
"key",
"organization_id",
"project_id",
name="uq_feature_flag_key_org_project",
),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Composite UniqueConstraint won't deduplicate org-level flags (project_id NULL).

In PostgreSQL, NULL values are not considered equal in unique constraints, so this constraint allows multiple rows with the same (key, organization_id, NULL). That breaks the implicit "one row per (key, org) when no project scope" invariant — list_feature_flags would return both, and the resolved flag value would depend on insertion order (and update_feature_flag could update an arbitrary one of them).

Add a partial unique index for the org-scoped case:

         sa.UniqueConstraint(
             "key",
             "organization_id",
             "project_id",
             name="uq_feature_flag_key_org_project",
         ),
     )
+    # PostgreSQL treats NULLs as distinct in unique constraints, so the constraint
+    # above does not prevent duplicates when project_id IS NULL. Enforce uniqueness
+    # of org-level flags via a partial unique index.
+    op.create_index(
+        "uq_feature_flag_key_org_no_project",
+        "feature_flag",
+        ["key", "organization_id"],
+        unique=True,
+        postgresql_where=sa.text("project_id IS NULL"),
+    )
     op.create_index(
         op.f("ix_feature_flag_key"),

And mirror the drop in downgrade():

 def downgrade():
     op.drop_index(op.f("ix_feature_flag_organization_id"), table_name="feature_flag")
     op.drop_index(op.f("ix_feature_flag_key"), table_name="feature_flag")
+    op.drop_index("uq_feature_flag_key_org_no_project", table_name="feature_flag")
     op.drop_table("feature_flag")

You'll likely want a matching Index(..., postgresql_where=...) declaration on the FeatureFlag model's __table_args__ so future autogenerate runs stay in sync.

Side note on indexes: with the composite unique constraint already covering (key, organization_id, project_id) and a separate ix_feature_flag_key, the standalone ix_feature_flag_organization_id is the only thing needed for org-scoped lookups. The redundancy is harmless but worth noting.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/alembic/versions/055_create_feature_flag_table.py` around lines
75 - 81, The composite UniqueConstraint uq_feature_flag_key_org_project does not
prevent duplicate org-scoped flags because PostgreSQL treats NULLs as distinct;
add a partial unique index for the org-scoped case (unique on (key,
organization_id) WHERE project_id IS NULL) in the upgrade migration and drop it
in downgrade(), and also add a matching Index(...) with postgresql_where=... to
the FeatureFlag model's __table_args__ so future autogenerate keeps migrations
in sync; keep existing composite constraint and other indexes as-is.

Comment thread backend/app/assessment/batch.py
Comment thread backend/app/assessment/batch.py
Comment thread backend/app/assessment/dataset.py
Comment thread backend/app/assessment/dataset.py
Comment thread backend/app/assessment/routes.py
from fastapi import HTTPException
from fastapi.responses import StreamingResponse

from app.assessment.batch import _load_dataset_rows
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Avoid importing private (_-prefixed) symbols across modules.

_load_dataset_rows (line 13) and _get_batch_provider (line 290) are leading-underscore symbols imported from sibling modules. By convention these are module-private and may be renamed/moved without notice, silently breaking this exporter. Either promote them to public helpers (drop the underscore and document them) or move the export pipeline closer to where these helpers live.

Also applies to: 287-291

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` at line 13, The exporter imports
module-private helpers _load_dataset_rows and _get_batch_provider from a sibling
module; change the exporter to use public helpers instead by renaming those
functions in their source to load_dataset_rows and get_batch_provider (and
updating their docs), then update the imports in this file to import
load_dataset_rows and get_batch_provider; alternatively, move the export
pipeline into the same module that currently defines
_load_dataset_rows/_get_batch_provider so it can call them privately—ensure all
references in this file are updated to the new public names (load_dataset_rows,
get_batch_provider) or removed if you relocate the pipeline.

Comment thread backend/app/main.py Outdated
Comment on lines +66 to +67
if settings.SENTRY_DSN and settings.ENVIRONMENT != "development":
sentry_sdk.init(dsn=str(settings.SENTRY_DSN), enable_tracing=True)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, let's examine the actual code in main.py to see the structure
cat -n backend/app/main.py | head -90

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3531


🏁 Script executed:

# Check the full file structure around the relevant lines
wc -l backend/app/main.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 99


🏁 Script executed:

# Look at the exact lines mentioned in the review
sed -n '30,80p' backend/app/main.py | cat -n

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1850


🌐 Web query:

Sentry SDK sentry_sdk.init multiple calls production behavior documentation

💡 Result:

In production, calling sentry_sdk.init multiple times on the Sentry Python SDK leads to undefined behavior. The SDK is designed to be initialized only once during the application lifecycle. Multiple initializations cause integrations to hook into libraries and lifecycle events repeatedly, leading to interactions between initializations that result in unpredictable issues, such as duplicate event reporting or events sent to incorrect DSNs.

Citations:


🌐 Web query:

Sentry SDK enable_tracing parameter traces_sample_rate equivalence

💡 Result:

In the Sentry Python SDK, there is an enable_tracing parameter (also referred to as enable_tracing in the Client constructor) that, when set to True, is equivalent to setting traces_sample_rate=1.0. This enables tracing by capturing 100% of transactions. Setting either traces_sample_rate or traces_sampler (non-None) enables tracing; if neither is set, no transactions are sampled (equivalent to traces_sample_rate=0.0). The enable_tracing parameter provides a convenient boolean toggle for full sampling. This equivalence is confirmed in the Python SDK API documentation where both parameters are listed together under tracing options, and Sentry developer guidelines note enableTracing (camelCase in spec) sets the sample rate to 1.0 implicitly. In other SDKs like .NET, EnableTracing=true explicitly sets TracesSampleRate=1.0, with past conflicts resolved to avoid overriding explicit rates. To fully disable tracing, set traces_sample_rate=None (not 0.0, as 0.0 still respects parent sampling decisions for distributed traces).

Citations:


Second sentry_sdk.init call causes undefined behavior in production.

In non-development environments, this re-initializes Sentry at module load time, replacing the Hub's client with one stripped of all configuration from lines 34–56 — losing the OTEL instrumenter, before_send_transaction filter, LoggingIntegration, and disabled_integrations list. Per Sentry's developer documentation, calling Sentry.init() multiple times in production leads to undefined behavior, with integrations hooking repeatedly and causing issues like duplicate event reporting.

Additionally, enable_tracing=True is a legacy alias for traces_sample_rate=1.0, which the first init already sets; the second call provides no new functionality while discarding essential configuration.

Merge the environment check into the first init and remove the redundant second call. Also add a return type annotation to the lifespan function (line 70).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/main.py` around lines 66 - 67, Remove the second call to
sentry_sdk.init and instead include the environment check in the initial sentry
setup so the single sentry_sdk.init call (the one that configures
traces_sample_rate, before_send_transaction, LoggingIntegration,
disabled_integrations, and OTEL instrumenter) only runs when settings.SENTRY_DSN
is set and settings.ENVIRONMENT != "development"; delete the redundant
init(dsn=..., enable_tracing=True) call that re-initializes Sentry. Also add an
explicit return type annotation to the lifespan function (lifespan) to match its
async generator type signature.

Comment thread backend/app/models/feature_flag.py Outdated
Comment on lines +33 to +42
class FeatureFlag(FeatureFlagBase, table=True):
__tablename__ = "feature_flag"
__table_args__ = (
UniqueConstraint(
"key",
"organization_id",
"project_id",
name="uq_feature_flag_key_org_project",
),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

PostgreSQL UNIQUE constraint NULLS NOT DISTINCT version support

💡 Result:

PostgreSQL supports the UNIQUE constraint with NULLS NOT DISTINCT starting from version 15. This feature allows unique constraints and indexes to treat NULL values as not distinct (equal), preventing multiple rows with NULL in the constrained columns. By default (NULLS DISTINCT), multiple NULLs are allowed. Syntax examples: CREATE TABLE products ( product_no integer UNIQUE NULLS NOT DISTINCT, name text ); or ALTER TABLE products ADD CONSTRAINT uq_product_no UNIQUE NULLS NOT DISTINCT (product_no); Official documentation confirms this in PostgreSQL 15 unique indexes section and current (18) constraints section. PostgreSQL 15 release notes explicitly mention: "Allow unique constraints and indexes to treat NULL values as not distinct". Prior to version 15, NULLs were always treated as distinct in unique constraints/indexes. Workarounds like partial indexes were used for similar behavior. The feature is available in all versions from 15 onwards, including the latest (18 as of 2026-02-26).

Citations:


🏁 Script executed:

# Read the full feature_flag.py file to understand the context
cat -n backend/app/models/feature_flag.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2931


🏁 Script executed:

# Search for get_feature_flag_enabled to understand the usage pattern
rg -A 5 "get_feature_flag_enabled" --type py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1290


🏁 Script executed:

# Check if there are any alembic migrations related to feature_flag
fd -e py . alembic/versions -o git ls-files "alembic/versions/*feature*flag*"

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 201


🏁 Script executed:

# Check the FeatureFlagBase to understand all columns
rg -B 5 "class FeatureFlagBase" backend/app/models/feature_flag.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 196


🏁 Script executed:

# Get the full get_feature_flag_enabled function implementation
rg -A 20 "def get_feature_flag_enabled" backend/app/crud/feature_flag.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 578


🏁 Script executed:

# Search for all feature flag CRUD operations to understand the full context
rg "def.*feature_flag" backend/app/crud/feature_flag.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 226


🏁 Script executed:

# Check for any existing alembic migrations related to feature_flag
fd . alembic/versions -name "*feature*" -o -name "*flag*"

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 306


🏁 Script executed:

# Check the database schema or migration that created feature_flag table
rg "feature_flag" alembic/versions --type py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 132


🏁 Script executed:

# Check SQLModel version and documentation for UniqueConstraint support
rg "sqlmodel|SQLModel" pyproject.toml setup.py requirements.txt 2>/dev/null | head -20

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Verify the query behavior - check if there are any tests related to feature flags
fd feature_flag . --type py | rg -l "test|spec"

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 245


🏁 Script executed:

# Check the imports in feature_flag.py to confirm SQLModel/SQLAlchemy capabilities
head -20 backend/app/crud/feature_flag.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 551


🏁 Script executed:

# Check if there are any configuration files that specify PostgreSQL version
rg -l "postgres|version" . --type py | grep -E "config|settings|pyproject|requirements" | head -5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 316


🏁 Script executed:

# Search for any existing use of postgresql_nulls_not_distinct or partial indexes
rg "postgresql_nulls_not_distinct|postgresql_where" .

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2238


🏁 Script executed:

# Check SQLAlchemy/SQLModel imports to confirm Index support
rg "from sqlalchemy|from sqlmodel" backend/app/models/ | head -10

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 823


🏁 Script executed:

# Check for feature_flag migration file
fd "feature_flag\|041\|042\|043\|044\|045" alembic/versions --extension py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 179


🏁 Script executed:

# Search in alembic versions for feature_flag table creation
rg -l "feature_flag" alembic/versions

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 132


🏁 Script executed:

# Check if Index is imported anywhere in models for pattern precedent
rg "from sqlalchemy import.*Index" backend/app/models/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 312


🏁 Script executed:

# Find the actual alembic directory structure
find . -type d -name "alembic" 2>/dev/null | head -5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 97


🏁 Script executed:

# Check if feature_flag migration exists
find . -name "*feature*flag*" -type f 2>/dev/null | grep -E "alembic|migration"

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 139


🏁 Script executed:

# List alembic versions directory structure
ls -la backend/app/alembic/versions/ 2>/dev/null | tail -20

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1805


🏁 Script executed:

# Read the feature_flag migration file
cat -n backend/app/alembic/versions/055_create_feature_flag_table.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3302


Critical: UniqueConstraint won't enforce uniqueness for org-scoped flags (project_id IS NULL).

PostgreSQL's default NULLS DISTINCT semantics (prior to version 15) mean two rows with the same (key, organization_id) and project_id = NULL will both be allowed. Since get_feature_flag_enabled() does .first(), duplicate org-level rows lead to nondeterministic enablement reads, and the API surface cannot reliably upsert org-scoped flags.

Pick one of these (and mirror in migration 055_create_feature_flag_table.py):

  • Postgres 15+: add postgresql_nulls_not_distinct=True to the UniqueConstraint.
  • Portable: replace with two partial unique indexes (one for project_id IS NULL, one for project_id IS NOT NULL).
🛡️ Option A — NULLS NOT DISTINCT (PG ≥ 15)
-class FeatureFlag(FeatureFlagBase, table=True):
-    __tablename__ = "feature_flag"
-    __table_args__ = (
-        UniqueConstraint(
-            "key",
-            "organization_id",
-            "project_id",
-            name="uq_feature_flag_key_org_project",
-        ),
-    )
+class FeatureFlag(FeatureFlagBase, table=True):
+    __tablename__ = "feature_flag"
+    __table_args__ = (
+        UniqueConstraint(
+            "key",
+            "organization_id",
+            "project_id",
+            name="uq_feature_flag_key_org_project",
+            postgresql_nulls_not_distinct=True,
+        ),
+    )
🛡️ Option B — partial unique indexes (portable)
-from sqlmodel import Field, SQLModel, UniqueConstraint
+from sqlalchemy import Index, text
+from sqlmodel import Field, SQLModel
@@
-    __table_args__ = (
-        UniqueConstraint(
-            "key",
-            "organization_id",
-            "project_id",
-            name="uq_feature_flag_key_org_project",
-        ),
-    )
+    __table_args__ = (
+        Index(
+            "uq_feature_flag_key_org_proj",
+            "key", "organization_id", "project_id",
+            unique=True,
+            postgresql_where=text("project_id IS NOT NULL"),
+        ),
+        Index(
+            "uq_feature_flag_key_org_no_proj",
+            "key", "organization_id",
+            unique=True,
+            postgresql_where=text("project_id IS NULL"),
+        ),
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/feature_flag.py` around lines 33 - 42, The
UniqueConstraint on FeatureFlag (class FeatureFlag) won’t prevent duplicate
org-scoped flags because project_id IS NULL is treated distinct; update both the
model and migration 055_create_feature_flag_table.py to use one of the suggested
fixes: either add postgresql_nulls_not_distinct=True to the UniqueConstraint
(Postgres ≥15) or replace the UniqueConstraint with two partial unique
indexes—one UNIQUE on (key, organization_id) WHERE project_id IS NULL and
another UNIQUE on (key, organization_id, project_id) WHERE project_id IS NOT
NULL—so get_feature_flag_enabled() returns deterministic results; pick and apply
the same approach in the model (FeatureFlag) and the migration.

Comment thread backend/app/models/llm/request.py Outdated
- Added `service.py` to handle assessment evaluation orchestration, including starting and retrying assessments.
- Introduced utility functions for exporting assessment results in various formats (CSV, XLSX, JSON) in `export.py`.
- Created parsing utilities for handling batch results in `parsing.py`.
- Implemented validation for dataset file uploads in `validators.py`.
- Updated batch job model to include assessment type.
- Enhanced evaluation run model to reference parent assessments.
- Added new fields for reasoning effort and summary preferences in LLM request model.
@vprashrex vprashrex force-pushed the feature/assessment branch from b0f5463 to bce0944 Compare April 27, 2026 15:54
…n model

- Introduced AssessmentRun model to replace EvaluationRun for assessment-specific runs.
- Updated CRUD operations to manage Assessment and AssessmentRun entities.
- Refactored processing logic to accommodate new AssessmentRun structure.
- Adjusted logging and error handling to reflect changes in run management.
- Modified routes and services to utilize AssessmentRun for assessment evaluations.
- Enhanced export functionality to support new AssessmentRun model.
- Removed EvaluationRun references where applicable and ensured backward compatibility.
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (3)
backend/app/assessment/dataset.py (2)

78-101: ⚠️ Potential issue | 🟠 Major

Resource leak: wb.close() is not in a finally block.

If iter_rows/next or the count comprehension raises (corrupt xlsx, unexpected types), control jumps to except at line 99 and wb.close() is never reached, leaking the underlying zip file handle. Move cleanup into finally (or use contextlib.closing).

🛡️ Proposed fix
 def _count_excel_rows(content: bytes) -> int:
     """Count data rows in an Excel file (excluding header)."""
+    wb = None
     try:
         import openpyxl
 
         wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
         ws = wb.active
         if ws is None:
             return 0
 
         rows_iter = ws.iter_rows(values_only=True)
         header = next(rows_iter, None)
         if header is None:
-            wb.close()
             return 0
 
-        count = sum(
+        return sum(
             1 for row in rows_iter if row and any(cell is not None for cell in row)
         )
-        wb.close()
-        return count
     except Exception as e:
         logger.warning(f"[_count_excel_rows] Failed to count rows | {e}")
         return 0
+    finally:
+        if wb is not None:
+            try:
+                wb.close()
+            except Exception:
+                pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/dataset.py` around lines 78 - 101, The
_count_excel_rows function can leak the workbook handle because wb.close() is
only called on normal return; ensure the workbook is always closed by moving
wb.close() into a finally block (or use contextlib.closing or openpyxl.Workbook
context manager) so that after creating wb = openpyxl.load_workbook(...) and
before returning from any path (including exceptions) the workbook is closed;
update the logic around ws, header, and the count comprehension to use this
guaranteed cleanup and remove duplicate close calls inside the try block.

140-163: ⚠️ Potential issue | 🟠 Major

Silent partial failure: dataset row persisted even when object-store upload fails.

_upload_file_to_object_store swallows exceptions and returns None, after which create_evaluation_dataset is invoked unconditionally — leaving an "EvaluationDataset" record whose object_store_url is None. Downstream batch processing (backend/app/assessment/batch.py::_load_dataset_rows raises ValueError(f"Dataset {dataset.id} has no object_store_url")) will then fail mysteriously, and the caller of upload_dataset has no signal of the failure.

Fail loudly on upload failure rather than persisting an orphan metadata row.

🛡️ Proposed behavior
     object_store_url = _upload_file_to_object_store(
         session=session,
         project_id=project_id,
         file_content=file_content,
         file_ext=file_ext,
         dataset_name=dataset_name,
     )
+    if object_store_url is None:
+        raise HTTPException(
+            status_code=502,
+            detail="Failed to upload dataset to object storage. Please retry.",
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/dataset.py` around lines 140 - 163, The code currently
calls _upload_file_to_object_store and then unconditionally creates the DB row
via create_evaluation_dataset even when the upload fails or returns None; change
upload_dataset so it treats upload failures as fatal: call
_upload_file_to_object_store inside a try/except, propagate or raise a clear
exception when the upload raises or when the returned object_store_url is
falsy/None, and only call create_evaluation_dataset when object_store_url is a
valid non-empty string; reference _upload_file_to_object_store and
create_evaluation_dataset to locate the change and ensure no EvaluationDataset
is persisted if the upload did not succeed.
backend/app/assessment/events.py (1)

13-48: ⚠️ Potential issue | 🟠 Major

Unbounded subscriber queues — slow/idle SSE clients can OOM the process.

asyncio.Queue() defaults to unlimited size, and put_nowait here never blocks/drops. A subscriber that connects but stops draining (e.g., its TCP send buffer is backed up) will accumulate every published assessment event indefinitely. Under cron-driven publish bursts across many subscribers this is a real memory-pressure / reliability issue.

Use a bounded queue with explicit overflow handling (drop oldest, or evict the slow subscriber).

🛡️ Proposed bounded queue with overflow drop
+_QUEUE_MAXSIZE = 256
+
 class AssessmentEventBroker:
     def __init__(self) -> None:
         self._subscribers: set[asyncio.Queue[dict]] = set()
 
     async def subscribe(self) -> AsyncIterator[str]:
-        queue: asyncio.Queue[dict] = asyncio.Queue()
+        queue: asyncio.Queue[dict] = asyncio.Queue(maxsize=_QUEUE_MAXSIZE)
         self._subscribers.add(queue)
@@
         for queue in list(self._subscribers):
-            queue.put_nowait(payload)
+            try:
+                queue.put_nowait(payload)
+            except asyncio.QueueFull:
+                logger.warning(
+                    "[publish] Subscriber queue full, dropping oldest | type=%s",
+                    payload.get("type"),
+                )
+                try:
+                    queue.get_nowait()
+                    queue.put_nowait(payload)
+                except (asyncio.QueueEmpty, asyncio.QueueFull):
+                    pass
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` around lines 13 - 48, The subscriber queues
are unbounded (asyncio.Queue()) so slow/idle SSE clients can cause OOM; fix by
creating bounded queues in subscribe (e.g., queue: asyncio.Queue[dict] =
asyncio.Queue(maxsize=XX)) and update publish to handle overflow: if
queue.put_nowait(payload) raises asyncio.QueueFull, either drop the oldest item
from that queue (queue.get_nowait() then retry put_nowait) or treat the
subscriber as dead by discarding the queue from self._subscribers and
logging/removing it; change symbols: modify subscribe (queue creation) and
publish (the for loop over self._subscribers and use of queue.put_nowait) to
implement the chosen overflow strategy and add appropriate debug/info logs.
🧹 Nitpick comments (18)
backend/app/api/routes/cron.py (2)

79-95: Be defensive about the shape of result before mutating it.

If process_all_pending_evaluations ever returns something other than a dict (e.g., None after an internal error path), result["assessment"] = ... will raise a TypeError outside the inner try/except, and the original error context will be lost. A quick isinstance(result, dict) guard (or initializing result = result or {}) keeps the merge robust.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/cron.py` around lines 79 - 95, The code mutates result
assuming it's a dict which can raise a TypeError if
process_all_pending_evaluations (or the variable result) is None or not a dict;
guard and normalize it before merging by checking isinstance(result, dict) (or
using result = result or {}) and then safely set result["assessment"] =
assessment_result and update totals from assessment_result; reference the
variables result and assessment_result and the surrounding merge block so the
normalization happens immediately before the lines that assign
result["assessment"] and update
total_processed/total_failed/total_still_processing.

71-77: Lazy import inside the request handler — surface the reason or hoist it.

The runtime from app.assessment.cron import poll_all_pending_assessment_evaluations inside the handler is typically a circular-import workaround. If that's the case, leave a brief comment explaining why; otherwise hoist the import to module scope so it's loaded once at startup rather than on every cron invocation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/api/routes/cron.py` around lines 71 - 77, The inline runtime
import of poll_all_pending_assessment_evaluations inside the cron request
handler should be resolved: either hoist the import to module scope so the
function is imported once at startup (remove the in-handler from
app.assessment.cron import ... and add it at top-level), or if this was done to
avoid a circular import, add a concise comment above the inline import
explaining the circular-import reason and why the import must remain lazy.
Ensure you keep the existing call to
poll_all_pending_assessment_evaluations(session=session) unchanged and reference
that symbol in your change.
backend/app/assessment/cron.py (1)

73-76: Stray adjacent string concatenation produces an awkward log line.

"[poll_all_pending_assessment_evaluations] " "No active assessments found" (and the same pattern on line 75) creates a single string with two spaces between the prefix and the message. Drop the empty piece for a cleaner log.

♻️ Proposed change
-        logger.info(
-            "[poll_all_pending_assessment_evaluations] " "No active assessments found"
-        )
+        logger.info(
+            "[poll_all_pending_assessment_evaluations] No active assessments found"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/cron.py` around lines 73 - 76, The log call in
poll_all_pending_assessment_evaluations uses two adjacent string literals
("[poll_all_pending_assessment_evaluations] " "No active assessments found")
which concatenates into a string with double space; update the logger.info
invocation in cron.py to use a single combined string (e.g.,
"[poll_all_pending_assessment_evaluations] No active assessments found") or an
f-string so the prefix and message are one contiguous string with a single
space, ensuring the message reads cleanly.
backend/app/assessment/events.py (1)

26-26: Use the builtin TimeoutError alias.

Per Ruff UP041, asyncio.TimeoutError is a deprecated alias of the builtin TimeoutError (since Python 3.11). Since this repo targets 3.11+, use the builtin directly.

♻️ Proposed change
-                except asyncio.TimeoutError:
+                except TimeoutError:
                     yield ": keep-alive\n\n"
                     continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` at line 26, Replace the deprecated alias
asyncio.TimeoutError with the builtin TimeoutError in the exception handler
inside backend.app.assessment.events (the except block currently using
asyncio.TimeoutError); update the except clause in the function/method where
that try/except appears (look for the "except asyncio.TimeoutError:" line) to
use "except TimeoutError:" so it relies on the built-in exception instead of the
asyncio alias.
backend/app/alembic/versions/055_add_assessment_manager_table.py (1)

121-132: Consider DB-level server_default for inserted_at/updated_at.

inserted_at and updated_at are NOT NULL but have no server_default. The model relies on SQLModel's default_factory=now, so any direct SQL INSERT (e.g., from psql, a fixture script, or an Alembic data migration) without these columns set will fail. Adding server_default=sa.func.now() keeps both ORM and raw inserts safe.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/alembic/versions/055_add_assessment_manager_table.py` around
lines 121 - 132, The migration defines NOT NULL columns inserted_at and
updated_at without DB defaults, which breaks raw INSERTs; update the Column
definitions for "inserted_at" and "updated_at" to include a database server
default (server_default=sa.func.now()) and add server_onupdate=sa.func.now() for
"updated_at" so the DB will populate timestamps for direct SQL inserts and
automatic updates from the database side.
backend/app/assessment/service.py (4)

10-10: Importing a private symbol (_resolve_config) across modules.

The leading underscore signals _resolve_config is batch.py's implementation detail; importing it from service.py couples the two modules to that internal contract. Either rename it (e.g., resolve_config) and treat it as a real public helper, or move it to a shared assessment._utils / assessment.config module so both call sites depend on a stable name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` at line 10, service.py is importing the
private function _resolve_config from batch.py which couples modules to an
internal symbol; change this by promoting the helper to a public API (rename
_resolve_config to resolve_config) or move it into a shared module (e.g.,
assessment.config or assessment_utils) and update both callers to import the new
public name, then remove the underscore import from batch.py and service.py and
ensure submit_assessment_batch and any other callers reference the new
resolve_config location.

145-193: start_assessment does multi-row creates/updates without an explicit transactional boundary — partial state on failure.

Each create_assessment, create_assessment_run, and update_assessment_run_status commits on its own (per the CRUD helpers). If something between iterations raises an unhandled error (e.g., a network blip during submit_assessment_batch that escapes the try, or a DB error during recompute_assessment_status), the parent Assessment plus a partial set of AssessmentRun rows will be persisted with the user receiving a 500. The cron will then try to process orphaned pending/processing rows.

Consider either (a) wrapping the orchestration in an explicit transaction with session.begin() plus a single final commit, or (b) ensuring the failure path here marks the whole parent assessment as failed with a clear error_message so cron + the UI can reason about it. The current per-iteration try/except only handles submit_assessment_batch failures, not the surrounding CRUD ops.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 145 - 193, The
start_assessment flow performs multiple DB creates/updates (create_assessment,
create_assessment_run, update_assessment_run_status,
recompute_assessment_status) without an explicit transaction, allowing partial
commits if an error occurs (e.g., during submit_assessment_batch or
recompute_assessment_status); wrap the multi-step orchestration in a single
transactional boundary (use session.begin() / session.begin_nested() as
appropriate) so all create/update calls are atomic and commit once at the end,
or alternatively, on any uncaught exception ensure you mark the parent
Assessment as failed (update_assessment_run_status + update parent assessment
status and error_message via the same session) before re-raising/returning to
avoid leaving orphaned pending/processing runs; locate these changes in
start_assessment and the exception handling around
submit_assessment_batch/recompute_assessment_status to implement the chosen
approach.

159-195: Per-iteration recompute_assessment_status is redundant and triples the DB writes on the parent row.

Each loop iteration calls recompute_assessment_status on the success path (line 178) or the failure path (line 191), then there's a final one at line 195. For an N-config assessment, that's N+1 recomputes (each does a SELECT over runs + an UPDATE/COMMIT on the parent). The final one alone is sufficient since the parent's status only needs to be correct after the loop completes — none of the per-iteration values are read between iterations.

Drop the inner calls and rely on line 195.

♻️ Proposed change
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="processing",
                 batch_job_id=batch_job.id,
                 total_items=batch_job.total_items,
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)
 
         except Exception as e:
             ...
             run = update_assessment_run_status(
                 session=session,
                 run=run,
                 status="failed",
                 error_message="Batch submission failed. Please try again or contact support.",
             )
-            recompute_assessment_status(session=session, assessment_id=assessment.id)
 
         runs.append(run)
 
     recompute_assessment_status(session=session, assessment_id=assessment.id)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 159 - 195, The loop currently
calls recompute_assessment_status inside both the try (after
update_assessment_run_status) and the except blocks, causing redundant DB
writes; remove the per-iteration calls to recompute_assessment_status within the
try/except and keep only the final recompute_assessment_status after the loop
completes so the parent's status is updated once; locate the calls around
submit_assessment_batch, update_assessment_run_status, and the except block
handling the exception for run processing and delete those inner
recompute_assessment_status invocations while leaving the final
recompute_assessment_status(session=session, assessment_id=assessment.id) at the
end.

51-62: UUID(str(run.config_id)) is a redundant round-trip.

AssessmentRun.config_id is already typed as UUID | None per the model, and the if not run.config_id guard above ensures it's a UUID here. UUID(str(uuid_obj)) is a no-op conversion.

♻️ Proposed change
         configs.append(
             AssessmentConfigRef(
-                config_id=UUID(str(run.config_id)),
+                config_id=run.config_id,
                 config_version=run.config_version,
             )
         )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/service.py` around lines 51 - 62, The code is
redundantly wrapping an already-UUID config_id with UUID(str(...)) in the loop
building AssessmentConfigRef; change the AssessmentConfigRef call to use the
existing UUID directly (e.g. config_id=run.config_id) and if the type checker
complains, assert or cast the non-None type before use (e.g. assert
run.config_id is not None then use run.config_id) so you remove the unnecessary
UUID(str(run.config_id)) round-trip; update the instantiation in the loop that
creates AssessmentConfigRef accordingly.
backend/app/assessment/mappers.py (3)

10-22: Type hint inaccuracy: normalize_llm_text can return None.

The signature is (text: str) -> str, but lines 11–12 return text unchanged when text is falsy or non-str — so a None input yields None. Callers at lines 100 and 185 pass kaapi_params.get("instructions") (possibly None) and rely on this. Tighten the annotation.

♻️ Proposed change
-def normalize_llm_text(text: str) -> str:
-    if not isinstance(text, str) or not text:
+def normalize_llm_text(text: str | None) -> str | None:
+    if not isinstance(text, str) or not text:
         return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 10 - 22, The type hint for
normalize_llm_text is inaccurate because it can accept and return None; update
its annotation to accept Optional[str] and return Optional[str] (from typing
import Optional), i.e. change def normalize_llm_text(text: str) -> str to def
normalize_llm_text(text: Optional[str]) -> Optional[str], and keep the existing
early-return behavior so callers that pass kaapi_params.get("instructions")
(which may be None) are correctly typed.

25-59: Strict-schema/strip-schema helpers don't recurse into anyOf/oneOf/allOf.

_ensure_openai_strict_schema and _strip_additional_properties only walk properties and items. JSON Schemas that compose subschemas via anyOf/oneOf/allOf (or nested under $defs) won't have additionalProperties: false injected (OpenAI strict mode will reject them) and won't have additionalProperties stripped on the Google side. Worth handling these for parity with the OpenAI strict-schema spec.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 25 - 59, Both
_ensure_openai_strict_schema and _strip_additional_properties only recurse into
"properties" and "items", so they miss subschemas under composition keywords and
definitions; update both functions to also traverse "anyOf", "oneOf", "allOf"
(iterating and recursively processing each subschema when they're lists) and
"$defs" / "definitions" (recursively processing dict values), ensuring you treat
nested dicts/lists consistently (use _ensure_openai_strict_schema in the former
and _strip_additional_properties in the latter) so additionalProperties is
injected or removed throughout composed/nested schemas.

5-5: Avoid depending on google.genai._transformers (private module).

The leading underscore signals it's not part of the public API; minor SDK upgrades can rename, restructure, or remove it without notice, breaking assessment schema conversion silently. The Google Genai SDK now natively supports JSON Schema via the documented response_json_schema parameter (announced Nov 2025)—pass raw JSON Schema directly without converting through t_schema. Alternatively, hand-roll the conversion in _convert_json_schema_to_google by implementing the transformation logic (uppercase types, resolve $ref from $defs, convert anyOf[null, X] to nullable) and pin the SDK version explicitly.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` at line 5, The import of
google.genai._transformers in backend.app.assessment.mappers should be removed
and the assessment->GenAI schema flow updated to not rely on the private module;
instead pass the raw JSON Schema into the SDK's documented response_json_schema
parameter when building requests, or implement a robust internal converter
_convert_json_schema_to_google that performs the necessary transformations
(uppercase/openAI-style type names, resolve $ref entries from $defs, convert
anyOf [ "null", X ] into a nullable representation) and pin the SDK version if
you must keep an internal transformer; update any references in this file (e.g.,
where genai_transformers was used) to call response_json_schema or the new
_convert_json_schema_to_google helper.
backend/app/assessment/batch.py (2)

289-304: Shallow copy of openai_params shares nested text/reasoning/tools dicts across all rows.

body = dict(openai_params) is a shallow copy: every JSONL line built in this loop ends up referencing the same nested text.format.json_schema, reasoning, and tools objects. Today nothing mutates them after this point, but any future code that does (e.g. per-row schema override) would silently affect every other row.

Use copy.deepcopy(openai_params) here (and on the Gemini side at the equivalent assignment) if these payloads are ever expected to diverge per row.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 289 - 304, The current shallow
copy "body = dict(openai_params)" causes nested objects (like
text.format.json_schema, reasoning, tools) to be shared across all rows; replace
that with a deep copy (use copy.deepcopy(openai_params)) when constructing body
inside the loop so each JSONL row gets an independent payload; apply the same
deep-copy pattern at the equivalent Gemini-side assignment as well to avoid
cross-row mutation later (look for the "body" variable, "openai_params", and the
jsonl_data.append block/BATCH_KEY usage to locate the spots to change).

138-161: Placeholder-in-value injection in _build_text_prompt.

prompt.replace("{col}", row[col]) replaces all occurrences in sequence. If a column value itself contains another placeholder (e.g., column name has the literal text {policy}), and policy is also a substituted column processed later, the value gets re-expanded. Equivalent to chained substitution. For Indic-text or user-uploaded data this is unlikely, but worth either documenting the contract or using str.format_map with a defaultdict to do a single-pass substitution.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 138 - 161, The
_build_text_prompt function currently does sequential replace which allows
placeholder-in-value re-expansion (e.g., a column value containing "{other}"
gets re-substituted when that column is processed later); update
_build_text_prompt to perform a single-pass, safe substitution by building a
mapping of column->normalized value and using a single-call formatting approach
(e.g., str.format_map with a default/missing-safe dict) instead of iterative
prompt.replace, ensuring you still call normalize_llm_text for each value before
substitution; keep the original behavior for the no-template branch unchanged.
backend/app/assessment/models.py (1)

32-36: Primary-key fields are typed as non-optional but default to None.

id: int = SQLField(default=None, primary_key=True) is internally inconsistent with strict type checking — the value is None until the DB assigns it. SQLModel's idiomatic form is id: int | None = SQLField(default=None, primary_key=True).

Low-priority cleanup; behavior is unchanged.

Also applies to: 144-148

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/models.py` around lines 32 - 36, The primary-key
fields (e.g., the id field declared as id: int = SQLField(default=None,
primary_key=True)) are typed non-optional while defaulting to None; update the
type annotations to allow None (use int | None or Optional[int]) so the
signature matches the SQLField(default=None) behavior, e.g., change id: int to
id: int | None = SQLField(...); apply the same fix to the other primary-key
field(s) with the same pattern elsewhere in the file (the block around the
second id declaration).
backend/app/assessment/crud.py (1)

255-315: Recompute is read‑modify‑write without locking — concurrent updates can leave stale counters.

recompute_assessment_status reads child runs, computes counters, and writes them back to the parent. If two workers run this concurrently for the same assessment_id (e.g. cron + an SSE-triggered recompute), the later commit overwrites the earlier one based on a possibly stale snapshot of child rows.

Today's call sites appear to be single-threaded (cron + sequential processing), so this is informational. If parallel polling is ever introduced, consider SELECT ... FOR UPDATE on the parent row, or recompute counters in SQL with a single UPDATE ... FROM (SELECT ...) aggregation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 255 - 315,
recompute_assessment_status currently does a read-modify-write on the Assessment
row which can be lost under concurrent recomputes; to fix, acquire a row lock on
the parent Assessment (e.g., SELECT ... FOR UPDATE) before recomputing in
recompute_assessment_status or replace the logic with a single atomic UPDATE ...
FROM (SELECT ...) aggregation that computes totals from AssessmentRun and
updates Assessment in one statement; reference the recompute_assessment_status
function and the Assessment/AssessmentRun models and ensure you commit while
holding the lock or use a transaction so concurrent workers cannot overwrite
each other's results.
backend/app/assessment/utils/export.py (1)

395-417: assessment_id=run.assessment_id or 0 masks orphan rows.

Coercing a missing assessment_id to 0 produces an export row that looks like it belongs to assessment 0. If the schema invariant is that exported runs always have a parent, fail loud instead; otherwise relax AssessmentExportRow.assessment_id to int | None in models.py and pass the real value through.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 395 - 417, The export is
masking missing parent relationships by coercing assessment_id to 0; update the
export to pass the real value (use assessment_id=run.assessment_id) and either
(A) if a missing parent should be an error, add an explicit check in the export
routine (raise ValueError or similar when run.assessment_id is None) before
building AssessmentExportRow, or (B) if exports may legitimately have no parent,
update AssessmentExportRow.assessment_id type in models.py to allow None (int |
None) and keep passing run.assessment_id through; locate the construction of
AssessmentExportRow in export_rows.append and implement one of these two fixes
around that symbol.
backend/app/assessment/processing.py (1)

482-486: LGTM on the wrapper, but the local import is doing real work — call it out.

Importing poll_all_pending_assessment_evaluations inside the function avoids a circular import with app.assessment.cron. A short inline comment will save the next maintainer a debugging session if they "clean up" the import to module scope.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` around lines 482 - 486, The local
import of poll_all_pending_assessment_evaluations inside
poll_all_pending_assessments is intentional to avoid a circular import; add a
concise inline comment above the from app.assessment.cron import
poll_all_pending_assessment_evaluations line explaining that the import must
remain local (not module-level) to prevent circular imports and to warn future
maintainers not to move it to top-level.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/alembic/versions/055_add_assessment_manager_table.py`:
- Around line 15-16: The migration's revision identifier is not zero-padded and
mismatches the filename and down_revision; update the module-level variable
revision from "55" to "055" so it matches the filename and existing
down_revision = "054", keeping the three-digit convention used across migrations
and ensuring subsequent revisions align.

In `@backend/app/assessment/batch.py`:
- Around line 107-135: The workbook created in _parse_excel_rows (wb returned by
openpyxl.load_workbook) may never be closed if iter_rows, next(header) or the
row-to-dict conversion raises; wrap the workbook usage in a try/finally (or
context manager if available) so wb.close() is always called in the finally
block after opening wb and before returning, and move any early returns (e.g.,
when ws is None or header is None) into that try so they still hit the finally
cleanup; ensure variables referenced (wb, ws, rows_iter, header, result) are
initialized inside the try scope.
- Around line 481-569: The code is defaulting unknown provider_name to "openai"
via _NATIVE_PROVIDERS.get(..., "openai"), causing unknown providers to silently
use the OpenAI path; change this to fail fast by using
_NATIVE_PROVIDERS.get(provider_name) (no default) or explicitly checking if
provider_name not in _NATIVE_PROVIDERS and raising ValueError, then use
base_provider for the if/elif branches (referencing the variables base_provider
and provider_name and the mapping _NATIVE_PROVIDERS); ensure the unreachable
else that raises ValueError remains reachable so start_batch_job is only called
for known providers.

In `@backend/app/assessment/cron.py`:
- Around line 144-182: The cleanup steps inside the broad except should be
isolated so a secondary DB error can't abort the whole cron loop: wrap the calls
to update_assessment_run_status, recompute_assessment_status, the
construction/appending of failure_result, assessment_event_broker.publish (and
the failed += 1) in their own try/except block; on exception catch and
logger.error with context (include run.id/run.assessment_id/run.run_name) and
exc_info=True, then continue the for-loop without re-raising so one failed
cleanup won't stop processing remaining runs.
- Around line 41-61: The payload returned by _build_callback_payload nests
"type" inside APIResponse.success_response(...).data so events.py's subscribe()
(which reads payload.get("type", "message")) always falls back to "message"; fix
by promoting the event type to the root of the published payload: in
_build_callback_payload (or immediately after calling
APIResponse.success_response(...).model_dump()) ensure the returned dict
contains a top-level "type": "assessment.child_status_changed" key (while
keeping the detailed run/assessment info under "data"), so subscribe() can
detect the event without changing events.py.

In `@backend/app/assessment/crud.py`:
- Around line 192-223: The update_assessment_run_status function currently
treats None as "do not change", preventing callers from clearing fields; fix
this by introducing a unique sentinel (e.g., UNSET = object()) as the default
for error_message, batch_job_id, total_items, and object_store_url, then change
the checks to "is not UNSET" and assign the value (which may be None) to
run.error_message, run.batch_job_id, run.total_items, and run.object_store_url
inside update_assessment_run_status so callers can explicitly clear fields by
passing None; update any call sites accordingly to use the sentinel omission vs
explicit None semantics.

In `@backend/app/assessment/processing.py`:
- Line 253: Update the log_prefix to include the current function name so all
subsequent logger calls follow the project's convention; replace the line
setting log_prefix = f"[assessment_run={run.id}]" with something like log_prefix
= f"[{inspect.currentframe().f_code.co_name}][assessment_run={run.id}]" (and add
"import inspect" if missing) so functions like the one containing run.id will
prefix logs with the function name and the assessment_run id used in
logger.info/logger.error calls.
- Around line 198-231: In the Google/Gemini branch (the block handling
provider_name in ("google","google-native")), populate the "usage" field from
the response instead of hard-coding None (e.g., extract response.get("usage")
or, if Gemini batch responses use a different shape, parse the batch structure
to pull usage/metadata) and treat an empty string from
extract_text_from_response_dict(response) as an error by setting error="Empty
response output" (mirror the Anthropic branch behavior); update the logic around
response, text = extract_text_from_response_dict(response) and the
results.append entries to include the extracted usage and to set error when text
is empty.

In `@backend/app/assessment/utils/export.py`:
- Around line 60-66: The expansion loop in export.py silently overwrites input
columns that share names with export fields (e.g. output, error, row_id, run_id,
result_status, response_id) by setting new_row[k] from input_data then calling
new_row.update(row); modify the loop that builds new_row (the variables
row_payload, input_data, input_keys, new_row are present) to detect collisions
against a reserved_fields set and instead namespace conflicting input keys (e.g.
store as "input_<k>") or, alternatively, emit a single warning when any
collision is detected; ensure namespacing logic also avoids clobbering an
existing "input_<k>" in row and preserves original row fields when calling
new_row.update(row).
- Around line 422-433: The sort_export_rows function currently sorts by
row.row_id lexicographically which makes "row_10" come before "row_2"; update
sort_export_rows (and any callers expecting stable order like
load_export_rows_for_run / AssessmentExportRow) to extract the numeric index
from row.row_id (e.g. parse the suffix after "row_") and use that integer in the
sort key instead of the raw string, keeping the existing tie-breakers
(row.config_version or 0, row.run_id); ensure the parser handles missing/invalid
suffixes by falling back to 0 so sorting remains stable.

---

Duplicate comments:
In `@backend/app/assessment/dataset.py`:
- Around line 78-101: The _count_excel_rows function can leak the workbook
handle because wb.close() is only called on normal return; ensure the workbook
is always closed by moving wb.close() into a finally block (or use
contextlib.closing or openpyxl.Workbook context manager) so that after creating
wb = openpyxl.load_workbook(...) and before returning from any path (including
exceptions) the workbook is closed; update the logic around ws, header, and the
count comprehension to use this guaranteed cleanup and remove duplicate close
calls inside the try block.
- Around line 140-163: The code currently calls _upload_file_to_object_store and
then unconditionally creates the DB row via create_evaluation_dataset even when
the upload fails or returns None; change upload_dataset so it treats upload
failures as fatal: call _upload_file_to_object_store inside a try/except,
propagate or raise a clear exception when the upload raises or when the returned
object_store_url is falsy/None, and only call create_evaluation_dataset when
object_store_url is a valid non-empty string; reference
_upload_file_to_object_store and create_evaluation_dataset to locate the change
and ensure no EvaluationDataset is persisted if the upload did not succeed.

In `@backend/app/assessment/events.py`:
- Around line 13-48: The subscriber queues are unbounded (asyncio.Queue()) so
slow/idle SSE clients can cause OOM; fix by creating bounded queues in subscribe
(e.g., queue: asyncio.Queue[dict] = asyncio.Queue(maxsize=XX)) and update
publish to handle overflow: if queue.put_nowait(payload) raises
asyncio.QueueFull, either drop the oldest item from that queue
(queue.get_nowait() then retry put_nowait) or treat the subscriber as dead by
discarding the queue from self._subscribers and logging/removing it; change
symbols: modify subscribe (queue creation) and publish (the for loop over
self._subscribers and use of queue.put_nowait) to implement the chosen overflow
strategy and add appropriate debug/info logs.

---

Nitpick comments:
In `@backend/app/alembic/versions/055_add_assessment_manager_table.py`:
- Around line 121-132: The migration defines NOT NULL columns inserted_at and
updated_at without DB defaults, which breaks raw INSERTs; update the Column
definitions for "inserted_at" and "updated_at" to include a database server
default (server_default=sa.func.now()) and add server_onupdate=sa.func.now() for
"updated_at" so the DB will populate timestamps for direct SQL inserts and
automatic updates from the database side.

In `@backend/app/api/routes/cron.py`:
- Around line 79-95: The code mutates result assuming it's a dict which can
raise a TypeError if process_all_pending_evaluations (or the variable result) is
None or not a dict; guard and normalize it before merging by checking
isinstance(result, dict) (or using result = result or {}) and then safely set
result["assessment"] = assessment_result and update totals from
assessment_result; reference the variables result and assessment_result and the
surrounding merge block so the normalization happens immediately before the
lines that assign result["assessment"] and update
total_processed/total_failed/total_still_processing.
- Around line 71-77: The inline runtime import of
poll_all_pending_assessment_evaluations inside the cron request handler should
be resolved: either hoist the import to module scope so the function is imported
once at startup (remove the in-handler from app.assessment.cron import ... and
add it at top-level), or if this was done to avoid a circular import, add a
concise comment above the inline import explaining the circular-import reason
and why the import must remain lazy. Ensure you keep the existing call to
poll_all_pending_assessment_evaluations(session=session) unchanged and reference
that symbol in your change.

In `@backend/app/assessment/batch.py`:
- Around line 289-304: The current shallow copy "body = dict(openai_params)"
causes nested objects (like text.format.json_schema, reasoning, tools) to be
shared across all rows; replace that with a deep copy (use
copy.deepcopy(openai_params)) when constructing body inside the loop so each
JSONL row gets an independent payload; apply the same deep-copy pattern at the
equivalent Gemini-side assignment as well to avoid cross-row mutation later
(look for the "body" variable, "openai_params", and the jsonl_data.append
block/BATCH_KEY usage to locate the spots to change).
- Around line 138-161: The _build_text_prompt function currently does sequential
replace which allows placeholder-in-value re-expansion (e.g., a column value
containing "{other}" gets re-substituted when that column is processed later);
update _build_text_prompt to perform a single-pass, safe substitution by
building a mapping of column->normalized value and using a single-call
formatting approach (e.g., str.format_map with a default/missing-safe dict)
instead of iterative prompt.replace, ensuring you still call normalize_llm_text
for each value before substitution; keep the original behavior for the
no-template branch unchanged.

In `@backend/app/assessment/cron.py`:
- Around line 73-76: The log call in poll_all_pending_assessment_evaluations
uses two adjacent string literals ("[poll_all_pending_assessment_evaluations] "
"No active assessments found") which concatenates into a string with double
space; update the logger.info invocation in cron.py to use a single combined
string (e.g., "[poll_all_pending_assessment_evaluations] No active assessments
found") or an f-string so the prefix and message are one contiguous string with
a single space, ensuring the message reads cleanly.

In `@backend/app/assessment/crud.py`:
- Around line 255-315: recompute_assessment_status currently does a
read-modify-write on the Assessment row which can be lost under concurrent
recomputes; to fix, acquire a row lock on the parent Assessment (e.g., SELECT
... FOR UPDATE) before recomputing in recompute_assessment_status or replace the
logic with a single atomic UPDATE ... FROM (SELECT ...) aggregation that
computes totals from AssessmentRun and updates Assessment in one statement;
reference the recompute_assessment_status function and the
Assessment/AssessmentRun models and ensure you commit while holding the lock or
use a transaction so concurrent workers cannot overwrite each other's results.

In `@backend/app/assessment/events.py`:
- Line 26: Replace the deprecated alias asyncio.TimeoutError with the builtin
TimeoutError in the exception handler inside backend.app.assessment.events (the
except block currently using asyncio.TimeoutError); update the except clause in
the function/method where that try/except appears (look for the "except
asyncio.TimeoutError:" line) to use "except TimeoutError:" so it relies on the
built-in exception instead of the asyncio alias.

In `@backend/app/assessment/mappers.py`:
- Around line 10-22: The type hint for normalize_llm_text is inaccurate because
it can accept and return None; update its annotation to accept Optional[str] and
return Optional[str] (from typing import Optional), i.e. change def
normalize_llm_text(text: str) -> str to def normalize_llm_text(text:
Optional[str]) -> Optional[str], and keep the existing early-return behavior so
callers that pass kaapi_params.get("instructions") (which may be None) are
correctly typed.
- Around line 25-59: Both _ensure_openai_strict_schema and
_strip_additional_properties only recurse into "properties" and "items", so they
miss subschemas under composition keywords and definitions; update both
functions to also traverse "anyOf", "oneOf", "allOf" (iterating and recursively
processing each subschema when they're lists) and "$defs" / "definitions"
(recursively processing dict values), ensuring you treat nested dicts/lists
consistently (use _ensure_openai_strict_schema in the former and
_strip_additional_properties in the latter) so additionalProperties is injected
or removed throughout composed/nested schemas.
- Line 5: The import of google.genai._transformers in
backend.app.assessment.mappers should be removed and the assessment->GenAI
schema flow updated to not rely on the private module; instead pass the raw JSON
Schema into the SDK's documented response_json_schema parameter when building
requests, or implement a robust internal converter
_convert_json_schema_to_google that performs the necessary transformations
(uppercase/openAI-style type names, resolve $ref entries from $defs, convert
anyOf [ "null", X ] into a nullable representation) and pin the SDK version if
you must keep an internal transformer; update any references in this file (e.g.,
where genai_transformers was used) to call response_json_schema or the new
_convert_json_schema_to_google helper.

In `@backend/app/assessment/models.py`:
- Around line 32-36: The primary-key fields (e.g., the id field declared as id:
int = SQLField(default=None, primary_key=True)) are typed non-optional while
defaulting to None; update the type annotations to allow None (use int | None or
Optional[int]) so the signature matches the SQLField(default=None) behavior,
e.g., change id: int to id: int | None = SQLField(...); apply the same fix to
the other primary-key field(s) with the same pattern elsewhere in the file (the
block around the second id declaration).

In `@backend/app/assessment/processing.py`:
- Around line 482-486: The local import of
poll_all_pending_assessment_evaluations inside poll_all_pending_assessments is
intentional to avoid a circular import; add a concise inline comment above the
from app.assessment.cron import poll_all_pending_assessment_evaluations line
explaining that the import must remain local (not module-level) to prevent
circular imports and to warn future maintainers not to move it to top-level.

In `@backend/app/assessment/service.py`:
- Line 10: service.py is importing the private function _resolve_config from
batch.py which couples modules to an internal symbol; change this by promoting
the helper to a public API (rename _resolve_config to resolve_config) or move it
into a shared module (e.g., assessment.config or assessment_utils) and update
both callers to import the new public name, then remove the underscore import
from batch.py and service.py and ensure submit_assessment_batch and any other
callers reference the new resolve_config location.
- Around line 145-193: The start_assessment flow performs multiple DB
creates/updates (create_assessment, create_assessment_run,
update_assessment_run_status, recompute_assessment_status) without an explicit
transaction, allowing partial commits if an error occurs (e.g., during
submit_assessment_batch or recompute_assessment_status); wrap the multi-step
orchestration in a single transactional boundary (use session.begin() /
session.begin_nested() as appropriate) so all create/update calls are atomic and
commit once at the end, or alternatively, on any uncaught exception ensure you
mark the parent Assessment as failed (update_assessment_run_status + update
parent assessment status and error_message via the same session) before
re-raising/returning to avoid leaving orphaned pending/processing runs; locate
these changes in start_assessment and the exception handling around
submit_assessment_batch/recompute_assessment_status to implement the chosen
approach.
- Around line 159-195: The loop currently calls recompute_assessment_status
inside both the try (after update_assessment_run_status) and the except blocks,
causing redundant DB writes; remove the per-iteration calls to
recompute_assessment_status within the try/except and keep only the final
recompute_assessment_status after the loop completes so the parent's status is
updated once; locate the calls around submit_assessment_batch,
update_assessment_run_status, and the except block handling the exception for
run processing and delete those inner recompute_assessment_status invocations
while leaving the final recompute_assessment_status(session=session,
assessment_id=assessment.id) at the end.
- Around line 51-62: The code is redundantly wrapping an already-UUID config_id
with UUID(str(...)) in the loop building AssessmentConfigRef; change the
AssessmentConfigRef call to use the existing UUID directly (e.g.
config_id=run.config_id) and if the type checker complains, assert or cast the
non-None type before use (e.g. assert run.config_id is not None then use
run.config_id) so you remove the unnecessary UUID(str(run.config_id))
round-trip; update the instantiation in the loop that creates
AssessmentConfigRef accordingly.

In `@backend/app/assessment/utils/export.py`:
- Around line 395-417: The export is masking missing parent relationships by
coercing assessment_id to 0; update the export to pass the real value (use
assessment_id=run.assessment_id) and either (A) if a missing parent should be an
error, add an explicit check in the export routine (raise ValueError or similar
when run.assessment_id is None) before building AssessmentExportRow, or (B) if
exports may legitimately have no parent, update
AssessmentExportRow.assessment_id type in models.py to allow None (int | None)
and keep passing run.assessment_id through; locate the construction of
AssessmentExportRow in export_rows.append and implement one of these two fixes
around that symbol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 072190d8-1f9b-4ad9-9f98-d507cb197d57

📥 Commits

Reviewing files that changed from the base of the PR and between ec2eeda and f2e512e.

📒 Files selected for processing (21)
  • backend/app/alembic/versions/055_add_assessment_manager_table.py
  • backend/app/api/main.py
  • backend/app/api/routes/cron.py
  • backend/app/assessment/__init__.py
  • backend/app/assessment/batch.py
  • backend/app/assessment/cron.py
  • backend/app/assessment/crud.py
  • backend/app/assessment/dataset.py
  • backend/app/assessment/events.py
  • backend/app/assessment/mappers.py
  • backend/app/assessment/models.py
  • backend/app/assessment/processing.py
  • backend/app/assessment/routes.py
  • backend/app/assessment/service.py
  • backend/app/assessment/utils/__init__.py
  • backend/app/assessment/utils/export.py
  • backend/app/assessment/utils/parsing.py
  • backend/app/assessment/validators.py
  • backend/app/models/__init__.py
  • backend/app/models/batch_job.py
  • backend/app/models/llm/request.py
✅ Files skipped from review due to trivial changes (5)
  • backend/app/models/batch_job.py
  • backend/app/assessment/init.py
  • backend/app/api/main.py
  • backend/app/assessment/utils/parsing.py
  • backend/app/models/init.py
🚧 Files skipped from review as they are similar to previous changes (3)
  • backend/app/models/llm/request.py
  • backend/app/assessment/utils/init.py
  • backend/app/assessment/routes.py

Comment on lines +15 to +16
revision = "55"
down_revision = "054"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Revision ID is "55" but the filename and down_revision use zero-padded "055" / "054".

Existing migrations are zero-padded three-digit IDs, and down_revision = "054". Setting revision = "55" (without padding) breaks the visual ordering convention and creates a mismatch with the filename 055_add_assessment_manager_table.py. Use "055" to match the established scheme — the next migration (which will likely set its down_revision to this id) will be far less surprising.

🛠️ Proposed change
-revision = "55"
+revision = "055"
 down_revision = "054"

As per coding guidelines: "Generate database migrations using alembic revision --autogenerate -m "Description" --rev-id <number> where rev-id is the latest existing revision ID + 1".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
revision = "55"
down_revision = "054"
revision = "055"
down_revision = "054"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/alembic/versions/055_add_assessment_manager_table.py` around
lines 15 - 16, The migration's revision identifier is not zero-padded and
mismatches the filename and down_revision; update the module-level variable
revision from "55" to "055" so it matches the filename and existing
down_revision = "054", keeping the three-digit convention used across migrations
and ensuring subsequent revisions align.

Comment on lines +107 to +135
def _parse_excel_rows(content: bytes) -> list[dict[str, str]]:
"""Parse Excel content into list of row dicts."""
import openpyxl

wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
ws = wb.active
if ws is None:
wb.close()
return []

rows_iter = ws.iter_rows(values_only=True)
header = next(rows_iter, None)
if header is None:
wb.close()
return []

columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
result = []
for row in rows_iter:
if row and any(cell is not None for cell in row):
row_dict = {
columns[i]: str(cell) if cell is not None else ""
for i, cell in enumerate(row)
if i < len(columns)
}
result.append(row_dict)

wb.close()
return result
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

wb.close() is not guaranteed on iteration errors.

If iter_rows/next or any conversion in the row-build raises (e.g., extremely malformed .xlsx), the workbook handle is leaked because close() is only called on the success paths. Wrap the body in try/finally to ensure cleanup.

♻️ Suggested fix
 def _parse_excel_rows(content: bytes) -> list[dict[str, str]]:
     """Parse Excel content into list of row dicts."""
     import openpyxl
 
     wb = openpyxl.load_workbook(io.BytesIO(content), read_only=True, data_only=True)
-    ws = wb.active
-    if ws is None:
-        wb.close()
-        return []
-
-    rows_iter = ws.iter_rows(values_only=True)
-    header = next(rows_iter, None)
-    if header is None:
-        wb.close()
-        return []
-
-    columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
-    result = []
-    for row in rows_iter:
-        if row and any(cell is not None for cell in row):
-            row_dict = {
-                columns[i]: str(cell) if cell is not None else ""
-                for i, cell in enumerate(row)
-                if i < len(columns)
-            }
-            result.append(row_dict)
-
-    wb.close()
-    return result
+    try:
+        ws = wb.active
+        if ws is None:
+            return []
+
+        rows_iter = ws.iter_rows(values_only=True)
+        header = next(rows_iter, None)
+        if header is None:
+            return []
+
+        columns = [str(h) if h is not None else f"col_{i}" for i, h in enumerate(header)]
+        result = []
+        for row in rows_iter:
+            if row and any(cell is not None for cell in row):
+                row_dict = {
+                    columns[i]: str(cell) if cell is not None else ""
+                    for i, cell in enumerate(row)
+                    if i < len(columns)
+                }
+                result.append(row_dict)
+        return result
+    finally:
+        wb.close()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 107 - 135, The workbook created
in _parse_excel_rows (wb returned by openpyxl.load_workbook) may never be closed
if iter_rows, next(header) or the row-to-dict conversion raises; wrap the
workbook usage in a try/finally (or context manager if available) so wb.close()
is always called in the finally block after opening wb and before returning, and
move any early returns (e.g., when ws is None or header is None) into that try
so they still hit the finally cleanup; ensure variables referenced (wb, ws,
rows_iter, header, result) are initialized inside the try scope.

Comment on lines +481 to +569
base_provider = _NATIVE_PROVIDERS.get(provider_name, "openai")

if base_provider == "openai":
mapped_params, warnings = map_kaapi_to_openai_params(params)
if warnings:
logger.info(f"[submit_assessment_batch] Mapper warnings: {warnings}")

jsonl_data = build_openai_jsonl(
rows=rows,
text_columns=text_columns,
attachments=attachments,
prompt_template=prompt_template,
openai_params=mapped_params,
)

# Get OpenAI client and submit
from app.utils import get_openai_client

openai_client = get_openai_client(
session=session,
org_id=organization_id,
project_id=project_id,
)
provider = OpenAIBatchProvider(client=openai_client)

batch_config = {
"endpoint": "/v1/responses",
"description": f"Assessment: {run.run_name}",
"completion_window": "24h",
}

batch_job = start_batch_job(
session=session,
provider=provider,
provider_name="openai",
job_type=BatchJobType.ASSESSMENT,
organization_id=organization_id,
project_id=project_id,
jsonl_data=jsonl_data,
config=batch_config,
)

elif base_provider == "google":
mapped_params, warnings = map_kaapi_to_google_params(params)
if warnings:
logger.info(f"[submit_assessment_batch] Mapper warnings: {warnings}")

jsonl_data = build_google_jsonl(
rows=rows,
text_columns=text_columns,
attachments=attachments,
prompt_template=prompt_template,
google_params=mapped_params,
)

# Get Gemini client and submit
from app.core.batch import GeminiBatchProvider
from app.core.batch.client import GeminiClient

gemini_client = GeminiClient.from_credentials(
session=session,
org_id=organization_id,
project_id=project_id,
)
provider = GeminiBatchProvider(
client=gemini_client.client,
model=f"models/{mapped_params.get('model', 'gemini-2.5-pro')}",
)

batch_config = {
"display_name": f"assessment-{run.run_name}",
"model": f"models/{mapped_params.get('model', 'gemini-2.5-pro')}",
}

batch_job = start_batch_job(
session=session,
provider=provider,
provider_name="google",
job_type=BatchJobType.ASSESSMENT,
organization_id=organization_id,
project_id=project_id,
jsonl_data=jsonl_data,
config=batch_config,
)

else:
raise ValueError(
f"Unsupported provider for assessment batches: {provider_name}"
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Unknown provider names silently fall through to the OpenAI batch path.

base_provider = _NATIVE_PROVIDERS.get(provider_name, "openai") defaults unknown providers to "openai", so the if base_provider == "openai": branch always matches and the else: raise ValueError(...) at lines 566‑569 is unreachable. A typo or new provider value (e.g., "anthropic") will silently submit through OpenAI rather than fail loudly.

♻️ Suggested fix — fail fast on unknown providers
-    # Determine the base provider (openai or google)
-    base_provider = _NATIVE_PROVIDERS.get(provider_name, "openai")
+    # Determine the base provider (openai or google)
+    base_provider = _NATIVE_PROVIDERS.get(provider_name)
+    if base_provider is None:
+        raise ValueError(
+            f"Unsupported provider for assessment batches: {provider_name}"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/batch.py` around lines 481 - 569, The code is
defaulting unknown provider_name to "openai" via _NATIVE_PROVIDERS.get(...,
"openai"), causing unknown providers to silently use the OpenAI path; change
this to fail fast by using _NATIVE_PROVIDERS.get(provider_name) (no default) or
explicitly checking if provider_name not in _NATIVE_PROVIDERS and raising
ValueError, then use base_provider for the if/elif branches (referencing the
variables base_provider and provider_name and the mapping _NATIVE_PROVIDERS);
ensure the unreachable else that raises ValueError remains reachable so
start_batch_job is only called for known providers.

Comment on lines +41 to +61
def _build_callback_payload(
assessment: Assessment,
run: AssessmentRun,
result: dict[str, Any],
) -> dict[str, Any]:
"""Build minimal SSE payload for assessment invalidation."""
return APIResponse.success_response(
data={
"type": "assessment.child_status_changed",
"assessment_id": assessment.id,
"assessment_status": assessment.status,
"run": {
"id": run.id,
"config_id": str(run.config_id) if run.config_id else None,
"config_version": run.config_version,
"status": result.get("current_status"),
"error": result.get("error"),
"updated_at": run.updated_at.isoformat() if run.updated_at else None,
},
}
).model_dump()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect APIResponse.success_response to confirm wrapping shape.
fd -t f utils.py -p backend/app | head
ast-grep --pattern $'class APIResponse$$$:
  $$$
'
rg -nP -C5 "def success_response" --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1073


🏁 Script executed:

# Get full APIResponse class definition
sed -n '50,75p' backend/app/utils.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 863


🏁 Script executed:

# Check the publish method in events.py around line 29
fd -t f 'events.py' backend/app/assessment

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 108


🏁 Script executed:

# Look at events.py to see how payload is used
rg -n "def publish" backend/app/assessment -A 20 --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1084


🏁 Script executed:

# Check if there are any model_dump calls or customizations in APIResponse
rg -n "model_dump\|class_config\|ConfigDict" backend/app/utils.py --type=py -B 2 -A 2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check where the payload from the queue is consumed
rg -n "queue.get\|from_queue\|on_event" backend/app/assessment -A 10 --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check SSE streaming to confirm event_type usage
rg -n "event_type\|event:" backend/app/assessment --type=py -B 2 -A 2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Verify type hints are complete on _build_callback_payload
sed -n '41,62p' backend/app/assessment/cron.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 891


🏁 Script executed:

# Find where assessment_event_broker is imported and used in cron.py
rg -n "assessment_event_broker" backend/app/assessment/cron.py --type=py -B 2 -A 2

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 857


🏁 Script executed:

# Find SSE route/handler that consumes assessment_event_broker
rg -n "assessment_event_broker\|AssessmentEventBroker" backend/app --type=py -l

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Look at the full events.py to understand the complete flow
cat backend/app/assessment/events.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1777


SSE event: line will degrade to "message""type" is nested inside APIResponse.success_response(...).data, not at payload root.

The subscribe handler in backend/app/assessment/events.py at line 28 extracts the event type:

event_type = payload.get("type", "message")
yield f"event: {event_type}\ndata: {json.dumps(payload)}\n\n"

But _build_callback_payload wraps the callback dict in APIResponse.success_response(data={...}).model_dump(), producing:

{
  "success": true,
  "data": {
    "type": "assessment.child_status_changed",
    "assessment_id": "...",
    ...
  },
  "error": null,
  ...
}

The root-level payload.get("type") returns None and falls back to "message", so all assessment events are sent as event: message, defeating client-side listeners expecting event: assessment.child_status_changed.

Either lift "type" to the root of the published payload or update events.py::subscribe() to look at payload["data"]["type"].

Suggested fix (option A — flatten)
-    return APIResponse.success_response(
-        data={
-            "type": "assessment.child_status_changed",
-            "assessment_id": assessment.id,
-            "assessment_status": assessment.status,
-            "run": {
-                "id": run.id,
-                "config_id": str(run.config_id) if run.config_id else None,
-                "config_version": run.config_version,
-                "status": result.get("current_status"),
-                "error": result.get("error"),
-                "updated_at": run.updated_at.isoformat() if run.updated_at else None,
-            },
-        }
-    ).model_dump()
+    envelope = APIResponse.success_response(
+        data={
+            "assessment_id": assessment.id,
+            "assessment_status": assessment.status,
+            "run": {
+                "id": run.id,
+                "config_id": str(run.config_id) if run.config_id else None,
+                "config_version": run.config_version,
+                "status": result.get("current_status"),
+                "error": result.get("error"),
+                "updated_at": run.updated_at.isoformat() if run.updated_at else None,
+            },
+        }
+    ).model_dump()
+    envelope["type"] = "assessment.child_status_changed"
+    return envelope
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/cron.py` around lines 41 - 61, The payload returned by
_build_callback_payload nests "type" inside
APIResponse.success_response(...).data so events.py's subscribe() (which reads
payload.get("type", "message")) always falls back to "message"; fix by promoting
the event type to the root of the published payload: in _build_callback_payload
(or immediately after calling APIResponse.success_response(...).model_dump())
ensure the returned dict contains a top-level "type":
"assessment.child_status_changed" key (while keeping the detailed run/assessment
info under "data"), so subscribe() can detect the event without changing
events.py.

Comment on lines +144 to +182
except Exception as e:
logger.error(
"[poll_all_pending_assessment_evaluations] "
f"Failed run {run.id} | "
f"experiment={run.run_name} | "
f"assessment_id={run.assessment_id} | "
f"config_id={run.config_id} | "
f"config_version={run.config_version} | "
f"error={e}",
exc_info=True,
)
update_assessment_run_status(
session=session,
run=run,
status="failed",
error_message="Processing failed. Check server logs for details.",
)
refreshed_assessment = recompute_assessment_status(
session=session, assessment_id=assessment.id
)
failure_result = {
"assessment_id": run.assessment_id,
"run_id": run.id,
"run_name": run.run_name,
"config_id": str(run.config_id) if run.config_id else None,
"config_version": run.config_version,
"action": "failed",
"error": "Processing failed",
"current_status": "failed",
}
all_results.append(failure_result)
assessment_event_broker.publish(
_build_callback_payload(
refreshed_assessment,
run,
failure_result,
)
)
failed += 1
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Cleanup inside the broad except can itself raise and abort the cron loop.

update_assessment_run_status and recompute_assessment_status both perform session.commit() and re-raise on failure (see backend/app/assessment/crud.py:192-222 and :255-315). If a transient DB error hits while we're handling an unrelated check_and_process_assessment exception, the secondary exception escapes this except and unwinds out of the for-loop, skipping every remaining assessment for this cron tick.

Wrap the cleanup ops in their own try/except (logging + continue) so one bad run can't poison the whole batch.

🛡️ Sketch
             except Exception as e:
                 logger.error( ... )
-                update_assessment_run_status(
-                    session=session,
-                    run=run,
-                    status="failed",
-                    error_message="Processing failed. Check server logs for details.",
-                )
-                refreshed_assessment = recompute_assessment_status(
-                    session=session, assessment_id=assessment.id
-                )
+                try:
+                    update_assessment_run_status(
+                        session=session,
+                        run=run,
+                        status="failed",
+                        error_message="Processing failed. Check server logs for details.",
+                    )
+                    refreshed_assessment = recompute_assessment_status(
+                        session=session, assessment_id=assessment.id
+                    )
+                except Exception as cleanup_err:
+                    logger.error(
+                        "[poll_all_pending_assessment_evaluations] "
+                        f"Cleanup failed for run {run.id}: {cleanup_err}",
+                        exc_info=True,
+                    )
+                    failed += 1
+                    continue
                 ...
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/cron.py` around lines 144 - 182, The cleanup steps
inside the broad except should be isolated so a secondary DB error can't abort
the whole cron loop: wrap the calls to update_assessment_run_status,
recompute_assessment_status, the construction/appending of failure_result,
assessment_event_broker.publish (and the failed += 1) in their own try/except
block; on exception catch and logger.error with context (include
run.id/run.assessment_id/run.run_name) and exc_info=True, then continue the
for-loop without re-raising so one failed cleanup won't stop processing
remaining runs.

Comment on lines +192 to +223
def update_assessment_run_status(
session: Session,
run: AssessmentRun,
status: str,
error_message: str | None = None,
batch_job_id: int | None = None,
total_items: int | None = None,
object_store_url: str | None = None,
) -> AssessmentRun:
"""Update an assessment run's status and optional fields."""
run.status = status
run.updated_at = now()

if error_message is not None:
run.error_message = error_message
if batch_job_id is not None:
run.batch_job_id = batch_job_id
if total_items is not None:
run.total_items = total_items
if object_store_url is not None:
run.object_store_url = object_store_url

session.add(run)
try:
session.commit()
session.refresh(run)
except Exception as e:
session.rollback()
logger.error(f"[update_assessment_run_status] Failed: {e}", exc_info=True)
raise

return run
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

error_message=None cannot clear an existing error.

The pattern if error_message is not None: run.error_message = error_message (and the same for batch_job_id, total_items, object_store_url) means once any of these are populated, callers cannot reset them through this function, only overwrite them. If a retry/recovery path ever needs to clear error_message on transition back to processing/completed, this will silently keep stale data.

If the intent is "merge non-default fields", this is fine — just be aware of the limitation.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 192 - 223, The
update_assessment_run_status function currently treats None as "do not change",
preventing callers from clearing fields; fix this by introducing a unique
sentinel (e.g., UNSET = object()) as the default for error_message,
batch_job_id, total_items, and object_store_url, then change the checks to "is
not UNSET" and assign the value (which may be None) to run.error_message,
run.batch_job_id, run.total_items, and run.object_store_url inside
update_assessment_run_status so callers can explicitly clear fields by passing
None; update any call sites accordingly to use the sentinel omission vs explicit
None semantics.

Comment on lines +198 to +231
elif provider_name in ("google", "google-native"):
response = result.get("response")
error = result.get("error")

if error:
results.append(
{
"row_id": row_id,
"output": None,
"error": str(error),
"usage": None,
}
)
continue

if response:
text = extract_text_from_response_dict(response)
results.append(
{
"row_id": row_id,
"output": text,
"error": None,
"usage": None,
}
)
else:
results.append(
{
"row_id": row_id,
"output": None,
"error": "Empty response",
"usage": None,
}
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Confirm the Gemini response shape and where usage metadata / errors live in the parsed dict.
fd -t f 'gemini.py' backend/app/core/batch
rg -n -C2 'usageMetadata|usage_metadata|extract_text_from_response_dict|promptFeedback|finishReason' backend/app/core/batch

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1618


🏁 Script executed:

# Check the extract_text_from_response_dict function implementation
head -100 backend/app/core/batch/gemini.py | cat -n

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3570


🏁 Script executed:

# Look for usageMetadata handling in gemini.py
rg -n 'usageMetadata|usage|metadata' backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check the OpenAI branch to see how it handles usage
sed -n '198,231p' backend/app/assessment/processing.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1157


🏁 Script executed:

# Look at the broader context in processing.py to see the OpenAI branch
sed -n '150,231p' backend/app/assessment/processing.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3368


🏁 Script executed:

# Check the usage_totals function in export.py
rg -n -A10 'def usage_totals' backend/app/assessment/export.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 148


🏁 Script executed:

# Find export.py
fd -t f 'export.py' backend/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 114


🏁 Script executed:

# Look for usage extraction in Gemini context
rg -n 'usageMetadata|usage_metadata' backend/

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3629


🏁 Script executed:

# Search for Gemini batch response handling to understand structure
rg -n -B3 -A3 'candidates|finishReason|promptFeedback' backend/app/core/batch/gemini.py | head -50

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 661


🏁 Script executed:

# Check if there's test data or fixtures showing Gemini response structure
fd -t f '*.json' backend/ | xargs rg -l 'usageMetadata|candidates' | head -5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 632


🏁 Script executed:

# Check usage_totals function
rg -n -A15 'def usage_totals' backend/app/assessment/utils/export.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Look at Gemini batch test to see response structure
cat -n backend/app/tests/core/batch/test_gemini.py | head -150

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 6852


🏁 Script executed:

# Check if Gemini batch processing includes usageMetadata in response dict
rg -n -B5 -A5 'usageMetadata\|"usage"' backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check how Gemini processes batch results and constructs the response dict
sed -n '200,350p' backend/app/core/batch/gemini.py | cat -n

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 6925


🏁 Script executed:

# Check if finishReason is checked anywhere in batch processing
rg -n 'finishReason|finish_reason|SAFETY|BLOCKED' backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Look for usageMetadata in Gemini batch response handling
rg -n '"usageMetadata"|usage' backend/app/core/batch/gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check test data to see actual Gemini batch response structure
rg -n -A20 'download_batch_results\|response.*dict' backend/app/tests/core/batch/test_gemini.py | head -80

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Verify the Anthropic implementation to compare error handling patterns
sed -n '130,200p' backend/app/assessment/processing.py | cat -n

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 3672


🏁 Script executed:

# Check Gemini API documentation or response examples in tests/batch
rg -n -B10 -A10 '"response"' backend/app/tests/core/batch/test_gemini.py | head -100

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 4494


🏁 Script executed:

# Search for example response structure in test data
rg -n '"candidates"' backend/app/tests/core/batch/test_gemini.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 521


🏁 Script executed:

# Check if there's any code that extracts usage from Gemini batch response
rg -n 'response\.' backend/app/core/batch/gemini.py | grep -i usage

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


Gemini branch has two unhandled cases that diverge from Anthropic behavior.

  1. Usage data is dropped: usage is hard-coded to None even when processing Gemini responses. The Anthropic branch extracts this via response_body.get("usage"). However, verify whether Gemini batch responses include usage metadata—the batch response structure shown in tests (candidates/content/parts) differs from direct API responses.

  2. Empty text is not treated as an error: When extract_text_from_response_dict() returns an empty string, the row is recorded as a success with error=None. The Anthropic branch handles this by setting error="Empty response output" when no text is extracted. Apply the same logic to the Google branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` around lines 198 - 231, In the
Google/Gemini branch (the block handling provider_name in
("google","google-native")), populate the "usage" field from the response
instead of hard-coding None (e.g., extract response.get("usage") or, if Gemini
batch responses use a different shape, parse the batch structure to pull
usage/metadata) and treat an empty string from
extract_text_from_response_dict(response) as an error by setting error="Empty
response output" (mirror the Anthropic branch behavior); update the logic around
response, text = extract_text_from_response_dict(response) and the
results.append entries to include the extracted usage and to set error when text
is empty.

Returns:
Dict with status information
"""
log_prefix = f"[assessment_run={run.id}]"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Log prefix should include the function name.

log_prefix = f"[assessment_run={run.id}]" is later used in logger.info/logger.error calls (lines 322‑325, 356, 456‑458), so those messages won't carry the function name as required by the project's logging convention.

♻️ Suggested adjustment
-    log_prefix = f"[assessment_run={run.id}]"
+    log_prefix = f"[check_and_process_assessment] assessment_run={run.id}"

As per coding guidelines: "Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message {mask_string(sensitive_value)}")".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
log_prefix = f"[assessment_run={run.id}]"
log_prefix = f"[check_and_process_assessment] assessment_run={run.id}"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/processing.py` at line 253, Update the log_prefix to
include the current function name so all subsequent logger calls follow the
project's convention; replace the line setting log_prefix =
f"[assessment_run={run.id}]" with something like log_prefix =
f"[{inspect.currentframe().f_code.co_name}][assessment_run={run.id}]" (and add
"import inspect" if missing) so functions like the one containing run.id will
prefix logs with the function name and the assessment_run id used in
logger.info/logger.error calls.

Comment on lines +60 to +66
for row in row_payload:
input_data = row.pop("input_data", None) or {}
new_row = {}
for k in input_keys:
new_row[k] = input_data.get(k)
new_row.update(row)
expanded.append(new_row)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Dataset column name collisions silently lose user data.

new_row[k] = input_data.get(k) followed by new_row.update(row) means any input column whose name matches an AssessmentExportRow field (e.g. output, error, row_id, run_id, result_status, response_id) will be overwritten by the result column on every row, with no warning. This is easy to hit if a dataset has an output or error column.

Consider namespacing input columns (e.g. input_<col>) when they collide with reserved field names, or warn once during expansion.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 60 - 66, The expansion
loop in export.py silently overwrites input columns that share names with export
fields (e.g. output, error, row_id, run_id, result_status, response_id) by
setting new_row[k] from input_data then calling new_row.update(row); modify the
loop that builds new_row (the variables row_payload, input_data, input_keys,
new_row are present) to detect collisions against a reserved_fields set and
instead namespace conflicting input keys (e.g. store as "input_<k>") or,
alternatively, emit a single warning when any collision is detected; ensure
namespacing logic also avoids clobbering an existing "input_<k>" in row and
preserves original row fields when calling new_row.update(row).

Comment on lines +422 to +433
def sort_export_rows(
export_rows: list[AssessmentExportRow],
) -> list[AssessmentExportRow]:
"""Sort exported rows for stable downloads across runs/configs."""
export_rows.sort(
key=lambda row: (
row.config_version or 0,
row.row_id,
row.run_id,
)
)
return export_rows
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

row_id is sorted lexicographically — "row_10" comes before "row_2".

row_id is the string f"row_{idx}" (see processing.py parsing and load_export_rows_for_run at line 406). Lexicographic ordering will place row_10 before row_2, which scrambles CSV/XLSX/JSON exports for any dataset with ≥10 rows.

♻️ Suggested fix — natural sort on the index
+def _row_id_index(row_id: str) -> int:
+    if isinstance(row_id, str) and row_id.startswith("row_"):
+        try:
+            return int(row_id.split("_", 1)[1])
+        except (ValueError, IndexError):
+            return 0
+    return 0
+
 def sort_export_rows(
     export_rows: list[AssessmentExportRow],
 ) -> list[AssessmentExportRow]:
     """Sort exported rows for stable downloads across runs/configs."""
     export_rows.sort(
         key=lambda row: (
             row.config_version or 0,
-            row.row_id,
+            _row_id_index(row.row_id),
             row.run_id,
         )
     )
     return export_rows
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/utils/export.py` around lines 422 - 433, The
sort_export_rows function currently sorts by row.row_id lexicographically which
makes "row_10" come before "row_2"; update sort_export_rows (and any callers
expecting stable order like load_export_rows_for_run / AssessmentExportRow) to
extract the numeric index from row.row_id (e.g. parse the suffix after "row_")
and use that integer in the sort key instead of the raw string, keeping the
existing tie-breakers (row.config_version or 0, row.run_id); ensure the parser
handles missing/invalid suffixes by falling back to 0 so sorting remains stable.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/assessment/dataset.py`:
- Around line 22-26: Remove support for legacy .xls by deleting the ".xls" entry
from the _MIME_TYPES dict and from ALLOWED_EXTENSIONS in validators.py so .xls
files are rejected at validation (do not route them to openpyxl). Also update
the validation routine that checks ALLOWED_EXTENSIONS to explicitly reject
".xls" (and return a clear validation error). Finally, harden the parsing code
that calls openpyxl (the read/parse functions in dataset.py and batch.py where
openpyxl is invoked) by replacing the broad silent except that returns 0 rows
with either re-raising the caught InvalidFileException or logging the exception
and returning a proper error response so legacy-file parsing failures aren’t
swallowed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7f0dd174-276b-4d14-9951-73c8a226e4b3

📥 Commits

Reviewing files that changed from the base of the PR and between f2e512e and 0c04860.

📒 Files selected for processing (3)
  • backend/app/assessment/batch.py
  • backend/app/assessment/dataset.py
  • backend/app/models/llm/request.py
✅ Files skipped from review due to trivial changes (1)
  • backend/app/assessment/batch.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • backend/app/models/llm/request.py

Comment on lines +22 to +26
_MIME_TYPES = {
".csv": "text/csv",
".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
".xls": "application/vnd.ms-excel",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether .xls is accepted upstream and where parsing paths are defined.
rg -n --type=py -C3 'validate_dataset_file|allowed.*(xls|xlsx|csv)|\.xls|_count_rows\(|_count_excel_rows\(|openpyxl\.load_workbook'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 6119


🌐 Web query:

Does openpyxl.load_workbook support legacy ".xls" files, or only ".xlsx/.xlsm/.xltx/.xltm" formats?

💡 Result:

No, openpyxl.load_workbook does not support legacy ".xls" files. It only supports Office Open XML formats: ".xlsx", ".xlsm", ".xltx", and ".xltm". Attempting to load a ".xls" file raises an InvalidFileException with a message recommending xlrd or conversion to .xlsx.

Citations:


🏁 Script executed:

# Check the full _count_excel_rows function in dataset.py
sed -n '78,102p' backend/app/assessment/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 823


🏁 Script executed:

# Check the full _parse_excel_rows function in batch.py
sed -n '124,150p' backend/app/assessment/batch.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 879


🏁 Script executed:

# Also check if there's a broader exception handler after _count_rows call
rg -n -A20 'def _count_rows' backend/app/assessment/dataset.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 822


.xls is routed to an openpyxl parser that does not support legacy Excel format.

Upstream validation in validators.py:15 accepts .xls files, which are then routed to openpyxl-based parsing. Since openpyxl only supports Office Open XML formats (.xlsx, .xlsm, etc.), not legacy .xls, this fails:

  • In dataset.py: The broad except Exception as e: handler (line 98-100) catches the error silently and returns 0 rows, corrupting dataset metadata and causing downstream failures.
  • In batch.py: The same openpyxl call (line 127) has no exception handling, causing an unhandled InvalidFileException.

Also applies to: backend/app/assessment/batch.py:103-105 and backend/app/assessment/batch.py:127

Proposed fix (reject legacy .xls at validation layer)
 _MIME_TYPES = {
     ".csv": "text/csv",
     ".xlsx": "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet",
-    ".xls": "application/vnd.ms-excel",
 }
@@
 def _count_rows(content: bytes, file_ext: str) -> int:
     """Count data rows in a file (CSV or Excel), excluding the header."""
-    if file_ext in (".xlsx", ".xls"):
+    if file_ext == ".xlsx":
         return _count_excel_rows(content)
+    if file_ext == ".xls":
+        raise HTTPException(
+            status_code=422,
+            detail="Legacy .xls files are not supported. Please upload .xlsx or .csv.",
+        )
     return _count_csv_rows(content)

Also remove .xls from ALLOWED_EXTENSIONS in validators.py:15.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/dataset.py` around lines 22 - 26, Remove support for
legacy .xls by deleting the ".xls" entry from the _MIME_TYPES dict and from
ALLOWED_EXTENSIONS in validators.py so .xls files are rejected at validation (do
not route them to openpyxl). Also update the validation routine that checks
ALLOWED_EXTENSIONS to explicitly reject ".xls" (and return a clear validation
error). Finally, harden the parsing code that calls openpyxl (the read/parse
functions in dataset.py and batch.py where openpyxl is invoked) by replacing the
broad silent except that returns 0 rows with either re-raising the caught
InvalidFileException or logging the exception and returning a proper error
response so legacy-file parsing failures aren’t swallowed.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (7)
backend/app/assessment/events.py (2)

36-48: Tighten the payload type annotation.

dict without parameters loses information for callers and static analyzers. Since payloads are JSON-serialized and the type field is read via .get("type"), prefer dict[str, Any] to match the queue type and align with the Python 3.11+ typing convention used elsewhere in the codebase.

♻️ Proposed fix
-    def publish(self, payload: dict) -> None:
+    def publish(self, payload: dict[str, Any]) -> None:

Add the import at the top:

-from collections.abc import AsyncIterator
+from collections.abc import AsyncIterator
+from typing import Any

Optionally also update the queue annotation on Line 13 / 16 to asyncio.Queue[dict[str, Any]] for consistency.

As per coding guidelines: "Always add type hints to all function parameters and return values in Python code" and "Use Python 3.11+ with type hints throughout the codebase".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` around lines 36 - 48, Update the publish
method signature to use a precise payload type: change payload: dict to payload:
dict[str, Any] and add from typing import Any to the module imports; also update
any asyncio.Queue annotations used for subscribers (e.g.,
asyncio.Queue[dict[str, Any]] or the _subscribers collection element type) so
the queue and publish agree on dict[str, Any] payloads, keeping the logic
unchanged.

24-24: Use builtin TimeoutError instead of the deprecated asyncio.TimeoutError alias.

Since Python 3.11, asyncio.TimeoutError is a deprecated alias for the builtin TimeoutError (per Ruff UP041). Given the codebase targets Python 3.11+, prefer the builtin.

♻️ Proposed fix
-                except asyncio.TimeoutError:
+                except TimeoutError:
                     yield ": keep-alive\n\n"
                     continue
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/events.py` at line 24, Replace the deprecated
asyncio.TimeoutError alias with the builtin TimeoutError in the exception
handler: change the except clause currently written as "except
asyncio.TimeoutError" in backend/app/assessment/events.py to "except
TimeoutError" and remove any now-unused reference to asyncio if that's the only
usage in this module; ensure the surrounding function/method (the exception
handling block in events.py) still behaves the same.
backend/app/tests/api/routes/test_cron.py (1)

28-47: Consider asserting the new assessment merge behavior and aligning the mock shape with the production contract.

The three updated tests patch poll_all_pending_assessment_evaluations but never assert on the new behavior the cron handler adds (merging assessment counts into the totals and exposing result["assessment"]). With all assessment counts mocked to 0, the existing total_* assertions would still pass even if the merge logic in evaluation_cron_job regressed (e.g., assessment counts ignored, key renamed, exception swallowed into assessment_error). Adding at least one case where assessment counts are non-zero would actually exercise the new code path this PR is testing.

Additionally, the real poll_all_pending_assessment_evaluations always returns total and details alongside the three counters (see backend/app/assessment/cron.py lines 189-195). The mock omitting those fields drifts from the production contract; if a future change reads assessment["details"] or assessment["total"], these tests will silently keep passing.

♻️ Suggested tightening (apply to one or more scenarios)
     with patch(
         "app.api.routes.cron.process_all_pending_evaluations",
         new=AsyncMock(return_value=mock_result),
     ), patch(
         "app.assessment.cron.poll_all_pending_assessment_evaluations",
         new=AsyncMock(
-            return_value={"processed": 0, "failed": 0, "still_processing": 0}
+            return_value={
+                "total": 2,
+                "processed": 1,
+                "failed": 1,
+                "still_processing": 0,
+                "details": [],
+            }
         ),
     ):
         response = client.get(
             f"{settings.API_V1_STR}/cron/evaluations",
             headers={"X-API-KEY": superuser_api_key.key},
         )

     assert response.status_code == 200
     data = response.json()
     assert data["status"] == "success"
-    assert data["total_processed"] == 5
-    assert data["total_failed"] == 0
-    assert data["total_still_processing"] == 1
+    # Verify assessment counts are merged into the aggregated totals
+    assert data["total_processed"] == 5 + 1
+    assert data["total_failed"] == 0 + 1
+    assert data["total_still_processing"] == 1 + 0
+    assert data["assessment"]["processed"] == 1
+    assert data["assessment"]["failed"] == 1
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/api/routes/test_cron.py` around lines 28 - 47, Update the
tests that patch poll_all_pending_assessment_evaluations to return a realistic
shape (include "processed", "failed", "still_processing" with at least one
non-zero count plus the production fields "total" and "details") and then assert
that evaluation_cron_job/cron handler merges those assessment counts into the
response (check response.json()["assessment"] exists and that
total_processed/total_failed/total_still_processing include the assessment
counts). Ensure the mock uses AsyncMock(return_value=...) for
poll_all_pending_assessment_evaluations and add assertions referencing the
merged totals and the assessment sub-dictionary.
backend/app/assessment/crud.py (1)

1-12: CRUD module should be placed under backend/app/crud/ per project convention.

This file is located at backend/app/assessment/crud.py, but the project convention places CRUD modules in backend/app/crud/ (as noted in learnings and confirmed by the existing backend/app/crud/ directory). If your team is intentionally adopting feature-folder organization for the assessment subsystem, document this exception in CLAUDE.md or contributor guidelines to keep the convention consistent for reviewers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 1 - 12, The CRUD module for
Assessment is in the wrong location; move backend/app/assessment/crud.py into
the shared CRUD package under backend/app/crud/ (keep filename crud.py) so that
the Assessment and AssessmentRun operations live alongside other CRUD modules;
update any imports that reference app.assessment.crud to app.crud.crud (or
equivalent) and ensure logger = logging.getLogger(__name__) and references to
Assessment and AssessmentRun continue to resolve; if you intentionally want
feature-folder organization instead, add a clear exception note to CLAUDE.md or
the contributor guidelines describing why assessment CRUD lives under
app/assessment and list the impacted symbols (Assessment, AssessmentRun,
functions in crud.py) so reviewers are aware.
backend/app/assessment/mappers.py (3)

117-118: Dead branch — "null" is no longer a valid summary value.

TextLLMParams.summary now only accepts "auto" | "detailed" | "concise" | None. The summary == "null" check can never be True for validated params and only adds confusion. Drop the special-case.

♻️ Suggested simplification
-        if summary is not None:
-            reasoning_payload["summary"] = None if summary == "null" else summary
+        if summary is not None:
+            reasoning_payload["summary"] = summary
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 117 - 118, The `"null"`
branch is dead because TextLLMParams.summary no longer allows "null"; inside the
block that checks `if summary is not None` (where `reasoning_payload` and
`summary` are used) remove the `summary == "null"` special-case and assign the
`summary` value directly (e.g., set `reasoning_payload["summary"]` to
`summary`), leaving None handling as-is; update any related comments if present.

96-98: Implicit aliasing of reasoningeffort is non-obvious.

effort = kaapi_params.get("effort") or reasoning quietly promotes TextLLMParams.reasoning (Literal low/medium/high) into the OpenAI reasoning.effort slot when effort isn't set. The two fields have different declared enums and semantics (reasoning is described as “Reasoning configuration or instructions,” while effort is the OpenAI-specific reasoning effort). At minimum add a short comment explaining the fallback so future readers don’t conflate them; ideally, decide whether reasoning is a deprecated alias for effort and document/log it.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 96 - 98, The current fallback
"effort = kaapi_params.get('effort') or reasoning" silently reuses
TextLLMParams.reasoning for the OpenAI-specific reasoning.effort, which
conflates different enums/semantics; update the mapping in
backend/app/assessment/mappers.py to make this explicit by either (a) adding a
clear comment above the lines referencing kaapi_params explaining that reasoning
is used as a fallback for effort and why, or (b) preferably implement an
explicit conversion/logging path: if kaapi_params.get("effort") is missing but
reasoning is present, map/translate reasoning to the allowed effort values (or
log a deprecation/warning that reasoning is being used as an alias) before
assigning to effort; reference the symbols kaapi_params, reasoning, effort and
ensure any enum translation or warning message clarifies the intended semantics.

1-7: Unused logger.

logger = logging.getLogger(__name__) is defined but never used in this module. Either drop it, or replace the silent warnings.append(...) calls with structured logs prefixed with the function name (per the repo logging convention) so warnings are observable in the runtime, not only in the returned tuple.

As per coding guidelines: “Prefix all log messages with the function name in square brackets: logger.info(f"[function_name] Message ...")”.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 1 - 7, The module defines
logger = logging.getLogger(__name__) but never uses it; update the module to
either remove the unused logger variable or (preferred) replace the silent
warnings.append(...) calls in the functions in this file with structured logging
using that logger (e.g., logger.warning) and follow repo convention by prefixing
messages with the function name in square brackets (use the function's actual
name in the message), e.g., logger.warning(f"[function_name] descriptive
warning..."); keep the existing return tuple behavior but emit warnings via
logger instead of only appending to the warnings list so runtime logs are
observable.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/assessment/crud.py`:
- Around line 270-289: Remove the dead "in_progress" branch from status checks:
anywhere the code computes processing_runs or inspects AssessmentRun.status
using r.status in {"processing", "in_progress"} (e.g., the processing_runs
calculation and the other conditional in the same module) change the set to just
"processing". Update the expressions that sum or filter runs (symbol:
processing_runs) and any related conditionals that pass values into
_determine_assessment_status so they only check for "processing" to eliminate
the unreachable "in_progress" case.

In `@backend/app/assessment/mappers.py`:
- Around line 203-213: Remove the silent passthrough of the unused "reasoning"
field (delete the assignment to google_params["reasoning"]) and instead emit
warnings when unsupported fields appear: check kaapi_params for "effort" and
"summary" and call the module logger's warning (e.g., logger.warning) to notify
users those params are ignored; preserve the existing handling for
thinking_level -> google_params["thinking_config"] and output_schema ->
google_params["output_schema"] (using _convert_json_schema_to_google) unchanged.
- Around line 95-108: The OpenAI mapper must validate the required model early:
check kaapi_params.get("model") (variable model) and if it's missing/None
immediately return the same error behavior used in the Google mapper (i.e.,
surface a validation error rather than continuing); do this before calling
litellm.supports_reasoning(f"openai/{model}") so you never call
supports_reasoning with None and so model will be present when building
openai_params. Locate the OpenAI mapper's use of model, support_reasoning, and
openai_params and add that early guard/return to mirror the Google mapper's
validation logic.

In `@backend/app/models/llm/request.py`:
- Around line 38-43: Update the Field description for the summary attribute in
the request model to remove the obsolete "null/" wording; change the text that
currently reads "Use null/None to disable." to something like "Use None to
disable." so callers aren’t misled into passing the string "null" — locate the
summary: Literal["auto", "detailed", "concise"] | None = Field(...) definition
and edit only the description string accordingly.

---

Nitpick comments:
In `@backend/app/assessment/crud.py`:
- Around line 1-12: The CRUD module for Assessment is in the wrong location;
move backend/app/assessment/crud.py into the shared CRUD package under
backend/app/crud/ (keep filename crud.py) so that the Assessment and
AssessmentRun operations live alongside other CRUD modules; update any imports
that reference app.assessment.crud to app.crud.crud (or equivalent) and ensure
logger = logging.getLogger(__name__) and references to Assessment and
AssessmentRun continue to resolve; if you intentionally want feature-folder
organization instead, add a clear exception note to CLAUDE.md or the contributor
guidelines describing why assessment CRUD lives under app/assessment and list
the impacted symbols (Assessment, AssessmentRun, functions in crud.py) so
reviewers are aware.

In `@backend/app/assessment/events.py`:
- Around line 36-48: Update the publish method signature to use a precise
payload type: change payload: dict to payload: dict[str, Any] and add from
typing import Any to the module imports; also update any asyncio.Queue
annotations used for subscribers (e.g., asyncio.Queue[dict[str, Any]] or the
_subscribers collection element type) so the queue and publish agree on
dict[str, Any] payloads, keeping the logic unchanged.
- Line 24: Replace the deprecated asyncio.TimeoutError alias with the builtin
TimeoutError in the exception handler: change the except clause currently
written as "except asyncio.TimeoutError" in backend/app/assessment/events.py to
"except TimeoutError" and remove any now-unused reference to asyncio if that's
the only usage in this module; ensure the surrounding function/method (the
exception handling block in events.py) still behaves the same.

In `@backend/app/assessment/mappers.py`:
- Around line 117-118: The `"null"` branch is dead because TextLLMParams.summary
no longer allows "null"; inside the block that checks `if summary is not None`
(where `reasoning_payload` and `summary` are used) remove the `summary ==
"null"` special-case and assign the `summary` value directly (e.g., set
`reasoning_payload["summary"]` to `summary`), leaving None handling as-is;
update any related comments if present.
- Around line 96-98: The current fallback "effort = kaapi_params.get('effort')
or reasoning" silently reuses TextLLMParams.reasoning for the OpenAI-specific
reasoning.effort, which conflates different enums/semantics; update the mapping
in backend/app/assessment/mappers.py to make this explicit by either (a) adding
a clear comment above the lines referencing kaapi_params explaining that
reasoning is used as a fallback for effort and why, or (b) preferably implement
an explicit conversion/logging path: if kaapi_params.get("effort") is missing
but reasoning is present, map/translate reasoning to the allowed effort values
(or log a deprecation/warning that reasoning is being used as an alias) before
assigning to effort; reference the symbols kaapi_params, reasoning, effort and
ensure any enum translation or warning message clarifies the intended semantics.
- Around line 1-7: The module defines logger = logging.getLogger(__name__) but
never uses it; update the module to either remove the unused logger variable or
(preferred) replace the silent warnings.append(...) calls in the functions in
this file with structured logging using that logger (e.g., logger.warning) and
follow repo convention by prefixing messages with the function name in square
brackets (use the function's actual name in the message), e.g.,
logger.warning(f"[function_name] descriptive warning..."); keep the existing
return tuple behavior but emit warnings via logger instead of only appending to
the warnings list so runtime logs are observable.

In `@backend/app/tests/api/routes/test_cron.py`:
- Around line 28-47: Update the tests that patch
poll_all_pending_assessment_evaluations to return a realistic shape (include
"processed", "failed", "still_processing" with at least one non-zero count plus
the production fields "total" and "details") and then assert that
evaluation_cron_job/cron handler merges those assessment counts into the
response (check response.json()["assessment"] exists and that
total_processed/total_failed/total_still_processing include the assessment
counts). Ensure the mock uses AsyncMock(return_value=...) for
poll_all_pending_assessment_evaluations and add assertions referencing the
merged totals and the assessment sub-dictionary.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 259f6741-a7b6-438c-973c-4239e35527ce

📥 Commits

Reviewing files that changed from the base of the PR and between 0c04860 and 593eea3.

📒 Files selected for processing (7)
  • backend/app/assessment/crud.py
  • backend/app/assessment/events.py
  • backend/app/assessment/mappers.py
  • backend/app/assessment/routes.py
  • backend/app/models/__init__.py
  • backend/app/models/llm/request.py
  • backend/app/tests/api/routes/test_cron.py
✅ Files skipped from review due to trivial changes (2)
  • backend/app/models/init.py
  • backend/app/assessment/routes.py

Comment on lines +270 to +289
runs = list(session.exec(statement).all())

pending_runs = sum(1 for r in runs if r.status == "pending")
processing_runs = sum(1 for r in runs if r.status in {"processing", "in_progress"})
completed_runs = sum(1 for r in runs if r.status == "completed")
failed_runs = sum(1 for r in runs if r.status == "failed")
total_runs = len(runs)

assessment.total_runs = total_runs
assessment.pending_runs = pending_runs
assessment.processing_runs = processing_runs
assessment.completed_runs = completed_runs
assessment.failed_runs = failed_runs
assessment.status = _determine_assessment_status(
total_runs=total_runs,
pending_runs=pending_runs,
processing_runs=processing_runs,
completed_runs=completed_runs,
failed_runs=failed_runs,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all writers/readers of AssessmentRun.status to confirm the canonical status vocabulary.
rg -nP --type=py -C2 '\b(AssessmentRun|run)\.status\s*=\s*' 
echo '---'
rg -nP --type=py -C1 '"(pending|processing|in_progress|completed|failed|submitted|queued)"'
echo '---'
# Any place comparing against in_progress?
rg -nP --type=py -C2 '"in_progress"'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 50387


🏁 Script executed:

# Check AssessmentRun model definition for status field
rg -n "class AssessmentRun" backend/app/assessment/models.py -A 100 | head -150

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 5234


🏁 Script executed:

# Find all places where AssessmentRun.status is explicitly written (not just run.status)
rg -n "AssessmentRun.*status\s*=" backend/app/assessment/ --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check update_assessment_run function and similar CRUD operations
rg -n "def.*assessment.*status|assessment.*status.*=" backend/app/assessment/crud.py -B2 -A5

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1489


🏁 Script executed:

# Find all places in assessment module where status is assigned to any value
rg -n '\.status\s*=\s*"' backend/app/assessment/ --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Also check if "in_progress" string appears anywhere in assessment module
rg -n '"in_progress"' backend/app/assessment/ --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 311


🏁 Script executed:

# Find all status assignments in assessment module
rg -n 'status\s*=\s*"(pending|processing|completed|failed)"' backend/app/assessment/ --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 837


🏁 Script executed:

# Also check for status assignments via update function calls
rg -n 'update_assessment_run_status\|status=.*pending\|status=.*processing\|status=.*completed\|status=.*failed' backend/app/assessment/ --type=py | head -100

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

# Check for "completed" assignments for AssessmentRun status
rg -n 'status.*=.*"completed"' backend/app/assessment/ --type=py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 386


🏁 Script executed:

# Check the update_assessment_run_status function signature and where it's called
rg -n 'update_assessment_run_status' backend/app/assessment/ --type=py -B2 -A10

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 11169


Remove the dead "in_progress" branch from status conditionals.

"in_progress" is never written to AssessmentRun.status—the column is documented and only assigned the canonical values "pending", "processing", "completed", and "failed". The conditionals at lines 273 and 98 that check for r.status in {"processing", "in_progress"} can be simplified to just "processing". This removes unnecessary dead code.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/crud.py` around lines 270 - 289, Remove the dead
"in_progress" branch from status checks: anywhere the code computes
processing_runs or inspects AssessmentRun.status using r.status in
{"processing", "in_progress"} (e.g., the processing_runs calculation and the
other conditional in the same module) change the set to just "processing".
Update the expressions that sum or filter runs (symbol: processing_runs) and any
related conditionals that pass values into _determine_assessment_status so they
only check for "processing" to eliminate the unreachable "in_progress" case.

Comment on lines +95 to +108
model = kaapi_params.get("model")
reasoning = kaapi_params.get("reasoning")
effort = kaapi_params.get("effort") or reasoning
summary = kaapi_params.get("summary")
temperature = kaapi_params.get("temperature")
top_p = kaapi_params.get("top_p")

instructions = normalize_llm_text(kaapi_params.get("instructions"))
knowledge_base_ids = kaapi_params.get("knowledge_base_ids")
max_num_results = kaapi_params.get("max_num_results")
response_format = kaapi_params.get("response_format")
output_schema = kaapi_params.get("output_schema")

support_reasoning = litellm.supports_reasoning(model=f"openai/{model}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

litellm supports_reasoning behavior when model name is invalid or contains "None"

💡 Result:

No, LiteLLM does not specifically support reasoning behavior when the model name is invalid or contains "None". LiteLLM's supports_reasoning(model) function checks if a model supports reasoning features by looking up the model's info in its internal registry (model_cost map or get_model_info), which contains a "supports_reasoning" flag set to true/false for known models that support reasoning (e.g., certain OpenAI o1/o3, Anthropic claude-3-7, DeepSeek models). For invalid model names (not in the registry), get_model_info returns None, and supports_reasoning treats this as False (does not support). There is no special handling or enabling of reasoning behavior for invalid models or those containing "None". If reasoning params like reasoning_effort are passed to unsupported models, LiteLLM raises UnsupportedParamsError unless drop_params=True.

Citations:


🏁 Script executed:

# First, locate and read the mappers.py file
fd -t f 'mappers.py' backend/

# Then examine the file structure and the specific lines mentioned
wc -l backend/app/assessment/mappers.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 230


🏁 Script executed:

# Read the relevant sections of the mappers.py file
# Start with lines 90-150 to see the OpenAI mapper and model handling
sed -n '90,150p' backend/app/assessment/mappers.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 2507


🏁 Script executed:

# Also check lines 175-190 to see the Google mapper's validation at line 182
sed -n '175,190p' backend/app/assessment/mappers.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 519


🏁 Script executed:

# Check the function signatures and full context of both mappers
sed -n '70,220p' backend/app/assessment/mappers.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 5488


Add early validation for required model parameter in OpenAI mapper to match Google mapper.

The OpenAI mapper passes an unvalidated model to litellm.supports_reasoning() at line 108. When model is None/missing, f"openai/{model}" becomes the string "openai/None", which litellm.supports_reasoning() treats as an unsupported model and returns False. The code then proceeds silently, with model omitted from openai_params due to the truthiness check at line 142.

The Google mapper has explicit validation (lines 184–185) that returns an error when model is missing. The OpenAI mapper should be symmetric:

Suggested fix
     model = kaapi_params.get("model")
+    if not model:
+        return {}, ["Missing required 'model' parameter"]
     reasoning = kaapi_params.get("reasoning")
     effort = kaapi_params.get("effort") or reasoning
     ...
     support_reasoning = litellm.supports_reasoning(model=f"openai/{model}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
model = kaapi_params.get("model")
reasoning = kaapi_params.get("reasoning")
effort = kaapi_params.get("effort") or reasoning
summary = kaapi_params.get("summary")
temperature = kaapi_params.get("temperature")
top_p = kaapi_params.get("top_p")
instructions = normalize_llm_text(kaapi_params.get("instructions"))
knowledge_base_ids = kaapi_params.get("knowledge_base_ids")
max_num_results = kaapi_params.get("max_num_results")
response_format = kaapi_params.get("response_format")
output_schema = kaapi_params.get("output_schema")
support_reasoning = litellm.supports_reasoning(model=f"openai/{model}")
model = kaapi_params.get("model")
if not model:
return {}, ["Missing required 'model' parameter"]
reasoning = kaapi_params.get("reasoning")
effort = kaapi_params.get("effort") or reasoning
summary = kaapi_params.get("summary")
temperature = kaapi_params.get("temperature")
top_p = kaapi_params.get("top_p")
instructions = normalize_llm_text(kaapi_params.get("instructions"))
knowledge_base_ids = kaapi_params.get("knowledge_base_ids")
max_num_results = kaapi_params.get("max_num_results")
response_format = kaapi_params.get("response_format")
output_schema = kaapi_params.get("output_schema")
support_reasoning = litellm.supports_reasoning(model=f"openai/{model}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 95 - 108, The OpenAI mapper
must validate the required model early: check kaapi_params.get("model")
(variable model) and if it's missing/None immediately return the same error
behavior used in the Google mapper (i.e., surface a validation error rather than
continuing); do this before calling
litellm.supports_reasoning(f"openai/{model}") so you never call
supports_reasoning with None and so model will be present when building
openai_params. Locate the OpenAI mapper's use of model, support_reasoning, and
openai_params and add that early guard/return to mirror the Google mapper's
validation logic.

Comment on lines +203 to +213
thinking_level = kaapi_params.get("thinking_level")
if thinking_level:
google_params["thinking_config"] = {"thinking_level": thinking_level}

reasoning = kaapi_params.get("reasoning")
if reasoning:
google_params["reasoning"] = reasoning

output_schema = kaapi_params.get("output_schema")
if output_schema is not None:
google_params["output_schema"] = _convert_json_schema_to_google(output_schema)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Inspect how google_params keys (reasoning, output_schema, instructions, thinking_config) are consumed downstream.
fd -t f 'batch.py' backend/app/assessment | xargs -I{} rg -nP -C3 '\b(google_params|map_kaapi_to_google_params|output_schema|response_schema|response_mime_type|thinking_config|instructions)\b' {}
rg -nP -C3 '\bmap_kaapi_to_google_params\b'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 13862


🏁 Script executed:

cat -n backend/app/assessment/mappers.py | head -250

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 9101


🏁 Script executed:

rg -n '\b(reasoning|effort|summary)\b' backend/app/assessment/batch.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 57


🏁 Script executed:

rg -nC5 'effort|summary' backend/app/assessment/mappers.py

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 1937


Remove silent reasoning passthrough and add warning for unsupported effort/summary fields.

The Google mapper has inconsistent parameter handling compared to the OpenAI mapper:

  1. reasoning is set but never consumed: Line 209 sets google_params["reasoning"], but batch.py lines 485-510 never extract or use it. The field is silently dropped without warning.

  2. effort/summary lack warning messages: Unlike the OpenAI mapper (lines 132–136), the Google mapper does not warn when effort or summary are present. These fields are silently ignored, causing data loss without user notification.

  3. output_schema and thinking_config are correctly handled: These are properly renamed downstream to responseSchema/responseMimeType (batch.py 505–508) and thinkingConfig (batch.py 502–504).

Add explicit warnings for unsupported parameters to match OpenAI mapper behavior and remove the reasoning field passthrough, which has no corresponding Gemini API surface.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/assessment/mappers.py` around lines 203 - 213, Remove the silent
passthrough of the unused "reasoning" field (delete the assignment to
google_params["reasoning"]) and instead emit warnings when unsupported fields
appear: check kaapi_params for "effort" and "summary" and call the module
logger's warning (e.g., logger.warning) to notify users those params are
ignored; preserve the existing handling for thinking_level ->
google_params["thinking_config"] and output_schema ->
google_params["output_schema"] (using _convert_json_schema_to_google) unchanged.

Comment on lines +38 to +43
summary: Literal["auto", "detailed", "concise"] | None = Field(
default=None,
description=(
"Model-specific reasoning summary preference. " "Use null/None to disable."
),
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Description references obsolete "null" sentinel.

The summary Literal no longer accepts the string "null" (correctly fixed earlier). The description still says “Use null/None to disable,” which can mislead callers into passing the string "null" and getting a 422. Drop the null/ mention.

✏️ Suggested wording
     summary: Literal["auto", "detailed", "concise"] | None = Field(
         default=None,
         description=(
-            "Model-specific reasoning summary preference. " "Use null/None to disable."
+            "Model-specific reasoning summary preference. Use None to disable."
         ),
     )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/models/llm/request.py` around lines 38 - 43, Update the Field
description for the summary attribute in the request model to remove the
obsolete "null/" wording; change the text that currently reads "Use null/None to
disable." to something like "Use None to disable." so callers aren’t misled into
passing the string "null" — locate the summary: Literal["auto", "detailed",
"concise"] | None = Field(...) definition and edit only the description string
accordingly.

@vprashrex vprashrex changed the title Feature/assessment Assessment: Ai Assessment Pipeline Apr 28, 2026
@vprashrex vprashrex self-assigned this Apr 28, 2026
@vprashrex vprashrex added the enhancement New feature or request label Apr 28, 2026
@vprashrex vprashrex moved this to In Review in Kaapi-dev Apr 28, 2026
@vprashrex vprashrex linked an issue Apr 28, 2026 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request ready-for-review

Projects

Status: In Review

Development

Successfully merging this pull request may close these issues.

Assessment: AI assessment pipeline

1 participant