Evaluation #405

AkhileshNegi · 2025-10-13T14:32:35Z

Summary

Target issue is #417

Checklist

Before submitting a pull request, please ensure that you mark these task.

Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
If you've fixed a bug or added code that is tested and has test cases.

Notes

Here is the dataflow

Upload CSV → Langfuse Dataset (5x duplication)
Start Evaluation → Create EvaluationRun (pending)
Fetch Dataset → Build JSONL for OpenAI Batch API
Upload JSONL → OpenAI Files API
Create Batch → OpenAI Batch API (returns batch_id)
Update DB → BatchJob + EvaluationRun (processing)
Wait → OpenAI processes batch async (1-24 hours)
Celery Beat → Polls every 60 seconds
Check Status → Poll OpenAI batch status
When Response Batch Completed:
- Download results (JSONL)
- Parse output + match ground truth
- Upload to S3 (optional)
- Create Langfuse traces
Build Embedding JSONL → [output, ground_truth] pairs per trace
Create Embedding Batch → OpenAI Embeddings API (returns batch_id)
Update DB → Link embedding_batch_job_id (status: processing)
Wait → OpenAI processes embeddings (1-24 hours)
Celery Beat → Polls embedding batch status
When Embedding Batch Completed:
- Download embedding results
- Parse embedding pairs (output_emb, ground_truth_emb)
- Calculate cosine similarity per trace
- Calculate aggregate stats (avg, min, max, std)
- Update Langfuse traces with cosine scores
- Update eval_run.score with similarity metrics
- Mark evaluation complete
User views results via API or Langfuse UI

Summary by CodeRabbit

New Features
- Added evaluation dataset management: upload CSV datasets, list, retrieve, and delete.
- Added batch evaluation runs with asynchronous processing and status tracking.
- Added cosine similarity scoring for evaluation results.
- Added automated cron job to process pending evaluations continuously.
Bug Fixes
- Inactive user authentication now returns HTTP 403 instead of HTTP 400.

- Remove project_id parameter from /evaluate endpoint - Update get_provider_credential calls to not require project_id - Credentials now retrieved via API key authentication - Clean up logging configuration and imports - Fix linting errors and update type annotations

coderabbitai · 2025-10-13T14:33:13Z

Walkthrough

Adds a full evaluation subsystem: dataset CSV upload/management, evaluation run lifecycle, batch provider abstractions with OpenAI batch integration, embedding-based scoring and Langfuse integration, cron-driven polling, DB models/migration, storage utilities, tests, docs, and small auth/endpoint adjustments.

Changes

Cohort / File(s)	Summary
API routes & docs `backend/app/api/routes/evaluation.py`, `backend/app/api/routes/cron.py`, `backend/app/api/main.py`, `backend/app/api/routes/llm.py`, `backend/app/api/deps.py`, `backend/app/api/docs/evaluation/*`	New evaluation and cron endpoints; cron router included in main; llm endpoint changed to synchronous; get_current_user now returns HTTP 403 for inactive users; comprehensive evaluation docs added.
Batch core & providers `backend/app/core/batch/base.py`, `backend/app/core/batch/openai.py`, `backend/app/core/batch/__init__.py`	New BatchProvider interface and OpenAIBatchProvider implementing create/poll/download/upload/download_file for batch workflows.
Batch CRUD & orchestration `backend/app/crud/batch_job.py`, `backend/app/crud/batch_operations.py`	New BatchJob CRUD and orchestration: start, poll, download, process completed batches, and upload results to object store (S3 alias).
Evaluation CRUD & processing `backend/app/crud/evaluations/__init__.py`, `.../core.py`, `.../dataset.py`, `.../batch.py`, `.../embeddings.py`, `.../langfuse.py`, `.../processing.py`, `.../cron.py`	End-to-end evaluation workflows: CSV parsing/upload to Langfuse, dataset CRUD, evaluation run create/list/get/delete, JSONL building, batch submission, embedding batches, result parsing, cosine similarity scoring, Langfuse trace updates, and polling/processing (async + sync wrapper).
Models & migration `backend/app/models/batch_job.py`, `backend/app/models/evaluation.py`, `backend/app/models/__init__.py`, `backend/app/models/organization.py`, `backend/app/models/project.py`, `backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py`	New SQLModel tables/schemas for BatchJob, EvaluationDataset, EvaluationRun with DTOs and re-exports; import/type-checking tweaks; Alembic migration creating tables, indices, and constraints.
Storage utilities `backend/app/core/storage_utils.py`, `backend/app/crud/evaluations/dataset.py`	Utilities to upload CSV/JSONL to object storage with timestamped filenames; dataset upload/download helpers and aliases.
Celery & cron tooling `backend/app/celery/celery_app.py`, `scripts/python/invoke-cron.py`, `.env.example`	Celery include formatting adjusted and beat_schedule removed; new async cron invoker script and .env entries for cron.
Tests `backend/app/tests/api/routes/test_evaluation.py`, `backend/app/tests/crud/evaluations/*`	New unit tests covering dataset upload/duplication/validation, Langfuse interactions, batch JSONL building, embeddings parsing and similarity calculations, dataset CRUD and storage interactions.
Misc & deps `backend/app/utils.py`, `backend/pyproject.toml`	get_langfuse_client now raises HTTP errors on missing/invalid creds and returns Langfuse (no Optional); NumPy dependency added.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant API as Evaluation API
    participant DB as Database
    participant Storage as Object Store
    participant Langfuse
    participant OpenAI as OpenAI Batch
    participant Cron as Cron Worker

    User->>API: POST /upload_dataset (CSV)
    API->>Storage: (optional) upload CSV
    API->>Langfuse: create dataset & upload items
    API->>DB: store EvaluationDataset
    API-->>User: DatasetUploadResponse

    User->>API: POST /evaluations (dataset_id, config)
    API->>DB: create EvaluationRun (pending)
    API->>Langfuse: fetch dataset items
    API->>API: build JSONL
    API->>OpenAI: submit batch job
    API->>DB: update run with batch_job_id
    API-->>User: EvaluationRunPublic

    Cron->>API: GET /cron/evaluations
    API->>DB: query pending EvaluationRuns
    loop per pending run
      API->>OpenAI: poll batch status
      alt complete
        API->>OpenAI: download results
        API->>Storage: (optional) upload results
        API->>Langfuse: create traces
        API->>OpenAI: start embedding batch
        API->>DB: update run status/metrics
      else not complete
        API->>DB: no change
      end
    end
    API-->>Cron: summary

sequenceDiagram
    participant Embedding as Embedding Batch
    participant OpenAI
    participant Scorer
    participant Langfuse
    participant DB

    Embedding->>OpenAI: poll embedding batch status
    alt complete
      Embedding->>OpenAI: download embedding results
      Embedding->>Scorer: parse embeddings & pairings
      Scorer->>Scorer: compute cosine similarities (per-item)
      Scorer->>Scorer: aggregate mean/std
      Scorer->>Langfuse: update traces with scores
      Scorer->>DB: update EvaluationRun with scores
    else failed
      Embedding->>DB: mark run completed with error
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

backend/app/crud/batch_operations.py and backend/app/crud/evaluations/processing.py — orchestration, state transitions, error propagation.
backend/app/crud/evaluations/langfuse.py and backend/app/crud/evaluations/core.py — Langfuse SDK usage, CSV parsing/duplication, flush semantics.
backend/app/core/batch/openai.py — OpenAI Batch API interactions, JSONL handling, error parsing.
backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py — FK constraints, indices, and downgrade order.
backend/app/utils.py and backend/app/api/deps.py — authentication/credential error paths and HTTP status semantics.
backend/app/crud/evaluations/embeddings.py — numerical correctness and edge-case handling in embedding pairing and similarity aggregation.

Possibly related PRs

Implement Unified LLM API v1 to Simplify LLM Integrations and Enable Multi-Provider Extensibility #413 — overlaps with llm route and router inclusion changes.
Refactor Document Module and Cloud Storage to use project-based organization and add signed URL support #346 — related changes to cloud storage API used by storage_utils and dataset upload code.
Authentication: Replace encryption with hashing and add permission-based authorization #396 — related authentication/deps changes (inactive-user handling and auth context).

Suggested labels

ready-for-review

Suggested reviewers

avirajsingh7
nishika26
kartpop

Poem

🐰 I nibble CSVs and hop through queues,
I stitch JSONL whispers and batch all the views,
Langfuse keeps traces, embeddings take flight,
Cron hums at intervals through day and through night,
A rabbit-built pipeline — the tests pass just right!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 46.67% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'Evaluation' is vague and generic, using a broad term without specifying the feature or change being introduced.	Consider using a more descriptive title like 'Add evaluation feature with dataset uploads and batch processing' or 'Implement evaluation workflow with OpenAI batch API and embeddings' to better communicate the scope of changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/evaluation

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ad6982 and c08d626.

📒 Files selected for processing (1)

backend/app/api/routes/evaluation.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/api/routes/evaluation.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

🧠 Learnings (2)

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/models/**/*.py : Define SQLModel entities (database tables and domain objects) in backend/app/models/

Applied to files:

backend/app/api/routes/evaluation.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/finetune/**/*.py : Place fine-tuning utilities under backend/app/core/finetune/

Applied to files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (1)

backend/app/api/routes/evaluation.py (9)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-279)

backend/app/crud/assistants.py (1)

get_assistant_by_id (19-30)

backend/app/crud/evaluations/dataset.py (5)

create_evaluation_dataset (31-104)

get_dataset_by_id (107-140)

list_datasets (175-210)

upload_csv_to_object_store (213-245)

delete_dataset (323-387)

backend/app/crud/evaluations/core.py (3)

create_evaluation_run (136-178)

get_evaluation_run_by_id (220-258)

list_evaluation_runs (181-217)

backend/app/crud/evaluations/batch.py (1)

start_evaluation_batch (118-212)

backend/app/core/storage_utils.py (1)

upload_csv_to_object_store (22-83)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse_from_csv (181-297)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (24-43)

EvaluationRunPublic (246-264)

backend/app/utils.py (4)

get_langfuse_client (209-245)

get_openai_client (176-206)

load_description (288-293)

_ (297-298)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: checks (3.11.7, 6)

🔇 Additional comments (8)

backend/app/api/routes/evaluation.py (8)

1-41: LGTM: Imports and security constants are well-defined.

The security constants appropriately restrict uploads to CSV files with a 1MB limit, and all necessary dependencies are imported.

44-89: LGTM: Dataset name sanitization logic is correct.

The function properly handles Langfuse naming constraints with clear transformation rules and appropriate error handling for edge cases.

92-293: LGTM: Comprehensive dataset upload implementation.

The endpoint properly validates CSV files with security checks (extension, MIME type, size), handles optional object store uploads gracefully, uploads to Langfuse, and persists metadata. Error handling uses semantically correct status codes, and the multi-step process is well-logged.

296-336: LGTM: Dataset listing endpoint is properly implemented.

The endpoint correctly scopes queries to the current organization and project, caps the limit at 100, and transforms database records to the response model.

339-375: LGTM: Dataset retrieval endpoint is correct.

The endpoint properly validates access by organization and project, returns appropriate 404 for missing datasets, and transforms the result to the response model.

378-408: LGTM: Dataset deletion endpoint is correct.

The endpoint properly validates access by organization and project, distinguishes between not-found and other errors, and returns an appropriate response.

568-591: LGTM: Evaluation runs listing endpoint is correct.

The endpoint properly delegates to the CRUD layer with organization and project scoping, and returns the appropriate response model.

594-627: LGTM: Evaluation run status endpoint is correct.

The endpoint properly validates access by organization and project through the CRUD layer, and returns appropriate 404 for inaccessible runs.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2025-10-13T14:33:42Z

Codecov Report

❌ Patch coverage is 64.96241% with 699 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
backend/app/crud/evaluations/processing.py	10.43%	206 Missing ⚠️
backend/app/api/routes/evaluation.py	58.78%	68 Missing ⚠️
backend/app/crud/batch_operations.py	17.07%	68 Missing ⚠️
backend/app/crud/evaluations/core.py	18.51%	66 Missing ⚠️
backend/app/core/batch/openai.py	17.33%	62 Missing ⚠️
backend/app/crud/batch_job.py	16.17%	57 Missing ⚠️
backend/app/crud/evaluations/embeddings.py	69.17%	41 Missing ⚠️
backend/app/crud/evaluations/batch.py	37.50%	35 Missing ⚠️
backend/app/crud/evaluations/cron.py	23.07%	30 Missing ⚠️
backend/app/crud/evaluations/dataset.py	72.72%	24 Missing ⚠️
... and 6 more

📢 Thoughts on this report? Let us know!

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (5)

backend/app/models/batch_job.py (1)

75-80: Resolve forward references using TYPE_CHECKING guard.

The forward references to Organization and Project still trigger F821 lint errors despite the # noqa comments. As previously suggested, import these types under a TYPE_CHECKING guard to satisfy static analyzers without causing runtime circular imports.

Apply this diff:
-from typing import Any, Optional
+from typing import TYPE_CHECKING, Any, Optional

+if TYPE_CHECKING:
+    from app.models.organization import Organization
+    from app.models.project import Project

-    organization: Optional["Organization"] = Relationship(  # noqa: F821
+    organization: Optional["Organization"] = Relationship(
         back_populates="batch_jobs"
     )
-    project: Optional["Project"] = Relationship(
-        back_populates="batch_jobs"
-    )  # noqa: F821
+    project: Optional["Project"] = Relationship(back_populates="batch_jobs")

backend/app/models/evaluation.py (2)

119-124: Resolve forward references using TYPE_CHECKING guard.

The forward references to Project and Organization trigger F821 lint errors. Import these types under a TYPE_CHECKING guard to satisfy static analyzers.

Apply this diff:

-from typing import Any, Optional
+from typing import TYPE_CHECKING, Any, Optional

+if TYPE_CHECKING:
+    from app.models.batch_job import BatchJob
+    from app.models.organization import Organization
+    from app.models.project import Project

-    project: "Project" = Relationship(
-        back_populates="evaluation_datasets"
-    )  # noqa: F821
-    organization: "Organization" = Relationship(
-        back_populates="evaluation_datasets"
-    )  # noqa: F821
+    project: "Project" = Relationship(back_populates="evaluation_datasets")
+    organization: "Organization" = Relationship(back_populates="evaluation_datasets")

206-213: Use SQLField for timestamp columns to ensure persistence.

The inserted_at and updated_at fields are declared with Pydantic Field, which means SQLModel won't generate actual database columns. This breaks persistence for these timestamps.

Apply this diff:

     # Timestamps
-    inserted_at: datetime = Field(
-        default_factory=now,
-        description="The timestamp when the evaluation run was started",
-    )
-    updated_at: datetime = Field(
-        default_factory=now,
-        description="The timestamp when the evaluation run was last updated",
-    )
+    inserted_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        description="The timestamp when the evaluation run was started",
+    )
+    updated_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        description="The timestamp when the evaluation run was last updated",
+    )

backend/app/api/routes/evaluation.py (1)

634-662: Preserve caller's config when applying assistant defaults.

The current implementation replaces the entire config dictionary, which discards any additional keys provided by the caller (e.g., include, tools, custom parameters). This prevents legitimate API use cases where callers want to override or extend assistant configuration.

Merge assistant defaults into the caller's config instead:

-        config = {
-            "model": config.get("model", assistant.model),
-            "instructions": config.get("instructions", assistant.instructions),
-            "temperature": config.get("temperature", assistant.temperature),
-        }
+        # Preserve caller's config and apply assistant defaults where not provided
+        if not config.get("model"):
+            config["model"] = assistant.model
+        if not config.get("instructions"):
+            config["instructions"] = assistant.instructions
+        if "temperature" not in config and assistant.temperature is not None:
+            config["temperature"] = assistant.temperature

         # Add tools if vector stores are available
         vector_store_ids = config.get(
             "vector_store_ids", assistant.vector_store_ids or []
         )
         if vector_store_ids and len(vector_store_ids) > 0:
-            config["tools"] = [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": vector_store_ids,
-                }
-            ]
+            # Add file_search tool if not already present
+            if not any(tool.get("type") == "file_search" for tool in config.get("tools", [])):
+                config.setdefault("tools", []).append({
+                    "type": "file_search",
+                    "vector_store_ids": vector_store_ids,
+                })

backend/app/crud/evaluations/processing.py (1)

742-747: Classify embedding outcomes correctly.

The action classification logic doesn't account for "embeddings_completed" and "embeddings_failed" actions returned by check_and_process_evaluation(). These fall into the else branch and are incorrectly counted as still_processing.

Apply this diff:

-                    if result["action"] == "processed":
+                    action = result["action"]
+                    if action in {"processed", "embeddings_completed"}:
                         total_processed_count += 1
-                    elif result["action"] == "failed":
+                    elif action in {"failed", "embeddings_failed"}:
                         total_failed_count += 1
                     else:
                         total_still_processing_count += 1

🧹 Nitpick comments (1)

backend/app/api/routes/evaluation.py (1)
195-238: Consider validating CSV headers before full file read.

Currently, the entire file is read at line 196, and then CSV headers are validated at lines 205-213. For better efficiency and security, consider validating the CSV headers first by reading only the first line, then read the entire file if validation passes. This prevents processing large invalid files unnecessarily.

Example approach:
# Read first line to validate headers
file.file.seek(0)
first_line = file.file.readline()
csv_reader = csv.DictReader(io.StringIO(first_line.decode("utf-8")))
csv_reader.fieldnames = [name.strip() for name in csv_reader.fieldnames]

if "question" not in csv_reader.fieldnames or "answer" not in csv_reader.fieldnames:
    raise HTTPException(
        status_code=422,
        detail=f"CSV must contain 'question' and 'answer' columns. "
        f"Found columns: {csv_reader.fieldnames}",
    )

# Now read the entire file
file.file.seek(0)
csv_content = await file.read()

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 622b4eb and 4ec5971.

📒 Files selected for processing (6)

backend/app/api/routes/evaluation.py (1 hunks)
backend/app/celery/celery_app.py (1 hunks)
backend/app/crud/evaluations/core.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)
backend/app/models/batch_job.py (1 hunks)
backend/app/models/evaluation.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/celery/celery_app.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/crud/evaluations/core.py
backend/app/api/routes/evaluation.py
backend/app/models/evaluation.py
backend/app/crud/evaluations/processing.py
backend/app/models/batch_job.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/processing.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py
backend/app/models/batch_job.py

🧠 Learnings (3)

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

Applied to files:

backend/app/crud/evaluations/core.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/models/**/*.py : Define SQLModel entities (database tables and domain objects) in backend/app/models/

Applied to files:

backend/app/models/evaluation.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to **/*.py : Use type hints in Python code (Python 3.11+ project)

Applied to files:

backend/app/models/batch_job.py

🧬 Code graph analysis (3)

backend/app/crud/evaluations/core.py (5)

backend/app/core/util.py (1)

configure_langfuse (42-73)

backend/app/crud/credentials.py (1)

get_provider_credential (121-159)

backend/app/models/evaluation.py (2)

EvaluationRun (130-230)

DatasetUploadResponse (19-38)

backend/app/models/user.py (1)

UserProjectOrg (58-59)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/api/routes/evaluation.py (10)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-279)

backend/app/core/util.py (2)

configure_langfuse (42-73)

configure_openai (76-95)

backend/app/crud/assistants.py (1)

get_assistant_by_id (19-30)

backend/app/crud/credentials.py (1)

get_provider_credential (121-159)

backend/app/crud/evaluations/dataset.py (5)

create_evaluation_dataset (31-105)

get_dataset_by_id (108-143)

list_datasets (179-215)

upload_csv_to_object_store (218-250)

delete_dataset (325-387)

backend/app/crud/evaluations/core.py (3)

create_evaluation_run (140-182)

get_evaluation_run_by_id (224-262)

list_evaluation_runs (185-221)

backend/app/crud/evaluations/batch.py (1)

start_evaluation_batch (114-202)

backend/app/core/storage_utils.py (1)

upload_csv_to_object_store (22-83)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse_from_csv (165-277)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (19-38)

EvaluationRunPublic (247-265)

backend/app/crud/evaluations/processing.py (10)

backend/app/core/batch/openai_provider.py (1)

OpenAIBatchProvider (14-267)

backend/app/core/util.py (2)

configure_langfuse (42-73)

configure_openai (76-95)

backend/app/crud/batch_job.py (1)

get_batch_job (56-70)

backend/app/crud/batch_operations.py (2)

upload_batch_results_to_object_store (267-311)

poll_batch_status (110-165)

backend/app/crud/credentials.py (1)

get_provider_credential (121-159)

backend/app/crud/evaluations/batch.py (1)

fetch_dataset_items (24-57)

backend/app/crud/evaluations/core.py (1)

update_evaluation_run (265-312)

backend/app/crud/evaluations/embeddings.py (3)

calculate_average_similarity (255-331)

parse_embedding_results (132-218)

start_embedding_batch (334-434)

backend/app/crud/evaluations/langfuse.py (2)

create_langfuse_dataset_run (18-109)

update_traces_with_cosine_scores (112-162)

backend/app/models/evaluation.py (1)

EvaluationRun (130-230)

🪛 Ruff (0.14.3)

backend/app/models/evaluation.py

119-119: Undefined name Project

(F821)

122-122: Undefined name Organization

(F821)

217-217: Undefined name Organization

(F821)

backend/app/models/batch_job.py

78-78: Undefined name Project

(F821)

🔇 Additional comments (4)

backend/app/crud/evaluations/core.py (1)

1-311: LGTM! Well-structured CRUD operations.

The evaluation core CRUD module is well-implemented with:

Comprehensive error handling in upload_dataset_to_langfuse

Proper CSV validation and parsing

Good logging throughout

Proper access control checks in query functions

Clean separation of concerns

backend/app/api/routes/evaluation.py (2)

165-193: Good security validation.

The file upload security checks are well-implemented:

Extension validation against allowed list

MIME type verification

Size limit enforcement (1 MB)

Empty file detection

339-385: Well-structured API endpoints.

The dataset and evaluation run management endpoints are well-implemented with:

Proper authentication and authorization via AuthContextDep

Appropriate HTTP status codes

Good error messages

Pagination support where needed

Clear documentation in docstrings

Also applies to: 388-429, 432-472, 702-770

backend/app/crud/evaluations/processing.py (1)

45-284: Well-architected batch processing orchestration.

The evaluation processing module is well-designed with:

Robust parsing logic that handles multiple output formats gracefully

Comprehensive error handling throughout the processing pipeline

Graceful degradation when optional steps fail (object store, Langfuse updates)

Clear separation between evaluation and embedding batch processing

Good logging with contextual prefixes for debugging

Also applies to: 287-409, 412-613

avirajsingh7

A few points to consider (now or later):

I see extensive usage of async. While async provides concurrency benefits, it should be used carefully. If a route is async but the internal function calls are blocking/synchronous, it can actually degrade performance.
Async is most useful when there are I/O-heavy operations (e.g., network calls, file access).
There are docstrings throughout the codebase, which is good in principle. However, some of them describe very obvious behavior and end up reducing readability.
Docstrings are most valuable where behavior or intent is non-obvious — otherwise they add noise instead of clarity.
Potential for OOP Structure (optional / future consideration)
There’s a heavy use of standalone functions. Over time, grouping related logic into classes could improve structure, reusability, and readability. Not necessary right now, but something to consider if this module grows.
Langfuse SDK Version
Upgrading to newer versions of the Langfuse SDK might be challenging due to how usage is increase.
Magic Strings & Hardcoded Dictionaries
Strings use in job_type, status, provider, etc., are directly used in multiple places.
This is okay for now, but can lead to errors or inconsistencies, can consider use of ENUM

Replacing repeated dictionaries like structures with Pydantic models or enums could improve maintainability and reduce duplication.

backend/app/api/routes/evaluation.py

avirajsingh7 · 2025-11-05T03:31:24Z

backend/app/api/routes/evaluation.py

+    auth_context: AuthContextDep,
+    limit: int = 50,
+    offset: int = 0,
+) -> list[DatasetUploadResponse]:


For all Evals routes,
instead of returning a dict all response should return valid SQLModel class, and should be in standard format wrapped in APIResponse type

avirajsingh7 · 2025-11-05T03:35:00Z

backend/app/api/routes/evaluation.py

+    List all datasets for the current organization and project.
+
+    Args:
+        limit: Maximum number of datasets to return (default 50, max 100)


Here to put limit you need to add something like this and can we use (skip offset) instead of limit, offset like all other endpoints

skip: int = Query(0, ge=0, description="Number of records to skip"),
limit: int = Query(100, ge=1, le=100, description="Maximum records to return"),

avirajsingh7 · 2025-11-05T03:36:52Z

backend/app/api/routes/evaluation.py

+    )
+
+
+@router.delete("/evaluations/datasets/{dataset_id}")


See here there is no response model,
OpenAPI Doc will not get updated

avirajsingh7 · 2025-11-05T03:59:01Z

backend/app/api/routes/evaluation.py

+        return eval_run
+
+
+@router.get("/evaluations/list", response_model=list[EvaluationRunPublic])


@AkhileshNegi We are fixing that endpoint you can remove list from this one

avirajsingh7 · 2025-11-05T04:02:45Z

backend/app/api/routes/evaluation.py

+async def get_evaluation_run_status(
+    evaluation_id: int,
+    _session: SessionDep,
+    auth_context: AuthContextDep,


To all the routes,
Can you add restriction like

dependencies=[Depends(require_permission(Permission.REQUIRE_PROJECT))],

Otherwise someone with username and passwrod can access the endpoint, and in that case project object will be null
Above dependecies ensure project object is not null

refer api_keys route

avirajsingh7 · 2025-11-05T04:05:40Z

backend/app/core/storage_utils.py

+
+This module provides common functions for uploading various file types
+to cloud object storage, abstracting away provider-specific details.
+"""


Can you please put this particular file in storage directory?
storage/utils.py

backend/app/crud/batch_operations.py

backend/app/core/batch/base.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (6)

backend/app/alembic/versions/d5747495bd7c_create_evaluation_run_table.py (2)

20-20: Add return type hint to upgrade() function.

The function signature lacks a return type annotation, which violates the project's Python 3.11+ type hint requirements.

As per coding guidelines.

Apply this diff:
-def upgrade():
+def upgrade() -> None:
231-231: Add return type hint to downgrade() function.

The function signature lacks a return type annotation, which violates the project's Python 3.11+ type hint requirements.

As per coding guidelines.

Apply this diff:
-def downgrade():
+def downgrade() -> None:

backend/app/crud/evaluations/langfuse.py (1)

216-228: Guard against empty CSV to prevent TypeError.

When csv.DictReader processes an empty file, fieldnames is None, which causes the list comprehension at line 218 to raise TypeError before the validation at lines 221-228 can run. This turns a user error (empty file) into a 500 error.

Apply this diff:

         csv_text = csv_content.decode("utf-8")
         csv_reader = csv.DictReader(io.StringIO(csv_text))
-        csv_reader.fieldnames = [name.strip() for name in csv_reader.fieldnames]
+        
+        fieldnames = csv_reader.fieldnames
+        if not fieldnames:
+            raise ValueError(
+                "CSV must contain 'question' and 'answer' columns. Found columns: []"
+            )
+        csv_reader.fieldnames = [name.strip() for name in fieldnames]
 
         # Validate CSV headers

backend/app/models/evaluation.py (1)

207-214: Use SQLField for timestamp columns to ensure DB persistence.

Lines 207-214 declare inserted_at and updated_at with Pydantic Field, which means SQLModel won't generate actual database columns for these fields. This breaks timestamp persistence for EvaluationRun. Notice that EvaluationDataset correctly uses SQLField for its timestamps at lines 120-121.

Apply this diff to fix the issue:

     # Timestamps
-    inserted_at: datetime = Field(
-        default_factory=now,
-        description="The timestamp when the evaluation run was started",
-    )
-    updated_at: datetime = Field(
-        default_factory=now,
-        description="The timestamp when the evaluation run was last updated",
-    )
+    inserted_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        description="The timestamp when the evaluation run was started",
+    )
+    updated_at: datetime = SQLField(
+        default_factory=now,
+        nullable=False,
+        description="The timestamp when the evaluation run was last updated",
+    )

backend/app/api/routes/evaluation.py (1)

527-546: Merge assistant defaults into caller config instead of replacing it.

Lines 529-533 create a new config dict with only model, instructions, and temperature, which discards all other keys from the caller-provided config. This means important parameters like custom evaluation settings or other configuration values are silently dropped. The pattern should preserve the caller's config and only set defaults for missing keys.

Apply this diff to properly merge configurations:

-        # Build config from assistant (use provided config values to override
-        # if present)
-        config = {
-            "model": config.get("model", assistant.model),
-            "instructions": config.get("instructions", assistant.instructions),
-            "temperature": config.get("temperature", assistant.temperature),
-        }
+        # Merge assistant defaults into caller's config (preserve all caller keys)
+        config.setdefault("model", assistant.model)
+        config.setdefault("instructions", assistant.instructions)
+        if "temperature" not in config and assistant.temperature is not None:
+            config["temperature"] = assistant.temperature
 
         # Add tools if vector stores are available
-        vector_store_ids = config.get(
-            "vector_store_ids", assistant.vector_store_ids or []
-        )
+        vector_store_ids = config.get("vector_store_ids") or assistant.vector_store_ids or []
         if vector_store_ids and len(vector_store_ids) > 0:
-            config["tools"] = [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": vector_store_ids,
-                }
-            ]
+            if "tools" not in config:
+                config["tools"] = []
+            # Only add file_search if not already present
+            if not any(t.get("type") == "file_search" for t in config["tools"]):
+                config["tools"].append({
+                    "type": "file_search",
+                    "vector_store_ids": vector_store_ids,
+                })

backend/app/crud/evaluations/processing.py (1)

763-768: Correctly classify embedding completion outcomes in summary counts.

Lines 763-768 only check for "processed" and "failed" actions, causing "embeddings_completed" and "embeddings_failed" actions (returned at lines 503 and 524) to fall into the else branch and be incorrectly counted as still_processing. This breaks the polling summary statistics.

Apply this diff to properly classify all outcomes:
-                    if result["action"] == "processed":
+                    action = result["action"]
+                    if action in {"processed", "embeddings_completed"}:
                         total_processed_count += 1
-                    elif result["action"] == "failed":
+                    elif action in {"failed", "embeddings_failed"}:
                         total_failed_count += 1
                     else:
                         total_still_processing_count += 1

🧹 Nitpick comments (1)

backend/app/core/batch/provider_interface.py (1)
11-13: Consider more specific return types if provider responses are consistent.

While dict[str, Any] provides flexibility for provider-specific metadata, if the core return structure is consistent across providers (e.g., all return provider_batch_id, provider_status, etc.), consider using TypedDict or a dataclass for better type safety and IDE support.

Example:
from typing import TypedDict

class BatchCreationResult(TypedDict):
    provider_batch_id: str
    provider_file_id: str
    provider_status: str
    total_items: int
This is optional and can be deferred if provider responses vary significantly.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4ec5971 and cba41c9.

📒 Files selected for processing (22)

backend/app/alembic/versions/d5747495bd7c_create_evaluation_run_table.py (1 hunks)
backend/app/api/docs/evaluation/create_evaluation.md (1 hunks)
backend/app/api/docs/evaluation/delete_dataset.md (1 hunks)
backend/app/api/docs/evaluation/get_dataset.md (1 hunks)
backend/app/api/docs/evaluation/get_evaluation.md (1 hunks)
backend/app/api/docs/evaluation/list_datasets.md (1 hunks)
backend/app/api/docs/evaluation/list_evaluations.md (1 hunks)
backend/app/api/docs/evaluation/upload_dataset.md (1 hunks)
backend/app/api/routes/cron.py (1 hunks)
backend/app/api/routes/evaluation.py (1 hunks)
backend/app/core/batch/openai_provider.py (1 hunks)
backend/app/core/batch/provider_interface.py (1 hunks)
backend/app/crud/batch_job.py (1 hunks)
backend/app/crud/batch_operations.py (1 hunks)
backend/app/crud/evaluations/batch.py (1 hunks)
backend/app/crud/evaluations/dataset.py (1 hunks)
backend/app/crud/evaluations/langfuse.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)
backend/app/models/batch_job.py (1 hunks)
backend/app/models/evaluation.py (1 hunks)
backend/app/models/organization.py (1 hunks)
backend/app/models/project.py (1 hunks)

✅ Files skipped from review due to trivial changes (2)

backend/app/api/docs/evaluation/list_datasets.md
backend/app/api/docs/evaluation/upload_dataset.md

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/api/routes/cron.py

🧰 Additional context used

📓 Path-based instructions (5)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/alembic/versions/d5747495bd7c_create_evaluation_run_table.py
backend/app/models/evaluation.py
backend/app/core/batch/openai_provider.py
backend/app/crud/batch_operations.py
backend/app/crud/evaluations/langfuse.py
backend/app/core/batch/provider_interface.py
backend/app/crud/evaluations/batch.py
backend/app/models/organization.py
backend/app/api/routes/evaluation.py
backend/app/crud/batch_job.py
backend/app/models/project.py
backend/app/models/batch_job.py
backend/app/crud/evaluations/dataset.py
backend/app/crud/evaluations/processing.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/evaluation.py
backend/app/models/organization.py
backend/app/models/project.py
backend/app/models/batch_job.py

backend/app/core/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place core functionality (config, DB session, security, exceptions, middleware) in backend/app/core/

Files:

backend/app/core/batch/openai_provider.py
backend/app/core/batch/provider_interface.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/batch_operations.py
backend/app/crud/evaluations/langfuse.py
backend/app/crud/evaluations/batch.py
backend/app/crud/batch_job.py
backend/app/crud/evaluations/dataset.py
backend/app/crud/evaluations/processing.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/evaluation.py

🧠 Learnings (5)

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to **/*.py : Use type hints in Python code (Python 3.11+ project)

Applied to files:

backend/app/alembic/versions/d5747495bd7c_create_evaluation_run_table.py
backend/app/models/project.py
backend/app/models/batch_job.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/models/**/*.py : Define SQLModel entities (database tables and domain objects) in backend/app/models/

Applied to files:

backend/app/models/evaluation.py
backend/app/api/routes/evaluation.py
backend/app/models/batch_job.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

Applied to files:

backend/app/crud/evaluations/langfuse.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/services/response/**/*.py : Put OpenAI Responses API integration, conversation management, and job execution in backend/app/services/response/

Applied to files:

backend/app/api/routes/evaluation.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/crud/**/*.py : Implement database access operations in backend/app/crud/

Applied to files:

backend/app/crud/batch_job.py

🧬 Code graph analysis (13)

backend/app/models/evaluation.py (3)

backend/app/models/batch_job.py (1)

BatchJob (15-80)

backend/app/models/organization.py (1)

Organization (34-54)

backend/app/models/project.py (1)

Project (38-66)

backend/app/core/batch/openai_provider.py (2)

backend/app/core/batch/provider_interface.py (6)

BatchProvider (7-105)

create_batch (11-32)

upload_file (75-89)

get_batch_status (35-52)

download_batch_results (55-72)

download_file (92-105)

backend/app/crud/batch_operations.py (1)

download_batch_results (174-215)

backend/app/crud/batch_operations.py (6)

backend/app/core/batch/provider_interface.py (4)

BatchProvider (7-105)

create_batch (11-32)

get_batch_status (35-52)

download_batch_results (55-72)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-279)

backend/app/core/storage_utils.py (1)

upload_jsonl_to_object_store (86-152)

backend/app/crud/batch_job.py (2)

create_batch_job (13-56)

update_batch_job (76-120)

backend/app/models/batch_job.py (3)

BatchJob (15-80)

BatchJobCreate (83-97)

BatchJobUpdate (100-109)

backend/app/core/batch/openai_provider.py (3)

create_batch (26-87)

get_batch_status (89-141)

download_batch_results (143-191)

backend/app/crud/evaluations/langfuse.py (2)

backend/app/api/routes/evaluation.py (1)

get_dataset (357-388)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/core/batch/provider_interface.py (2)

backend/app/core/batch/openai_provider.py (5)

create_batch (26-87)

get_batch_status (89-141)

download_batch_results (143-191)

upload_file (193-223)

download_file (225-254)

backend/app/crud/batch_operations.py (1)

download_batch_results (174-215)

backend/app/crud/evaluations/batch.py (3)

backend/app/core/batch/openai_provider.py (1)

OpenAIBatchProvider (14-254)

backend/app/crud/batch_operations.py (1)

start_batch_job (20-111)

backend/app/models/evaluation.py (1)

EvaluationRun (131-229)

backend/app/models/organization.py (5)

backend/app/models/collection.py (2)

now (49-51)

Collection (13-36)

backend/app/models/assistants.py (1)

Assistant (33-44)

backend/app/models/credentials.py (1)

Credential (48-97)

backend/app/models/openai_conversation.py (1)

OpenAIConversation (58-69)

backend/app/models/project.py (1)

Project (38-66)

backend/app/api/routes/evaluation.py (11)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-279)

backend/app/core/util.py (2)

configure_langfuse (42-73)

configure_openai (76-95)

backend/app/crud/assistants.py (1)

get_assistant_by_id (19-30)

backend/app/crud/credentials.py (1)

get_provider_credential (121-159)

backend/app/crud/evaluations/dataset.py (5)

create_evaluation_dataset (31-104)

get_dataset_by_id (107-140)

list_datasets (175-210)

upload_csv_to_object_store (213-245)

delete_dataset (323-387)

backend/app/crud/evaluations/core.py (3)

create_evaluation_run (140-182)

get_evaluation_run_by_id (224-262)

list_evaluation_runs (185-221)

backend/app/crud/evaluations/batch.py (1)

start_evaluation_batch (118-212)

backend/app/core/storage_utils.py (1)

upload_csv_to_object_store (22-83)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse_from_csv (181-297)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (24-43)

EvaluationRunPublic (246-264)

backend/app/utils.py (1)

load_description (248-253)

backend/app/crud/batch_job.py (1)

backend/app/models/batch_job.py (3)

BatchJob (15-80)

BatchJobCreate (83-97)

BatchJobUpdate (100-109)

backend/app/models/project.py (6)

backend/app/models/collection.py (2)

now (49-51)

Collection (13-36)

backend/app/models/assistants.py (1)

Assistant (33-44)

backend/app/models/credentials.py (1)

Credential (48-97)

backend/app/models/fine_tuning.py (1)

Fine_Tuning (55-90)

backend/app/models/openai_conversation.py (1)

OpenAIConversation (58-69)

backend/app/models/organization.py (1)

Organization (34-54)

backend/app/models/batch_job.py (2)

backend/app/models/organization.py (1)

Organization (34-54)

backend/app/models/project.py (1)

Project (38-66)

backend/app/crud/evaluations/dataset.py (3)

backend/app/core/cloud/storage.py (1)

CloudStorage (114-142)

backend/app/core/storage_utils.py (1)

generate_timestamped_filename (155-167)

backend/app/models/evaluation.py (2)

EvaluationDataset (72-128)

EvaluationRun (131-229)

backend/app/crud/evaluations/processing.py (10)

backend/app/core/batch/openai_provider.py (2)

OpenAIBatchProvider (14-254)

download_batch_results (143-191)

backend/app/core/util.py (2)

configure_langfuse (42-73)

configure_openai (76-95)

backend/app/crud/batch_job.py (1)

get_batch_job (59-73)

backend/app/crud/batch_operations.py (3)

download_batch_results (174-215)

upload_batch_results_to_object_store (283-330)

poll_batch_status (114-171)

backend/app/crud/credentials.py (1)

get_provider_credential (121-159)

backend/app/crud/evaluations/batch.py (1)

fetch_dataset_items (24-59)

backend/app/crud/evaluations/core.py (1)

update_evaluation_run (265-312)

backend/app/crud/evaluations/embeddings.py (3)

calculate_average_similarity (255-331)

parse_embedding_results (132-218)

start_embedding_batch (334-434)

backend/app/crud/evaluations/langfuse.py (2)

create_langfuse_dataset_run (18-122)

update_traces_with_cosine_scores (125-178)

backend/app/models/evaluation.py (1)

EvaluationRun (131-229)

🔇 Additional comments (10)

backend/app/models/project.py (1)

9-16: LGTM! Correct use of TYPE_CHECKING for circular import avoidance.

The TYPE_CHECKING block properly imports related model types for use in relationship annotations (lines 51-66) without causing circular import issues at runtime. This is the standard Python pattern for forward references in type hints.

backend/app/api/docs/evaluation/delete_dataset.md (1)

1-18: Documentation looks good.

The delete dataset endpoint is well-documented with clear parameter descriptions, return values, and error responses. The note about CSV files remaining in object storage for audit purposes is a helpful detail.

backend/app/api/docs/evaluation/get_dataset.md (1)

1-22: LGTM! Comprehensive endpoint documentation.

The documentation clearly describes the dataset retrieval endpoint, including all response fields and error cases. The inclusion of Langfuse integration details and object store URL is helpful for users.

backend/app/api/docs/evaluation/get_evaluation.md (1)

26-28: Good usage guidance for async polling.

The documentation helpfully explains that evaluation is processed asynchronously and users should poll this endpoint to check status. The 60-second Celery Beat interval is a useful implementation detail.

backend/app/api/docs/evaluation/list_evaluations.md (1)

1-25: LGTM! Clear pagination and return structure.

The documentation clearly describes the list endpoint with pagination parameters and comprehensive field descriptions. The ordering (most recent first) is an important detail for users.

backend/app/models/organization.py (1)

8-13: LGTM! Proper TYPE_CHECKING pattern.

The TYPE_CHECKING block correctly imports model types used in relationship annotations (lines 40-54) while avoiding circular import issues at runtime. All imported types are actively used in the forward references.

backend/app/api/docs/evaluation/create_evaluation.md (2)

14-27: Request for clarification on config vs assistant_id interaction.

The documentation shows that config is optional and assistant_id is optional, with an example where config: {} when using assistant_id. However, it's not entirely clear what happens if both are provided with conflicting values. Does assistant_id override config, or do they merge?

Consider adding a note clarifying the precedence/merge behavior when both config and assistant_id are provided.

1-12: Excellent end-to-end flow documentation.

The step-by-step flow description (lines 3-9) and async processing notes are very helpful for developers integrating this API. The explanation of Celery Beat processing and polling guidance provides clear expectations.

backend/app/core/batch/provider_interface.py (2)

7-32: LGTM! Clean abstract interface.

The BatchProvider abstract base class provides a well-documented interface for batch processing providers. The create_batch method signature and return structure are clearly specified.

1-105: Inconsistency between AI summary and implementation.

The AI summary claims this interface defines 6 abstract methods including cancel_batch, but the actual implementation only contains 5 methods: create_batch, get_batch_status, download_batch_results, upload_file, and download_file. The cancel_batch method is not present.

Verify whether a cancel_batch method is needed for the batch provider interface. If batch cancellation is a required operation, it should be added to the interface. If not, the summary should be updated.

coderabbitai · 2025-11-05T06:13:39Z

backend/app/core/batch/openai.py

+    def __init__(self, client: OpenAI):
+        """
+        Initialize the OpenAI batch provider.
+
+        Args:
+            client: Configured OpenAI client
+        """
+        self.client = client


⚠️ Potential issue | 🟡 Minor

Add return type hint to __init__() method.

The method signature is missing a return type annotation, which violates the project's Python 3.11+ type hint requirements.

As per coding guidelines.

Apply this diff:

- def __init__(self, client: OpenAI): + def __init__(self, client: OpenAI) -> None:

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def __init__(self, client: OpenAI):

"""

Initialize the OpenAI batch provider.

Args:

client: Configured OpenAI client

"""

self.client = client

def __init__(self, client: OpenAI) -> None:

"""

Initialize the OpenAI batch provider.

Args:

client: Configured OpenAI client

"""

self.client = client

🤖 Prompt for AI Agents

In backend/app/core/batch/openai_provider.py around lines 17 to 24, the __init__ method is missing a return type annotation; update the signature to include an explicit "-> None" return type (def __init__(self, client: OpenAI) -> None:) so it complies with the project's Python 3.11+ type hint requirements and keep the body unchanged.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

backend/app/models/__init__.py (1)
40-45: Move batch_job imports to maintain alphabetical ordering.

The batch_job imports should be placed before the collection imports (line 9) to maintain the alphabetical ordering convention followed throughout this file.

Apply this refactor to maintain consistency:
+from .batch_job import (
+    BatchJob,
+    BatchJobCreate,
+    BatchJobPublic,
+    BatchJobUpdate,
+)
+
 from .collection import Collection, CollectionPublic
And remove the current placement at lines 40-45.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cba41c9 and 9dffd06.

⛔ Files ignored due to path filters (1)

backend/uv.lock is excluded by !**/*.lock

📒 Files selected for processing (3)

backend/app/api/main.py (2 hunks)
backend/app/models/__init__.py (1 hunks)
backend/pyproject.toml (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/pyproject.toml

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/models/__init__.py
backend/app/api/main.py

backend/app/models/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Define SQLModel entities (database tables and domain objects) in backend/app/models/

Files:

backend/app/models/__init__.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/main.py

🧬 Code graph analysis (1)

backend/app/models/__init__.py (2)

backend/app/models/batch_job.py (4)

BatchJob (15-80)

BatchJobCreate (83-97)

BatchJobPublic (112-129)

BatchJobUpdate (100-109)

backend/app/models/evaluation.py (6)

EvaluationDataset (72-128)

EvaluationDatasetCreate (267-284)

EvaluationDatasetPublic (287-299)

EvaluationRun (131-229)

EvaluationRunCreate (232-243)

EvaluationRunPublic (246-264)

🔇 Additional comments (3)

backend/app/api/main.py (2)

21-22: LGTM! Router imports are correctly placed.

The new cron and evaluation router imports follow the existing alphabetical ordering pattern and are consistent with other router imports in the file.

35-35: LGTM! Router inclusions are properly integrated.

The new routers are correctly included in alphabetical order, following the existing pattern for router registration.

Also applies to: 38-38

backend/app/models/__init__.py (1)

47-54: Evaluation model exports are complete and correctly positioned relative to fine_tuning.

All model variants from evaluation.py are properly re-exported. The placement relative to fine_tuning follows alphabetical order. However, if batch_job imports are moved as suggested, these imports are already in the correct position.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (2)

backend/app/api/routes/evaluation.py (1)

500-518: Preserve caller config when merging assistant defaults.

The current implementation replaces the entire config dict, discarding caller-provided keys like include, tools, or other OpenAI Responses API parameters. This breaks legitimate use cases where the caller wants to override or add fields beyond model/instructions/temperature.

Apply this diff to merge assistant defaults without losing caller config:

-        config = {
-            "model": config.get("model", assistant.model),
-            "instructions": config.get("instructions", assistant.instructions),
-            "temperature": config.get("temperature", assistant.temperature),
-        }
-
-        # Add tools if vector stores are available
-        vector_store_ids = config.get(
-            "vector_store_ids", assistant.vector_store_ids or []
-        )
-        if vector_store_ids and len(vector_store_ids) > 0:
-            config["tools"] = [
-                {
-                    "type": "file_search",
-                    "vector_store_ids": vector_store_ids,
-                }
-            ]
+        # Merge assistant defaults into caller config without replacing it
+        config.setdefault("model", assistant.model)
+        config.setdefault("instructions", assistant.instructions)
+        if "temperature" not in config and assistant.temperature is not None:
+            config["temperature"] = assistant.temperature
+
+        # Add file_search tool if vector stores are available and not already in tools
+        vector_store_ids = config.get("vector_store_ids") or assistant.vector_store_ids or []
+        if vector_store_ids:
+            # Check if file_search tool already exists
+            existing_tools = config.get("tools", [])
+            has_file_search = any(
+                tool.get("type") == "file_search" for tool in existing_tools
+            )
+            if not has_file_search:
+                config.setdefault("tools", []).append({
+                    "type": "file_search",
+                    "vector_store_ids": vector_store_ids,
+                })

backend/app/crud/evaluations/processing.py (1)

733-739: Classify embedding outcomes correctly.

When check_and_process_evaluation() returns "embeddings_completed" or "embeddings_failed", the current code treats them as still processing. Completed embedding runs should increment processed, and failed embeddings should increment failed.

Apply this diff to fix the classification:

-                    if result["action"] == "processed":
+                    action = result["action"]
+                    if action in {"processed", "embeddings_completed"}:
                         total_processed_count += 1
-                    elif result["action"] == "failed":
+                    elif action in {"failed", "embeddings_failed"}:
                         total_failed_count += 1
                     else:
                         total_still_processing_count += 1

Based on learnings

🧹 Nitpick comments (1)

backend/app/crud/evaluations/processing.py (1)
477-477: Consider moving imports to module level.

Importing poll_batch_status inside the function (also at line 551) is unconventional and makes the code harder to follow. While this pattern is sometimes used to avoid circular imports, consider restructuring your modules to eliminate circular dependencies instead.

If circular imports are unavoidable, add a comment explaining why the import must be local:
                 # Poll embedding batch status
                 provider = OpenAIBatchProvider(client=openai_client)
+                # Local import to avoid circular dependency with batch_operations
                 from app.crud.batch_operations import poll_batch_status

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9dffd06 and 24a958e.

📒 Files selected for processing (6)

backend/app/api/routes/evaluation.py (1 hunks)
backend/app/api/routes/llm.py (1 hunks)
backend/app/crud/evaluations/core.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)
backend/app/tests/api/routes/test_evaluation.py (1 hunks)
backend/app/utils.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (3)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/utils.py
backend/app/api/routes/llm.py
backend/app/crud/evaluations/processing.py
backend/app/tests/api/routes/test_evaluation.py
backend/app/crud/evaluations/core.py
backend/app/api/routes/evaluation.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/llm.py
backend/app/api/routes/evaluation.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/core.py

🧠 Learnings (4)

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/langfuse/**/*.py : Place Langfuse observability integration under backend/app/core/langfuse/

Applied to files:

backend/app/utils.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/services/response/**/*.py : Put OpenAI Responses API integration, conversation management, and job execution in backend/app/services/response/

Applied to files:

backend/app/api/routes/evaluation.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/models/**/*.py : Define SQLModel entities (database tables and domain objects) in backend/app/models/

Applied to files:

backend/app/api/routes/evaluation.py

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to backend/app/core/finetune/**/*.py : Place fine-tuning utilities under backend/app/core/finetune/

Applied to files:

backend/app/api/routes/evaluation.py

🧬 Code graph analysis (4)

backend/app/crud/evaluations/processing.py (8)

backend/app/core/batch/openai_provider.py (2)

OpenAIBatchProvider (14-254)

download_batch_results (143-191)

backend/app/crud/batch_job.py (1)

get_batch_job (59-73)

backend/app/crud/batch_operations.py (3)

download_batch_results (174-215)

upload_batch_results_to_object_store (283-330)

poll_batch_status (114-171)

backend/app/crud/evaluations/batch.py (1)

fetch_dataset_items (24-59)

backend/app/crud/evaluations/core.py (1)

update_evaluation_run (262-309)

backend/app/crud/evaluations/embeddings.py (3)

calculate_average_similarity (255-331)

parse_embedding_results (132-218)

start_embedding_batch (334-434)

backend/app/crud/evaluations/langfuse.py (2)

create_langfuse_dataset_run (18-122)

update_traces_with_cosine_scores (125-178)

backend/app/utils.py (2)

get_langfuse_client (209-245)

get_openai_client (176-206)

backend/app/tests/api/routes/test_evaluation.py (3)

backend/app/crud/evaluations/batch.py (1)

build_evaluation_jsonl (62-115)

backend/app/models/evaluation.py (1)

EvaluationDataset (72-128)

backend/app/tests/conftest.py (2)

user_api_key_header (90-92)

db (28-45)

backend/app/crud/evaluations/core.py (6)

backend/app/crud/credentials.py (1)

get_provider_credential (121-159)

backend/app/models/evaluation.py (2)

EvaluationRun (131-229)

DatasetUploadResponse (24-43)

backend/app/models/user.py (1)

UserProjectOrg (58-59)

backend/app/utils.py (1)

get_langfuse_client (209-245)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/api/routes/evaluation.py (1)

list_evaluation_runs (573-591)

backend/app/api/routes/evaluation.py (9)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-279)

backend/app/crud/assistants.py (1)

get_assistant_by_id (19-30)

backend/app/crud/evaluations/dataset.py (5)

create_evaluation_dataset (31-104)

get_dataset_by_id (107-140)

list_datasets (175-210)

upload_csv_to_object_store (213-245)

delete_dataset (323-387)

backend/app/crud/evaluations/core.py (3)

create_evaluation_run (137-179)

get_evaluation_run_by_id (221-259)

list_evaluation_runs (182-218)

backend/app/crud/evaluations/batch.py (1)

start_evaluation_batch (118-212)

backend/app/core/storage_utils.py (1)

upload_csv_to_object_store (22-83)

backend/app/crud/evaluations/langfuse.py (1)

upload_dataset_to_langfuse_from_csv (181-297)

backend/app/models/evaluation.py (2)

DatasetUploadResponse (24-43)

EvaluationRunPublic (246-264)

backend/app/utils.py (4)

get_langfuse_client (209-245)

get_openai_client (176-206)

load_description (288-293)

_ (297-298)

🔇 Additional comments (10)

backend/app/utils.py (1)

209-245: LGTM! Enhanced error handling for Langfuse client.

The changes enforce explicit credential validation (HTTP 400 for missing/incomplete credentials) and wrap client construction with proper error handling (HTTP 500 for construction failures). Logging includes relevant context for debugging.

backend/app/api/routes/llm.py (1)

38-58: LGTM! Correct alignment of function signature with implementation.

The function performs no asynchronous operations (no await statements), so making it synchronous is appropriate. The start_job call queues work for background processing and returns immediately, consistent with the callback pattern.

backend/app/tests/api/routes/test_evaluation.py (1)

1-688: LGTM! Comprehensive test coverage.

This test file provides excellent coverage of the evaluation subsystem:

CSV validation and parsing (valid, invalid, empty rows)

Duplication logic with boundary testing (0, 1, 5, 6)

Error handling (Langfuse config failure, invalid CSV, missing auth)

Batch evaluation endpoints

JSONL building logic with various configurations

The tests use appropriate mocking for external dependencies and verify both API responses and database state.

backend/app/crud/evaluations/core.py (4)

17-134: LGTM! Comprehensive error handling in dataset upload.

The function properly handles multiple failure scenarios:

Langfuse client creation failure (caught and returned as error tuple)

CSV validation (missing headers, empty items)

Per-item upload failures (logged but processing continues)

General exceptions (caught and returned as error)

The duplication logic correctly multiplies items and tracks metadata.

182-218: LGTM! Proper project scoping in list query.

The query correctly filters by both organization_id and project_id (lines 204-205), ensuring users can only see evaluation runs within their project scope.

221-259: LGTM! Proper access control in get_by_id query.

The query correctly scopes by id, organization_id, and project_id (lines 241-243), preventing cross-project data access.

262-309: LGTM! Consistent update pattern with automatic timestamp.

The function provides selective field updates while always maintaining the updated_at timestamp (line 302), ensuring proper audit trail.

backend/app/api/routes/evaluation.py (3)

92-293: LGTM! Comprehensive validation and security checks.

The endpoint implements multiple layers of validation:

File extension and MIME type checks (lines 131-144)

File size limits (1 MB max, lines 146-155)

CSV header validation (lines 170-178)

Empty file detection (lines 157-158)

Object store upload is optional (catches exceptions gracefully), while Langfuse upload is required (raises on failure). The 1 MB size limit makes reading the entire file into memory (line 161) acceptable.

296-336: LGTM! Proper pagination with enforced limits.

The endpoint enforces a maximum limit of 100 (lines 308-309) to prevent excessive data retrieval, and properly scopes queries by organization and project.

568-627: LGTM! Consistent access control in list and get endpoints.

Both endpoints properly scope queries by organization and project using AuthContextDep, ensuring users can only access their own evaluation runs.

backend/app/crud/evaluations/processing.py

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

backend/app/api/routes/cron.py (1)
52-64: Raise HTTPException instead of returning HTTP 200 on error.

This exception handler logs the error but returns a JSON payload with HTTP 200 status. Clients (including cron monitoring) will interpret this as success and won't alert on actual failures. You must raise an HTTPException with a 5xx status code so the HTTP response properly reflects the error condition.

This issue was flagged in a previous review but remains unresolved.

Apply this diff to fix:
     except Exception as e:
         logger.error(
             f"[evaluation_cron_job] Error executing cron job: {e}",
             exc_info=True,
         )
-        return {
-            "status": "error",
-            "error": str(e),
-            "organizations_processed": 0,
-            "total_processed": 0,
-            "total_failed": 0,
-            "total_still_processing": 0,
-        }
+        raise HTTPException(
+            status_code=500,
+            detail=f"Cron job failed: {str(e)}"
+        )
backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py (2)

21-21: Add return type hint for upgrade.

Please declare the signature as def upgrade() -> None: to satisfy the project’s Python 3.11+ typing rule. As per coding guidelines.

232-232: Add return type hint for downgrade.

Please update the signature to def downgrade() -> None: in line with the typing guideline. As per coding guidelines.
backend/app/core/batch/openai.py (1)
17-24: Add the missing __init__ return annotation.

Project guidelines require explicit return types, and the prior review already called this out. Please add -> None so the constructor stays aligned with our Python 3.11+ typing policy.

Apply this diff:
-    def __init__(self, client: OpenAI):
+    def __init__(self, client: OpenAI) -> None:
backend/app/crud/evaluations/embeddings.py (1)
92-109: Don’t discard falsy-but-valid outputs.

if not generated_output or not ground_truth still skips legitimate zero/False answers, so whole embedding jobs can empty out again. We already called this out earlier—please tighten the guard to only drop truly missing or blank values and coerce non-strings to strings before building the payload.

Apply this diff:
-        generated_output = result.get("generated_output", "")
-        ground_truth = result.get("ground_truth", "")
+        generated_output = result.get("generated_output")
+        ground_truth = result.get("ground_truth")
@@
-        if not generated_output or not ground_truth:
-            logger.warning(f"Skipping item {item_id} - empty output or ground_truth")
-            continue
+        if generated_output is None or ground_truth is None:
+            logger.warning(
+                f"Skipping item {item_id} - output or ground_truth is missing"
+            )
+            continue
+
+        if isinstance(generated_output, str):
+            if not generated_output.strip():
+                logger.warning(
+                    f"Skipping item {item_id} - empty generated_output string"
+                )
+                continue
+        else:
+            generated_output = str(generated_output)
+
+        if isinstance(ground_truth, str):
+            if not ground_truth.strip():
+                logger.warning(
+                    f"Skipping item {item_id} - empty ground_truth string"
+                )
+                continue
+        else:
+            ground_truth = str(ground_truth)
backend/app/crud/evaluations/processing.py (1)
732-737: Count embedding outcomes instead of lumping them into “still processing”.

check_and_process_evaluation() can return "embeddings_completed" and "embeddings_failed". The current else branch drops both into total_still_processing_count, so finished embedding runs vanish from the processed/failed totals and the summary is wrong. Please bucket those actions explicitly.
-                    if result["action"] == "processed":
-                        total_processed_count += 1
-                    elif result["action"] == "failed":
-                        total_failed_count += 1
-                    else:
-                        total_still_processing_count += 1
+                    action = result["action"]
+                    if action in {"processed", "embeddings_completed"}:
+                        total_processed_count += 1
+                    elif action in {"failed", "embeddings_failed"}:
+                        total_failed_count += 1
+                    else:
+                        total_still_processing_count += 1

🧹 Nitpick comments (1)

backend/app/api/routes/cron.py (1)
21-23: Consider a more specific return type.

The return type dict is vague. Per the coding guidelines for Python type hints, consider dict[str, Any] or define a TypedDict with the expected keys (status, organizations_processed, total_processed, total_failed, total_still_processing, error).
 def evaluation_cron_job(
     session: SessionDep,
-) -> dict:
+) -> dict[str, Any]:

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24a958e and 2b38293.

📒 Files selected for processing (10)

backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py (1 hunks)
backend/app/api/routes/cron.py (1 hunks)
backend/app/core/batch/__init__.py (1 hunks)
backend/app/core/batch/base.py (1 hunks)
backend/app/core/batch/openai.py (1 hunks)
backend/app/crud/batch_operations.py (1 hunks)
backend/app/crud/evaluations/batch.py (1 hunks)
backend/app/crud/evaluations/core.py (1 hunks)
backend/app/crud/evaluations/embeddings.py (1 hunks)
backend/app/crud/evaluations/processing.py (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

backend/app/core/batch/init.py

🧰 Additional context used

📓 Path-based instructions (4)

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py
backend/app/crud/batch_operations.py
backend/app/core/batch/base.py
backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/batch.py
backend/app/core/batch/openai.py
backend/app/api/routes/cron.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/embeddings.py

backend/app/crud/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Implement database access operations in backend/app/crud/

Files:

backend/app/crud/batch_operations.py
backend/app/crud/evaluations/core.py
backend/app/crud/evaluations/batch.py
backend/app/crud/evaluations/processing.py
backend/app/crud/evaluations/embeddings.py

backend/app/core/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Place core functionality (config, DB session, security, exceptions, middleware) in backend/app/core/

Files:

backend/app/core/batch/base.py
backend/app/core/batch/openai.py

backend/app/api/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Expose FastAPI REST endpoints under backend/app/api/ organized by domain

Files:

backend/app/api/routes/cron.py

🧠 Learnings (1)

📚 Learning: 2025-10-08T12:05:01.317Z

Learnt from: CR
Repo: ProjectTech4DevAI/ai-platform PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-10-08T12:05:01.317Z
Learning: Applies to **/*.py : Use type hints in Python code (Python 3.11+ project)

Applied to files:

backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py

🧬 Code graph analysis (8)

backend/app/crud/batch_operations.py (5)

backend/app/core/batch/base.py (4)

BatchProvider (7-105)

create_batch (11-32)

get_batch_status (35-52)

download_batch_results (55-72)

backend/app/core/cloud/storage.py (1)

get_cloud_storage (262-279)

backend/app/core/storage_utils.py (1)

upload_jsonl_to_object_store (86-152)

backend/app/crud/batch_job.py (2)

create_batch_job (13-56)

update_batch_job (76-120)

backend/app/models/batch_job.py (3)

BatchJob (15-80)

BatchJobCreate (83-97)

BatchJobUpdate (100-109)

backend/app/core/batch/base.py (2)

backend/app/core/batch/openai.py (5)

create_batch (26-87)

get_batch_status (89-141)

download_batch_results (143-191)

upload_file (193-223)

download_file (225-254)

backend/app/crud/batch_operations.py (1)

download_batch_results (130-156)

backend/app/crud/evaluations/core.py (5)

backend/app/models/evaluation.py (2)

EvaluationRun (131-229)

DatasetUploadResponse (24-43)

backend/app/models/user.py (1)

UserProjectOrg (58-59)

backend/app/utils.py (1)

get_langfuse_client (209-245)

backend/app/core/langfuse/langfuse.py (1)

flush (108-109)

backend/app/api/routes/evaluation.py (1)

list_evaluation_runs (573-591)

backend/app/crud/evaluations/batch.py (3)

backend/app/core/batch/openai.py (1)

OpenAIBatchProvider (14-254)

backend/app/crud/batch_operations.py (1)

start_batch_job (20-86)

backend/app/models/evaluation.py (1)

EvaluationRun (131-229)

backend/app/core/batch/openai.py (2)

backend/app/core/batch/base.py (6)

BatchProvider (7-105)

create_batch (11-32)

upload_file (75-89)

get_batch_status (35-52)

download_batch_results (55-72)

download_file (92-105)

backend/app/crud/batch_operations.py (1)

download_batch_results (130-156)

backend/app/api/routes/cron.py (2)

backend/app/api/permissions.py (2)

Permission (10-15)

require_permission (45-70)

backend/app/crud/evaluations/cron.py (1)

process_all_pending_evaluations_sync (142-154)

backend/app/crud/evaluations/processing.py (8)

backend/app/core/batch/openai.py (2)

OpenAIBatchProvider (14-254)

download_batch_results (143-191)

backend/app/crud/batch_job.py (1)

get_batch_job (59-73)

backend/app/crud/batch_operations.py (3)

download_batch_results (130-156)

upload_batch_results_to_object_store (205-232)

poll_batch_status (89-127)

backend/app/crud/evaluations/batch.py (1)

fetch_dataset_items (24-59)

backend/app/crud/evaluations/core.py (1)

update_evaluation_run (261-308)

backend/app/crud/evaluations/embeddings.py (3)

calculate_average_similarity (255-331)

parse_embedding_results (132-218)

start_embedding_batch (334-434)

backend/app/crud/evaluations/langfuse.py (2)

create_langfuse_dataset_run (18-122)

update_traces_with_cosine_scores (125-178)

backend/app/utils.py (2)

get_langfuse_client (209-245)

get_openai_client (176-206)

backend/app/crud/evaluations/embeddings.py (3)

backend/app/core/batch/openai.py (1)

OpenAIBatchProvider (14-254)

backend/app/crud/batch_operations.py (1)

start_batch_job (20-86)

backend/app/models/evaluation.py (1)

EvaluationRun (131-229)

🔇 Additional comments (1)

backend/app/api/routes/cron.py (1)

13-13: LGTM!

Router initialization is correct with appropriate tagging.

coderabbitai · 2025-11-05T18:36:41Z

backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py

+        "batch_job",
+        sa.Column("id", sa.Integer(), nullable=False),
+        sa.Column(
+            "provider",
+            sa.String(),
+            nullable=False,
+            comment="LLM provider name (e.g., 'openai', 'anthropic')",
+        ),
+        sa.Column(
+            "job_type",
+            sa.String(),
+            nullable=False,
+            comment="Type of batch job (e.g., 'evaluation', 'classification', 'embedding')",
+        ),
+        sa.Column(
+            "config",
+            postgresql.JSONB(astext_type=sa.Text()),
+            nullable=False,
+            server_default=sa.text("'{}'::jsonb"),
+            comment="Complete batch configuration",
+        ),
+        sa.Column(
+            "provider_batch_id",
+            sa.String(),
+            nullable=True,
+            comment="Provider's batch job ID",
+        ),
+        sa.Column(
+            "provider_file_id",
+            sa.String(),
+            nullable=True,
+            comment="Provider's input file ID",
+        ),
+        sa.Column(
+            "provider_output_file_id",
+            sa.String(),
+            nullable=True,
+            comment="Provider's output file ID",
+        ),
+        sa.Column(
+            "provider_status",
+            sa.String(),
+            nullable=True,
+            comment="Provider-specific status (e.g., OpenAI: validating, in_progress, completed, failed)",
+        ),
+        sa.Column(
+            "raw_output_url",
+            sa.String(),
+            nullable=True,
+            comment="S3 URL of raw batch output file",
+        ),
+        sa.Column(
+            "total_items",
+            sa.Integer(),
+            nullable=False,
+            server_default=sa.text("0"),
+            comment="Total number of items in the batch",
+        ),
+        sa.Column(
+            "error_message",
+            sa.Text(),
+            nullable=True,
+            comment="Error message if batch failed",
+        ),
+        sa.Column("organization_id", sa.Integer(), nullable=False),
+        sa.Column("project_id", sa.Integer(), nullable=False),
+        sa.Column("inserted_at", sa.DateTime(), nullable=False),
+        sa.Column("updated_at", sa.DateTime(), nullable=False),
+        sa.ForeignKeyConstraint(
+            ["organization_id"], ["organization.id"], ondelete="CASCADE"
+        ),
+        sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
+        sa.PrimaryKeyConstraint("id"),
+    )
+    op.create_index(
+        op.f("ix_batch_job_job_type"), "batch_job", ["job_type"], unique=False
+    )
+    op.create_index(
+        op.f("ix_batch_job_organization_id"),
+        "batch_job",
+        ["organization_id"],
+        unique=False,
+    )
+    op.create_index(
+        op.f("ix_batch_job_project_id"), "batch_job", ["project_id"], unique=False
+    )
+    op.create_index(
+        "idx_batch_job_status_org",
+        "batch_job",
+        ["provider_status", "organization_id"],
+        unique=False,
+    )
+    op.create_index(
+        "idx_batch_job_status_project",
+        "batch_job",
+        ["provider_status", "project_id"],
+        unique=False,
+    )
+
+    # Create evaluation_dataset table
+    op.create_table(
+        "evaluation_dataset",
+        sa.Column("id", sa.Integer(), nullable=False),
+        sa.Column("name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
+        sa.Column("description", sqlmodel.sql.sqltypes.AutoString(), nullable=True),
+        sa.Column(
+            "dataset_metadata",
+            postgresql.JSONB(astext_type=sa.Text()),
+            nullable=False,
+            server_default=sa.text("'{}'::jsonb"),
+        ),
+        sa.Column(
+            "object_store_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True
+        ),
+        sa.Column(
+            "langfuse_dataset_id",
+            sqlmodel.sql.sqltypes.AutoString(),
+            nullable=True,
+        ),
+        sa.Column("organization_id", sa.Integer(), nullable=False),
+        sa.Column("project_id", sa.Integer(), nullable=False),
+        sa.Column("inserted_at", sa.DateTime(), nullable=False),
+        sa.Column("updated_at", sa.DateTime(), nullable=False),
+        sa.ForeignKeyConstraint(
+            ["organization_id"], ["organization.id"], ondelete="CASCADE"
+        ),
+        sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
+        sa.PrimaryKeyConstraint("id"),
+        sa.UniqueConstraint(
+            "name",
+            "organization_id",
+            "project_id",
+            name="uq_evaluation_dataset_name_org_project",
+        ),
+    )
+    op.create_index(
+        op.f("ix_evaluation_dataset_name"),
+        "evaluation_dataset",
+        ["name"],
+        unique=False,
+    )
+
+    # Create evaluation_run table with all columns and foreign key references
+    op.create_table(
+        "evaluation_run",
+        sa.Column("run_name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
+        sa.Column("dataset_name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
+        sa.Column("config", sa.JSON(), nullable=False),
+        sa.Column("batch_job_id", sa.Integer(), nullable=True),
+        sa.Column(
+            "embedding_batch_job_id",
+            sa.Integer(),
+            nullable=True,
+            comment="Reference to the batch_job for embedding-based similarity scoring",
+        ),
+        sa.Column("dataset_id", sa.Integer(), nullable=False),
+        sa.Column("status", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
+        sa.Column(
+            "object_store_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True
+        ),
+        sa.Column("total_items", sa.Integer(), nullable=False),
+        sa.Column("score", sa.JSON(), nullable=True),
+        sa.Column("error_message", sa.Text(), nullable=True),
+        sa.Column("organization_id", sa.Integer(), nullable=False),
+        sa.Column("project_id", sa.Integer(), nullable=False),
+        sa.Column("id", sa.Integer(), nullable=False),
+        sa.Column("inserted_at", sa.DateTime(), nullable=False),
+        sa.Column("updated_at", sa.DateTime(), nullable=False),
+        sa.ForeignKeyConstraint(
+            ["batch_job_id"],
+            ["batch_job.id"],
+            ondelete="SET NULL",
+        ),
+        sa.ForeignKeyConstraint(
+            ["embedding_batch_job_id"],
+            ["batch_job.id"],
+            name="fk_evaluation_run_embedding_batch_job_id",
+            ondelete="SET NULL",
+        ),
+        sa.ForeignKeyConstraint(
+            ["dataset_id"],
+            ["evaluation_dataset.id"],
+            name="fk_evaluation_run_dataset_id",
+            ondelete="CASCADE",
+        ),
+        sa.ForeignKeyConstraint(
+            ["organization_id"], ["organization.id"], ondelete="CASCADE"
+        ),
+        sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
+        sa.PrimaryKeyConstraint("id"),
+    )


⚠️ Potential issue | 🔴 Critical

Fix primary keys to autoincrement.

batch_job, evaluation_dataset, and evaluation_run declare id as plain Integer NOT NULL with a separate PrimaryKeyConstraint. Postgres will not create a sequence/default in this setup, so inserts coming from SQLModel (which send id=None) will fail with NULL value in column "id" violates not-null constraint. Mark the columns as primary keys (or add sa.Identity()) so the database autogenerates IDs.

- sa.Column("id", sa.Integer(), nullable=False), + sa.Column("id", sa.Integer(), primary_key=True), ... - sa.Column("id", sa.Integer(), nullable=False), + sa.Column("id", sa.Integer(), primary_key=True), ... - sa.Column("id", sa.Integer(), nullable=False), + sa.Column("id", sa.Integer(), primary_key=True),

🤖 Prompt for AI Agents

In backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py around lines 24 to 214, the id columns for batch_job, evaluation_dataset, and evaluation_run are defined as plain Integer with a separate PrimaryKeyConstraint which prevents Postgres from creating sequences and causes NULL id insert failures; update each table definition to make the id column a real primary key with an autogenerating identity (either mark the id column as primary_key=True in the sa.Column definition or add sa.Identity()/server_default identity expression) so Postgres will auto-generate ids on insert, and remove or keep the separate PrimaryKeyConstraint accordingly to avoid duplicate/conflicting PK declarations.

coderabbitai · 2025-11-05T18:36:42Z

backend/app/api/routes/cron.py

+import logging
+
+from app.api.permissions import Permission, require_permission
+from fastapi import APIRouter, Depends
+from sqlmodel import Session
+
+from app.api.deps import SessionDep, AuthContextDep
+from app.crud.evaluations import process_all_pending_evaluations_sync
+from app.models import User
+
+logger = logging.getLogger(__name__)


🛠️ Refactor suggestion | 🟠 Major

Remove unused imports and add HTTPException.

Three imports are unused:

Session from sqlmodel (line 5)

AuthContextDep (line 7)

User (line 9)

Additionally, HTTPException is missing from the FastAPI import (line 4) but is required to properly fix the exception handling at lines 52-64.

Apply this diff:

import logging from app.api.permissions import Permission, require_permission -from fastapi import APIRouter, Depends -from sqlmodel import Session +from fastapi import APIRouter, Depends, HTTPException -from app.api.deps import SessionDep, AuthContextDep +from app.api.deps import SessionDep from app.crud.evaluations import process_all_pending_evaluations_sync -from app.models import User

🤖 Prompt for AI Agents

In backend/app/api/routes/cron.py around lines 1 to 11, remove the unused imports: Session (from sqlmodel), AuthContextDep, and User; and add HTTPException to the FastAPI import list so exception handling later (lines ~52-64) can raise HTTPException properly; update the import line to include HTTPException and delete the three unused import lines to keep imports clean.

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

scripts/python/invoke-cron.py (1)

148-149: Validate the response body status field to avoid masking failures.

The code logs "Endpoint invoked successfully" based only on the HTTP 200 status code without checking the response body. If the API returns {"status": "error", ...}, the failure will be masked and monitoring will miss it.

Apply this diff to validate the response body:

                 result = await self.invoke_endpoint(client)
-                logger.info(f"Endpoint invoked successfully: {result}")
+                
+                # Check response body status field
+                status = result.get("status")
+                if status != "success":
+                    error_msg = result.get("message", "Unknown error")
+                    logger.error(f"Endpoint returned failure status: {status}, message: {error_msg}, full response: {result}")
+                    # Continue to next iteration rather than crash the cron
+                else:
+                    logger.info(f"Endpoint invoked successfully: {result}")

🧹 Nitpick comments (6)

scripts/python/invoke-cron.py (6)

18-19: Add type hints to module-level constants.

Per coding guidelines, type hints should be used throughout the Python code.

Apply this diff:

-ENDPOINT = "/api/v1/cron/evaluations"  # Endpoint to invoke
-REQUEST_TIMEOUT = 30  # Timeout for requests in seconds
+ENDPOINT: str = "/api/v1/cron/evaluations"  # Endpoint to invoke
+REQUEST_TIMEOUT: int = 30  # Timeout for requests in seconds

As per coding guidelines.

32-51: Add return type hint to __init__ method.

Per coding guidelines, all methods should have type hints.

Apply this diff:

-    def __init__(self):
+    def __init__(self) -> None:

As per coding guidelines.

53-84: Add return type hint and use logging.exception in exception handlers.

Per coding guidelines, type hints should be used. Additionally, logging.exception is preferred over logging.error in exception handlers as it automatically includes the traceback.

Apply this diff:

-    async def authenticate(self, client: httpx.AsyncClient) -> str:
+    async def authenticate(self, client: httpx.AsyncClient) -> str:
         """Authenticate and get access token."""
         logger.info("Authenticating with API...")
 
         login_data = {
             "username": self.email,
             "password": self.password,
         }
 
         try:
             response = await client.post(
                 f"{self.base_url}/api/v1/login/access-token",
                 data=login_data,
                 timeout=REQUEST_TIMEOUT,
             )
             response.raise_for_status()
 
             data = response.json()
             self.access_token = data.get("access_token")
 
             if not self.access_token:
                 raise ValueError("No access token in response")
 
             logger.info("Authentication successful")
             return self.access_token
 
         except httpx.HTTPStatusError as e:
-            logger.error(f"Authentication failed with status {e.response.status_code}")
+            logger.exception(f"Authentication failed with status {e.response.status_code}")
             raise
         except Exception as e:
-            logger.error(f"Authentication error: {e}")
+            logger.exception(f"Authentication error: {e}")
             raise

As per coding guidelines.

86-129: Add return type hint and use logging.exception in exception handlers.

Per coding guidelines, type hints should be used. Additionally, use logging.exception in exception handlers for automatic traceback inclusion.

Apply this diff:

-    async def invoke_endpoint(self, client: httpx.AsyncClient) -> dict:
+    async def invoke_endpoint(self, client: httpx.AsyncClient) -> dict:
         """Invoke the configured endpoint."""
         if not self.access_token:
             await self.authenticate(client)
 
         headers = {"Authorization": f"Bearer {self.access_token}"}
 
         # Debug: Log what we're sending
         logger.debug(f"Request URL: {self.base_url}{self.endpoint}")
         logger.debug(f"Request headers: {headers}")
 
         try:
             response = await client.get(
                 f"{self.base_url}{self.endpoint}",
                 headers=headers,
                 timeout=REQUEST_TIMEOUT,
             )
 
             # Debug: Log response headers and first part of body
             logger.debug(f"Response status: {response.status_code}")
             logger.debug(f"Response headers: {dict(response.headers)}")
 
             # If unauthorized, re-authenticate and retry once
             if response.status_code == 401:
                 logger.info("Token expired, re-authenticating...")
                 await self.authenticate(client)
                 headers = {"Authorization": f"Bearer {self.access_token}"}
                 response = await client.get(
                     f"{self.base_url}{self.endpoint}",
                     headers=headers,
                     timeout=REQUEST_TIMEOUT,
                 )
 
             response.raise_for_status()
             return response.json()
 
         except httpx.HTTPStatusError as e:
-            logger.error(
+            logger.exception(
                 f"Endpoint invocation failed with status {e.response.status_code}: {e.response.text}"
             )
             raise
         except Exception as e:
-            logger.error(f"Endpoint invocation error: {e}")
+            logger.exception(f"Endpoint invocation error: {e}")
             raise

As per coding guidelines.

131-168: Add return type hint and use logging.exception for error logging.

Per coding guidelines, add return type hint. Also prefer logging.exception in exception handlers.

Apply this diff:

-    async def run(self):
+    async def run(self) -> None:
         """Main loop to invoke endpoint periodically."""
         logger.info(f"Using API Base URL: {self.base_url}")
         logger.info(
             f"Starting cron job - invoking {self.endpoint} every {self.interval_minutes} minutes"
         )
 
         # Use async context manager to ensure proper cleanup
         async with httpx.AsyncClient() as client:
             # Authenticate once at startup
             await self.authenticate(client)
 
             while True:
                 try:
                     start_time = datetime.now()
                     logger.info(f"Invoking endpoint at {start_time}")
 
                     result = await self.invoke_endpoint(client)
                     logger.info(f"Endpoint invoked successfully: {result}")
 
                     # Calculate next invocation time
                     elapsed = (datetime.now() - start_time).total_seconds()
                     sleep_time = max(0, self.interval_seconds - elapsed)
 
                     if sleep_time > 0:
                         logger.info(
                             f"Sleeping for {sleep_time:.1f} seconds until next invocation"
                         )
                         await asyncio.sleep(sleep_time)
 
                 except KeyboardInterrupt:
                     logger.info("Shutting down gracefully...")
                     break
                 except Exception as e:
-                    logger.error(f"Error during invocation: {e}")
+                    logger.exception(f"Error during invocation: {e}")
                     # Wait before retrying on error
                     logger.info(f"Waiting {self.interval_seconds} seconds before retry")
                     await asyncio.sleep(self.interval_seconds)

As per coding guidelines.

171-189: Add return type hint and use logging.exception for fatal errors.

Per coding guidelines, add return type hint. Also use logging.exception to capture traceback for fatal errors.

Apply this diff:

-def main():
+def main() -> None:
     """Entry point for the script."""
     # Load environment variables
     env_path = Path(__file__).parent.parent.parent / ".env"
     if env_path.exists():
         load_dotenv(env_path)
         logger.info(f"Loaded environment from {env_path}")
     else:
         logger.warning(f"No .env file found at {env_path}")
 
     try:
         invoker = EndpointInvoker()
         asyncio.run(invoker.run())
     except KeyboardInterrupt:
         logger.info("Interrupted by user")
         sys.exit(0)
     except Exception as e:
-        logger.error(f"Fatal error: {e}")
+        logger.exception(f"Fatal error: {e}")
         sys.exit(1)

As per coding guidelines.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b38293 and dce502b.

📒 Files selected for processing (2)

.env.example (1 hunks)
scripts/python/invoke-cron.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

.env.example

📄 CodeRabbit inference engine (CLAUDE.md)

Provide .env.example as the template for .env

Files:

.env.example

**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

Use type hints in Python code (Python 3.11+ project)

Files:

scripts/python/invoke-cron.py

🪛 Ruff (0.14.3)

scripts/python/invoke-cron.py

1-1: Shebang is present but file is not executable

(EXE001)

49-51: Avoid specifying long messages outside the exception class

(TRY003)

74-74: Abstract raise to an inner function

(TRY301)

74-74: Avoid specifying long messages outside the exception class

(TRY003)

77-77: Consider moving this statement to an else block

(TRY300)

80-80: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

83-83: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

123-125: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

128-128: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

164-164: Do not catch blind exception: Exception

(BLE001)

165-165: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

187-187: Do not catch blind exception: Exception

(BLE001)

188-188: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

🔇 Additional comments (1)

.env.example (1)

26-30: LGTM! Clear documentation for new configuration.

The new environment variables are well-documented with sensible defaults. The comments clearly explain their purpose and default behavior, making it easy for developers to understand the configuration options.

coderabbitai · 2025-11-06T07:13:54Z

scripts/python/invoke-cron.py

@@ -0,0 +1,193 @@
+#!/usr/bin/env python3


⚠️ Potential issue | 🟡 Minor

Make the file executable to match the shebang.

The shebang is present but the file lacks executable permissions.

Run this command to fix:

chmod +x scripts/python/invoke-cron.py

🧰 Tools

🪛 Ruff (0.14.3)

1-1: Shebang is present but file is not executable

(EXE001)

🤖 Prompt for AI Agents

In scripts/python/invoke-cron.py around line 1, the file contains a shebang but is missing executable permissions; make the file executable by setting its executable bit locally (so the shebang is honored), then commit the permission change to the repo so the script runs as intended.

kartpop

approved, lets merge in and release to staging; remnant issues can be taken up in the coming weeks and released with the unified API

Akhilesh Negi and others added 15 commits May 28, 2025 09:53

first stab at setting up langfuse evaluation

62d159e

bringing traces to sync

706c5dc

cleanup

d882160

getting it up and running with sync

168f521

using utils

fa509ad

cleanups

38ea1a9

added logs

be299a3

code refactoring

46fd438

cleanups

6198542

moving to separate files

74ab6d1

using pydantic types

cfc2dbd

Merge main into feature/langfuse-evaluation branch

675e220

using hardcoded

c130b02

Merge branch 'feature/langfuse-evaluation' into main

7878d2a

AkhileshNegi added 13 commits October 13, 2025 20:38

adding endpoint for uploading dataset

5466229

added steps for starting evaluation using batchAPI

2ac8163

added testcase

17c767c

using celery beat and evaluation batch

f99ae27

first stab at running evaluation

9bc96b2

cleaning up traces in langfuse

2200c27

cleanup unnecessary code

ae3c779

Merge branch 'main' into feature/evaluation

6acde66

syncing with master changes

34082d6

moving to batch table

2a915fa

Merge branch 'main' into feature/evaluation

b2c1b46

checking out AWS

1e278fc

cleanup migration

4cb5d56

cleanup evaluation run update and context runs

4ec5971

coderabbitai bot reviewed Nov 4, 2025

View reviewed changes

AkhileshNegi added 5 commits November 4, 2025 19:59

cleanup logs

6f19f05

using response id

4a649d3

type checking for clean code

9fc12a6

cleaner documentation

a5c8a03

added indexes

cba41c9

avirajsingh7 requested changes Nov 5, 2025

View reviewed changes

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

Merge branch 'main' into feature/evaluation

9dffd06

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

AkhileshNegi added 2 commits November 5, 2025 22:12

removing unnecessary asyncs

8d32883

using get_langfuse_client instead

24a958e

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

backend/app/crud/evaluations/processing.py Show resolved Hide resolved

AkhileshNegi added 3 commits November 5, 2025 22:49

update migration head

34700c5

refactoring and cleanups

9755403

cleanup cron

2b38293

coderabbitai bot reviewed Nov 5, 2025

View reviewed changes

AkhileshNegi added 2 commits November 6, 2025 12:40

moving to env for cron

dce502b

formatting code

8ad6982

coderabbitai bot reviewed Nov 6, 2025

View reviewed changes

kartpop approved these changes Nov 6, 2025

View reviewed changes

updated endpoints

c08d626

AkhileshNegi merged commit 5be0bdc into main Nov 6, 2025
2 of 3 checks passed

AkhileshNegi deleted the feature/evaluation branch November 6, 2025 07:33

This was referenced Nov 7, 2025

Evaluation: Code Optimizations #431

Open

Evaluation: SNEHA & VigyanShaala #432

Open

This was referenced Nov 13, 2025

Add Config Management System with Version Control for LLM Providers #435

Merged

Evaluation: Migration Cleanups #436

Merged

AkhileshNegi linked an issue Nov 20, 2025 that may be closed by this pull request

Evaluation #417

Closed

		return eval_run


		@router.get("/evaluations/list", response_model=list[EvaluationRunPublic])

		)


		@router.delete("/evaluations/datasets/{dataset_id}")

Evaluation #405

Evaluation #405

Uh oh!

Conversation

AkhileshNegi commented Oct 13, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

avirajsingh7 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

avirajsingh7 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

AkhileshNegi commented Oct 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 13, 2025 •

edited

Loading

codecov bot commented Oct 13, 2025 •

edited

Loading