Skip to content

Evaluation: Added type for dataset#641

Merged
AkhileshNegi merged 4 commits intomainfrom
enhancement/text-evals-dataset-fix
Mar 5, 2026
Merged

Evaluation: Added type for dataset#641
AkhileshNegi merged 4 commits intomainfrom
enhancement/text-evals-dataset-fix

Conversation

@AkhileshNegi
Copy link
Collaborator

@AkhileshNegi AkhileshNegi commented Mar 5, 2026

Summary

Target issue is ProjectTech4DevAI/kaapi-frontend#52 (review)

Checklist

Before submitting a pull request, please ensure that you mark these task.

  • Ran fastapi run --reload app/main.py or docker compose up in the repository root and test.
  • If you've fixed a bug or added code that is tested and has test cases.

Notes

  • New Features

    • Dataset operations now consistently filter to only include text-type evaluation datasets for creation, retrieval, and listing.
  • Tests

    • Added tests to verify datasets default to text-type and that non-text datasets are excluded from retrieval and listing.

Summary by CodeRabbit

  • Improvements
    • Evaluation datasets and runs now incorporate type-based filtering. New evaluations default to TEXT type, and all retrieval and listing operations automatically filter to show only TEXT-type datasets and runs. This ensures consistent data handling and establishes the foundation for supporting additional evaluation types in the future.

@AkhileshNegi AkhileshNegi self-assigned this Mar 5, 2026
@AkhileshNegi AkhileshNegi added the bug Something isn't working label Mar 5, 2026
@AkhileshNegi AkhileshNegi marked this pull request as ready for review March 5, 2026 10:32
@AkhileshNegi AkhileshNegi requested a review from nishika26 March 5, 2026 10:33
@coderabbitai
Copy link

coderabbitai bot commented Mar 5, 2026

📝 Walkthrough

Walkthrough

Adds type-based filtering for evaluation datasets and runs: created records default to EvaluationType.TEXT, and get/list operations now restrict results to type == TEXT across dataset and run CRUD paths.

Changes

Cohort / File(s) Summary
Evaluation dataset CRUD
backend/app/crud/evaluations/dataset.py
Import EvaluationType; set created datasets to TEXT; add type == TEXT constraints to get-by-id, get-by-name, and list queries.
Evaluation run CRUD
backend/app/crud/evaluations/core.py
Import EvaluationType; set created evaluation runs to TEXT; add type == TEXT filters to get-by-id and list queries.
Dataset tests
backend/app/tests/crud/evaluations/test_dataset.py
Add imports for EvaluationDataset and EvaluationType; assert created datasets default to TEXT; add tests ensuring non-TEXT (e.g., STT) datasets are excluded from get/list operations.
Run tests
backend/app/tests/crud/evaluations/test_core.py
New test module: verify creation, retrieval, and listing of evaluation runs with defaults to TEXT, and that non-TEXT runs are excluded from queries.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

ready-for-review

Suggested reviewers

  • Prajna1999

Poem

🐰 I hop through code where types align,
I mark new records as TEXT by design,
I hide the STT friends from lists I keep,
Tests nod along as they pass and leap,
A little rabbit dance — tidy and fine.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Evaluation: Added type for dataset' directly relates to the main change, which adds type filtering for evaluation datasets across CRUD operations. It accurately summarizes the primary modification.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch enhancement/text-evals-dataset-fix

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/crud/evaluations/dataset.py (1)

25-25: Consider moving EvaluationType to a neutral model module.

Importing EvaluationType from app.models.stt_evaluation in a generic evaluation dataset CRUD module introduces avoidable domain coupling. A shared location (e.g., app.models.evaluation) would keep boundaries cleaner.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/dataset.py` at line 25, The EvaluationType enum
is currently defined in app.models.stt_evaluation and should be moved to a
neutral model module (e.g., app.models.evaluation); extract the EvaluationType
definition into that new/shared module, export it there, then update the import
in dataset.py (and any other modules importing it) to import EvaluationType from
app.models.evaluation so the CRUD code no longer depends on the STT-specific
model.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/crud/evaluations/dataset.py`:
- Line 64: The CRUD functions in this module are hard-coding
type=EvaluationType.TEXT.value causing non-TEXT datasets to be ignored; change
the dataset CRUD functions (e.g., the create/read/list functions that currently
use type=EvaluationType.TEXT.value at the occurrences around the file) to accept
an optional dataset_type parameter (defaulting to EvaluationType.TEXT) and use
dataset_type.value when querying/creating, and update any internal calls to pass
through the correct EvaluationType (for example, have callers like
start_evaluation pass the incoming evaluation type or dataset.evaluation_type
instead of relying on the hard-coded TEXT); apply this change for each
occurrence noted (around the three lines mentioned) so STT/TTS datasets are
included while preserving TEXT as the default.

---

Nitpick comments:
In `@backend/app/crud/evaluations/dataset.py`:
- Line 25: The EvaluationType enum is currently defined in
app.models.stt_evaluation and should be moved to a neutral model module (e.g.,
app.models.evaluation); extract the EvaluationType definition into that
new/shared module, export it there, then update the import in dataset.py (and
any other modules importing it) to import EvaluationType from
app.models.evaluation so the CRUD code no longer depends on the STT-specific
model.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 79237a1b-93a2-4c81-b535-7638d281f042

📥 Commits

Reviewing files that changed from the base of the PR and between a0c1f24 and fdd37e0.

📒 Files selected for processing (1)
  • backend/app/crud/evaluations/dataset.py

dataset = EvaluationDataset(
name=name,
description=description,
type=EvaluationType.TEXT.value,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hard-coded TEXT scope makes generic dataset CRUD return false “not found” for valid non-TEXT datasets.

These changes force dataset creation and all reads/lists to TEXT only. Given EvaluationType includes STT and TTS (backend/app/models/stt_evaluation.py:21-26), this module now silently excludes those datasets and can surface misleading 404s in callers like start_evaluation (backend/app/services/evaluations/evaluation.py:28-85).

Suggested fix (parameterize dataset type, keep TEXT default)
 def create_evaluation_dataset(
     session: Session,
     name: str,
     dataset_metadata: dict[str, Any],
     organization_id: int,
     project_id: int,
     description: str | None = None,
     object_store_url: str | None = None,
     langfuse_dataset_id: str | None = None,
+    evaluation_type: EvaluationType = EvaluationType.TEXT,
 ) -> EvaluationDataset:
@@
         dataset = EvaluationDataset(
             name=name,
             description=description,
-            type=EvaluationType.TEXT.value,
+            type=evaluation_type.value,
             dataset_metadata=dataset_metadata,
@@
 def get_dataset_by_id(
-    session: Session, dataset_id: int, organization_id: int, project_id: int
+    session: Session,
+    dataset_id: int,
+    organization_id: int,
+    project_id: int,
+    evaluation_type: EvaluationType = EvaluationType.TEXT,
 ) -> EvaluationDataset | None:
@@
         .where(EvaluationDataset.organization_id == organization_id)
         .where(EvaluationDataset.project_id == project_id)
-        .where(EvaluationDataset.type == EvaluationType.TEXT.value)
+        .where(EvaluationDataset.type == evaluation_type.value)
@@
 def get_dataset_by_name(
-    session: Session, name: str, organization_id: int, project_id: int
+    session: Session,
+    name: str,
+    organization_id: int,
+    project_id: int,
+    evaluation_type: EvaluationType = EvaluationType.TEXT,
 ) -> EvaluationDataset | None:
@@
         .where(EvaluationDataset.organization_id == organization_id)
         .where(EvaluationDataset.project_id == project_id)
-        .where(EvaluationDataset.type == EvaluationType.TEXT.value)
+        .where(EvaluationDataset.type == evaluation_type.value)
@@
 def list_datasets(
     session: Session,
     organization_id: int,
     project_id: int,
     limit: int = 50,
     offset: int = 0,
+    evaluation_type: EvaluationType = EvaluationType.TEXT,
 ) -> list[EvaluationDataset]:
@@
         .where(EvaluationDataset.organization_id == organization_id)
         .where(EvaluationDataset.project_id == project_id)
-        .where(EvaluationDataset.type == EvaluationType.TEXT.value)
+        .where(EvaluationDataset.type == evaluation_type.value)

Also applies to: 127-127, 164-164, 201-201

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/crud/evaluations/dataset.py` at line 64, The CRUD functions in
this module are hard-coding type=EvaluationType.TEXT.value causing non-TEXT
datasets to be ignored; change the dataset CRUD functions (e.g., the
create/read/list functions that currently use type=EvaluationType.TEXT.value at
the occurrences around the file) to accept an optional dataset_type parameter
(defaulting to EvaluationType.TEXT) and use dataset_type.value when
querying/creating, and update any internal calls to pass through the correct
EvaluationType (for example, have callers like start_evaluation pass the
incoming evaluation type or dataset.evaluation_type instead of relying on the
hard-coded TEXT); apply this change for each occurrence noted (around the three
lines mentioned) so STT/TTS datasets are included while preserving TEXT as the
default.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/tests/crud/evaluations/test_dataset.py (1)

130-157: Well-structured test; db.add() is redundant.

The test correctly validates the type filtering behavior. However, calling db.add(dataset) at line 147 is unnecessary since the object is already tracked by the session after create_evaluation_dataset. The attribute mutation will be persisted on commit() regardless.

♻️ Suggested simplification
         # Manually update type to STT to simulate a non-text dataset
         dataset.type = EvaluationType.STT.value
-        db.add(dataset)
         db.commit()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/crud/evaluations/test_dataset.py` around lines 130 - 157,
Remove the redundant session add: in
test_get_dataset_by_id_excludes_non_text_type, after creating the dataset with
create_evaluation_dataset and mutating dataset.type = EvaluationType.STT.value,
do not call db.add(dataset) because the instance is already tracked; simply call
db.commit() to persist the change and leave the rest of the test (especially the
get_dataset_by_id assertion) unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/tests/crud/evaluations/test_dataset.py`:
- Line 16: Remove the unused EvaluationDataset import from the import statement
that currently reads "from app.models import EvaluationDataset, Organization,
Project"; the tests use the create_evaluation_dataset() CRUD helper and only
reference "TestCreateEvaluationDataset" as a class name, so keep Organization
and Project but delete EvaluationDataset to avoid an unused import.

---

Nitpick comments:
In `@backend/app/tests/crud/evaluations/test_dataset.py`:
- Around line 130-157: Remove the redundant session add: in
test_get_dataset_by_id_excludes_non_text_type, after creating the dataset with
create_evaluation_dataset and mutating dataset.type = EvaluationType.STT.value,
do not call db.add(dataset) because the instance is already tracked; simply call
db.commit() to persist the change and leave the rest of the test (especially the
get_dataset_by_id assertion) unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 60e1103f-2327-416a-9501-96df8d80fd2f

📥 Commits

Reviewing files that changed from the base of the PR and between fdd37e0 and 35a9e4e.

📒 Files selected for processing (1)
  • backend/app/tests/crud/evaluations/test_dataset.py

upload_csv_to_object_store,
)
from app.models import Organization, Project
from app.models import EvaluationDataset, Organization, Project
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if EvaluationDataset is directly used in the test file (beyond imports)
rg -n 'EvaluationDataset' backend/app/tests/crud/evaluations/test_dataset.py | grep -v 'import'

Repository: ProjectTech4DevAI/kaapi-backend

Length of output: 112


Remove the unused EvaluationDataset import from line 16.

The EvaluationDataset model is imported but never directly used in the test file. The tests create datasets via the create_evaluation_dataset() CRUD function, which handles instantiation implicitly. The only occurrence of "EvaluationDataset" in the file is the test class name TestCreateEvaluationDataset, not a usage of the imported model.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/crud/evaluations/test_dataset.py` at line 16, Remove the
unused EvaluationDataset import from the import statement that currently reads
"from app.models import EvaluationDataset, Organization, Project"; the tests use
the create_evaluation_dataset() CRUD helper and only reference
"TestCreateEvaluationDataset" as a class name, so keep Organization and Project
but delete EvaluationDataset to avoid an unused import.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
backend/app/tests/crud/evaluations/test_core.py (1)

130-133: Use a collision-proof non-existent ID in not-found tests.

Using a hardcoded 99999 can become flaky in long-lived/shared test DBs; prefer a guaranteed-miss ID strategy (e.g., -1 if IDs are positive-only).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/crud/evaluations/test_core.py` around lines 130 - 133,
Replace the hardcoded 99999 used in the not-found test for
get_evaluation_run_by_id with a collision-proof non-existent ID (e.g., use -1)
to guarantee a miss in databases where IDs are positive-only; update the test
invocation of get_evaluation_run_by_id(session=db, evaluation_id=...,
organization_id=org.id) to pass -1 (or otherwise compute a guaranteed-miss
value) so the test cannot collide with real records.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/app/tests/crud/evaluations/test_core.py`:
- Around line 16-42: The inline helper _create_config and repeated setup logic
should be replaced by reusable factory fixtures under backend/app/tests/ (e.g.,
ConfigFactory/ConfigVersionFactory, ProjectFactory, OrgFactory, DatasetFactory,
RunFactory); create factories that construct Config and ConfigVersion (mirroring
the fields set in _create_config and using now() for timestamps), register them
as pytest fixtures, and update tests in test_core.py (and other affected tests)
to use these fixtures instead of calling _create_config or manually creating
models; ensure factories return the same identifiers/objects (config.id and
config_version.version) or provide equivalent attributes so existing assertions
in the tests remain valid.

---

Nitpick comments:
In `@backend/app/tests/crud/evaluations/test_core.py`:
- Around line 130-133: Replace the hardcoded 99999 used in the not-found test
for get_evaluation_run_by_id with a collision-proof non-existent ID (e.g., use
-1) to guarantee a miss in databases where IDs are positive-only; update the
test invocation of get_evaluation_run_by_id(session=db, evaluation_id=...,
organization_id=org.id) to pass -1 (or otherwise compute a guaranteed-miss
value) so the test cannot collide with real records.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9ffdeda3-927d-4509-aae5-bd30406b78ef

📥 Commits

Reviewing files that changed from the base of the PR and between 35a9e4e and 176b769.

📒 Files selected for processing (2)
  • backend/app/crud/evaluations/core.py
  • backend/app/tests/crud/evaluations/test_core.py

Comment on lines +16 to +42
def _create_config(db: Session, project_id: int) -> tuple:
"""Helper to create a config and config_version for evaluation runs."""
from app.models.config import Config, ConfigVersion

config = Config(
name="test_config",
project_id=project_id,
inserted_at=now(),
updated_at=now(),
)
db.add(config)
db.commit()
db.refresh(config)

config_version = ConfigVersion(
config_id=config.id,
version=1,
config_blob={"completion": {"params": {"model": "gpt-4o"}}},
inserted_at=now(),
updated_at=now(),
)
db.add(config_version)
db.commit()
db.refresh(config_version)

return config.id, config_version.version

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Adopt factory fixtures instead of inline object construction in tests.

This module repeats setup logic (org/project/dataset/config/run) and uses an ad-hoc helper; please move these into reusable factory fixtures under backend/app/tests/ for consistency and maintainability.

♻️ Example direction
- def _create_config(db: Session, project_id: int) -> tuple:
-     ...
+ # backend/app/tests/factories/config_factory.py
+ def create_config_with_version(db: Session, project_id: int) -> tuple[int, int]:
+     ...
- config_id, config_version = _create_config(db, project.id)
+ config_id, config_version = config_factory.create_config_with_version(db, project.id)

As per coding guidelines, "backend/app/tests/**/*.py: Use factory pattern for test fixtures in backend/app/tests/."

Also applies to: 47-245

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/app/tests/crud/evaluations/test_core.py` around lines 16 - 42, The
inline helper _create_config and repeated setup logic should be replaced by
reusable factory fixtures under backend/app/tests/ (e.g.,
ConfigFactory/ConfigVersionFactory, ProjectFactory, OrgFactory, DatasetFactory,
RunFactory); create factories that construct Config and ConfigVersion (mirroring
the fields set in _create_config and using now() for timestamps), register them
as pytest fixtures, and update tests in test_core.py (and other affected tests)
to use these fixtures instead of calling _create_config or manually creating
models; ensure factories return the same identifiers/objects (config.id and
config_version.version) or provide equivalent attributes so existing assertions
in the tests remain valid.

@codecov
Copy link

codecov bot commented Mar 5, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@AkhileshNegi AkhileshNegi merged commit 03a0568 into main Mar 5, 2026
3 checks passed
@AkhileshNegi AkhileshNegi deleted the enhancement/text-evals-dataset-fix branch March 5, 2026 11:18
@github-project-automation github-project-automation bot moved this to Closed in Kaapi-dev Mar 5, 2026
nishika26 pushed a commit that referenced this pull request Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

Status: Closed

Development

Successfully merging this pull request may close these issues.

2 participants