Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
62d159e
first stab at setting up langfuse evaluation
May 28, 2025
706c5dc
bringing traces to sync
May 28, 2025
d882160
cleanup
May 28, 2025
168f521
getting it up and running with sync
May 28, 2025
fa509ad
using utils
May 28, 2025
38ea1a9
cleanups
May 28, 2025
be299a3
added logs
May 29, 2025
46fd438
code refactoring
May 29, 2025
6198542
cleanups
May 29, 2025
74ab6d1
moving to separate files
May 29, 2025
cfc2dbd
using pydantic types
May 29, 2025
675e220
Merge main into feature/langfuse-evaluation branch
AkhileshNegi Oct 7, 2025
0633c54
Remove project_id dependency from evaluation endpoints
AkhileshNegi Oct 7, 2025
c130b02
using hardcoded
AkhileshNegi Oct 8, 2025
7878d2a
Merge branch 'feature/langfuse-evaluation' into main
AkhileshNegi Oct 13, 2025
5466229
adding endpoint for uploading dataset
AkhileshNegi Oct 13, 2025
2ac8163
added steps for starting evaluation using batchAPI
AkhileshNegi Oct 14, 2025
17c767c
added testcase
AkhileshNegi Oct 14, 2025
f99ae27
using celery beat and evaluation batch
AkhileshNegi Oct 14, 2025
9bc96b2
first stab at running evaluation
AkhileshNegi Oct 14, 2025
2200c27
cleaning up traces in langfuse
AkhileshNegi Oct 14, 2025
ae3c779
cleanup unnecessary code
AkhileshNegi Oct 14, 2025
6acde66
Merge branch 'main' into feature/evaluation
AkhileshNegi Oct 16, 2025
34082d6
syncing with master changes
AkhileshNegi Oct 16, 2025
2a915fa
moving to batch table
AkhileshNegi Oct 21, 2025
b2c1b46
Merge branch 'main' into feature/evaluation
AkhileshNegi Oct 21, 2025
1e278fc
checking out AWS
AkhileshNegi Oct 21, 2025
4cb5d56
cleanup migration
AkhileshNegi Oct 21, 2025
877ba04
added support for cosine similarity score
AkhileshNegi Oct 23, 2025
6979e33
first stab at pushing cosine to langfuse
AkhileshNegi Oct 23, 2025
26ee6f0
cleanup logs
AkhileshNegi Oct 23, 2025
c76ef50
optimizing similarity
AkhileshNegi Oct 25, 2025
5bfc8d8
added evaluation dataset
AkhileshNegi Oct 29, 2025
3912d9f
update endpoints
AkhileshNegi Oct 29, 2025
6012d5c
updated testcases
AkhileshNegi Oct 30, 2025
d289794
using single migration file
AkhileshNegi Oct 30, 2025
e98dae6
code cleanups
AkhileshNegi Oct 30, 2025
cc2df27
few more cleanups and tests
AkhileshNegi Oct 30, 2025
ebafe8b
added support for sanitizing dataset name
AkhileshNegi Oct 30, 2025
a21709f
fix import issues in testcases
AkhileshNegi Oct 30, 2025
e15c5f2
Merge branch 'main' into feature/evaluation
AkhileshNegi Oct 31, 2025
ed0da58
fixing imports
AkhileshNegi Oct 31, 2025
f573e70
Merge branch 'feature/evaluation' of github.com:ProjectTech4DevAI/ai-…
AkhileshNegi Oct 31, 2025
11663da
minor cleanups for evaluation
AkhileshNegi Oct 31, 2025
5988f80
passing project id as well
AkhileshNegi Oct 31, 2025
22361fe
updated testcases and error codes
AkhileshNegi Nov 3, 2025
d9704e3
using util for file uploads
AkhileshNegi Nov 3, 2025
0fd0842
optimizing cosine similarities
AkhileshNegi Nov 3, 2025
cd757bd
added support for duplication factor limit
AkhileshNegi Nov 3, 2025
ea5d000
Merge branch 'main' into feature/evaluation
AkhileshNegi Nov 3, 2025
f2ec2a5
cleanup for dataset id in evaluation
AkhileshNegi Nov 3, 2025
f7ca621
file validations
AkhileshNegi Nov 3, 2025
e74ea09
refactoring file structure
AkhileshNegi Nov 3, 2025
4f4cea1
Evaluation: Add cron job endpoint and script for periodic evaluation …
kartpop Nov 4, 2025
9b038c6
minor fixes
AkhileshNegi Nov 4, 2025
622b4eb
cleanup cruds
AkhileshNegi Nov 4, 2025
4c61d74
removed celery beat
AkhileshNegi Nov 4, 2025
4ec5971
cleanup evaluation run update and context runs
AkhileshNegi Nov 4, 2025
6f19f05
cleanup logs
AkhileshNegi Nov 4, 2025
4a649d3
using response id
AkhileshNegi Nov 4, 2025
9fc12a6
type checking for clean code
AkhileshNegi Nov 4, 2025
a5c8a03
cleaner documentation
AkhileshNegi Nov 4, 2025
cba41c9
added indexes
AkhileshNegi Nov 4, 2025
9dffd06
Merge branch 'main' into feature/evaluation
AkhileshNegi Nov 5, 2025
8d32883
removing unnecessary asyncs
AkhileshNegi Nov 5, 2025
24a958e
using get_langfuse_client instead
AkhileshNegi Nov 5, 2025
34700c5
update migration head
AkhileshNegi Nov 5, 2025
9755403
refactoring and cleanups
AkhileshNegi Nov 5, 2025
2b38293
cleanup cron
AkhileshNegi Nov 5, 2025
dce502b
moving to env for cron
AkhileshNegi Nov 6, 2025
8ad6982
formatting code
AkhileshNegi Nov 6, 2025
c08d626
updated endpoints
AkhileshNegi Nov 6, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,12 @@ FIRST_SUPERUSER=superuser@example.com
FIRST_SUPERUSER_PASSWORD=changethis
EMAIL_TEST_USER="test@example.com"

# API Base URL for cron scripts (defaults to http://localhost:8000 if not set)
API_BASE_URL=http://localhost:8000

# Cron interval in minutes (defaults to 5 minutes if not set)
CRON_INTERVAL_MINUTES=5

# Postgres
POSTGRES_SERVER=localhost
POSTGRES_PORT=5432
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
"""create_evaluation_run_table, batch_job_table, and evaluation_dataset_table

Revision ID: 6fe772038a5a
Revises: 219033c644de
Create Date: 2025-11-05 22:47:18.266070

"""
from alembic import op
import sqlalchemy as sa
from sqlalchemy.dialects import postgresql
import sqlmodel.sql.sqltypes


# revision identifiers, used by Alembic.
revision = "6fe772038a5a"
down_revision = "219033c644de"
branch_labels = None
depends_on = None


def upgrade():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add return type hint to the upgrade function.

The function signature is missing a return type hint, which violates the project's coding guidelines for Python 3.11+.

As per coding guidelines.

Apply this diff:

-def upgrade():
+def upgrade() -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def upgrade():
def upgrade() -> None:
🤖 Prompt for AI Agents
In backend/app/alembic/versions/d5747495bd7c_create_evaluation_run_table.py
around line 20, the upgrade function signature lacks a return type hint; update
the function definition to include an explicit return type of None (e.g., def
upgrade() -> None:) to comply with the project's Python 3.11+ typing guidelines
and static checks.

# Create batch_job table first (as evaluation_run will reference it)
op.create_table(
"batch_job",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column(
"provider",
sa.String(),
nullable=False,
comment="LLM provider name (e.g., 'openai', 'anthropic')",
),
sa.Column(
"job_type",
sa.String(),
nullable=False,
comment="Type of batch job (e.g., 'evaluation', 'classification', 'embedding')",
),
sa.Column(
"config",
postgresql.JSONB(astext_type=sa.Text()),
nullable=False,
server_default=sa.text("'{}'::jsonb"),
comment="Complete batch configuration",
),
sa.Column(
"provider_batch_id",
sa.String(),
nullable=True,
comment="Provider's batch job ID",
),
sa.Column(
"provider_file_id",
sa.String(),
nullable=True,
comment="Provider's input file ID",
),
sa.Column(
"provider_output_file_id",
sa.String(),
nullable=True,
comment="Provider's output file ID",
),
sa.Column(
"provider_status",
sa.String(),
nullable=True,
comment="Provider-specific status (e.g., OpenAI: validating, in_progress, completed, failed)",
),
sa.Column(
"raw_output_url",
sa.String(),
nullable=True,
comment="S3 URL of raw batch output file",
),
sa.Column(
"total_items",
sa.Integer(),
nullable=False,
server_default=sa.text("0"),
comment="Total number of items in the batch",
),
sa.Column(
"error_message",
sa.Text(),
nullable=True,
comment="Error message if batch failed",
),
sa.Column("organization_id", sa.Integer(), nullable=False),
sa.Column("project_id", sa.Integer(), nullable=False),
sa.Column("inserted_at", sa.DateTime(), nullable=False),
sa.Column("updated_at", sa.DateTime(), nullable=False),
sa.ForeignKeyConstraint(
["organization_id"], ["organization.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("id"),
)
op.create_index(
op.f("ix_batch_job_job_type"), "batch_job", ["job_type"], unique=False
)
op.create_index(
op.f("ix_batch_job_organization_id"),
"batch_job",
["organization_id"],
unique=False,
)
op.create_index(
op.f("ix_batch_job_project_id"), "batch_job", ["project_id"], unique=False
)
op.create_index(
"idx_batch_job_status_org",
"batch_job",
["provider_status", "organization_id"],
unique=False,
)
op.create_index(
"idx_batch_job_status_project",
"batch_job",
["provider_status", "project_id"],
unique=False,
)

# Create evaluation_dataset table
op.create_table(
"evaluation_dataset",
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
sa.Column("description", sqlmodel.sql.sqltypes.AutoString(), nullable=True),
sa.Column(
"dataset_metadata",
postgresql.JSONB(astext_type=sa.Text()),
nullable=False,
server_default=sa.text("'{}'::jsonb"),
),
sa.Column(
"object_store_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True
),
sa.Column(
"langfuse_dataset_id",
sqlmodel.sql.sqltypes.AutoString(),
nullable=True,
),
sa.Column("organization_id", sa.Integer(), nullable=False),
sa.Column("project_id", sa.Integer(), nullable=False),
sa.Column("inserted_at", sa.DateTime(), nullable=False),
sa.Column("updated_at", sa.DateTime(), nullable=False),
sa.ForeignKeyConstraint(
["organization_id"], ["organization.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("id"),
sa.UniqueConstraint(
"name",
"organization_id",
"project_id",
name="uq_evaluation_dataset_name_org_project",
),
)
op.create_index(
op.f("ix_evaluation_dataset_name"),
"evaluation_dataset",
["name"],
unique=False,
)

# Create evaluation_run table with all columns and foreign key references
op.create_table(
"evaluation_run",
sa.Column("run_name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
sa.Column("dataset_name", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
sa.Column("config", sa.JSON(), nullable=False),
sa.Column("batch_job_id", sa.Integer(), nullable=True),
sa.Column(
"embedding_batch_job_id",
sa.Integer(),
nullable=True,
comment="Reference to the batch_job for embedding-based similarity scoring",
),
sa.Column("dataset_id", sa.Integer(), nullable=False),
sa.Column("status", sqlmodel.sql.sqltypes.AutoString(), nullable=False),
sa.Column(
"object_store_url", sqlmodel.sql.sqltypes.AutoString(), nullable=True
),
sa.Column("total_items", sa.Integer(), nullable=False),
sa.Column("score", sa.JSON(), nullable=True),
sa.Column("error_message", sa.Text(), nullable=True),
sa.Column("organization_id", sa.Integer(), nullable=False),
sa.Column("project_id", sa.Integer(), nullable=False),
sa.Column("id", sa.Integer(), nullable=False),
sa.Column("inserted_at", sa.DateTime(), nullable=False),
sa.Column("updated_at", sa.DateTime(), nullable=False),
sa.ForeignKeyConstraint(
["batch_job_id"],
["batch_job.id"],
ondelete="SET NULL",
),
sa.ForeignKeyConstraint(
["embedding_batch_job_id"],
["batch_job.id"],
name="fk_evaluation_run_embedding_batch_job_id",
ondelete="SET NULL",
),
sa.ForeignKeyConstraint(
["dataset_id"],
["evaluation_dataset.id"],
name="fk_evaluation_run_dataset_id",
ondelete="CASCADE",
),
sa.ForeignKeyConstraint(
["organization_id"], ["organization.id"], ondelete="CASCADE"
),
sa.ForeignKeyConstraint(["project_id"], ["project.id"], ondelete="CASCADE"),
sa.PrimaryKeyConstraint("id"),
)
Comment on lines +24 to +214
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix primary keys to autoincrement.

batch_job, evaluation_dataset, and evaluation_run declare id as plain Integer NOT NULL with a separate PrimaryKeyConstraint. Postgres will not create a sequence/default in this setup, so inserts coming from SQLModel (which send id=None) will fail with NULL value in column "id" violates not-null constraint. Mark the columns as primary keys (or add sa.Identity()) so the database autogenerates IDs.

-        sa.Column("id", sa.Integer(), nullable=False),
+        sa.Column("id", sa.Integer(), primary_key=True),
...
-        sa.Column("id", sa.Integer(), nullable=False),
+        sa.Column("id", sa.Integer(), primary_key=True),
...
-        sa.Column("id", sa.Integer(), nullable=False),
+        sa.Column("id", sa.Integer(), primary_key=True),
🤖 Prompt for AI Agents
In backend/app/alembic/versions/6fe772038a5a_create_evaluation_run_table.py
around lines 24 to 214, the id columns for batch_job, evaluation_dataset, and
evaluation_run are defined as plain Integer with a separate PrimaryKeyConstraint
which prevents Postgres from creating sequences and causes NULL id insert
failures; update each table definition to make the id column a real primary key
with an autogenerating identity (either mark the id column as primary_key=True
in the sa.Column definition or add sa.Identity()/server_default identity
expression) so Postgres will auto-generate ids on insert, and remove or keep the
separate PrimaryKeyConstraint accordingly to avoid duplicate/conflicting PK
declarations.

op.create_index(
op.f("ix_evaluation_run_run_name"), "evaluation_run", ["run_name"], unique=False
)
op.create_index(
"idx_eval_run_status_org",
"evaluation_run",
["status", "organization_id"],
unique=False,
)
op.create_index(
"idx_eval_run_status_project",
"evaluation_run",
["status", "project_id"],
unique=False,
)


def downgrade():
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add return type hint to the downgrade function.

The function signature is missing a return type hint, which violates the project's coding guidelines for Python 3.11+.

As per coding guidelines.

Apply this diff:

-def downgrade():
+def downgrade() -> None:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def downgrade():
def downgrade() -> None:
🤖 Prompt for AI Agents
In backend/app/alembic/versions/d5747495bd7c_create_evaluation_run_table.py
around line 207, the downgrade() function is missing a return type hint; update
the function signature to include an explicit return type (-> None) to match the
project's Python 3.11+ typing guidelines and keep the rest of the function body
unchanged.

# Drop evaluation_run table first (has foreign keys to batch_job and evaluation_dataset)
op.drop_index("idx_eval_run_status_project", table_name="evaluation_run")
op.drop_index("idx_eval_run_status_org", table_name="evaluation_run")
op.drop_index(op.f("ix_evaluation_run_run_name"), table_name="evaluation_run")
op.drop_table("evaluation_run")

# Drop evaluation_dataset table
op.drop_index(op.f("ix_evaluation_dataset_name"), table_name="evaluation_dataset")
op.drop_table("evaluation_dataset")

# Drop batch_job table
op.drop_index("idx_batch_job_status_project", table_name="batch_job")
op.drop_index("idx_batch_job_status_org", table_name="batch_job")
op.drop_index(op.f("ix_batch_job_project_id"), table_name="batch_job")
op.drop_index(op.f("ix_batch_job_organization_id"), table_name="batch_job")
op.drop_index(op.f("ix_batch_job_job_type"), table_name="batch_job")
op.drop_table("batch_job")
2 changes: 1 addition & 1 deletion backend/app/api/deps.py
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ def get_current_user(
if not user:
raise HTTPException(status_code=404, detail="User not found")
if not user.is_active:
raise HTTPException(status_code=400, detail="Inactive user")
raise HTTPException(status_code=403, detail="Inactive user")

return user # Return only User object

Expand Down
80 changes: 80 additions & 0 deletions backend/app/api/docs/evaluation/create_evaluation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
Start an evaluation using OpenAI Batch API.

This endpoint:
1. Fetches the dataset from database and validates it has Langfuse dataset ID
2. Creates an EvaluationRun record in the database
3. Fetches dataset items from Langfuse
4. Builds JSONL for batch processing (config is used as-is)
5. Creates a batch job via the generic batch infrastructure
6. Returns the evaluation run details with batch_job_id

The batch will be processed asynchronously by Celery Beat (every 60s).
Use GET /evaluations/{evaluation_id} to check progress.

## Request Body

- **dataset_id** (required): ID of the evaluation dataset (from /evaluations/datasets)
- **experiment_name** (required): Name for this evaluation experiment/run
- **config** (optional): Configuration dict that will be used as-is in JSONL generation. Can include any OpenAI Responses API parameters like:
- model: str (e.g., "gpt-4o", "gpt-5")
- instructions: str
- tools: list (e.g., [{"type": "file_search", "vector_store_ids": [...]}])
- reasoning: dict (e.g., {"effort": "low"})
- text: dict (e.g., {"verbosity": "low"})
- temperature: float
- include: list (e.g., ["file_search_call.results"])
- Note: "input" will be added automatically from the dataset
- **assistant_id** (optional): Assistant ID to fetch configuration from. If provided, configuration will be fetched from the assistant in the database. Config can be passed as empty dict {} when using assistant_id.

## Example with config

```json
{
"dataset_id": 123,
"experiment_name": "test_run",
"config": {
"model": "gpt-4.1",
"instructions": "You are a helpful FAQ assistant.",
"tools": [
{
"type": "file_search",
"vector_store_ids": ["vs_12345"],
"max_num_results": 3
}
],
"include": ["file_search_call.results"]
}
}
```

## Example with assistant_id

```json
{
"dataset_id": 123,
"experiment_name": "test_run",
"config": {},
"assistant_id": "asst_xyz"
}
```

## Returns

EvaluationRunPublic with batch details and status:
- id: Evaluation run ID
- run_name: Name of the evaluation run
- dataset_name: Name of the dataset used
- dataset_id: ID of the dataset used
- config: Configuration used for the evaluation
- batch_job_id: ID of the batch job processing this evaluation
- status: Current status (pending, running, completed, failed)
- total_items: Total number of items being evaluated
- completed_items: Number of items completed so far
- results: Evaluation results (when completed)
- error_message: Error message if failed

## Error Responses

- **404**: Dataset or assistant not found or not accessible
- **400**: Missing required credentials (OpenAI or Langfuse), dataset missing Langfuse ID, or config missing required fields
- **500**: Failed to configure API clients or start batch evaluation
18 changes: 18 additions & 0 deletions backend/app/api/docs/evaluation/delete_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
Delete a dataset by ID.

This will remove the dataset record from the database. The CSV file in object store (if exists) will remain for audit purposes, but the dataset will no longer be accessible for creating new evaluations.

## Path Parameters

- **dataset_id**: ID of the dataset to delete

## Returns

Success message with deleted dataset details:
- message: Confirmation message
- dataset_id: ID of the deleted dataset

## Error Responses

- **404**: Dataset not found or not accessible to your organization/project
- **400**: Dataset cannot be deleted (e.g., has active evaluation runs)
22 changes: 22 additions & 0 deletions backend/app/api/docs/evaluation/get_dataset.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
Get details of a specific dataset by ID.

Retrieves comprehensive information about a dataset including metadata, object store URL, and Langfuse integration details.

## Path Parameters

- **dataset_id**: ID of the dataset to retrieve

## Returns

DatasetUploadResponse with dataset details:
- dataset_id: Unique identifier for the dataset
- dataset_name: Name of the dataset (sanitized)
- total_items: Total number of items including duplication
- original_items: Number of original items before duplication
- duplication_factor: Factor by which items were duplicated
- langfuse_dataset_id: ID of the dataset in Langfuse
- object_store_url: URL to the CSV file in object storage

## Error Responses

- **404**: Dataset not found or not accessible to your organization/project
Loading