Evaluation

**Is your feature request related to a problem? Please describe.**
Teams using Kaapi to generate response using AI need a repeatable, comparable, and auditable way to evaluate LLM answer quality against a golden Q&A set. Today this process is manual and scattered across tools, making it hard to: 
(a) run consistent experiments (prompt/model/temp/vector store), 
(b) quantify flakiness, and 
(c) view per-item traces alongside aggregate scores. We need an automated pipeline that runs on a golden dataset, generates answers in batch, computes scores that can determine quality of responses with change in config

**Describe the solution you'd like**
Build an end-to-end evaluation flow in Kaapi that:
- Accepts a golden CSV of questions & answers, duplicates each question N=5 times to measure flakiness
- Uses an /evaluate flow that reads a config (assistant settings) and start evaluation
- Consolidates results (question, generated_output, ground_truth), archives to S3, generate score for each question answer pair and persist in Kaapi’s DB.


**Reference**
[Solution Doc
](https://docs.google.com/document/d/1ic3TCxZZ4xIfX9AD4lZT6ksWvPVQpn8nNFX4nbIQU8E/edit?tab=t.0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Evaluation #417

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation #417

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions