-
Notifications
You must be signed in to change notification settings - Fork 7
Closed
Labels
enhancementNew feature or requestNew feature or request
Description
Is your feature request related to a problem? Please describe.
Teams using Kaapi to generate response using AI need a repeatable, comparable, and auditable way to evaluate LLM answer quality against a golden Q&A set. Today this process is manual and scattered across tools, making it hard to:
(a) run consistent experiments (prompt/model/temp/vector store),
(b) quantify flakiness, and
(c) view per-item traces alongside aggregate scores. We need an automated pipeline that runs on a golden dataset, generates answers in batch, computes scores that can determine quality of responses with change in config
Describe the solution you'd like
Build an end-to-end evaluation flow in Kaapi that:
- Accepts a golden CSV of questions & answers, duplicates each question N=5 times to measure flakiness
- Uses an /evaluate flow that reads a config (assistant settings) and start evaluation
- Consolidates results (question, generated_output, ground_truth), archives to S3, generate score for each question answer pair and persist in Kaapi’s DB.
Reference
Solution Doc
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request
Type
Projects
Status
Closed