Skip to content

Evaluation #417

@AkhileshNegi

Description

@AkhileshNegi

Is your feature request related to a problem? Please describe.
Teams using Kaapi to generate response using AI need a repeatable, comparable, and auditable way to evaluate LLM answer quality against a golden Q&A set. Today this process is manual and scattered across tools, making it hard to:
(a) run consistent experiments (prompt/model/temp/vector store),
(b) quantify flakiness, and
(c) view per-item traces alongside aggregate scores. We need an automated pipeline that runs on a golden dataset, generates answers in batch, computes scores that can determine quality of responses with change in config

Describe the solution you'd like
Build an end-to-end evaluation flow in Kaapi that:

  • Accepts a golden CSV of questions & answers, duplicates each question N=5 times to measure flakiness
  • Uses an /evaluate flow that reads a config (assistant settings) and start evaluation
  • Consolidates results (question, generated_output, ground_truth), archives to S3, generate score for each question answer pair and persist in Kaapi’s DB.

Reference
Solution Doc

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

Status

Closed

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions