Describe the current behavior
Text evaluation depends on Langfuse for LLM-as-a-judge scoring. Each evaluator must be configured manually in the Langfuse UI since the Langfuse SDK doesn't expose an API for setting up evaluator, a recurring manual step for every new organization. Langfuse was originally chosen because Kaapi lacked batch evaluation runs and a UI to surface results.
Describe the enhancement you'd like
Decouple the evaluation pipeline from Langfuse:
- Run LLM-as-a-judge scoring natively inside Kaapi (judge prompts, model calls, scoring schema managed in our DB/codebase)
- Persist evaluator definitions and results in Kaapi so new evaluators can be created via API/UI without touching Langfuse
- Retain Langfuse only for tracing/observability, not as a hard dependency for evaluation execution
Why is this enhancement needed?
- Kaapi now has both batch operations and a results UI, removing the original justification for the Langfuse dependency
- No manual Langfuse setup per evaluator
- Evaluations runnable end-to-end via Kaapi APIs
- Faster iteration on new judge prompts and scoring criteria
Additional context
Langfuse remains valuable for tracing and observability — this change scopes it out only from the evaluation execution path.
Describe the current behavior
Text evaluation depends on Langfuse for LLM-as-a-judge scoring. Each evaluator must be configured manually in the Langfuse UI since the Langfuse SDK doesn't expose an API for setting up evaluator, a recurring manual step for every new organization. Langfuse was originally chosen because Kaapi lacked batch evaluation runs and a UI to surface results.
Describe the enhancement you'd like
Decouple the evaluation pipeline from Langfuse:
Why is this enhancement needed?
Additional context
Langfuse remains valuable for tracing and observability — this change scopes it out only from the evaluation execution path.