Feature Proposal: Real-Time Deterministic RAG Telemetry (Confidence & Risk Scoring) #95
Pandidharan22
started this conversation in
Ideas
Replies: 2 comments
-
|
Super sorry for taking so long on the reply, it's been a busy week for us. If you're still interested I think the idea sounds solid, you could make a PR for the same and test it, we will also see how the approach works in comparison to what we have planned for the eval. |
Beta Was this translation helpful? Give feedback.
0 replies
-
|
@Pandidharan22 I will suggest if you could open a Pull request with your idea and if there's a benchmark result that you can add in your Pull request to show how effective the approach is, that would be super helpful. Thanks for the contribution in advance! |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi Team,
I've been going through your Project's Goals and "Is the RAG system returning relevant context or noise?" Problem statement got my eye. To complement your solution: post-hoc LLM evaluation engine, I'd like to propose adding Real-Time Deterministic RAG Telemetry.
While an LLM-as-a-judge is really good for deep semantic grading, the API cost & latency is high and runs after generation. By capturing deterministic metrics - Retrieval Confidence and Hallucination Risk which are calculated from vector similarity and spread score, gives us a zero latency warning/alert at the moment of retrieval. This will notify the Devs about the noise from generation failures.
My proposed Implementation:
I want to make sure that this solution plugs into the existing ingestion flow without any disruption:
(I) Add retrieval_confidence and hallucination_risk fields to the "SpanIngest" schema.
(II) Then storing these metrics alongside existing token/latency data.
(III) Finally add a REST endpoint to sum up these metrics and render a scatterplot for "Risk vs Confidence" in the Agent metrics dashboard.
I have already mapped this out locally and would like to take ownership of building it end-to-end. I'd like to hear your thoughts on this architecture—do you have any feedback, alternative suggestions, or tweaks to the ingestion strategy before I start building?
Beta Was this translation helpful? Give feedback.
All reactions