You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It is a challenge to perform automated testing for applications that use GenAI.
Automated testing of conventional software rests on a bedrock of predictability and consistency. Thus far, we were able to test a function by asserting if it outputs a set of predictable values for a given set of inputs. With a GenAI product (say a chatbot), the same input can result in a set of outputs that are each slightly different, but may all be acceptable.
Teams faced with this challenge often resort to using human domain experts to evaluate their Gen-AI based system. This system might work, but it quickly runs into limitations. If you wish to iteratively improve your AI system, each change would require another round of validation by the expert. At some point, human based validation becomes cumbersome and infeasible. We need automation, but not automation of static assertions, but automation of judgement.
LLM-judges are soon becoming the industry standard in closing this gap. This technique involves prompting an LLM judge to scrutinize the output of the AI system under test. In effect, the LLM-judge is meant to be a reliable proxy for a human domain expert. For a generic set of questions, LLM-judges have been studied to match or exceed human-human agreement. However, this is likely to not be applicable to their application within domain-specific applications, since the domain-experts bring implicit specialist knowledge that pertains to specific regulatory environments, cultural and historical context, which a generic LLM doesn't bring. Using a non-calibrated LLM judge for evaluation can be a significant source of noise rather than signal. Therefore, my first hypothesis is that an LLM-Judge must be calibrated by one or more such domain-experts. I've provided details on how LLM-Judges can be calibrated in 'Background and additional information'.
Thus far, automated checks have been used merely to catch regression, i.e. to verify is some functionality or property in our application has broken. With a calibrated LLM-Judge, we have automated not merely static assertions, but also judgement to some degree. With the LLM-Judge, we can then measure the baseline performance of the AI-System and then perform experiments. We can change one part of the system and see if it produces an improvement or deterioration against this baseline. Given the stochastic nature of LLMs, we need to repeat each experiment multiple times, which is now feasible due to automated evaluation.
E.g., for a customer service chatbot, if baseline performance is low, we can change the LLM model and see if it improves. If performance is already high, we can use a cheaper model and see if performance is sustained, thereby cutting token costs. We can try several versions of the system prompt and see which ones produce the best results. For a RAG system, we can try out different chunking or retrieval strategies. Every change that improves the baseline can be retained, while every change that deteriorates it can be discarded.
In effect, my second hypothesis is that a calibrated LLM judge could beyond becoming a means for catching defects, and could actually be used as a calibrated instrument for continuous improvement of an AI-System. I presented this hypothesis at my recent talk at EuroSTAR 2026. In this repo, you can preview the live-demo I showed during my talk, which demonstrates the LLM-Judge calibration process.
The motivation for this research task is to test both my hypotheses and gather evidence that support or oppose them.
Expected results
I'd like to have a more nuanced understanding of
The need for LLM-Judge calibration.
Whether calibrated LLM-Judges can be used as instruments for continuous improvement, and if so, under what conditions
Collaboration
I've formulated these hypotheses by reading literature and evaluating prototypes with LLM Judges. However, I lack the experience of using LLM-Judges in large scale production grade applications. What I need from you:
Critique / complement the two hypotheses that I have proposed here
Share real-world experience that supports or opposes the premises I have presented here
Share related research papers / articles you've read on the topic
Target artifact
An article / paper that will likely turn into a talk (with a demo)
Background and additional information
LLM- Judge Calibration
To calibrate an LLM-Judge, domain experts formulate a set of test cases for the AI system (known as Ground Truths for conversational AI systems, or Golden Data set more generically) along with a set of evaluation criteria (rubrics) for how each test case can be evaluated (akin to acceptance criteria). Based on these rubrics, the LLM-judge evaluates whether the AI-system's response is acceptable or not.
Consider this example, in the context of an airline customer service chatbot system:
Question in Ground Truth set:
I am flying a long-haul flight on an economy light ticket, and I want to check-in an additional bag. How much will that cost me?
Evaluation criteria:
Economy Light on long-haul (6 hours or more) includes one checked bag (23 kg) in the fare
An additional bag (the second bag) costs EUR 45
Pre-purchasing the additional bag online is approximately 30% cheaper than purchasing at the airport
The AI system's response is then evaluated against these criteria. This could be carried out using a simple UI as shown above.
A ground truth set could contain more than 50-100 such test cases that cover a wide range of representative cases that the AI-system is likely to be exposed to. These test cases are fed to the AI system to generate responses. The responses are then evaluated by both a domain expert and an LLM-Judge.
By cumulating the results, the rate of agreement between the domain expert and the LLM-judge can be measured.
This exercise will often involve modifying the LLM-Judge by tweaking its system prompt or the evaluation rubrics for each question. At times, the exercise will reveal that the questions or the accompanying documentation will need to be adapted too, to make more explicit some implicit knowledge that domain experts carry.
One caveat is worth mentioning here. Any LLM-Judge is a proxy for the judgement of a domain-expert. The more the AI-system changes, the more the LLM-Judge's judgement is likely to drift away from that of the domain-expert. Further, the domain-experts are themselves estimating the preferences of real-users. Once the system is exposed to real users, they will invariably use the system in ways that even domain-experts couldn't have predicted. Therefore, the set of ground truths will need to be further expanded / adapted to accommodate these variations. This will also cause LLM-Judge evaluation to drift away from what real users care about. Just as with any other instrument for measurement, the LLM-Judge calibration exercise must be repeated periodically.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Why?
It is a challenge to perform automated testing for applications that use GenAI.
Automated testing of conventional software rests on a bedrock of predictability and consistency. Thus far, we were able to test a function by asserting if it outputs a set of predictable values for a given set of inputs. With a GenAI product (say a chatbot), the same input can result in a set of outputs that are each slightly different, but may all be acceptable.
Teams faced with this challenge often resort to using human domain experts to evaluate their Gen-AI based system. This system might work, but it quickly runs into limitations. If you wish to iteratively improve your AI system, each change would require another round of validation by the expert. At some point, human based validation becomes cumbersome and infeasible. We need automation, but not automation of static assertions, but automation of judgement.
LLM-judges are soon becoming the industry standard in closing this gap. This technique involves prompting an LLM judge to scrutinize the output of the AI system under test. In effect, the LLM-judge is meant to be a reliable proxy for a human domain expert. For a generic set of questions, LLM-judges have been studied to match or exceed human-human agreement. However, this is likely to not be applicable to their application within domain-specific applications, since the domain-experts bring implicit specialist knowledge that pertains to specific regulatory environments, cultural and historical context, which a generic LLM doesn't bring. Using a non-calibrated LLM judge for evaluation can be a significant source of noise rather than signal. Therefore, my first hypothesis is that an LLM-Judge must be calibrated by one or more such domain-experts. I've provided details on how LLM-Judges can be calibrated in 'Background and additional information'.
Thus far, automated checks have been used merely to catch regression, i.e. to verify is some functionality or property in our application has broken. With a calibrated LLM-Judge, we have automated not merely static assertions, but also judgement to some degree. With the LLM-Judge, we can then measure the baseline performance of the AI-System and then perform experiments. We can change one part of the system and see if it produces an improvement or deterioration against this baseline. Given the stochastic nature of LLMs, we need to repeat each experiment multiple times, which is now feasible due to automated evaluation.
E.g., for a customer service chatbot, if baseline performance is low, we can change the LLM model and see if it improves. If performance is already high, we can use a cheaper model and see if performance is sustained, thereby cutting token costs. We can try several versions of the system prompt and see which ones produce the best results. For a RAG system, we can try out different chunking or retrieval strategies. Every change that improves the baseline can be retained, while every change that deteriorates it can be discarded.
In effect, my second hypothesis is that a calibrated LLM judge could beyond becoming a means for catching defects, and could actually be used as a calibrated instrument for continuous improvement of an AI-System. I presented this hypothesis at my recent talk at EuroSTAR 2026. In this repo, you can preview the live-demo I showed during my talk, which demonstrates the LLM-Judge calibration process.
The motivation for this research task is to test both my hypotheses and gather evidence that support or oppose them.
Expected results
I'd like to have a more nuanced understanding of
Collaboration
I've formulated these hypotheses by reading literature and evaluating prototypes with LLM Judges. However, I lack the experience of using LLM-Judges in large scale production grade applications. What I need from you:
Target artifact
An article / paper that will likely turn into a talk (with a demo)
Background and additional information
LLM- Judge Calibration
To calibrate an LLM-Judge, domain experts formulate a set of test cases for the AI system (known as Ground Truths for conversational AI systems, or Golden Data set more generically) along with a set of evaluation criteria (rubrics) for how each test case can be evaluated (akin to acceptance criteria). Based on these rubrics, the LLM-judge evaluates whether the AI-system's response is acceptable or not.
Consider this example, in the context of an airline customer service chatbot system:
Question in Ground Truth set:
I am flying a long-haul flight on an economy light ticket, and I want to check-in an additional bag. How much will that cost me?
Evaluation criteria:
The AI system's response is then evaluated against these criteria. This could be carried out using a simple UI as shown above.
A ground truth set could contain more than 50-100 such test cases that cover a wide range of representative cases that the AI-system is likely to be exposed to. These test cases are fed to the AI system to generate responses. The responses are then evaluated by both a domain expert and an LLM-Judge.
By cumulating the results, the rate of agreement between the domain expert and the LLM-judge can be measured.
This exercise will often involve modifying the LLM-Judge by tweaking its system prompt or the evaluation rubrics for each question. At times, the exercise will reveal that the questions or the accompanying documentation will need to be adapted too, to make more explicit some implicit knowledge that domain experts carry.
One caveat is worth mentioning here. Any LLM-Judge is a proxy for the judgement of a domain-expert. The more the AI-system changes, the more the LLM-Judge's judgement is likely to drift away from that of the domain-expert. Further, the domain-experts are themselves estimating the preferences of real-users. Once the system is exposed to real users, they will invariably use the system in ways that even domain-experts couldn't have predicted. Therefore, the set of ground truths will need to be further expanded / adapted to accommodate these variations. This will also cause LLM-Judge evaluation to drift away from what real users care about. Just as with any other instrument for measurement, the LLM-Judge calibration exercise must be repeated periodically.
Beta Was this translation helpful? Give feedback.
All reactions