-
Notifications
You must be signed in to change notification settings - Fork 1
Home
EvalAssist is a web-based tool designed to simplify the use of large language models (LLMs) as evaluators, also known as "LLM-as-a-Judge." It enables users to iteratively design and refine evaluation criteria and run evaluations of LLM outputs at scale. EvalAssist leverages the open-source Unitxt evaluation framework and a multi-step prompt-chaining approach. Once criteria are finalized, users can export them to a Jupyter notebook for large-scale evaluation.
Supports custom criteria development for both direct assessment and pairwise comparison evaluation methods, giving users the flexibility to choose what fits their use case best.
Direct assessment is an evaluation paradigmn wherein the LLM chooses one of a set of predefined values from an evaluation rubric. This approach can be used to perform likert scale scoring of responses (e.g 1-5) or assign from a set of semantically conditioned literals (Yes/No, Pass/Fail) etc. Direct assessment criteria comprise of a criteria title, a criteria expression (typically a short statement or question) and a set of predefined options with optional descriptions. Option description can sometimes make it easier for the LLM evaluator to distinguish between options.
Pairwise Comparison is an evaluation paradigm wherein the LLM chooses a preferred responses from a pair of candidate responses. At least two responses are required to run a pairwise comparison in the Evaluation Sandbox. Pairwise Comparison criteria are simpler, requiring only a criteria title and a criteria expression or description. In EvalAssist, pairwise comparisons involving more than two responses are performed by evaluating all possible response pairs, calculating win rates, and ranking the outputs accordingly.
EvalAssist includes a built-in synthetic test case generator to help users improve evaluation criteria. It generates examples that:
-
Explore boundary cases between options and generate variations for options
-
Simulate realistic variations across domains and roles
This feature is especially useful for tasks like Q&A, summarization, and text generation in general.
Empowers users to inspect outcomes with built-in trustworthiness metrics such as positional bias and model-generated explanations. Our evaluators are capable of detecting positional bias. Positional bias occurs when the LLM evaluator favors one option based on its position, regardless of the actual quality of the response. Results that exhibit positional bias are uncertain and cannot be trusted. You may see sandbox evaluation results in red, indicating that positional bias has been detected. Using Chain-of-Thought techniques also generates explanations that can be used to validate the LLM evaluator's reasoning.
EvalAssist is built on top of the open source Unitxt library. The user experience allows you to refine a single criterion with multiple data points and multiple contexts. Once you feel your criteria is ready for larger scale testing, you can take it outside EvalAssist by downloading a Jupyter notebooks or Python code (both based on Unitxt). This way you can run bulk evaluations and have finer-grained, scalable control over your LLM-assisted evaluation.
EvalAssist integrates a range of general and specialized LLM judges coming from multiple providers (OpenAI, WatsonX, Ollama etc.). Models include, for example, IBM Granite Guardian, Llama, Mixtral, and OpenAI GPTs. Optionally, you can add any custom model for the available providers. We are using a chained prompting process to ensure consistent and accurate evaluations. See below for more details.
EvalAssist includes a small catalogue of test cases to get you started and allows you to save your own test cases. A separate section in the catalog is focused on detecting harms and risks and shows a bunch of examples that run with Granite Guardian as a specilized model for direct assessment of harms and risks in user messages, assistant responses, and RAG hallucinations. If you develop criteria that you believe could be of benefit to the larger community, consider contributing them to Unitxt.
Follow the installation instructions in the README in order to run EvalAssist locally.
In order to run evaluations, you will need to enter at least one set of provider credentials. You can use either REST API-based models or local models. We support multiple LLM service providers, such as, for example, WatsonX, Ollama, Open AI, Azure, etc. You can enter API credentials in the UI next to the model selector in the API Credentials dialog as shown below.
EvalAssist enables the assignment of distinct models for Evaluation and Synthetic Data Generation, acknowledging that different tasks may benefit from specialized models. The Evaluation Model is employed for LLM-as-a-Judge evaluations. Conversely, the Synthetic Data Generation Model is utilized for Test Data Generation and Direct AI Actions.
To run LLM-as-a-judge evaluations in EvalAssist, you will need to provide
- Criteria title
- Criteria description (this is often a question or short description expressing the requirement of the criteria)
- Options for the evaluator LLM to choose from (only for direct assessment)
- Evaluator LLM
-
Test data: In direct assessment, each data point consists of a single response field along with a set of context values. In contrast, pairwise comparison data points are made up of a collection of responses to be compared, along with a set of context values. Additional data points can be added row by row. You can also use synthetic data generation ("Generate test data" button) to produce more examples to test your criteria on (currently only available in direct assessment test cases).
- Response variable:: The field to be evaluated is denoted as 'Response' by default ('Response 1', 'Response 2', etc. in pairwise comparison test cases). Depending on your use case, you may want to consider changing the variable name.
- Context variables (optional): Depending on your use case you can define the value of the default context variable (e.g. a question to an LLM, or an article to summarize). You can also change the name of the 'context' variable to something else, if it is more meaningful in terms of referencing these values in your criteria definition. You can add additional context variables as needed. Note that using context variables and references in your criteria definition helps the evaluator LLM to attend to the correct information when making a scoring decision.
The screenshot below shows you an example for the criteria 'Conciseness' from our Test Data catalog. The Mini-Tutorial below illustrates the process through a series of screenshots.
EvalAssist supports two ways of generating test data. You can create new test data or modify existing test data inline. The Mini-Tutorial below illustrates the process through a series of screenshots.
Below the Test Data section in the UI, the "Generate Test Data" button provides an easy way to add additional rows of test data that can help you refine your criteria. When you first press the button, you will need to configure the test data generation settings, which involves choosing a task, a domain, and a user role. Furthermore, you can specify how many new data points you want for each option and for borderline cases. The test data configuration can be changed anytime by pressing the configuration button. Note that configuration settings are persistent relative to the test case when you save it. The test data generation options that can be configured are:
Selecting the appropriate task type is crucial for achieving optimal evaluation results. The evaluation process is sensitive to the names of the response and context variables; choosing clear and descriptive names for these variables can significantly enhance performance.
- Generic/Unstructured: Use this option for general test data generation that doesn't fit into the other predefined categories. This task type allows for any combination of context and response variables, offering maximum flexibility for diverse or unconventional tasks.
- Question Answering: Designed for tasks where the context variable represents a question, and the response variable contains the corresponding answer.
- Summarization: ntended for tasks where the context variable holds the original text to be summarized. The response variable should contain the generated summary.
- Text Generation: This task is designed for scenarios where the system generates new text based on a given prompt or context. The context variable may or may not contain the initial input or prompt, and the response variable should hold the generated text. This setup is ideal for evaluating models that perform open-ended text generation, such as story writing, dialogue generation, or creative content creation.
The domain specifies the subject area or field relevant to the evaluation task. It guides the generation of context and response variables, ensuring that the content aligns with the intended topic. Selecting an appropriate domain helps tailor the evaluation to specific knowledge areas, such as healthcare, news media, or customer support.
The persona defines the role or identity that the AI system should assume when generating responses. It influences the tone, style, and perspective of the output, making interactions more relatable and contextually appropriate.
The expected length of the generated response. The options are:
- Short: 1-2 sentences
- Medium: 3-5 sentences
- Long: 5-9 sentences
You can modify specific parts of an existing test case by highlighting text within the context or response fields. Upon highlighting, an action menu will appear, offering options to rephrase the selected text, regenerate it, shorten it, or expand it.
The evaluation sandbox supports iterating over a single criterion with multiple outputs to be evaluated and multiple task context for all outputs for smaller amount of data. Once you are ready to evaluate larger data sets with diverse contexts, you can download an auto-generated Jupyter Notebook or Python code based on the Unitxt toolkit, giving you full control over the execution.
Unitxt LLM-as-a-Judge Evaluators are implemented using a prompt chaining strategy as illustrated in the Figure below. The evaluator LLM is first asked to generate an assessment (explanation) of the response subject given the context and the evaluation criteria using a Chain-of-Thought approach. In a second step, the evaluator LLM is prompted to make a final decision based on the assessment and the available options (response winner in Pairwise Comparison or selected option in Direct Assessment). As part of this process, we also calculate meta evaluation metrics such as positional bias, to determine how robust the overall evaluation is. This is detected by changing the order of options presented to the evaluator LLM, re-running the evaluation and looking for consistency in the result.
Positional bias is indicated in the results section as well as in the details pop-up. The screenshot below shows an example of the chain-of-thought reasoning (serving as an explanation) used for decision-making and communication of results.
You can add custom evaluators to EvalAssist using any of the supported providers. To do so, you will need to add a custom_models.json
file to the backend folder. For example, to add "Gemma 3" from Ollama as an evaluator, add the following json:
[
{
"name": "Gemma 3",
"path": "gemma3:12b",
"providers": [
"ollama"
]
}
]
Our project welcomes external contributions. There are two ways of contributing: You can contribute to EvalAssist directly or, if you developed a criteria that may be beneficial for the entire research community, you can contribute the criteria to Unitxt. That process is maintained by the Unitxt community but we've outlined the essential steps (if in doubt, please contact the Unitxt project directly).
Let’s say your task is to classify text data (“response”) as either concise or not concise, i.e. your objective is to develop the "conciseness" criteria.
- Write your evaluation criteria. Below is an example of the conciseness criteria. You may want to load the example and change the criteria text and description texts if you want. After defining your first version of the criteria, select an LLM to use for evaluation.
2. List your test data to evaluate
You can edit the existing test data as you want. You can add your own test data by clicking the “Add row” button.
3. Evaluate the test data
You can 1) optionally specify your expected evaluation results (useful when you have many data points and you want to see at once where the evaluation didn't meet your expectations, and then 2) click the “Evaluate” button.
4. Inspect the evaluation results
After you click "Evaluate", you will see the evaluation results with an explanation, and whether the evaluation results agree or disagree with your expected results. Hovering over a test case will show you various options to manipulate the test case, such as duplicating it, running a single test case, viewing more details, or deleting it. The generated results section also shows you if the system detected positional bias, an indicator that the results may not be trustworthy. You may try to resolve the positional bias by selecting another LLM evaluator or refining your criteria.
5. Refine your criteria
If there's a disagreement between your expectations and the evaluation results, refine your criteria description like below, and evaluate again until you think you’ve made a criteria that best meets your expectation.
6. Save your test case
Once you are satisfied with your criteria, you can save your results by clicking “save” button. You can also download a Jupyter notebook or Python code to move into a programmatic way of evaluation with larger amounts of data.
- Generating test data The "Generate test data" button at the bottom of the test data section in the UI creates new test data for you automatically using an LLM. The LLM for test data generation needs to be selected prior to using this functionality. Note that when you press the button for the first time, you will have to configure the test data generation settings. When you press the button next time, it will immediately create test data using the latest configurations, if any.
- Configuring test data generation You can always specify and modify how you want to generate the test data by clicking the “configure” button next to the “Generate test data” button.
- Configuration settings In a pop-up window you can specify the task type (text generation, summarization, question & answer), desired domain, persona, and length of the test data to be generated. Additionally, you can define how many test cases to generate for each label (e.g., Yes, No, Borderline).
- Direct AI Manipulation of Existing Test Data To modify a specific part of an existing test case, whether by rephrasing to a similar text, regenerating to a different one, shortening, or expanding it, highlight the text, and the relevant action buttons will appear.
- Viewing details of the test data You can read about how each test case has been generated (configurations and prompt) by clicking the “details” button.