You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Evals mean different things to different people. The biggest confusion is that there are actually 2 different
categories of evals:
a. Model evals (ex: HellaSwag, MMLU, TruthfulQA etc)
b. Task evals
Model Evals are really for people who are building or fine-tuning an LLM (eg HuggingFace OpenLLM leaderboard).
Task evals use an eval template that evaluates application output on the quality of the response ([1]).
[1]: https://twitter.com/aparnadhinak/status/1752763354320404488
(Model evals appear to measure how much information an LLM has compressed, and how good it is at regurgitating this information.)
Key Points
Task evals do not require a ground truth. They use an eval template that is designed to be applied to different data repeatedly in a structured manner. The eval template contains the prompt/instruction for the evaluator.
For RAG, a reasonable description for an eval template appears to be a template that takes as input a tuple of the form (query: str, sources: Optional[list[str]], answer: str). After being sent to the evaluator, a reasonable description for the response appears to be a tuple of the form (outcome: bool, category: Optional[str], critique: Optional[str]).
2. Introducing LLM-as-a-judge
Quotes
A powerful solution to assess outputs in a human way, without requiring costly human time, is LLM-as-a-judge.
This method was introduced in Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena - which I encourage
you to read ([2]).
[2]: https://huggingface.co/learn/cookbook/en/llm_judge
Relative to human judgments which are typically noisy (due to differing biases among annotators), LLM judgments
tend to be less noisy (as the bias is more systematic) but more biased. We can mitigate some of these biases:
a. Position bias: LLMs tend to favor the response in the first position. To mitigate this, we can evaluate the same
pair of responses twice while swapping their order.
b. Verbosity bias: LLMs tend to favor longer, wordier responses over more concise ones, even if the latter is clearer
and of higher quality. A possible solution is to ensure that comparison responses are similar in length.
c. Self-enhancement bias: LLMs have a slight bias towards their own answers. To counter this, don’t use the same
LLM for evaluation tasks ([3]).
[3]: https://eugeneyan.com/writing/llm-patterns/
(This appears to be yet another example of the classic bias-variance tradeoff in ML.)
LLM Evals are valuable analysis tools. But you should use classes instead of numeric scores as outputs ([4]).
[4]: https://x.com/aparnadhinak/status/1748368364395721128
Use pairwise comparisons: Instead of asking the LLM to score a single output on a Likert scale, present it with
two options and ask it to select the better one. This tends to lead to more stable results ([5]).
[5]: https://applied-llms.org/#llm-as-judge-can-work-somewhat-but-its-not-a-silver-bullet
Model-based evaluation is a meta-problem within your larger problem. You must maintain a mini-evaluation
system to track its quality, and to calibrate the model with a human ([6]).
[6]: https://hamel.dev/blog/posts/evals/#automated-evaluation-w-llms
Key Points:
The discovery that LLM output can be reasonably evaluated by other LLMs has significant implications for data collection and evaluation in general. Human evaluation is expensive and time-consuming, and subject to the participant's expertise and motivation. LLM evaluation still needs some human input need to evaluate the accuracy of model-generated evaluations, but
it is a far cheaper and more scalable way to approximate human preferences.
By making it as easy as possible to use LLM-as-a-judge, an application can facilitate the rapid and low-cost generation of AI datasets. This implies out-of-the-box support for sending data to an LLM for evaluation, out-of-the-box support for making sure that the LLM evaluations are well-calibrated against human judgement, and documentation for developing appropriate prompts for sending to the LLM eg avoid the use of a continuous scale, prefer categorical and pairwise comparisons.
3. A duality between Evals and fine-tuning
Duality: Both sides are always there, even if you only see one.
Quotes
99% of the labor involved with fine-tuning is assembling high-quality data that covers your AI product’s
surface area. However, if you have a solid evaluation system, you already have a robust data generation
and curation engine ([7]).
[7]: https://hamel.dev/blog/posts/evals/#:~:text=99%25%20of%20the%20labor%20involved,tuning%20in%20a%20future%20post.
All you need is synthetic data, LoRA, and 750 human responses for evaluation ([8]).
(*Taken from the release notes of Apple Intelligence.*)
[8]: https://x.com/_philschmid/status/1800415559332528249
... language models have always been used by tech companies in the past to extract knowledge, curate
information, and rank content ([9]).
[9]: https://justine.lol/matmul/
The most interesting stuff in LLMs right now (to me) is:
a. figuring out how to do it small
b. figuring out how to do it on CPU
c. figuring out how to do it well for specific tasks [10][10]: https://x.com/vboykis/status/1789461150939193475
Key Points:
The future is trending toward smaller LLMs (eg Apple Intelligence and their vision for integrating AI into everyday life). Quality datasets that are curated for specific use-cases will likely be important for maximising the performance of these smaller LLMs. Whether or not an organisation decides to host a foundation model themselves, they will still need quality data for prompting or fine-tuning the foundation model.
RAG is an accelerator of the data curation process - a natural side effect of capturing the output of a (high-quality) RAG system is to produce a valuable dataset. RAG is also designed to handle out of domain data (eg retrieval engines), which can be seen as a means of improving the diversity of a dataset in a semi-automated way. From the perspective of data generation, RAG systems can be seen as a powerful tool to produce AI datasets.
To allow developers to focus on tackling the many challenges associated with deploying smaller LLMs (rather than compatibility issues with the data), RAG systems should prioritise producing datasets according to a standardised ('AI') schema. Such datasets can be used to build small, task-specific LLMs ('reasoning engines') that can be easily shared. A standardised data schema can also help support the development of an ecosystem of tools.
4. A brief aside: How to evaluate a RAG system?
Quotes
The dirty secret of improving your RAG application: https://ir-measur.es (`ir_measures`) ... Pair it with
`ir_datasets`, `pyterrier` and you have your Learning to RAG starter pack ([11])
[11]: https://x.com/jobergum/status/1794996654958854219
More details
ir_measures is used to compute how well your retrieval system is working. You generate (query, document, relevance_binary_classifier) tuples, and then get (query, document, rank, ranking_score) from your retrieval system. You then select your metrics, and you can instantly evaluate your RAG system. As soon as a user has the relevance judgements available, they can generate metrics for their retrieval system.
Note: If a relevance judgement does not exist, then it is often ignored from the metrics.
py_datasets has some standard IR datasets that include queries, documents and relevance judgements. These can be used to benchmark a retrieval (RAG) pipeline against other retrieval pipelines.
py_terrier can be used to implement new retrieval models, and then measure the benefit of adding them to a current retrieval system (eg what is the impact of adding BM25 to the current ragna pipeline?).
Key Points
rank and ranking_score are important fields that should always be tracked.
By including support for ir_measures (as an optional add-on), a RAG system makes it easy to evaluate metrics from the data for all users.
To encourage adoption for serious use, a RAG system should create a benchmark pipeline. This benchmark can also be used to assess whether there is merit to adding new components.
5. Proposing an Evals table for ragna
Quotes
That which is measured, improves.
(*Pearson's Law*)
I’ve found that unsuccessful products almost always share a common root cause: a failure to create robust
evaluation systems ... Evaluation systems create a flywheel that allows you to iterate very quickly. It’s almost
always where people get stuck when building AI products ([12]).
[12]: https://hamel.dev/blog/posts/evals/
Do not buy an eval framework too early. The hard part of evals is not the library — it is understanding the user
problem and diversity of data, and defining what good looks like, and measuring it ([13]).
[13]: https://docs.google.com/presentation/d/1GC868XXjhxOpQEt1jUM79aW0RHjzxPp0XhpFHnYH760/edit#slide=id.p
... it was important to evaluate performance against datasets that are representative of real use cases ([14])
[14]: https://machinelearning.apple.com/research/introducing-apple-foundation-models
Instead of using off-the-shelf benchmarks, we can start by collecting a set of task-specific evals
(i.e., prompt, context, expected outputs as references). These evals will then guide prompt engineering, model
selection, fine-tuning, and so on. And as we update our systems, we can run these evals to quickly measure
improvements or regressions. Think of it as Eval Driven Development (EDD) ([15]).
[15]: https://eugeneyan.com/writing/llm-patterns/
Build evals and kickstart a data flywheel ... By creating assets that compound their value over time, we upgrade
building evals from a purely operational expense to a strategic investment, and build our data flywheel in the
process ([16]).
[16]: https://applied-llms.org/
Key Points
Practical experience from others says that a user is unlikely to build a successful AI application if they don't understand their data. If you want users to build successful AI applications with your tool, you should first make it as easy as possible for users to get familiar with their data. Every user needs to create an evaluation system specific to their data, and to try and avoid creating a 'Goodhart' metric.
Proposed updates to ragna
ragna already includes an internal database that contains much of the information that a user needs to build an evaluation system for their data. With the following extensions, the ragna database should be able to support all use-cases outlined in the previous sections:
Updates to the Current Tables
The LLM prompt is so important it should have it's own field in the Message table (otherwise it makes it harder for users to share and reproduce data). The prompt is currently relegated to being a member of the params field of a Chat.
rank and ranking_score should be added to the Source table to allow for computation of useful metrics.
document needs to be made more flexible in the Source table to allow for documents that exist outside of ragna ('Corpus' feature); this can perhaps also lead to the removal of the location field.
It is necessary to create separate Message tables for both questions and answers (see also 'Proposed new Evals table' below).
It is likely useful to also add a metadata field for the Message tables to be able to record information like 'time-to-first-token' etc.
Proposed new Evals table
It must support the following requirements:
A data-point should not exist in the Evals table unless it has been evaluated by a LLM or human.
LLM-as-a-judge and human evaluations both require a prompt/instruction for evaluation (eval_prompt).
Categorical evals should be supported but continuous evals should not (category, outcome).
Critiques can be a powerful tool for improving both RAG models and LLMs (critique).
LLM-as-a-judge requires a column for distinguishing between human and AI evaluation, and for recording the evaluator (human_eval, evaluator_name).
Labels are required to mark things like test or validation data, experiment IDs, different components of a RAG pipeline, etc. (labels).
Metadata such as document names and timestamps should be stored on other tables.
(query, source, answer) tuples should be taken from other tables (hence a requirement for separate Message tables for both questions and answers).
Given the above requirements, here is a proposed schema for a Evals table:
What about structured responses ('function calling'), embeddings ('list of floats'), and multi-modal data (eg binary blobs)? Should we also think about how they might be supported in the future - are there any straightforward additions to the Evals table above that can make it more general?
Appendix I: Guardrails
Details
There is actually another type of evaluation that is very relevant: 'Guardrails' (also known as unit tests!).
You write lots of tests that check the business logic of your application (eg you write assertion statements to check that a LLM response does not contain any banned keywords). These tests can of course be LLM-generated. These unit tests can be used both offline (eg testing the impact of a new prompt) and online (eg preventing the return of banned keywords).
You don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.
Key Points
Guardrails are a high-impact, low-effort way of evaluating the individual components of an AI application. They are also crucial for increasing the robustness and reliability of an AI application in production.
Appendix II: Some proposed new web API methods for ragna
This is just a start - these are just some basic components that are needed by users to allow building on the areas discussed already. Over time, it's likely to be of interest to try and expand these methods to anything else that would be sensible discovery work for users.
/bootstrap - Sampling THE USER'S distribution is very important. Users may not always have a prepared dataset of (question, sources, answer) tuples. To address this cold start problem, we can use an LLM to look at user-supplied documents (this is where ragna helps target THE USER'S distribution) and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. A user can then update this dataset over time as they collect and label more items.
/{orm} - Expose ragna ORMs to user so that they can easily access database information. ragna does a lot of great work building the ORMs, it's a shame not to expose this to the users. (Obviously, there needs to be a bit of thought around what exactly is public and private, but the gist of the idea is to re-use some of the ORMs, so that users don't have to spend time re-writing this logic themselves). This should also probably include the ability to pass filters to only retrieve a subset of the data.
/guardrails - For running the unit tests on data in the ragna database. This is a high-impact, low-effort way to diagnose problems in an AI Application.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
1. What do we mean by evaluation ('Evals')?
Quotes (paraphrased)
(Model evals appear to measure how much information an LLM has compressed, and how good it is at regurgitating this information.)
Key Points
Task evals do not require a ground truth. They use an eval template that is designed to be applied to different data repeatedly in a structured manner. The eval template contains the prompt/instruction for the evaluator.
For RAG, a reasonable description for an eval template appears to be a template that takes as input a tuple of the form
(query: str, sources: Optional[list[str]], answer: str)
. After being sent to the evaluator, a reasonable description for the response appears to be a tuple of the form(outcome: bool, category: Optional[str], critique: Optional[str])
.2. Introducing LLM-as-a-judge
Quotes
(This appears to be yet another example of the classic bias-variance tradeoff in ML.)
Key Points:
The discovery that LLM output can be reasonably evaluated by other LLMs has significant implications for data collection and evaluation in general. Human evaluation is expensive and time-consuming, and subject to the participant's expertise and motivation. LLM evaluation still needs some human input need to evaluate the accuracy of model-generated evaluations, but
it is a far cheaper and more scalable way to approximate human preferences.
By making it as easy as possible to use LLM-as-a-judge, an application can facilitate the rapid and low-cost generation of AI datasets. This implies out-of-the-box support for sending data to an LLM for evaluation, out-of-the-box support for making sure that the LLM evaluations are well-calibrated against human judgement, and documentation for developing appropriate prompts for sending to the LLM eg avoid the use of a continuous scale, prefer categorical and pairwise comparisons.
3. A duality between Evals and fine-tuning
Duality: Both sides are always there, even if you only see one.
Quotes
Key Points:
The future is trending toward smaller LLMs (eg Apple Intelligence and their vision for integrating AI into everyday life). Quality datasets that are curated for specific use-cases will likely be important for maximising the performance of these smaller LLMs. Whether or not an organisation decides to host a foundation model themselves, they will still need quality data for prompting or fine-tuning the foundation model.
RAG is an accelerator of the data curation process - a natural side effect of capturing the output of a (high-quality) RAG system is to produce a valuable dataset. RAG is also designed to handle out of domain data (eg retrieval engines), which can be seen as a means of improving the diversity of a dataset in a semi-automated way. From the perspective of data generation, RAG systems can be seen as a powerful tool to produce AI datasets.
To allow developers to focus on tackling the many challenges associated with deploying smaller LLMs (rather than compatibility issues with the data), RAG systems should prioritise producing datasets according to a standardised ('AI') schema. Such datasets can be used to build small, task-specific LLMs ('reasoning engines') that can be easily shared. A standardised data schema can also help support the development of an ecosystem of tools.
4. A brief aside: How to evaluate a RAG system?
Quotes
More details
ir_measures
is used to compute how well your retrieval system is working. You generate(query, document, relevance_binary_classifier)
tuples, and then get(query, document, rank, ranking_score)
from your retrieval system. You then select your metrics, and you can instantly evaluate your RAG system. As soon as a user has the relevance judgements available, they can generate metrics for their retrieval system.Note: If a relevance judgement does not exist, then it is often ignored from the metrics.
py_datasets
has some standard IR datasets that include queries, documents and relevance judgements. These can be used to benchmark a retrieval (RAG) pipeline against other retrieval pipelines.py_terrier
can be used to implement new retrieval models, and then measure the benefit of adding them to a current retrieval system (eg what is the impact of adding BM25 to the currentragna
pipeline?).Key Points
rank
andranking_score
are important fields that should always be tracked.By including support for
ir_measures
(as an optional add-on), a RAG system makes it easy to evaluate metrics from the data for all users.To encourage adoption for serious use, a RAG system should create a benchmark pipeline. This benchmark can also be used to assess whether there is merit to adding new components.
5. Proposing an Evals table for
ragna
Quotes
Key Points
Proposed updates to
ragna
ragna
already includes an internal database that contains much of the information that a user needs to build an evaluation system for their data. With the following extensions, theragna
database should be able to support all use-cases outlined in the previous sections:Updates to the Current Tables
The LLM
prompt
is so important it should have it's own field in theMessage
table (otherwise it makes it harder for users to share and reproduce data). The prompt is currently relegated to being a member of theparams
field of a Chat.rank
andranking_score
should be added to theSource
table to allow for computation of useful metrics.document
needs to be made more flexible in theSource
table to allow for documents that exist outside ofragna
('Corpus' feature); this can perhaps also lead to the removal of thelocation
field.It is necessary to create separate
Message
tables for both questions and answers (see also 'Proposed newEvals
table' below).It is likely useful to also add a metadata field for the
Message
tables to be able to record information like 'time-to-first-token' etc.Proposed new
Evals
tableIt must support the following requirements:
eval_prompt
).category
,outcome
).critique
).human_eval
,evaluator_name
).labels
).(query, source, answer)
tuples should be taken from other tables (hence a requirement for separateMessage
tables for both questions and answers).Given the above requirements, here is a proposed schema for a
Evals
table:Outstanding Question
Evals
table above that can make it more general?Appendix I: Guardrails
Details
There is actually another type of evaluation that is very relevant: 'Guardrails' (also known as unit tests!).
You write lots of tests that check the business logic of your application (eg you write assertion statements to check that a LLM response does not contain any banned keywords). These tests can of course be LLM-generated. These unit tests can be used both offline (eg testing the impact of a new prompt) and online (eg preventing the return of banned keywords).
You don’t necessarily need a 100% pass rate. Your pass rate is a product decision, depending on the failures you are willing to tolerate.
Key Points
Appendix II: Some proposed new web API methods for
ragna
This is just a start - these are just some basic components that are needed by users to allow building on the areas discussed already. Over time, it's likely to be of interest to try and expand these methods to anything else that would be sensible discovery work for users.
/bootstrap
- Sampling THE USER'S distribution is very important. Users may not always have a prepared dataset of(question, sources, answer)
tuples. To address this cold start problem, we can use an LLM to look at user-supplied documents (this is whereragna
helps target THE USER'S distribution) and generate questions that the specific chunk would answer. This provides us with quality questions and the exact source the answer is in. A user can then update this dataset over time as they collect and label more items./{orm}
- Exposeragna
ORMs to user so that they can easily access database information.ragna
does a lot of great work building the ORMs, it's a shame not to expose this to the users. (Obviously, there needs to be a bit of thought around what exactly is public and private, but the gist of the idea is to re-use some of the ORMs, so that users don't have to spend time re-writing this logic themselves). This should also probably include the ability to pass filters to only retrieve a subset of the data./guardrails
- For running the unit tests on data in theragna
database. This is a high-impact, low-effort way to diagnose problems in an AI Application.Beta Was this translation helpful? Give feedback.
All reactions