# LLM Evals

# Your AI Product Needs Evals

[web site](https://hamel.dev/blog/posts/evals/)

## Iterating Quickly == Success

You must have tools and processes for:

- Evaluating quality (Testing)
- Debugging  (logging and inspecting data) 
- Changing behaviour (prompt engineering, fine tuning, Writing code)

Recommended processes:

```mermaid
flowchart LR
    A --> B
    A --> C
    B --> D
    C --> D
    D --> E
    D --> F
    E --> G
    F --> G
    D -.-> D_note["Human review\nModel-based\nA/B tests"]

    A["LLM\nInvocation\n(synthetic/human inputs)"]
    B["Unit\ntests"]
    C["Logging\nTraces"]
    D["Eval &\nCuration"]
    E["Fine-tuning"]
    F["Prompt Eng."]
    G["Improve\nModel"]
    

    style D fill:#FFD966,stroke:#333,stroke-width:2px,color:#000
    style D_note fill:none,stroke:none,color:#FFFFFF
```

## Case study

- First stages of fast improvement due to prompt. engineering.
- Then stuck due to improvement in one place leading to failures in others.

## Evals

- Cost of A/B testing > cost of model-based and human evals > cost of unit tests
- Cadence: unit-tests after each code change, model-based + human evals with some cadence, A/B tests after major changes.

## Unit tests

- Assertions like in pytest
- In more places: data cleaning and automatic retries (using assertions to course-correct) during inference.
- Should be fast to run often.
- To come up with unit tests:
    - Think about your traces and the failure modes they incur.
    - Ask your model to brainstorm.
- Step 1: write scoped tests
    - Break down the scope into features and scenarios
    - For example, one feature of Lucy is to find real estate listings, for example: "Find listings with more than 3 bedrooms and less than $2M in San Jose, CA"
    - The assertion verifies that the expected **number of results** is returned, scenarios: 
        - only one listing, more than one listing, no listing.
    - Generic tests: do not include UUID of user in response.
- Step 2: create test cases:
    - Inputs that trigger each of the scenarios.
    - Use synthetic inputs based on LLM. 
    - If possible, write both instructions for obtaining the response as well as instructions for verifying the result. 
        - For example "write 50 different instructions that a real estate agent can give to his assistant to create contacts on his CRM. For each of the instructions, generate a second instruction to look up the created contact". 
        - For each of the test cases, we execute the first user input to create the contact and then execute the second to fetch the contact. If the result length is not exactly 1, the test fails.
    - One signal the tests are good is when the model struggles to pass them.
    - You don't need 100% pass rate.
- Step 3: run and track your tests regularly.

## Level 2: Human & Model Eval

- Logging traces
    - For example, LangSmith
- Looking at traces
    - Remove all friction from the process of looking at data. 
        - Build your own data viewing and labelling tool => Shiny for python.
        - Filter by scenario or feature, go to trace, check if input is human or synthetic, ...
    - Make the output of the LLM editable.
    - Lilac: 
        - Search and filter data semantically.
        - Find a set of similar data points while debugging an issue
- How much data:
    - At least read traces for all test cases and all user-generated traces. Sample over time.
- Automated Evaluation with LLMs
    - Have humans periodically evaluate a sample of traces. 
        - Track correlation between human and model evaluations.
    - Collect "critiques" from labelers explainig why they are making a decision. 
        - Use them for prompt engineering and fine tuning of the LLM evaluator.
    - Use the most powerful model you can afford.
    

# Using LLM-as-a-Judge For Evaluation: A Complete Guide

[blog](https://hamel.dev/blog/posts/llm-judge/#the-problem-ai-teams-are-drowning-in-data)

## Step 1: Find the principal domain expert.

- Get one principal domain expert evaluate LLM output
- Use binary decisions.
- Include critique.

## Step 2: Create dataset

- Diverse: define problem in terms of *dimensions* and have inputs for each combination.
    - Example of dimensions: features, scenarios, personas
    - One input per each combination of feature, scenario and persona.
- Types:
    - logged real interactions
    - synthetic
- Use real DB and APIs to get the data so it is a realistic as possible.

## Step 3: Evaluate accuracy on created dataset

- Remove friction for domain exper to evaluate.
    - May need to get additional context: metadata about the user, state of current system (time, inventory levels ...), resources to check => ability to check a database.
    - All this into single page
    - Build simple web app to review data => Shiny for python.

- How much data:
    - 30 examples and keep going until no more failures. Then keep going until I don't learn anything new.

## Step 4: Fix Errors

- Pervasive errors? (or failures?)

## Step 5: Build LLM as judge

- Spreadsheet with:
    - model response
    - jugdge critique
    - jugdge decision
    - Expert critique
    - Expert decision
    - Expert revised response (what the model should have outputted)
    - Agreement between judge and expert (true / false)

- Sometimes we need precision / recall instead of agreement if the dataset is imbalanced (more failures than passes, or the other way)

- Iterate using better prompts (with expert's critiques as new examples?) until > 90% accuracy / F1 / ...
- Adjust prompts by hand or using ALIGN Eval
- What if this doesnâ€™t work?
    - We may need to rely more on human annotations.
- Mistakes in LLM judges due to:
    - Not providing critiques, or providing very terse critiques.
    - Not providing enough context. Everything used to evaluate the quality of the judge should be also given to it as context.
    - Not providing diverse examples.

## Step 6: Error Analysis

- Apply judge against real or synthetic interactions, always on unseen data.
- Measure error rate on each segment of data, i.e., combination of feature, scenario, and persona in our example.
- Look at each type of error and classify it by hand, after looking at the whole trace (including tool calls made and what context / insight was extracted from each) for example: 
    - Missing user Education.
    - Authentication issues.
    - Poor Context Handling.
    - Inadequate Error Messages.
- Fix Errors again.
    - Go back to step 3 and iterate until satisfied.
    - Try to write a test case for the error.
- Data Literacy and statistics [link](https://jxnl.co/writing/2024/06/02/10-ways-to-be-data-illiterate-and-how-to-avoid-them/)




## Step 7: Create More Specialized LLM Judges, if needed

- For example, if the judge is poor at citing sources correctly, we can create a targeted eval for that, or even use code-based assertions without judge.


# A Field Guide to Rapidly Improving AI Products

https://hamel.dev/blog/posts/field-guide/index.html

## Error Analysis

Example of success:
- Team built a simple viewer to examine conversations. 
- Next to each conversation was a space for open-ended notes about failure modes.
- After annotating dozens of conversations, clear patterns emerged. 
    - For instance, their model was struggling with date handling failing 66% of the time.
- Real case example:
    - See what things the users are asking for and how well the model satisfies their needs in each case. This makes building the road map without effort.
    - See how people assumed your product would work.
    - By looking at how the model responds in each case you start to be able to predict where it will fail and how to improve it via RAG, Prompt Engineering, etc.
    - Custom viewer that has button for categorizing failures.
    - Brain Trust: automate implementation of unit tests or other eval techniques to measure how the changes made help improve those failure modes.
- Summary. The process of error analysis consists of:
    - Looking at the conversations.
    - Writing detail notes about how each conversation failed.
    - Categorizing the notes (or only the failures)
    - The latter can sometimes be made semi-automatic by using an LLM to classify those notes.

## Custom Data Viewer

- Each use case has its specificities that are rarely covered by off-the-shelf tools.
- Even small UX decisions make the difference between the team using the tool or not.
- What makes a good data viewer:
    - Show all context in one place. No need to switch.
    - Make feedback trivial to capture. A simple button.
    - Capture open-ended feedback.
    - Enable quick filtering and sorting. Make it easy to dive into specific error types.
    - Have hotkeys. 

## Empower Domain Experts

- Give domain expertrs tools to write and iterate on prompts directly.
- Prompt playgrounds like LangSmith and Braintrust are good for this.
- Integrated prompt environments: admin versions of their actual user interface that expose prompt editing.
- Avoid technical jargon when talking to domain experts.

## Generate syntethic data

- Choose right dimensions to test. Example with Real Estate product:
    - Features: different capabilities of the product, Examnpl
        - find listings matching criteria
        - analyze trends and princing
        - setting up property viewings
        - post-viewing communication
    - Scenarios: different situations in which the product is used.
        - Exact match
        - Multiple matches
        - No matches
        - Invalid criteria
    - Personas: different types of users:
        - first-time homebuyer
        - property investor
        - luxury home seeker
        - relocating family
- Ensure synthetic data triggers the dimensions to be tested:
    - Test database with enough variety to cover all dimensions.
        - This can be anonymized production data.
    - A way to verify that the generated queries actually trigger the intended dimensions.
- It is key that synthetic data is grounded in real system constraints:
    - real listings, real agent schedules, restricting business rules, including local regulations, etc.
- If we don't have production data because the product is new, use LLMs to generate both test queries and test data.
    - Use realistic attributes 
        - prices that match market conditions, valid addresses with real street names, etc.
- Guidelines for using synthetic data:
    - Diserify dataset: cover all dimensions.
    - Generate user inputs, not outputs: realistic user queries, not LLM responses.
    - Incorporate real system constraints: use real data and business rules.
    - Verify dimension coverage: ensure generated queries trigger intended dimensions.
    - Start simple then add complexity: begin with basic queries, then introduce edge cases.

## Keep trust in Eval system

### Criteria drift 

- Evaluation criteria evolve as you observe more model outputs.
    - The process of reviewing AI outputs helps articulate our own evaluation standards.
    - We need to treat evaluation criteria as living documents that evolve with our understanding.
    - Different stakeholders may have different criteria and we need to reconcile them rather than imposing a single standard.

### Trustworthy evaluation systems

- How: 
    - As discussed: binary metrics + critiques, and measuring alignment with human judgements.
    - And scaling correctly:
        - start with high human involvement
        - study alignment patterns and focus manual evaluation on areas of disagreement
        - use strategic sampling: 
            - sample outputs that provide more information.
            - more weight on areas of disagreement.
        - keep regular calibration as you scale.
- Scaling is not about reducing human effort but redirecting it towards the most impactful areas.

## Plan experiments not features

To complete reading.

# Further reading

- [ALIGN Eval](https://eugeneyan.com/writing/aligneval/)
    - [blog post](https://eugeneyan.com/writing/aligneval/)
- [LLM as judges](https://hamel.dev/blog/posts/llm-judge/index.html#resources):
    - [survey on LLM as judge approaches](https://eugeneyan.com/writing/llm-evaluators/)
    - [similar approach](https://www.databricks.com/blog/enhancing-llm-as-a-judge-with-grading-notes)
    - [end-to-end example](https://cookbook.openai.com/examples/custom-llm-as-a-judge)
    - [DOSU](https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/)
- [OpenAI cookbook](https://cookbook.openai.com/examples/partners/eval_driven_system_design/receipt_inspection)
- [G-Eval](https://deepeval.com/docs/metrics-llm-evals)
    - [G-Eval Simply Explained: LLM-as-a-Judge for LLM Evaluation](https://www.confident-ai.com/blog/g-eval-the-definitive-guide)
- [Inspect](https://www.youtube.com/watch?v=kNaZU9bz-UM)

# Other Links

- [fine-tuning](https://parlance-labs.com/education/#fine-tuning)
- [evaluating RAGs](https://jxnl.github.io/blog/writing/2024/02/28/levels-of-complexity-rag-applications/)
    - [6 RAG Evals](https://jxnl.co/writing/2025/05/19/there-are-only-6-rag-evals/)
    - [RAG is dead](https://pashpashpash.substack.com/p/why-i-no-longer-recommend-rag-for)
    - [RAG is not dead](https://hamel.dev/notes/llm/rag/not_dead.html)
- [A/B testing](https://www.geteppo.com/blog)
- [Continual In-Context Learning](https://blog.langchain.dev/dosu-langsmith-no-prompt-eng/)
- [prompt caching](https://platform.openai.com/docs/guides/prompt-caching)
- [error analysis](https://youtu.be/JoAxZsdw_3w)
    - [Hamel recap and blog collection on error analysis](https://hamel.dev/blog/posts/evals-faq/#q-why-is-error-analysis-so-important-in-llm-evals-and-how-is-it-performed)
    - [Hamel's video walkthrough](https://youtu.be/qH1dZ8JLLdU)
- [data literacy](https://jxnl.co/writing/2024/06/02/10-ways-to-be-data-illiterate-and-how-to-avoid-them/)