# [3.1] Intro to Evals

**What are evals trying to achieve?**

* to help companies and regulators to make decisions about whether it is safe to train and deploy AI
* to build 'safety cases' - a structured argument for why AI systems are unlikely to cause a catastrophe; 

**What are evals?**

From the point of view of researchers
* Estimate upper bounds of capabilities 
* 

**Why is this hard?**

**Where are we now?**

Evals is a nascent field. As of August 2024, there are around X full-time equivalents in the world working on evals. Most eval researchers agree that the methodology is at an early stage.

    * "Evals Gap"
        * More trustworthy results, make quantitative claims (https://www.apolloresearch.ai/blog/we-need-a-science-of-evals)
        * Lacking coverage of risks
        * Lacking coverage of AI models in different modalities
        * Often has narrow focus on harms


Reading list
* https://arxiv.org/pdf/2310.11986


# Content & Learning Objectives

# 1️⃣ Threat Modeling

The challenge for evals is to **connect real-world bad outcomes that we care about to results from abstract programmable tasks that we can easily measure**.

To start, we need to build a threat model for the quantity we want to evaluate. A **threat model** is a realistic scenario by which an AI capability/tendency could lead to harm in the real world. This usually starts out as a high-level story about what the AI can do and the setting in which it is used. This might be a sequence of events (e.g. AI actions, human actions, system failures) that lead to a particular outcome, or different dangerous applications enabled by a capability. At each step, we examine the set of assumptions about the AI model and human actors that must be true for the event to occur. 

In this process of making the threat model more concrete, we recognize the key aspects of the quantity that enable these harms, and get ideas for tasks that can measure them in naturalistic ways (i.e. realistic to real uses cases of AI where these capability aspects manifest). This helps us to build a detailed **specification** of the vague quantity we started with. Our specification may: 
1. stipulate a more refined, narrow definition, 
2. decompose the quantity into subcomponents, 
3. establish milestones or levels of this quantity that form a smooth gradation in sophistication/difficulty, so we can track where current models are and forecast future capabilities.


Read the following examples to get a sense of how 
* 

### Exercise - Build threat models and specification
```c
Difficulty: 🔴🔴🔴🔴⚪
Importance: 🔵🔵🔵🔵🔵

You should spend up to 1-2 hour on this exercise.
```

**Think about the threat model for this attribute. Concretely:**
1. What specific real-world harms from this attribute are you worried about? Spell that out in some detail.
2. What does the path from AI today to these harm look like?
3. Try to make this concrete and realistic. State the assumptions you're making about the AI's capabilities, the actions of humans, the context of AI deployment etc. 


**Find a specification of this attribute:**
1. Which aspects of the AI attribute are most responsible for causing harm? Write a 2-4 sentence explanation of each aspect to clarify what it means and distinguish them from each other.
2. What is the hardest part of this attribute? 


Tip: Use ChatGPT or Claude to help you with the following - for almost everything you want to do, you can use ChatGPT to do it first, then improve on its answers.

# 2️⃣ Building MCQ benchmarks: Exploration

The goal from this section of chapter [3.1] onward to chapter [3.3] is to build an alignment evaluation benchmark to measure a model property of interest from scratch. The benchmark we will build will contain a multiple-choice (MC) question dataset of ~300 questions, a specification of the model property being measured and how to score and interpret the result.

Broadly speaking, a "benchmark" is anything you can measure and compare performance against. A benchmark usually refers a process for evaluating what the current state of art is for accomplishing a particular task. This typically includes specifying a dataset, but also specifications of a task or set of tasks to be performed on the dataset and how to score the result, so that the benchmark is used in a principled, standardized way that gives common ground for comparison.


**What makes a good benchmark?** This [article](https://www.safe.ai/blog/devising-ml-metrics) by Dan Hendrycks and Thomas Woodside gives a clear summary of the key qualities for making a good benchmark.

<details><summary><b>What are the pros and cons of MCQs?</b></summary>

The advantages of MCQ datasets, compared evaluating free-form response or more complex agent evals, are mainly to do with cost and convenience:

* Cheaper and faster to build
* Cheaper and shorter to run (compared to long-horizon tasks that may take weeks to run and have hundreds of thousands of dollars of inference compute cost)
* May produce more replicable results. 

The disadvantages are:
* A worse (less realistic, less accurate) proxy for what you want to measure

</details>

### Exercise - Design eval questions
```c
Difficulty: 🔴🔴🔴🔴⚪
Importance: 🔵🔵🔵🔵🔵

You should spend up to 2-3 hour on this exercise.
```
The goal of this exercise is to generate 20 MC questions that measures your target quantity in a way you are happy with, with help from AI models.


**Make your specification measurable:**
1. Go back to your specification of the target attribute and think about the aspects/hardest parts you picked out. What questions and answers would give you high evidence about this attribute? Come up with 4 example MC questions with 2 or 4 answer choices (one of the answers should be the answer matching the target attribute). 

**Refine** 🔁   
2. Ask your example questions to the model and see how it responds.
    * Likely, you will realize that your current specification and examples are somewhat imprecise or not quite measuring the right thing. Iterate until you’re more happy with it.
3. Ask the model to improve your example questions.

**Red-team** 🔁
1. What confounders could you accidentally measure? Come up with other factors that would cause the model to give the same answers to the kind of questions you are asking.