# Inspect - an Eval Framework
## Introduction
- As we've seen in the lesson about simple testing, changing the prompt slightly and senting it to the same llm can give different results.
- Imagine swapping in and out another model, how do you make sure your code keeps working?
- Much like in Test Driven Development, the LLM community talks about *Evals*
- Evals can be seen as a test suite to check the results across multiple prompts and llms
- We've seen how it works, but luckily there exist frameworks that have some helpers , so we don't have to do it all by ourselves

The framework we will show here is `Inspect AI` - <https://inspect.ai-safety-institute.org.uk/>. It comes in the form of a VSCode plugin too, this is eanbled in this workshop.

## Installation

In [1]:
%pip install -q inspect_ai openai anthropic

Note: you may need to restart the kernel to use updated packages.


## Something to make it work
- Inspect AI usually works from the CLI or the VSCode plugin
- To make it behave in a Docker machine, we need to fix the XDG_RUNTIME_DIR
- to avoid error FileNotFoundError: [Errno 2] No such file or directory: '/run/user/1000/inspect_ai/view'
- we need to indicate the XDG_RUNTIME_DIR to a writeable directory
See <https://github.com/UKGovernmentBEIS/inspect_ai/issues/51>


In [2]:
import os
os.environ['XDG_RUNTIME_DIR']="/tmp"

## Evaluating LLM as a translator
Image we find the perfect prompt to us an LLM as a translator. How would we evaluate this ?

Inspect AI has a decorated called *task*. WIth that is knows what functions to run to test.

In [3]:
from inspect_ai import Task, list_tasks, task, eval
from inspect_ai.dataset import Sample
from inspect_ai.scorer import match, model_graded_qa
from inspect_ai.solver import generate, system_message

@task()
def test_translator():

    OUR_PROMPT = "You are a friendly translator. Translate the following words into Italian. Only answer with a single word"

    # The dataset to test
    dataset = [
        Sample(input = "Pizza", target = "Pizza"),
        Sample(input="Bicyle", target = "Bicicletta")
    ]

    # A task to run evals on:
    # - The dataset to run
    # - The plan to execute
    # - The way to give the result/completion a score
    return Task(
        dataset=dataset,
        plan=[
          system_message(message=OUR_PROMPT),
          generate()
        ],
        scorer=[match(),model_graded_qa()]
    )

    # Plan is a list of steps to solve the task
    # In this case we set the system prompt and then generate (ask the completion)

    # Scorer is a list of scoring functions to evaluate the task
    # Example scorers
    # https://inspect.ai-safety-institute.org.uk/scorers.html
    # match()-  Determine whether the target from the Sample appears at the beginning or end of model output (defaults to looking at the end)
    # pattern() -  Extract the answer from model output using a regular expression.
    # model_graded_qa() - Have another model assess whether the model output is a correct answer based on the grading guidance contained in target. Has a built-in template that can be customised.

## Running the evals across multiple models

We add all the models we want to test across

In [4]:
# https://docs.anthropic.com/en/docs/about-claude/models#model-names
models = []
import os
if os.environ.get("ANTHROPIC_API_KEY"):
    models.append("anthropic/claude-3-5-sonnet-20240620")
if os.environ.get("OPENAI_API_KEY"):
    models.append("openai/gpt-4")
    models.append("openai/gpt-4o-mini")

# We run the task for each of the models and print the result
results = eval(test_translator, model=models, max_steps=10)
print(results)

Output()

## Inspecting the result of the test runs

We can open the results from the inspect run:
- Clicking the logs will open the log file
- Now right-click to open in the inspect view

This a great way to see the results and keep track of the impact of changes.
You can now run this as part of your test suite locally or on your CI/CD pipeline.

## Secure sandbox
See this example of using docker for running evaluations of capture the flag.
<https://inspect.ai-safety-institute.org.uk/agents.html#sec-sandbox-environments>

Time to write your own now !