# Running an experiment 🧪

In this notebook, we'll walk through how to run a basic experiment.

An experiment consists of three key components: a dataset, a task, and evaluators.
- **Task**: A function that takes an input and generates a response.
- **Evaluators**: Functions that compare the model's output against the expected output (along with the input) and returns a value based on specific criteria.

First, let's initialize the required libraries.

In [3]:
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from the .env file.
load_dotenv(override=True)

import ddtrace.llmobs.experimentation as dne

dne.init(project_name="Onboarding")

client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))


For this example, we will reuse the `capitals-of-the-world` dataset we created in the previous notebook.

In [None]:
# Pull the dataset from Datadog
dataset = dne.Dataset.pull(name="capitals-of-the-world")
dataset.as_dataframe()

### Task definition

This is the core of the experiment.

We will first define the task we want to evaluate and will decorate it using `@dne.task`.

You can evaluate from single prompts to complex agents inside this function. This function will have access to one dataset row, and it will be executed row-wise.

In [5]:
# The function must accept an `input` parameter, which represents a single dataset row.
@dne.task
def basic_generate_capital_name(input):
    # Inside this function, you can perform any logic you want. In this case, we'll use the OpenAI API to generate a response given the input.
    output = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": input}],
        temperature=1.4
    )
    return output.choices[0].message.content

# We now define the experiment with a descriptive name, dataset, task function, and evaluators.
experiment = dne.Experiment(
    name="gpt-4o-mini", # <---- We can use any name we want for the experiment.
    dataset=dataset,
    task=basic_generate_capital_name,
    evaluators=[]  # Empty for now; we’ll add evaluators later
)

We now can run the experiment! This will execute the task against the dataset in a row-wise manner concurrently so that it runs faster.

In [None]:
results = experiment.run(jobs=10)

In [None]:
# We can access the results locally if we want.
results.as_dataframe()

## Running evaluations 🔬

Evaluators are functions that assess the model’s performance by comparing the expected output with the actual output generated by the model. They receive the following parameters:
- input – The original input prompt.
- output – The model's generated response.
- expected_output – The correct answer from the dataset.

The evaluator should return a score or assessment based on a defined criterion.

#### Example: Exact Match Evaluator
For this experiment, we create a simple equality check to determine whether the model’s response matches the expected answer.

In [8]:
# Define a custom evaluator to check if the model's output matches the expected answer.
# Evaluators receive `input`, `output` (model-generated), and `expected_output` (ground truth). All of them come automatically from the dataset and the task.
# You can modify the logic to support different evaluation methods like fuzzy matching, semantic similarity, llm-as-a-judge, etc.
@dne.evaluator
def exact_match(input, output, expected_output):
    return expected_output == output

In [None]:
experiment = dne.Experiment(
    name="gpt-4o-mini-with-evals", 
    dataset=dataset, 
    task=basic_generate_capital_name,
    evaluators=[exact_match],
)

results = experiment.run()

## Adding Another Evaluator 🔍
The exact match evaluator may not always work well, especially when the model responds in a conversational format. Even if the answer is correct, slight variations in phrasing can cause mismatches.

To address this, we'll add an evaluator that checks whether the expected answer appears within the model's output, allowing for more flexible matching.

In [10]:
@dne.evaluator
def contains_answer(input, output, expected_output):
    return expected_output in output

After defining the new evaluator, there's no need to rerun the entire experiment. Instead, we can re-evaluate the existing results using the newly added evaluator.

This allows us to efficiently test different evaluation strategies without incurring the computational cost of running the experiment again.

In [None]:
results = experiment.run_evaluations([contains_answer])

We should see now two new fields, `exact_match` and `contains_answer`, in the `evaluations` column in the Datadog UI.

### Before finishing, let's talk about configurations

`config` is a dictionary that can be used to parametrize the experiment. It can be used to change the model, the temperature, the system prompt, and other parameters.

**Note:** The `config` parameter is optional. It's a way to record the configuration on each experiment for later analysis. This is useful later on in the UI to compare different configurations.


The `config` parameter in the task function will receive the configuration dictionary, which is defined in the `Experiment` object.

```python
@task
def my_function(input, config):
    ...

Experiment(... task=my_function, config={"model": "gpt-4o-mini", "temperature": 1.4}) # This is how you define the configuration in the experiment
```

## Why would you want to use a config?

It will help organize your experiments better. With clearly defined parameters, we can show you in the UI more insights when you submit configurations, like advanced filtering and correlations between your evals and your config.

In [None]:
# Let's create a new experiment to get better results, this time we'll add a system prompt to the model and few shot examples.
@dne.task
def generate_capital_name(input, config):
    output = client.chat.completions.create(
        model=f"{config.get("model")}",
        messages=[
            {"role": "system", "content": "You will respond only with the name of the capital of the country, nothing else."},
            {"role": "user", "content": "What is the capital of France?"},
            {"role": "assistant", "content": "Paris"},
            {"role": "user", "content": input}],
        temperature=config.get("temperature", 0)
    )
    return output.choices[0].message.content

experiment = dne.Experiment(
    name="gpt-4o-mini-with-config", 
    dataset=dataset, 
    task=generate_capital_name,
    evaluators=[exact_match, contains_answer],
    config={"model": "gpt-4o-mini", "temperature": 0}
)

results = experiment.run()


In [None]:
results.as_dataframe()

**TIP**: If you come across errors when running experiments, you can immediately raise an error by providing the `raise_errors` argument to one of the following experiment methods: `run()`, `run_task()`, and `run_evaluators()`. e.g., `experiment.run(raise_errors=True)`.

## Awesome! 🎉

We've just created three experiments with different models and configurations.

This is just the beginning, this task is very simple and it's not the best way to evaluate models, but it's a good starting point to understand how to use the SDK.

In the next notebook we'll cover how to use a more complex task to evaluate models with more nuanced evaluators.


---


Feel free to play around with datasets, tasks, and evaluators to get a better understanding of how to use the SDK.