# Running an experiment 🧪

In this notebook, we'll walk through how to run a basic experiment.

An experiment consists of three key components: a dataset, a task, and evaluators.
- **Task**: A function that takes an input and generates a response.
- **Evaluators**: Functions that compare the model's output against the expected output (along with the input) and returns a value based on specific criteria.

First, let's initialize the required libraries.

In [None]:
import os

from dotenv import load_dotenv
# Load environment variables from the .env file.
load_dotenv(override=True)

from typing import Dict, Any

from ddtrace.llmobs import LLMObs

from openai import OpenAI

LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"),  project_name="Onboarding", ml_app="Onboarding-ML-App")

oai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

For this example, we will reuse the `capitals-of-the-world` dataset we created in the previous notebook.

In [None]:
# Pull the dataset from Datadog
dataset = LLMObs.pull_dataset(name="capitals-of-the-world")
dataset.as_dataframe()

### Defining the Task & Evaluator

This is the core of the experiment.

First, we will define the task we want to evaluate.

You can evaluate anything from single prompts to complex agents inside this function. The task and evaluators will be executed row-wise.

Evaluators are functions that assess the model’s performance by comparing the expected output with the actual output generated by the model. They receive the following parameters:
- input – The original input prompt.
- output – The model's generated response.
- expected_output – The correct answer from the dataset.

The evaluator should return a score or assessment based on a defined criterion.

#### Example 1:
For this experiment, we will ask the LLM the question with some predefined configuration around the model and temperature, take the answer as is, and create a simple equality check as well as an inclusiveness check to determine whether the model’s response matches the expected answer or contains the answer.

In [None]:
# the task function will accept a row of input and will manipulate against it using the config provided
def generate_capital(input_data: Dict[str, Any], config: Dict[str, Any]) -> str:
    output = oai_client.chat.completions.create(
        model=config["model"],
        messages=[{"role": "user", "content": input_data["question"]}],
        temperature=config["temperature"]
    )

    return output.choices[0].message.content

# Evaluators receive `input_data`, `output_data` (the output to test against), and `expected_output` (ground truth). All of them come automatically from the dataset and the task.
# You can modify the logic to support different evaluation methods like fuzzy matching, semantic similarity, llm-as-a-judge, etc.
def exact_match(input_data, output_data, expected_output):
    return expected_output == output_data

def contains_answer(input_data, output_data, expected_output):
    return expected_output in output_data

# We now define the experiment with a descriptive name, dataset, task function, and evaluators.
experiment = LLMObs.experiment(
    name="generate-capital-with-config",
    dataset=dataset,
    task=generate_capital,
    evaluators=[exact_match, contains_answer],
    config={"model": "gpt-4.1-nano", "temperature": 0},
    description="a cool basic experiment with config",
)

We now can run the experiment! This will execute the task against the dataset in a row-wise manner concurrently so that it runs faster.

In [None]:
results = experiment.run(jobs=5)

After the experiment run is complete, you can see it in Datadog (it may take a few seconds to be accessible).

In [None]:
experiment.url

We can see that exact matches will fail, as the LLM will answer in a sentence, but the contains_answer check does OK.

Now let's refine the task to give a nicer answer to hopefully get us better evaluation results.

#### Example 2:
For this experiment, we will ask the LLM the question with the same configuration around the model and temperature, but the answer should only contain 1 word, and use the same simple equality check as well as the inclusiveness check to determine whether the model’s response matches the expected answer or contains the answer.

In [None]:
def generate_capital_name_one_word(input_data: Dict[str, Any], config: Dict[str, Any]) -> str:
    output = oai_client.chat.completions.create(
        model=config["model"],
        messages=[
            {"role": "system", "content": "You will respond only with the name of the capital of the country, nothing else."},
            {"role": "user", "content": "What is the capital of France?"},
            {"role": "assistant", "content": "Paris"},
            {"role": "user", "content": input_data["question"]}],
        temperature=config["temperature"]
    )

    return output.choices[0].message.content

def exact_match(input_data, output_data, expected_output):
    return expected_output == output_data

def contains_answer(input_data, output_data, expected_output):
    return expected_output in output_data

experiment = LLMObs.experiment(
    name="generate-one-word-capital-with-config",
    dataset=dataset,
    task=generate_capital_name_one_word,
    evaluators=[exact_match, contains_answer],
    config={"model": "gpt-4.1-nano", "temperature": 0},
    description="a cool basic experiment with config",
)

And run the experiment again...

In [None]:
experiment.run(jobs=2)

experiment.url

After a few seconds, you should see that in the experiment in Datadog that we have matching responses!

**TIP**: If you come across errors when running experiments, you can immediately raise an error by providing the `raise_errors` argument to the `run()` method. e.g., `experiment.run(raise_errors=True)`.

## Awesome! 🎉

We've just created two experiments with different results! This is just the beginning, this task is very simple and it's not the best way to evaluate models, but it's a good starting point to understand how to use the SDK.

In the next notebook we'll cover how to use a more complex task to evaluate models with more nuanced evaluators.


---


Feel free to play around with datasets, tasks, and evaluators to get a better understanding of how to use the SDK.