# Running an experiment 🧪

In this notebook, we'll walk through how to run a basic experiment.

An experiment consists of three key components: a dataset, a task, and evaluators.
- Task: A function that takes an input and generates a response.
- Evaluator: A function that compares the model's output against the expected output (along with the input) and returns a value based on specific criteria.

In [32]:
import os
from openai import OpenAI
from dotenv import load_dotenv

# Load environment variables from a .env file, overriding existing ones.
# Disable override if your environment is defined outside the virtualenv.
load_dotenv(override=True)

from ddtrace.llmobs import Dataset, Experiment, task, evaluator


client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))

In [None]:
# Pull the dataset from Datadog
dataset = Dataset.pull(name="capitals-of-the-world")
dataset.as_dataframe()

In [None]:
# Define a task function to evaluate the dataset.
# The function must accept an `input` parameter, which represents a single dataset row.
@task
def basic_generate_capital_name(input):
    output = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": input}],
        temperature=1.4
    )
    return output.choices[0].message.content

# Define the experiment with a descriptive name, dataset, task function, and evaluators.
experiment = Experiment(
    name="gpt-4o-mini",
    dataset=dataset,
    task=basic_generate_capital_name,
    evaluators=[]  # Empty for now; we’ll add evaluators later
)

# Run the experiment and retrieve results
results = experiment.run()
results.as_dataframe()

In [None]:
# Push the results to Datadog
results.push()

## Using custom evaluators 🔬

Evaluators are functions that assess the model’s performance by comparing the expected output with the actual output generated by the model. They receive the following parameters:
- input – The original input prompt.
- output – The model's generated response.
- expected_output – The correct answer from the dataset.

The evaluator should return a score or assessment based on a defined criterion.

#### Example: Exact Match Evaluator
For this experiment, we create a simple equality check to determine whether the model’s response matches the expected answer.

In [42]:
# Define a custom evaluator to check if the model's output matches the expected answer.
# Evaluators receive `input`, `output` (model-generated), and `expected_output` (ground truth).
# You can modify the logic to support different evaluation methods.
@evaluator
def exact_match(input, output, expected_output):
    return expected_output == output

In [None]:
experiment = Experiment(
    name="gpt-4o-mini-with-evals", 
    dataset=dataset, 
    task=basic_generate_capital_name,
    evaluators=[exact_match],
    config={"model": "gpt-4o-mini", "temperature": 1.4} 
)

results = experiment.run()
results.as_dataframe()

## Adding Another Evaluator 🔍
The exact match evaluator may not always work well, especially when the model responds in a conversational format. Even if the answer is correct, slight variations in phrasing can cause mismatches.

To address this, we'll add an evaluator that checks whether the expected answer appears within the model's output, allowing for more flexible matching.

In [44]:
@evaluator
def contains_answer(input, output, expected_output):
    return expected_output in output

After defining the new evaluator, there's no need to rerun the entire experiment. Instead, we can re-evaluate the existing results using the newly added evaluator.

This allows us to efficiently test different evaluation strategies without incurring the computational cost of running the experiment again.

In [None]:
results = experiment.run_evaluations([exact_match, contains_answer])

We should see now two new columns in the dataframe, `custom_evaluator` and `custom_evaluator_2`.
# TODO: Check with jonathan to see if this was fixed

In [None]:
results.as_dataframe()

## Adding another experiment
You can create and publish multiple experiments, all of which will be accessible in the Datadog UI. This allows you to compare different models, prompts, or evaluation strategies within the same dataset.

In this experiment, we’ll test a smaller model and analyze its performance. We'll be using (OpenRouter's API)[https://openrouter.ai] to generate responses.

In [33]:
# Let's create a new client for the OpenRouter API.
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_API_KEY"))

### Before continuing, let's talk about configurations

`config` is a dictionary that can be used to parametrize the experiment. It can be used to change the model, the temperature, the system prompt, and other parameters.

**Note:** The `config` parameter is optional. It's a way to record the configuration on each experiment for later analysis. This is useful later on in the UI to compare different configurations.


The `config` parameter in the task function will receive the configuration dictionary, which is defined in the `Experiment` object.

```python
@task
def my_function(input, config):
    ...

Experiment(... task=my_function, config={"model": "gpt-4o-mini", "temperature": 1.4}) # This is how you define the configuration in the experiment
```

Also, it's a convenient way when you want to do something like a hyperparameter search, which we'll cover in notebook 03.


For this case we'll pass the model and the temperature.

In [None]:
# Let's create a new experiment to get better results, this time we'll add a system prompt to the model and few shot examples.

@task
def generate_capital_name(input, config):
    output = client.chat.completions.create(
        model=f"{config['model_provider']}/{config['model']}",
        messages=[
            {"role": "system", "content": "You will respond only with the name of the capital of the country, nothing else."},
            {"role": "user", "content": "What is the capital of France?"},
            {"role": "assistant", "content": "Paris"},
            {"role": "user", "content": input}],
        temperature=config.get("temperature", 0)
    )
    return output.choices[0].message.content

experiment = Experiment(
    name="meta-llama-3.2-3b-instruct-0", 
    dataset=dataset, 
    task=generate_capital_name,
    evaluators=[exact_match, contains_answer],
    config={"model_provider": "meta-llama", "model": "llama-3.2-3b-instruct", "temperature": 0}
)

results = experiment.run()


In [None]:
results.as_dataframe()

## Awesome! 🎉

We've just created three experiments with different models and configurations.

This is just the beginning, this task is very simple and it's not the best way to evaluate models, but it's a good starting point to understand how to use the SDK.

In the next notebook we'll cover how to use a more complex task to evaluate models with more nuanced evaluators.


---


Feel free to play around with datasets, tasks, and evaluators to get a better understanding of how to use the SDK.