## Advanced Experiment Parameterization 🚀

In this example, we'll run a grid search over different models, temperatures, and prompts. 

### Scenario 🍔 🏋️
Imagine an app that helps users track their meals and workouts. Users can log a natural language description of their meal, and the app will extract macronutrient information. Similarly, for workouts, the app will identify exercise types and durations.

### Task Definition
The model’s task is to extract structured information from meal and workout descriptions.
- **For meals**: Extract macronutrients for each food item.
- **For workouts**: Identify the type of exercise and duration.
- **For mixed entries**: Handle multiple meals and workouts in a single input.

**Example Inputs**:
- "1 medium-size pizza"
- "1 medium-size pizza and 1 medium-size Coke"
- "30 minutes of yoga"
- "1 hour of yoga"
- "3x10 bench press 225lbs"

**Goal**:
Find the best-performing model that:
- ✅ Avoids hallucinations (doesn’t invent data).
- ✅ Extracts accurate macronutrient estimates from meals.
- ✅ Identifies exercise types and durations correctly.

### Evaluation Criteria 🏆
Each model will be evaluated based on the following:
- **Schema Adherence**: Does the model return a valid JSON output?
- **Appropriate Refusals**: If the input is irrelevant (e.g., a meditation log), the model should refuse rather than hallucinate.
- **Macronutrient Accuracy**: Scores for each macronutrient and an overall accuracy score.
- **Workout Extraction Accuracy**: Correctly identifying the exercise type and duration.
- **Handling Complex Inputs**: Can the model process multiple meals and workouts in the same entry?

### Experiment Setup 🔬
To explore the best configuration, we'll conduct a grid search with the following parameters:
- **Models**: 6 models
- **Temperatures**: 3 values
- **Prompts**: 4 variations

After running these experiments, we’ll analyze the results and refine the best-performing setups for further iteration. 🚀

In [1]:
import json
import os

from dotenv import load_dotenv

# Load environment variables from a .env file, overriding existing ones.
# Disable override if your environment is defined outside the virtualenv.
load_dotenv(override=True)

from typing import Literal, Optional
from itertools import product

from openai import OpenAI
from pydantic import BaseModel

import ddtrace.llmobs.experimentation as dne
dne.init(project_name="onboarding")

## 1. Creating a Dataset

In [None]:
dataset_name = "meals-and-workouts"
dataset = dne.Dataset(name=dataset_name, data=[
    {
        "input": "1 medium size pizza",
        "expected_output": [
            {"type": "meal", "description": "1 medium size pizza", "calories": 2300, "protein": 20, "carbs": 20, "fat": 20}
        ]
    },
    {
        "input": "30 minutes of yoga",
        "expected_output": [
            {"type": "workout", "description": "30 minutes of yoga", "exercise": "yoga", "duration_seconds": 1800}
        ]
    },
    {
        "input": "1 hour of yoga",
        "expected_output": [
            {"type": "workout", "description": "1 hour of yoga", "exercise": "yoga", "duration_seconds": 3600}
        ]
    },
    {
        "input": "3x10 bench press 225lbs",
        "expected_output": [
            {"type": "workout", "description": "3x10 bench press 225lbs", "exercise": "bench press", "series": 3, "reps": 10, "weight_kg": 225}
        ]
    },
    {
        "input": "1000 calories",
        "expected_output": [
            {"type": "meal", "description": "1000 calories", "calories": 1000, "protein": 0, "carbs": 0, "fat": 0}
        ]
    },
    {
        "input": "Feeling hungry",
        "expected_output": {"type": "error"}
    },
    {
        "input": "5 tacos and a coke",
        "expected_output": [
            {"type": "meal", "description": "5 tacos", "calories": 1000, "protein": 100, "carbs": 100, "fat": 100},
            {"type": "meal", "description": "a coke", "calories": 100, "protein": 0, "carbs": 0, "fat": 0}
        ]
    },
    {
        "input": "I ran 5km and then ate 3 hard boiled eggs",
        "expected_output": [
            {"type": "workout", "description": "I ran 5km", "exercise": "running", "duration_seconds": 300},
            {"type": "meal", "description": "3 hard boiled eggs", "calories": 200, "protein": 20, "carbs": 2, "fat": 14}
        ]
    },
    {
        "input": "During the morning I had a bacon-egg-and-cheese, in the afternoon I don’t remember, and now I’m having a small dish of lasagna.",
        "expected_output": [
            {"type": "meal", "description": "bacon-egg-and-cheese", "calories": 1000, "protein": 100, "carbs": 100, "fat": 100},
            {"type": "meal", "description": "a small dish of lasagna", "calories": 1000, "protein": 100, "carbs": 100, "fat": 100}
        ]
    },
    {
        "input": "I had a salad for lunch",
        "expected_output": [
            {"type": "meal", "description": "a salad", "calories": 100, "protein": 10, "carbs": 10, "fat": 10}
        ]
    },
    {
        "input": "Hey, how are you doing?",
        "expected_output": 
            {"type": "error"}
        
    },
    {
        "input": "Started my day with overnight oats with berries, had a protein shake after my workout, and finished with grilled salmon and quinoa for dinner",
        "expected_output": [
            {"type": "meal", "description": "overnight oats with berries", "calories": 350, "protein": 12, "carbs": 56, "fat": 8},
            {"type": "meal", "description": "protein shake", "calories": 160, "protein": 30, "carbs": 3, "fat": 2},
            {"type": "meal", "description": "grilled salmon and quinoa", "calories": 650, "protein": 45, "carbs": 45, "fat": 28}
        ]
    },    
    {
        "input": "Had authentic pad thai from the street vendor and 2 mango sticky rice for dessert",
        "expected_output": [
            {"type": "meal", "description": "pad thai", "calories": 600, "protein": 22, "carbs": 80, "fat": 25},
            {"type": "meal", "description": "2 mango sticky rice", "calories": 700, "protein": 8, "carbs": 140, "fat": 16}
        ]
    },    
    {
        "input": "45min HIIT session followed by 2 protein bars and a banana",
        "expected_output": [
            {"type": "workout", "description": "45min HIIT session", "exercise": "HIIT", "duration_seconds": 2700},
            {"type": "meal", "description": "2 protein bars", "calories": 440, "protein": 40, "carbs": 44, "fat": 16},
            {"type": "meal", "description": "banana", "calories": 105, "protein": 1, "carbs": 27, "fat": 0}
        ]
    },    
    {
        "input": "I think I might grab something later",
        "expected_output": 
            {"type": "error"}
        
    },    
    {
        "input": "Morning routine: 5k run, then 4x12 squats at 185lbs, finished with 10 minutes of stretching",
        "expected_output": [
            {"type": "workout", "description": "5k run", "exercise": "running", "duration_seconds": 1800},
            {"type": "workout", "description": "4x12 squats at 185lbs", "exercise": "squats", "series": 4, "reps": 12, "weight_kg": 84},
            {"type": "workout", "description": "10 minutes of stretching", "exercise": "stretching", "duration_seconds": 600}
        ]
    },    
    {
        "input": "McDonald's dinner: Big Mac, large fries, large Coke, and a McFlurry",
        "expected_output": [
            {"type": "meal", "description": "Big Mac", "calories": 563, "protein": 26, "carbs": 45, "fat": 33},
            {"type": "meal", "description": "large fries", "calories": 510, "protein": 6, "carbs": 66, "fat": 24},
            {"type": "meal", "description": "large Coke", "calories": 290, "protein": 0, "carbs": 75, "fat": 0},
            {"type": "meal", "description": "McFlurry", "calories": 510, "protein": 11, "carbs": 80, "fat": 16}
        ]
    },    
    {
        "input": "Buddha bowl with quinoa, roasted chickpeas, sweet potato, and tahini dressing",
        "expected_output": [
            {"type": "meal", "description": "Buddha bowl", "calories": 680, "protein": 22, "carbs": 102, "fat": 26}
        ]
    },
    {
        "input": "2 hours of rock climbing and bouldering",
        "expected_output": [
            {"type": "workout", "description": "rock climbing and bouldering", "exercise": "rock climbing", "duration_seconds": 7200}
        ]
    },    
    {
        "input": "Ate 2 tacos al pastor con piña y 1 horchata",
        "expected_output": [
            {"type": "meal", "description": "2 tacos al pastor", "calories": 460, "protein": 28, "carbs": 40, "fat": 24},
            {"type": "meal", "description": "horchata", "calories": 220, "protein": 1, "carbs": 50, "fat": 1}
        ]
    },
    {
        "input": "Made pasta: 100g spaghetti, 200g ground beef, 1 cup marinara sauce, 30g parmesan",
        "expected_output": [
            {"type": "meal", "description": "pasta with meat sauce", "calories": 950, "protein": 58, "carbs": 90, "fat": 40}
        ]
    },
    {
        "input": "Mindlessly ate a family size bag of Doritos while watching Netflix",
        "expected_output": [
            {"type": "meal", "description": "family size bag of Doritos", "calories": 1350, "protein": 18, "carbs": 162, "fat": 74}
        ]
    },    
    {
        "input": "Morning: 30min swim, Evening: 1 hour kickboxing class",
        "expected_output": [
            {"type": "workout", "description": "30min swim", "exercise": "swimming", "duration_seconds": 1800},
            {"type": "workout", "description": "1 hour kickboxing", "exercise": "kickboxing", "duration_seconds": 3600}
        ]
    },
    # Alcohol (should track calories)
    {
        "input": "3 pints of IPA and a plate of nachos at the bar",
        "expected_output": [
            {"type": "meal", "description": "3 pints of IPA", "calories": 720, "protein": 6, "carbs": 60, "fat": 0},
            {"type": "meal", "description": "plate of nachos", "calories": 850, "protein": 25, "carbs": 80, "fat": 50}
        ]
    },
    {
        "input": "Thanksgiving dinner with all the fixings: turkey, stuffing, mashed potatoes, gravy, cranberry sauce, and 2 slices of pumpkin pie",
        "expected_output": [
            {"type": "meal", "description": "Thanksgiving dinner", "calories": 2200, "protein": 85, "carbs": 240, "fat": 95}
        ]
    },
    {
        "input": "Protein shake before gym, chest day (bench 5x5 at 185lbs, 3x12 flyes), protein bar after",
        "expected_output": [
            {"type": "meal", "description": "protein shake", "calories": 160, "protein": 30, "carbs": 3, "fat": 2},
            {"type": "workout", "description": "bench press 5x5 at 185lbs", "exercise": "bench press", "series": 5, "reps": 5, "weight_kg": 84},
            {"type": "workout", "description": "3x12 flyes", "exercise": "flyes", "series": 3, "reps": 12},
            {"type": "meal", "description": "protein bar", "calories": 220, "protein": 20, "carbs": 22, "fat": 8}
        ]
    },
    # Pet food (should error)
    {
        "input": "Fed the dog 2 cups of kibble",
        "expected_output": 
            {"type": "error"}
        
    },
    {
        "input": "300g of pho with extra brisket and 2 spring rolls",
        "expected_output": [
            {"type": "meal", "description": "300g pho with extra brisket", "calories": 420, "protein": 28, "carbs": 65, "fat": 8},
            {"type": "meal", "description": "2 spring rolls", "calories": 200, "protein": 6, "carbs": 22, "fat": 10}
        ]
    },
    # Spanish - Complex meal day
    {
        "input": "Desayuné un café con leche y tostadas, entrené 45 minutos de crossfit, almorcé paella valenciana, y de cena tortilla española con ensalada",
        "expected_output": [
            {"type": "meal", "description": "café con leche y tostadas", "calories": 280, "protein": 8, "carbs": 45, "fat": 9},
            {"type": "workout", "description": "45 minutos de crossfit", "exercise": "crossfit", "duration_seconds": 2700},
            {"type": "meal", "description": "paella valenciana", "calories": 650, "protein": 35, "carbs": 80, "fat": 22},
            {"type": "meal", "description": "tortilla española con ensalada", "calories": 450, "protein": 20, "carbs": 15, "fat": 35}
        ]
    },
    
    # Chinese - Dim sum feast
    {
        "input": "早茶吃了 4笼小笼包，2份虾饺，1份叉烧包，和一碗艇仔粥",
        "expected_output": [
            {"type": "meal", "description": "4笼小笼包", "calories": 480, "protein": 24, "carbs": 60, "fat": 18},
            {"type": "meal", "description": "2份虾饺", "calories": 320, "protein": 16, "carbs": 40, "fat": 12},
            {"type": "meal", "description": "叉烧包", "calories": 260, "protein": 12, "carbs": 42, "fat": 6},
            {"type": "meal", "description": "艇仔粥", "calories": 220, "protein": 18, "carbs": 30, "fat": 5}
        ]
    },
    # Arabic - Mixed workout and meal
    {
        "input": "تمرين صباحي: ٣٠ دقيقة جري على المشاية، ٤ مجموعات من تمارين البطن. فطور: شكشوكة مع خبز عربي وحمص",
        "expected_output": [
            {"type": "workout", "description": "٣٠ دقيقة جري على المشاية", "exercise": "running", "duration_seconds": 1800},
            {"type": "workout", "description": "٤ مجموعات من تمارين البطن", "exercise": "abs", "series": 4},
            {"type": "meal", "description": "شكشوكة مع خبز عربي وحمص", "calories": 680, "protein": 28, "carbs": 75, "fat": 32}
        ]
    },
    
    # Portuguese - Post-workout meal
    {
        "input": "Treino de força: 5x5 agachamento 100kg, supino 80kg, e levantamento terra 120kg. Depois comi açaí na tigela com granola, banana e whey protein",
        "expected_output": [
            {"type": "workout", "description": "5x5 agachamento 100kg", "exercise": "squat", "series": 5, "reps": 5, "weight_kg": 100},
            {"type": "workout", "description": "5x5 supino 80kg", "exercise": "bench press", "series": 5, "reps": 5, "weight_kg": 80},
            {"type": "workout", "description": "5x5 levantamento terra 120kg", "exercise": "deadlift", "series": 5, "reps": 5, "weight_kg": 120},
            {"type": "meal", "description": "açaí na tigela com granola, banana e whey protein", "calories": 580, "protein": 35, "carbs": 85, "fat": 16}
        ]
    },
    # Spanish - Traditional dinner
    {
        "input": "Cena familiar: 2 raciones de cocido madrileño, 1 copa de rioja, y flan casero de postre",
        "expected_output": [
            {"type": "meal", "description": "2 raciones de cocido madrileño", "calories": 1200, "protein": 65, "carbs": 90, "fat": 55},
            {"type": "meal", "description": "1 copa de rioja", "calories": 125, "protein": 0, "carbs": 4, "fat": 0},
            {"type": "meal", "description": "flan casero", "calories": 220, "protein": 8, "carbs": 32, "fat": 8}
        ]
    },    
    # Chinese - Pet food (error)
    {
        "input": "给我的猫咪添了一碗猫粮和一些小鱼干",
        "expected_output": 
            {"type": "error"}
        
    },
    {
        "input": "Tomei duas pílulas de vitamina C e um comprimido para dor de cabeça",
        "expected_output": 
            {"type": "error"}
        
    },
    # Ambiguous input
    {
        "input": "I went to the gym at 3pm",
        "expected_output": 
            {"type": "error"}
    }

])

try:
    dataset = dne.Dataset.pull(dataset_name)
    print("Dataset pulled")
except Exception as e:
    dataset.push()
    print("Dataset pushed")

In [None]:
dataset.as_dataframe()

## 2. Evaluators
To assess the model's performance, we’ll use the following evaluation criteria:
- The model is able to return a JSON with the correct schema.
- The model refuses when it should, for example, if the log is about a meditation, it should return an error.
- The model is able to extract the macronutrients accurately from a meal description, one for each macronutrient and a global score.
- The model is able to extract the type of exercise and the duration accurately from a workout description.
- The model is able to handle multiple meals and workouts in the same description.

To generate responses, we’ll be using (OpenRouter's API)[https://openrouter.ai].

In [4]:
# Let's create a new client for the OpenRouter API.
client = OpenAI(base_url="https://openrouter.ai/api/v1", api_key=os.getenv("OPENROUTER_API_KEY"))

In [5]:
"""
[{'type': 'meal', 'description': '1 medium size pizza', 'calories': 2300, 'protein': 20, 'carbs': 20, 'fat': 20}]
{'results': None, 'error': 'No workouts logged.', 'type': 'error'}

[{'type': 'workout', 'description': '30 minutes of yoga', 'exercise': 'yoga', 'duration_seconds': 1800}]
{'results': [{'type': 'workout', 'description': '30 minutes of yoga', 'duration_seconds': 1800, 'series': 1, 'reps': 0, 'weight_kg': 0}]}

[{'type': 'workout', 'description': '1 hour of yoga', 'exercise': 'yoga', 'duration_seconds': 3600}]
{'results': None, 'error': 'No meal logged.', 'type': 'error'}

[{'type': 'workout', 'description': '3x10 bench press 225lbs', 'exercise': 'bench press', 'series': 3, 'reps': 10, 'weight_kg': 225}]
{'results': [{'type': 'error', 'error': 'No meal logged.'}]}

[{'type': 'meal', 'description': '1000 calories', 'calories': 1000, 'protein': 0, 'carbs': 0, 'fat': 0}]
{'results': None, 'error': 'No meals or workouts logged.', 'type': 'error'}

{'type': 'error'}
{'results': None, 'error': 'No meals or workouts logged.', 'type': 'error'}
"""

@dne.evaluator
def error_when_expected(input, output, expected_output):
    if isinstance(expected_output, dict) and expected_output.get("type") == "error":
        return output.get("type") == "error"
    
    if isinstance(expected_output, list):
        return output.get("results") is not None
    
    return False

# Evaluators for counting meal macronutrients
@dne.evaluator
def caloric_difference(input, output, expected_output):
    try:
        # Iterate over the results with type meal and add up the calories
        output_calories = sum(item.get("calories") for item in output.get("results") if item.get("type") == "meal")
        expected_calories = sum(item.get("calories") for item in expected_output if item.get("type") == "meal")
        return abs(output_calories - expected_calories)
    except Exception as e:
        raise e

@dne.evaluator
def protein_difference(input, output, expected_output):
    try:
        output_protein = sum(item.get("protein") for item in output.get("results") if item.get("type") == "meal")
        expected_protein = sum(item.get("protein") for item in expected_output if item.get("type") == "meal")
        return abs(output_protein - expected_protein)
    except Exception as e:
        raise e
    
@dne.evaluator
def carbs_difference(input, output, expected_output):
    try:
        output_carbs = sum(item.get("carbs") for item in output.get("results") if item.get("type") == "meal")
        expected_carbs = sum(item.get("carbs") for item in expected_output if item.get("type") == "meal")
        return abs(output_carbs - expected_carbs)
    except Exception as e:
        raise e
    
@dne.evaluator
def fat_difference(input, output, expected_output):
    try:
        output_fat = sum(item.get("fat") for item in output.get("results") if item.get("type") == "meal")
        expected_fat = sum(item.get("fat") for item in expected_output if item.get("type") == "meal")
        return abs(output_fat - expected_fat)
    except Exception as e:
        raise e

# This one counts that the meals and workouts are counted correctly
@dne.evaluator
def accurate_count(input, output, expected_output):
    try:
        # Count meals
        output_meals = sum(1 for item in output.get("results") if item.get("type") == "meal")
        expected_meals = sum(1 for item in expected_output if item.get("type") == "meal")
        
        # Count workouts
        output_workouts = sum(1 for item in output.get("results") if item.get("type") == "workout")
        expected_workouts = sum(1 for item in expected_output if item.get("type") == "workout")

        return output_meals == expected_meals and output_workouts == expected_workouts
    except Exception as e:
        raise e

# Composite evaluator
#
# Get all differences, normalize them based on the expected output, and then sum them up.
@dne.evaluator
def composite_difference(input, output, expected_output):
    try:
        # Caloric difference
        caloric_diff = caloric_difference(input, output, expected_output)
        expected_calories = sum(item.get("calories") for item in expected_output if item.get("type") == "meal")
        caloric_diff = caloric_diff / expected_calories
        # Protein difference
        protein_diff = protein_difference(input, output, expected_output)
        expected_protein = sum(item.get("protein") for item in expected_output if item.get("type") == "meal")
        protein_diff = protein_diff / expected_protein
        # Carbs difference
        carbs_diff = carbs_difference(input, output, expected_output)
        expected_carbs = sum(item.get("carbs") for item in expected_output if item.get("type") == "meal")
        carbs_diff = carbs_diff / expected_carbs
        # Fat difference
        fat_diff = fat_difference(input, output, expected_output)  
        expected_fat = sum(item.get("fat") for item in expected_output if item.get("type") == "meal")
        fat_diff = fat_diff / expected_fat

        return caloric_diff + protein_diff + carbs_diff + fat_diff
    except Exception as e:
        raise e

@dne.evaluator
def food_type_categorizer(input, output, expected_output):
    input_text = input.lower()
    
    categories = {
        "fast_food": ["burger", "fries", "pizza", "subway", "big mac", "doritos", "nuggets"],
        "healthy": ["salad", "grilled", "steamed", "protein", "quinoa", "kale", "smoothie"],
        "breakfast": ["eggs", "toast", "coffee", "cereal", "oatmeal", "breakfast"],
        "asian": ["sushi", "pho", "dim sum", "bibimbap", "pad thai"],
        "italian": ["pasta", "carbonara", "lasagna"],
        "mexican": ["tacos", "burrito", "enchiladas"],
        "dessert": ["ice cream", "cake", "cookie", "milkshake"]
    }
    
    # Check each category's keywords against the input
    for category, keywords in categories.items():
        if any(keyword in input_text for keyword in keywords):
            return category
            
    return "other"  # Default category if no matches found

@dne.evaluator
def workout_type_categorizer(input, output, expected_output):
    input_text = input.lower()
    
    categories = {
        "cardio": ["run", "walk", "cycle", "jog", "cardio"],
        "strength": ["squat", "bench press", "deadlift", "strength", "gym"],
        "flexibility": ["yoga", "stretching", "flexibility"],
    }
    
    for category, keywords in categories.items():
        if any(keyword in input_text for keyword in keywords):
            return category

    return "other"

## 2. Setup the Experiment

In [6]:
@dne.task
def app(input, config):
    prompt = """
    You are an intelligent assistant that helps users log their workouts and meals using natural language.

    Users can log their workouts using natural language and the app will keep track of their exercise volume and diet over time.
    
    Example inputs:
    - I ate a salad with grilled chicken and a side of quinoa and broccoli.
    - I had a big mac, fries, and a coke.
    - I did 30 minutes of cardio and 20 minutes of strength training.

    Meal:
    {
        "type": "meal",
        "description": "<description>",
        "calories": <calories>,
        "protein": <protein>,
        "carbs": <carbs>,
        "fat": <fat>
    }
    
    Workout:
    {
        "type": "workout",
        "description": "<description>",
        "duration_seconds": <duration_seconds>,
        "series": <series>,
        "reps": <reps>,
        "weight_kg": <weight_kg>
    }

    Schema:
    [Meal | Workout] # An array of meals and workouts

    Example:

    User input:
    I had a big mac, fries, and a coke. Then I went for a 30 minute run.

    Output:
    [
        {"type": "meal", "description": "I had a big mac, fries, and a coke", "calories": 2000, "protein": 100, "carbs": 200, "fat": 100},
        {"type": "workout", "description": "I went for a 30 minute run", "duration_seconds": 1800, "series": 1, "reps": 0, "weight_kg": 0}
    ]

    If the user input does not contain any meals or workouts, return an error.

    {
        "type": "error",
        "error": "<error_message>"
    }
    
    
    Return the output in JSON format. ONLY return the JSON. NO OTHER TEXT. NO MARKDOWN, NO EXPLANATIONS. JUST THE JSON to parse.
    """

    user_input = input

    response = client.chat.completions.create(
        # We use the model from the config to parametrize the experiment
        model=f"{config['model_provider']}/{config['model_name']}",
        messages=[
            {"role": "system", "content": prompt},
            {"role": "user", "content": user_input}
        ],
        # We use the temperature from the config to parametrize the experiment
        temperature=config["temperature"],
        max_completion_tokens=300
    )

    # Check if response has valid structure
    if not response.choices or not response.choices[0].message.content:
        raise ValueError("Invalid API response structure")
    
    content = response.choices[0].message.content

    # Sometimes the response starts with ```json\n[\n {\n ..., in that case, parse it, it should be valid JSON inside the code block    
    if content.startswith("```json"):
        content = response.choices[0].message.content.split("```json")[1].split("```")[0].strip()


    # Attempt to parse the response as JSON
    try:
        json_response = json.loads(content)

        output = {}
        if isinstance(json_response, list):
            output["results"] = json_response
        else:
            output["results"] = None
            # But it has error and type
            output["error"] = json_response["error"]
            output["type"] = json_response["type"]

        return output
    except json.JSONDecodeError:
        raise ValueError(f"Error parsing JSON: {content}")

## 3. Running the Experiment

In [7]:
evaluators = [
    error_when_expected,
    caloric_difference,
    protein_difference,
    carbs_difference,
    fat_difference,
    composite_difference,
    food_type_categorizer,
    accurate_count,
]

# Define configurations separately
model_configs = [
    {"model_name": "gpt-4o-mini", "model_provider": "openai"}, 
    {"model_name": "gpt-4o", "model_provider": "openai"},
    {"model_name": "qwen-2.5-72b-instruct", "model_provider": "qwen"},
    {"model_name": "llama-3.1-70b-instruct", "model_provider": "meta"},
    {"model_name": "claude-3.5-haiku", "model_provider": "anthropic"},
    {"model_name": "ministral-8b", "model_provider": "mistralai"},
    {"model_name": "gemini-flash-1.5-8b", "model_provider": "google"},
    {"model_name": "llama-3.2-3b-instruct", "model_provider": "meta-llama"},
    {"model_name": "qwen-2.5-7b-instruct", "model_provider": "qwen"},
    {"model_name": "llama-3.2-11b-vision-instruct", "model_provider": "meta-llama"},
    {"model_name": "llama-3.1-8b-instruct", "model_provider": "meta-llama"},
    {"model_name": "llama-3.1-sonar-small-128k-chat", "model_provider": "perplexity"},
    {"model_name": "gemini-2.0-flash-001", "model_provider": "google"},
    {"model_name": "qwen-plus", "model_provider": "qwen"},
    {"model_name": "o3-mini", "model_provider": "openai"},
    {"model_name": "mistral-small-24b-instruct-2501", "model_provider": "mistralai"},
    {"model_name": "deepseek-r1-distill-qwen-32b", "model_provider": "deepseek"},
    {"model_name": "deepseek-r1-distill-llama-8b", "model_provider": "deepseek"},
    {"model_name": "phi-4", "model_provider": "microsoft"},
    {"model_name": "grok-2-1212", "model_provider": "x-ai"},
    {"model_name": "command-r7b-12-2024", "model_provider": "cohere"},
    {"model_name": "llama-3.3-70b-instruct", "model_provider": "meta-llama"},
    {"model_name": "claude-3.5-sonnet:beta", "model_provider": "anthropic"},
]
temperatures = [0.0, 0.8]

experiments = []
# Create all possible combinations
for model, temp in product(model_configs, temperatures):
    config = {
        "model_name": model["model_name"],
        "model_provider": model["model_provider"],
        "temperature": temp,
    }
    experiment_name = f"{model['model_provider']}/{model['model_name']}-temp:{temp}"
    experiment = dne.Experiment(name=experiment_name, dataset=dataset, task=app, evaluators=evaluators, config=config)
    experiments.append(experiment)

In [None]:
# Each experiment represents a unique combination of parameters from our grid search. The experiments list stores all these variations.
# To inspect the parameters used in the first experiment, we can check its configuration:
experiments[0].config

In [None]:
# Let's run the first experiment
results = experiments[0].run(jobs=50, raise_errors=False)

In [None]:
# Let's run all experiments
results = []
for experiment in experiments:
    result = experiment.run(jobs=30,raise_errors=False)
    results.append(result)

If you check Datadog's LLM Observability's UI, you'll be able to see all of your experiments! 