From 6bfd5932d5e0e2e546df7087bc38c8c160e0e5f7 Mon Sep 17 00:00:00 2001 From: gary-huang Date: Wed, 1 Oct 2025 14:59:35 -0400 Subject: [PATCH 1/5] add docs on summary evals --- .../llm_observability/experiments/_index.md | 20 ++++++++++++++++--- 1 file changed, 17 insertions(+), 3 deletions(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index f873165affdd2..75d0cf97ca335 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -23,7 +23,7 @@ LLM Observability [Experiments][9] supports the entire lifecycle of building LLM Install Datadog's LLM Observability Python SDK: ```shell -pip install ddtrace>=3.14.0 +pip install ddtrace>=3.15.0 ``` ### Cookbooks @@ -221,6 +221,9 @@ Evaluators are functions that measure how well the model or agent performs by co - score: returns a numeric value (float) - categorical: returns a labeled category (string) +### Summary Evaluators +Summary Evaluators are optionally defined functions that measure how well the model or agent performs, by providing an aggregated score against the entire dataset, outputs, and evaluation results. The supported evaluator types are the same as above. + ### Creating an experiment 1. Load a dataset @@ -266,7 +269,17 @@ Evaluators are functions that measure how well the model or agent performs by co return fake_llm_call ``` Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type. - Evaluators can only return a string, number, Boolean. + Evaluators can only return a string, a number, or a boolean. + +5. (Optional) Define summary evaluator function(s). + + ```python + def num_exact_matches(inputs, outputs, expected_outputs, evaluators_results): + return evaluators_results["exact_match"].count(True) + + ``` + If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of booleans) from the `exact_match` evaluator to provide a count of number of exact matches. + Summary evaluators can only return a string, a number, or a boolean. 6. Create and run the experiment. ```python @@ -275,6 +288,7 @@ Evaluators are functions that measure how well the model or agent performs by co task=task, dataset=dataset, evaluators=[exact_match, overlap, fake_llm_as_a_judge], + summary_evaluators=[num_exact_matches], # optional description="Testing capital cities knowledge", config={ "model_name": "gpt-4", @@ -286,7 +300,7 @@ Evaluators are functions that measure how well the model or agent performs by co results = experiment.run() # Run on all dataset records # Process results - for result in results: + for result in results.get("rows", []): print(f"Record {result['idx']}") print(f"Input: {result['input']}") print(f"Output: {result['output']}") From 21cabb636ae2b8c28932572748758dfe48d59907 Mon Sep 17 00:00:00 2001 From: Charles Jacquet Date: Fri, 3 Oct 2025 09:47:07 -0400 Subject: [PATCH 2/5] Update _index.md --- content/en/llm_observability/experiments/_index.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index 75d0cf97ca335..1b5aac6e3e0ab 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -26,10 +26,6 @@ Install Datadog's LLM Observability Python SDK: pip install ddtrace>=3.15.0 ``` -### Cookbooks - -To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10] - ### Setup Enable LLM Observability: @@ -366,6 +362,10 @@ jobs: DD_APP_KEY: ${{ secrets.DD_APP_KEY }} ``` +## Cookbooks + +To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10] + ## HTTP API ### Postman quickstart From aec7493aa4de7ea5c317b19a127c61f95f6928d8 Mon Sep 17 00:00:00 2001 From: Olivia Shoup <116908616+OliviaShoup@users.noreply.github.com> Date: Fri, 3 Oct 2025 10:52:34 -0500 Subject: [PATCH 3/5] Update content/en/llm_observability/experiments/_index.md --- content/en/llm_observability/experiments/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index 1b5aac6e3e0ab..b5962a7f16458 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -265,7 +265,7 @@ Summary Evaluators are optionally defined functions that measure how well the mo return fake_llm_call ``` Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type. - Evaluators can only return a string, a number, or a boolean. + Evaluators can only return a string, a number, or a Boolean. 5. (Optional) Define summary evaluator function(s). From 4cc7d7b583b61d031eb8bca68aedd5be94af4257 Mon Sep 17 00:00:00 2001 From: Olivia Shoup <116908616+OliviaShoup@users.noreply.github.com> Date: Fri, 3 Oct 2025 10:53:16 -0500 Subject: [PATCH 4/5] Update content/en/llm_observability/experiments/_index.md --- content/en/llm_observability/experiments/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index b5962a7f16458..4b6f93b5275b1 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -274,7 +274,7 @@ Summary Evaluators are optionally defined functions that measure how well the mo return evaluators_results["exact_match"].count(True) ``` - If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of booleans) from the `exact_match` evaluator to provide a count of number of exact matches. + If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of Booleans) from the `exact_match` evaluator to provide a count of number of exact matches. Summary evaluators can only return a string, a number, or a boolean. 6. Create and run the experiment. From fa635a731ad45b3272389373c3e14b7e6c6fe8f0 Mon Sep 17 00:00:00 2001 From: Olivia Shoup <116908616+OliviaShoup@users.noreply.github.com> Date: Fri, 3 Oct 2025 10:53:36 -0500 Subject: [PATCH 5/5] Update content/en/llm_observability/experiments/_index.md --- content/en/llm_observability/experiments/_index.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index 4b6f93b5275b1..098a6b46eaca7 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -275,7 +275,7 @@ Summary Evaluators are optionally defined functions that measure how well the mo ``` If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of Booleans) from the `exact_match` evaluator to provide a count of number of exact matches. - Summary evaluators can only return a string, a number, or a boolean. + Summary evaluators can only return a string, a number, or a Boolean. 6. Create and run the experiment. ```python