Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 21 additions & 7 deletions content/en/llm_observability/experiments/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,9 @@
Install Datadog's LLM Observability Python SDK:

```shell
pip install ddtrace>=3.14.0
pip install ddtrace>=3.15.0
```

### Cookbooks

To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10]

### Setup

Enable LLM Observability:
Expand Down Expand Up @@ -221,6 +217,9 @@
- score: returns a numeric value (float)
- categorical: returns a labeled category (string)

### Summary Evaluators

Check warning on line 220 in content/en/llm_observability/experiments/_index.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.headings

'Summary Evaluators' should use sentence-style capitalization.
Summary Evaluators are optionally defined functions that measure how well the model or agent performs, by providing an aggregated score against the entire dataset, outputs, and evaluation results. The supported evaluator types are the same as above.

Check notice on line 221 in content/en/llm_observability/experiments/_index.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.

### Creating an experiment

1. Load a dataset
Expand Down Expand Up @@ -266,7 +265,17 @@
return fake_llm_call
```
Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type.
Evaluators can only return a string, number, Boolean.
Evaluators can only return a string, a number, or a Boolean.

5. (Optional) Define summary evaluator function(s).

```python
def num_exact_matches(inputs, outputs, expected_outputs, evaluators_results):
return evaluators_results["exact_match"].count(True)

```
If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of Booleans) from the `exact_match` evaluator to provide a count of number of exact matches.

Check notice on line 277 in content/en/llm_observability/experiments/_index.md

View workflow job for this annotation

GitHub Actions / vale

Datadog.sentencelength

Suggestion: Try to keep your sentence length to 25 words or fewer.
Summary evaluators can only return a string, a number, or a Boolean.

6. Create and run the experiment.
```python
Expand All @@ -275,6 +284,7 @@
task=task,
dataset=dataset,
evaluators=[exact_match, overlap, fake_llm_as_a_judge],
summary_evaluators=[num_exact_matches], # optional
description="Testing capital cities knowledge",
config={
"model_name": "gpt-4",
Expand All @@ -286,7 +296,7 @@
results = experiment.run() # Run on all dataset records

# Process results
for result in results:
for result in results.get("rows", []):
print(f"Record {result['idx']}")
print(f"Input: {result['input']}")
print(f"Output: {result['output']}")
Expand Down Expand Up @@ -352,6 +362,10 @@
DD_APP_KEY: ${{ secrets.DD_APP_KEY }}
```

## Cookbooks

To see in-depth examples of what you can do with LLM Experiments, you can check these [jupyter notebooks][10]

## HTTP API

### Postman quickstart
Expand Down
Loading