From 6138150b6189e09a5ea25c7165e3739cac0fe328 Mon Sep 17 00:00:00 2001 From: Alex Barksdale Date: Fri, 17 Oct 2025 14:43:25 -0400 Subject: [PATCH 1/4] Update dataset docs to include versioning details --- .../llm_observability/experiments/_index.md | 102 +++++++++++++++--- 1 file changed, 86 insertions(+), 16 deletions(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index 098a6b46eaca7..d2e773fe1eaac 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -43,7 +43,7 @@ LLMObs.enable( **Notes**: - You need *both* an API key and an application key - + ## Projects _Projects_ are the core organizational layer for LLM Experiments. All datasets and experiments live in a project. You can create a project manually in the Datadog console, API, and SDK by specifying a project name that does not already exist in `LLMObs.enable`. @@ -152,6 +152,67 @@ dataset.delete(1) # Deletes the second record dataset.push() ``` +### Dataset versioning + +Datasets are automatically versioned to track changes over time. This enables reproducibility and allows experiments to reference specific dataset versions. + +#### When are new versions created? + +Dataset versions start at 0 and increment automatically when: + +- **Adding records**: By default, adding new records creates a new version. +- **Updating records**: Changes to `input` or `expected_output` fields create a new version +- **Deleting records**: Deleting records creates a new version + +Dataset versions are **NOT** created when: + +- **Metadata-only updates**: Updating only the `metadata` field on records or the dataset itself +- **Dataset changes**: Updating the dataset name or description + +#### Version numbering + +- Datasets start at version 0 when created +- Each versioning operation increments the version by 1 +- The `current_version` field in the Dataset object shows the latest version + +#### Old versions (retention / TTL) + +- **What’s kept:** Only *previous* versions are subject to TTL. The `current_version` is not affected. +- **How long:** Each previous version has a **90-day retention window**. +- **Auto-refresh:** **Any use of a previous version resets its 90-day clock** (e.g., experiments that read that version). +- **Expiration:** If a previous version isn’t used for 90 consecutive days, it becomes eligible for permanent deletion and may no longer be accessible. + +**Example** + +- You publish `v12`. Now `v11` becomes a previous version with a 90-day window. +- On day 25, you run an experiment on `v11`. The 90-day window **restarts** from that day. +- If `v11` isn’t used again for 90 days after that, it may be deleted. + +#### Versioning workflow + +**Example: Standard versioning workflow** + +```python +# Create dataset (starts at version 0) +dataset = LLMObs.create_dataset( + dataset_name="my-dataset", + records=[ + {"input_data": "test1", "expected_output": "output1"} + ] +) +print(f"Current version: {dataset.current_version}") # 0 + +# Add more records (creates version 1) +dataset.append({"input_data": "test2", "expected_output": "output2"}) +dataset.push() # Default: create_new_version=True +print(f"Current version: {dataset.current_version}") # 1 + +# Update a record (creates version 2) +dataset.update(0, {"input_data": "test1-updated", "expected_output": "output1"}) +dataset.push() +print(f"Current version: {dataset.current_version}") # 2 +``` + ### Accessing dataset records You can access dataset records using standard Python indexing: @@ -230,7 +291,7 @@ Summary Evaluators are optionally defined functions that measure how well the mo dataset = LLMObs.pull_dataset("capitals-of-the-world") ``` -2. Define a task function that processes a single dataset record +2. Define a task function that processes a single dataset record ```python def task(input_data: Dict[str, Any], config: Optional[Dict[str, Any]] = None) -> str: @@ -238,15 +299,15 @@ Summary Evaluators are optionally defined functions that measure how well the mo # Your LLM or processing logic here return "Beijing" if "China" in question else "Unknown" ``` - A task can take any non-null type as `input_data` (string, number, Boolean, object, array). The output that will be used in the Evaluators can be of any type. + A task can take any non-null type as `input_data` (string, number, Boolean, object, array). The output that will be used in the Evaluators can be of any type. This example generates a string, but a dict can be generated as output to store any intermediary information and compare in the Evaluators. - - You can trace the different parts of your Experiment task (workflow, tool calls, etc.) using the [same tracing decorators][12] you use in production. + + You can trace the different parts of your Experiment task (workflow, tool calls, etc.) using the [same tracing decorators][12] you use in production. If you use a [supported framework][13] (OpenAI, Amazon Bedrock, etc.), LLM Observability automatically traces and annotates calls to LLM frameworks and libraries, giving you out-of-the-box observability for calls that your LLM application makes. -4. Define evaluator functions. - +4. Define evaluator functions. + ```python def exact_match(input_data: Dict[str, Any], output_data: str, expected_output: str) -> bool: return output_data == expected_output @@ -263,19 +324,19 @@ Summary Evaluators are optionally defined functions that measure how well the mo def fake_llm_as_a_judge(input_data: Dict[str, Any], output_data: str, expected_output: str) -> str: fake_llm_call = "excellent" return fake_llm_call - ``` - Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type. - Evaluators can only return a string, a number, or a Boolean. + ``` + Evaluator functions can take any non-null type as `input_data` (string, number, Boolean, object, array); `output_data` and `expected_output` can be any type. + Evaluators can only return a string, a number, or a Boolean. + +5. (Optional) Define summary evaluator function(s). -5. (Optional) Define summary evaluator function(s). - ```python def num_exact_matches(inputs, outputs, expected_outputs, evaluators_results): return evaluators_results["exact_match"].count(True) - ``` + ``` If defined and provided to the experiment, summary evaluator functions are executed after evaluators have finished running. Summary evaluator functions can take a list of any non-null type as `inputs` (string, number, Boolean, object, array); `outputs` and `expected_outputs` can be lists of any type. `evaluators_results` is a dictionary of list of results from evaluators, keyed by the name of the evaluator function. For example, in the above code snippet the summary evaluator `num_exact_matches` uses the results (a list of Booleans) from the `exact_match` evaluator to provide a count of number of exact matches. - Summary evaluators can only return a string, a number, or a Boolean. + Summary evaluators can only return a string, a number, or a Boolean. 6. Create and run the experiment. ```python @@ -564,6 +625,7 @@ List all datasets, sorted by creation date. The most recently-created datasets a | `name` | string | Unique dataset name. | | `description` | string | Dataset description. | | `metadata` | json | Arbitrary key-value metadata associated with the dataset. | +| `current_version` | int | The current version number of the dataset. Versions start at 0 and increment when records are added or modified. | | `created_at` | timestamp | Timestamp representing when the resource was created. | | `updated_at` | timestamp | Timestamp representing when the resource was last updated. | @@ -589,6 +651,7 @@ Create a dataset. If there is an existing dataset with the same name, the API re | `name` | string | Unique dataset name. | | `description` | string | Dataset description. | | `metadata` | json | Arbitrary key-value metadata associated with the dataset. | +| `current_version` | int | The current version number of the dataset. Starts at 0 for new datasets. | | `created_at` | timestamp | Timestamp representing when the resource was created. | | `updated_at` | timestamp | Timestamp representing when the resource was last updated. | @@ -602,10 +665,15 @@ List all dataset records, sorted by creation date. The most recently-created rec | Parameter | Type | Description | | ---- | ---- | --- | -| `filter[version]` | string | List results for a given dataset version. | +| `filter[version]` | int | List results for a given dataset version. If not specified, defaults to the dataset's current version. Version numbers start at 0. | | `page[cursor]` | string | List results with a cursor provided in the previous query. | | `page[limit]` | int | Limits the number of results. | +**Notes**: +- Without `filter[version]`, you get records from the **current version only**, not all versions. +- To retrieve records from a specific historical version, use `filter[version]=N` where N is the version number. +- Version numbers start at 0 when a dataset is created. + **Response** | Field | Type | Description | @@ -634,7 +702,8 @@ Appends records for a given dataset. | Field | Type | Description | | ---- | ---- | --- | -| `deduplicate` | bool | If `true`, deduplicates appended records. | +| `deduplicate` | bool | If `true`, deduplicates appended records. Defaults to `true`. | +| `create_new_version` | bool | If `true`, creates a new dataset version. If `false`, adds records to the current version (draft mode). Defaults to `true`. | | `records` (_required_) | [][RecordReq](#object-recordreq) | List of records to create. | #### Object: RecordReq @@ -673,6 +742,7 @@ Partially update a dataset object. Specify the fields to update in the payload. | `name` | string | Unique dataset name. | | `description` | string | Dataset description. | | `metadata` | json | Arbitrary key-value metadata associated with the dataset. | +| `current_version` | int | The current version number of the dataset. Metadata-only updates do not increment the version. | | `created_at` | timestamp | Timestamp representing when the resource was created. | | `updated_at` | timestamp | Timestamp representing when the resource was last updated. | From c6addc09c78a97243a7f40b91a0a8398d8140b48 Mon Sep 17 00:00:00 2001 From: Alex Barksdale Date: Fri, 17 Oct 2025 15:49:15 -0400 Subject: [PATCH 2/4] Update _index.md --- .../llm_observability/experiments/_index.md | 25 ------------------- 1 file changed, 25 deletions(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index d2e773fe1eaac..1123c7fdd43f3 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -188,31 +188,6 @@ Dataset versions are **NOT** created when: - On day 25, you run an experiment on `v11`. The 90-day window **restarts** from that day. - If `v11` isn’t used again for 90 days after that, it may be deleted. -#### Versioning workflow - -**Example: Standard versioning workflow** - -```python -# Create dataset (starts at version 0) -dataset = LLMObs.create_dataset( - dataset_name="my-dataset", - records=[ - {"input_data": "test1", "expected_output": "output1"} - ] -) -print(f"Current version: {dataset.current_version}") # 0 - -# Add more records (creates version 1) -dataset.append({"input_data": "test2", "expected_output": "output2"}) -dataset.push() # Default: create_new_version=True -print(f"Current version: {dataset.current_version}") # 1 - -# Update a record (creates version 2) -dataset.update(0, {"input_data": "test1-updated", "expected_output": "output1"}) -dataset.push() -print(f"Current version: {dataset.current_version}") # 2 -``` - ### Accessing dataset records You can access dataset records using standard Python indexing: From 6ec693e2074f3fff5a6272d5948a246668896a9c Mon Sep 17 00:00:00 2001 From: Alex Barksdale Date: Fri, 17 Oct 2025 15:50:05 -0400 Subject: [PATCH 3/4] Update _index.md --- content/en/llm_observability/experiments/_index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index 1123c7fdd43f3..4b4a886a387ec 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -678,7 +678,6 @@ Appends records for a given dataset. | Field | Type | Description | | ---- | ---- | --- | | `deduplicate` | bool | If `true`, deduplicates appended records. Defaults to `true`. | -| `create_new_version` | bool | If `true`, creates a new dataset version. If `false`, adds records to the current version (draft mode). Defaults to `true`. | | `records` (_required_) | [][RecordReq](#object-recordreq) | List of records to create. | #### Object: RecordReq From 5a5a5ec70471ab585ad6cdf16d3357ddb3f2d8bd Mon Sep 17 00:00:00 2001 From: cecilia saixue watt Date: Wed, 22 Oct 2025 16:49:34 -0400 Subject: [PATCH 4/4] versioning edits --- .../llm_observability/experiments/_index.md | 39 +++++++------------ 1 file changed, 15 insertions(+), 24 deletions(-) diff --git a/content/en/llm_observability/experiments/_index.md b/content/en/llm_observability/experiments/_index.md index 4b4a886a387ec..b4200d94c9fdb 100644 --- a/content/en/llm_observability/experiments/_index.md +++ b/content/en/llm_observability/experiments/_index.md @@ -154,39 +154,30 @@ dataset.push() ### Dataset versioning -Datasets are automatically versioned to track changes over time. This enables reproducibility and allows experiments to reference specific dataset versions. +Datasets are automatically versioned to track changes over time. Versioning information enables reproducibility and allows experiments to reference specific dataset versions. -#### When are new versions created? +The `Dataset` object has a field, `current_version`, which corresponds to the latest version; previous versions are subject to a 90-day retention window. -Dataset versions start at 0 and increment automatically when: +Dataset versions start at `0`, and each new version increments the version by 1. -- **Adding records**: By default, adding new records creates a new version. -- **Updating records**: Changes to `input` or `expected_output` fields create a new version -- **Deleting records**: Deleting records creates a new version +#### When new dataset versions are created -Dataset versions are **NOT** created when: +A new dataset version is created when: +- Adding records +- Updating records (changes to `input` or `expected_output` fields) +- Deleting records -- **Metadata-only updates**: Updating only the `metadata` field on records or the dataset itself -- **Dataset changes**: Updating the dataset name or description +Dataset versions are **NOT** created for changes to `metadata` fields, or when updating the dataset name or description. -#### Version numbering +#### Version retention -- Datasets start at version 0 when created -- Each versioning operation increments the version by 1 -- The `current_version` field in the Dataset object shows the latest version +- Previous versions (**NOT** the content of `current_version`) are retained for 90 days. +- The 90-day retention period resets when a previous version is used — for example, when an experiment reads a version. +- After 90 consecutive days without use, a previous version is eligible for permanent deletion and may no longer be accessible. -#### Old versions (retention / TTL) +**Example of version retention behavior** -- **What’s kept:** Only *previous* versions are subject to TTL. The `current_version` is not affected. -- **How long:** Each previous version has a **90-day retention window**. -- **Auto-refresh:** **Any use of a previous version resets its 90-day clock** (e.g., experiments that read that version). -- **Expiration:** If a previous version isn’t used for 90 consecutive days, it becomes eligible for permanent deletion and may no longer be accessible. - -**Example** - -- You publish `v12`. Now `v11` becomes a previous version with a 90-day window. -- On day 25, you run an experiment on `v11`. The 90-day window **restarts** from that day. -- If `v11` isn’t used again for 90 days after that, it may be deleted. +After you publish `12`, `11` becomes a previous version with a 90-day window. After 25 days, you run an experiment with version `11`, which causes the 90-day window to **restart**. After another 90 days, during which you have not used version `11`, version `11` may be deleted. ### Accessing dataset records