🗺️ Datasets and Experiments #2017

mikeldking · 2023-12-28T17:57:53Z

As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:

Golden Dataset - contains queries and golden responses
QA Asset - a set of difficult queries over which you want to perform regression testing
Sample of production data - pick and choose records from production to convert to an asset (overlaps with above)

Motivation

LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:

Modifying a prompt template
Iterate on various components of application and seeing if there are differences in output
Swapping out for new model releases
See performance on a cheaper new model or a fine-tuned model

Use-cases

Datasets will contain data from various data sources:

Pre-deployment

Store and maintain a set of synthetic queries
Store and maintain a set of hand curated queries
Make a copy of huggingface / CSV data to be maintained internally

Post-deployment

Move data from production or staging for regression testing or fine-tuning

Architecture

Dataset

A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.

Dataset Examples

A dataset is a set of examples. These examples contain:

input - data passed to an LLM, prompt, or function (e.x. a retriever)
expected / output the result of the invocation of an LLM, prompt, or function
metadata - any additional information that can be used during experimentation

In addition to the above a dataset record should optionally have

metadata any additional data associated with the record (e.x. attributes from a span)
**source span_rowid / trace_rowid ** if the data came from a span, it should link back to the source

Dataset Experiment

A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:

Change in LLM or LLM params
Change in prompt template
Change in retrieval strategy

Planning

[datasets][planning] ERD for datasets, dataset records, audit tables #3043

Infra

Tables

Rest API

GraphQL

Experiments SDK

OpenInference

add evaluator / eval span kind to openinference #3634

UI

Tests

Bugs

Documentation

Punt

The text was updated successfully, but these errors were encountered:

mikeldking · 2024-05-21T01:12:47Z

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

axiomofjoy · 2024-05-21T01:35:10Z

Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset

What other criteria can we think of?

mikeldking · 2024-06-08T14:31:43Z

As a user I want to be able to correct an eval if I deem it to be wrong

mikeldking · 2024-06-11T15:46:55Z

Thinking trials can be done via just repeating an experiment and adding the right constraint to do generation per example. This will create a more ideal UX and troubleshooting flow

mikeldking added the roadmap label Dec 28, 2023

mikeldking mentioned this issue Apr 30, 2024

create an arize-phoenix-client package #2914

Closed

mikeldking changed the title ~~🗺️ Evaluation / Fine-Tuning Datasets~~ 🗺️ Datasets Apr 30, 2024

mikeldking mentioned this issue May 10, 2024

feat(datasets): datasets feature #3167

Merged

mikeldking mentioned this issue May 30, 2024

[ENHANCEMENT] build golden datasets or manual evals #3249

Open

mikeldking changed the title ~~🗺️ Datasets~~ 🗺️ Datasets and Experiments May 31, 2024

mikeldking mentioned this issue May 31, 2024

🗺️ Evaluation Experiment Tracking #2220

Closed

mikeldking assigned axiomofjoy, RogerHYang, anticorrelator and mikeldking May 31, 2024

mikeldking mentioned this issue Jun 24, 2024

[feature request] Log the same traces to different projects #3645

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🗺️ Datasets and Experiments #2017

🗺️ Datasets and Experiments #2017

mikeldking commented Dec 28, 2023 •

edited by axiomofjoy

Loading

mikeldking commented May 21, 2024

axiomofjoy commented May 21, 2024

mikeldking commented Jun 8, 2024

mikeldking commented Jun 11, 2024

🗺️ Datasets and Experiments #2017

🗺️ Datasets and Experiments #2017

Comments

mikeldking commented Dec 28, 2023 • edited by axiomofjoy Loading

Motivation

Use-cases

Architecture

Dataset

Dataset Examples

Dataset Experiment

Planning

Infra

Tables

Rest API

GraphQL

Experiments SDK

OpenInference

UI

Tests

Bugs

Documentation

Punt

mikeldking commented May 21, 2024

axiomofjoy commented May 21, 2024

mikeldking commented Jun 8, 2024

mikeldking commented Jun 11, 2024

mikeldking commented Dec 28, 2023 •

edited by axiomofjoy

Loading