-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🗺️ Datasets and Experiments #2017
Comments
Note that if a span does not meet certain criteria (like embeddings) it might make sense to avoid allowing it to be added to a dataset |
What other criteria can we think of? |
As a user I want to be able to correct an eval if I deem it to be wrong |
Thinking trials can be done via just repeating an experiment and adding the right constraint to do generation per example. This will create a more ideal UX and troubleshooting flow |
As a user, I'd like to have the notion of a dataset of records over which I can run an application or a set of evals. Common dataset purposes are:
Motivation
LLM outputs are non-deterministic and teams need a proper way to evaluate the system. With datasets, teams can select a “test suite” of data points that they can evaluate changes on. This allows them to have trust in their application when they make modifications such as:
Use-cases
Datasets will contain data from various data sources:
Pre-deployment
Post-deployment
Architecture
Dataset
A dataset maintains a set of records. These records are versioned such that if a record is modified (added/edited/deleted) these changes are tracked and versioned. These versions must be immutable such that if there is code that depends on a version, the data does not change.
Dataset Examples
A dataset is a set of examples. These examples contain:
In addition to the above a dataset record should optionally have
Dataset Experiment
A dataset experiment that is run using the examples of a dataset. Experiments are tied to a specific dataset version and have a duration of time. During an experiment, certain parts of an LLM application's components is being modified. This includes:
Planning
Infra
arize-phoenix-client
package #2914Tables
dataset_example_revisions
table #3241Rest API
GET datasets/:id/examples?version-id
#3374POST datasets/:id/experiments/:id/results
(batch results) #3387POST datasets/:id/experiments/:id/runs/:id
(single result) #3388PATCH datasets/:id/experiments/:id/
(patch/finish) #3389GET datasets/:id/experiments/:id/results
(get results for evals) #3390POST datasets/:id/experiments/:id/evaluations
(batch evals) #3391POST dataset/:id/experiments/:id/runs/:id
(single eval) #3392GraphQL
Experiments SDK
download_dataset_examples
client method #3763append_dataset
client method #3764OpenInference
UI
Tests
Bugs
data
payload #3363None
should not be allowed for JSONB fields that are not nullable #3553Documentation
Punt
append_dataset
#3762The text was updated successfully, but these errors were encountered: