Description
When setting up evals for an existing system, the reality is that the "AI pipeline" is often not "pure". It depends on a lot of external resources in the middle. This makes it hard to simply "extract" it out and run it as part of an experiment.
As such, it is common to have an external tool/script which triggers an endpoint to start the AI pipeline. Generally, this is the same endpoint that real users would trigger in the product. This script can then be pointed at a local development or even production environment to generate logs.
However, it is possible to run the same set of inputs against the same versioned function multiple times. It would be helpful to be able to annotate and compare these separately (i.e. analogous to an A/A test).