# Evals Ergonomics

We aim to support evals for three different cases:

1. Running evals on an exported pandas dataframe and importing back into Phoenix,
2. Running evals during LLM application execution via our callback system,
3. Running evals post-hoc.

Note: This notebook uses `phoenix.evals` everywhere for convenience, but we'll continue to keep `evals` in `experimental` for the time being.

## 1. Running evals on an exported pandas dataframe and importing back into Phoenix

The user is actively experimenting with their trace data in the form of a pandas dataframe. They wish to compute some evals, perhaps using our custom evaluators or perhaps using their own bespoke code, upload their evaluations, and see their evaluations reflected in Phoenix.

In [None]:
import phoenix as px
from phoenix.evals import ClassificationPromptTemplate, LLMClassifier
from phoenix.experimental.callbacks.langchain_tracer import OpenInferenceTracer

session = px.launch_app()
trace_df = session.export()
tracer = OpenInferenceTracer()
# Run your LLM application...

trace_df = session.export_dataframe("span_kind == 'retriever'")
# massage the data...

# compute relevance with a new prompt template
model = OpenAIModel(model_name="gpt-4")
prompt_template = ClassificationPromptTemplate(
    template="some prompt template the user is testing out with {context} and {question}",
    classes=["relevant", "irrelevant"],
)
clf = LLMClassifier(model=model, prompt_template=prompt_template)
trace_df["relevant"] = clf.predict_dataframe(trace_df)

# import back into phoenix
# accepts pandas dataframe or pandas series indexed with the same span IDs
session.import_evals(trace_df["relevant"])

## 2. Running evals during LLM application execution via our callback system

Some users will trust our default templates, models, and configurations and won't need to dive deeper into the configuration.

In [None]:
import phoenix as px
from phoenix.evals import Evals
from phoenix.experimental.callbacks.langchain_tracer import OpenInferenceTracer

px.launch_app()
tracer = OpenInferenceTracer(
    evals=Evals.from_names(["hallucination", "relevance", "toxicity"]),
)
# define your chain...
for query in queries:
    chain.run(query, callbacks=[tracer])

Other users might want to configure their LLMs while still using our default templates. For example, their LLM application might running on open-source or fine-tuned models, but they want to use GPT-4 to evaluate.

In [None]:
import phoenix as px
from phoenix.evals import Evals, JobConfig, OpenAIModel
from phoenix.experimental.callbacks.langchain_tracer import OpenInferenceTracer

px.launch_app()
tracer = OpenInferenceTracer(
    evals=Evals.from_names(
        ["hallucination", "relevance", "toxicity"],
        model=OpenAIModel(model_name="gpt-4"),
        job_config=JobConfig(max_requests_per_minute=200, max_tokens_per_minute=50000),
    ),
)
# define your chain...
for query in queries:
    chain.run(query, callbacks=[tracer])

We should provide the ability for users to write their own custom evals.

In [None]:
import phoenix as px
from phoenix.evals import (
    Evals,
    Evaluator,
)
from phoenix.experimental.callbacks.langchain_tracer import OpenInferenceTracer

px.launch_app()

model = OpenAIModel(model_name="gpt-4")
prompt_template = ClassificationPromptTemplate(
    template="some prompt template the user is testing out with {context} and {question}",
    classes=["relevant", "irrelevant"],
)
clf = LLMClassifier(model=model, prompt_template=prompt_template)
evaluator = Evaluator(clf)
tracer = OpenInferenceTracer(
    evals=Evals(evaluators=[evaluator]),
)
# Run your LLM application, evaluations appear in the Phoenix UI as the application runs.

Questions:

- If someone is running our callbacks with one-click from LlamaIndex, how can they run with evals?
- How does the user configure ranking metrics that are composite (i.e., require first that a classification is run and second that a score is computed)?

## 3. Running evals post-hoc

Some users will want to run evals post-hoc an launch a phoenix dataset with their

In [None]:
import phoenix as px

spans = ...
ds = px.TraceDataset.from_spans(spans)
# define evals in the same manner as above
evals = Evals(...)
ds.run_evals(evals)
px.launch_app(ds)