# Evals Ergonomics

We aim to support evals for three different user flows:

1. Running evals on an exported pandas dataframe and importing back into Phoenix,
2. Running evals during LLM application execution via our callback system,
3. Running evals post-hoc on a dataset of traces.

Note: This notebook uses `phoenix.evals` everywhere for convenience, but we'll continue to keep `evals` in `experimental` for the time being.

## 1. Running evals on an exported pandas dataframe and importing back into Phoenix

The user is actively experimenting with their trace data in the form of a pandas dataframe. They wish to compute some evals, perhaps using our custom evaluators or perhaps using their own bespoke code, upload their evaluations, and see their evaluations reflected in Phoenix.

In [None]:
import phoenix as px
from phoenix.evals import (
    DefaultHallucinationClassificationConfig,
    DefaultOpenAIGPT4RequestConfig,
    DefaultRelevanceClassificationConfig,
    LLMEvalConfig,
    ManualRollingWindow,
    PromptTemplate,
    RequestConfig,
)
from phoenix.experimental.callbacks.langchain_tracer import OpenInferenceTracer

session = px.launch_app()
trace_df = session.export()
tracer = OpenInferenceTracer()
# Run your LLM application...

trace_df = session.export_dataframe()

# Method 1: Specify evals by name
# run_evals is able to apply evals to the appropriate span kinds
evals_df = run_evals(
    trace_df,
    model,
    evals=[
        LLMEvalConfig(
            classification_configs=["relevance", "hallucination"],
            model=OpenAIChatModel(model_name="gpt-4"),
        ),
    ],
)

# Method 2: Specify evals by default configuration
evals_df = run_evals(
    trace_df,
    model,
    evals=[
        LLMEvalConfig(
            classification_configs=[
                DefaultRelevanceClassificationConfig(),
                DefaultHallucinationClassificationConfig(),
            ],
            model=OpenAIChatModel(model_name="gpt-4"),
        ),
    ],
)

# Method 3: Specify completely custom configurations.
evals_df = run_evals(
    trace_df,
    evals=[
        LLMEvalConfig(
            classification_configs=[
                RelevanceClassificationConfig(
                    template=PromptTemplate(
                        template_string="Query: {query}\nReference: {reference}\nResponse: "
                    ),
                    rails=("relevant", "irrelevant"),
                    system_message='You are an assistant whose purpose is to classify a document as relevant or irrelevant to a query. You must respond with a single word, either "relevant" or "irrelevant".',
                    query_variable_name="query",
                    reference_variable_name="reference",
                ),
                HallucinationClassification(
                    template=PromptTemplate(
                        template_string="Query: {query}\nReference: {reference}\nResponse: {response}\nHallucination: "
                    ),
                    rails=("hallucinated", "grounded"),
                    system_message='You are an assistant whose purpose is to classify a response from an LLM as either a hallucination or a grounded response. You must respond with a single word, either "hallucinated" or "grounded".',
                ),
            ],
            model=OpenAIChatModel(model_name="gpt-4"),
            request_config=RequestConfig(
                rolling_window=ManualRollingWindow(
                    rolling_window_duration,
                    max_requests_per_window,
                    max_tokens_per_window,
                )
            ),
        )
    ],
)

# Method 4: Using llm_eval_binary on a subset of the trace data
trace_df = session.export_dataframe('span_kind == "LLM"')
model = OpenAIModel(model_name="gpt-4")
prompt_template = PromptTemplate(
    template="some prompt template the user is testing out with {context} and {question}",
)
trace_df["relevance"] = llm_eval_binary(
    trace_df,
    model,
    template,
    rails=["relevant", "irrelevant"],
)

# import back into phoenix
# accepts pandas dataframe or pandas series indexed with the same span IDs

# Method 1: Importing a pandas series
session.import_evals(trace_df["relevance"])

# Method 2: Importing a pandas dataframe
session.import_evals(evals_df)

## 2. Running evals during LLM application execution via our callback system

Some users will want to execute our evals at application runtime by tying their evals to.

In [None]:
import phoenix as px
from phoenix.experimental.evals import LLMEvalConfig
from phoenix.experimental.evals.models import AnthropicChatModel, OpenAIChatModel
from phoenix.trace.langchain import LangChainInstrumentor, OpenInferenceTracer

px.launch_app()

# Method 1: Evaluations by name.
tracer = OpenInferenceTracer(
    evals=[
        LLMEvalConfig(
            evaluators=["hallucination", "relevance", "toxicity"],
            model=OpenAIChatModel(model_name="gpt-4"),
        )
    ]
)
# in theory, someone could do evals with OpenAI and Anthropic at the same time if they wanted to
# not really needed at the moment, but could be a reasonable ask if we wind up relying on features of particular apis
tracer = OpenInferenceTracer(
    evals=[
        LLMEvalConfig(
            evaluators=["hallucination", "relevance"],
            model=OpenAIChatModel(model_name="gpt-4"),
            request_config=DefaultOpenAIGPT4RequestConfig(),  # optional argument
        ),
        LLMEvalConfig(
            evaluators=["toxicity"],
            model=AnthropicChatModel(model_name="claude-2"),
            request_config=DefaultAnthropicClaude2RequestConfig(),  # optional argument
        ),
    ]
)

# Method 2: Evaluations with custom configuration
tracer = OpenInferenceTracer(
    evals=[
        LLMEvalConfig(
            evaluators=[
                DefaultRelevanceClassificationConfig(),
                DefaultHallucinationClassificationConfig(),
            ],
            model=OpenAIChatModel(model_name="gpt-4"),
            request_config=DefaultOpenAIGPT4RequestConfig(),
        )
    ]
)


LangChainInstrumentor(tracer).instrument()

# define your chain...

# run your chain
for query in queries:
    chain.run(query)

Questions:

- If someone is running our callbacks with one-click from LlamaIndex, how can they run with evals?
- How does the user configure ranking metrics that are composite (i.e., require first that a classification is run and second that a score is computed)?

## 3. Running evals post-hoc

Some users will want to run evals post-hoc an launch a phoenix dataset with their

In [None]:
import phoenix as px
from phoenix.evals import LLMEvalConfig

# load in some spans
spans = ...

# define a trace dataset
ds = px.TraceDataset.from_spans(spans)

# define evals in the same manner as above
evals = [LLMEvalConfig(classification_configs=[...], model=...)]
ds.run_evals(evals=evals)
px.launch_app(ds)

## Questions

- How do we handle evaluations that require reference answers to compute?