This quickstart shows how Phoenix helps you evaluate data from your LLM application (e.g., inputs, outputs, retrieved documents).

You will:

- Export a dataframe from your Phoenix session that has collected traces from your instrumented LLM application.
- Evaluate your trace data for:
  - Relevance: Are the retrieved documents grounded in the response?
  - Q&A correctness: Are your application's responses grounded in the retrieved context?
  - Hallucinations: Is your application making up false information?
- Ingest the evaluations into Phoenix to see the results annotated on the corresponding spans and traces.

Let's get started! First, install Phoenix with `pip install arize-phoenix`.

To get you up and running quickly, we'll download some pre-existing trace data collected from a LlamaIndex application (in practice, this data would be collected by instrumenting your LLM application with an OpenInference-compatible tracer).  # TODO: Add link

In [None]:
from urllib.request import urlopen

from phoenix.trace.trace_dataset import TraceDataset
from phoenix.trace.utils import json_lines_to_df

traces_url = "https://storage.googleapis.com/arize-assets/phoenix/datasets/unstructured/llm/context-retrieval/trace.jsonl"
with urlopen(traces_url) as response:
    lines = [line.decode("utf-8") for line in response.readlines()]
trace_ds = TraceDataset(json_lines_to_df(lines))

Launch Phoenix. You can open use Phoenix within your notebook or in a separate browser window by opening the URL.

In [None]:
import phoenix as px

session = px.launch_app(trace=trace_ds)
session.view()

You should now see a view like this.  # TODO: Add gif

Export your retrieved documents and query data from your session into a pandas dataframe.

Note: If you are interested in a different subset of your data, you can export with a custom query.

In [None]:
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(session)
queries_df = get_qa_with_reference(session)

Phoenix evaluates your application data by prompting an LLM to classify whether a retrieved document is relevant or irrelevant to the corresponding query, whether a response is grounded in a retrieved document, etc. This example uses OpenAI and requires an OpenAI API key, but we support a wide variety of APIs and models.  # TODO: Add link

In [None]:
from phoenix.experimental.evals import OpenAIModel

api_key = None  # set your api key here or with the OPENAI_API_KEY environment variable
eval_model = OpenAIModel(model_name="gpt-4-1106-preview", api_key=api_key)

Run your evaluations.

In [None]:
import nest_asyncio
from phoenix.experimental.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    llm_classify,
)

nest_asyncio.apply()  # needed for concurrency in notebook environments

hallucination_eval_df = llm_classify(
    dataframe=queries_df,
    model=eval_model,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
hallucination_eval_df["score"] = (
    hallucination_eval_df.label[~hallucination_eval_df.label.isna()] == "factual"
).astype(int)
qa_correctness_eval_df = llm_classify(
    dataframe=queries_df,
    model=OpenAIModel("gpt-4", temperature=0.0),
    template=QA_PROMPT_TEMPLATE,
    rails=list(QA_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
qa_correctness_eval_df["score"] = (
    hallucination_eval_df.label[~qa_correctness_eval_df.label.isna()] == "correct"
).astype(int)
relevance_eval_df = llm_classify(
    dataframe=retrieved_documents_df,
    model=OpenAIModel("gpt-4", temperature=0.0),
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
relevance_eval_df["score"] = (
    relevance_eval_df.label[~relevance_eval_df.label.isna()] == "relevant"
).astype(int)

Log your evaluations to your running Phoenix session.

In [None]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
)
px.log_evaluations(DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df))

Your evaluations should now appear as annotations on your spans in Phoenix!

In [None]:
print(f"🔥🐦 Open back up Phoenix in case you closed it: {session.url}")

You can view aggregate evaluation statistics, surface problematic spans, and determine the cause (irrelevant retrievals, incorrect parameterization of your LLM, etc.) of your LLM application's responses.  # add gif