# Phoenix Evals Quickstart

This quickstart shows how Phoenix helps you evaluate data from your LLM application (e.g., inputs, outputs, retrieved documents).

You will:

- Export a dataframe from your Phoenix session that contains traces from an instrumented LLM application,
- Evaluate your trace data for:
  - Relevance: Are the retrieved documents grounded in the response?
  - Q&A correctness: Are your application's responses grounded in the retrieved context?
  - Hallucinations: Is your application making up false information?
- Ingest the evaluations into Phoenix to see the results annotated on the corresponding spans and traces.

Let's get started!

First, install Phoenix with `pip install arize-phoenix`.

In [2]:
# To address temporarily CERTIFICATE_VERIFY_FAILED
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [3]:
from urllib.request import urlopen

from phoenix.trace.trace_dataset import TraceDataset
from phoenix.trace.utils import json_lines_to_df

# To get you up and running quickly, we'll download some pre-existing trace data collected from a LlamaIndex application (in practice, this data would be collected by instrumenting your LLM application with an OpenInference-compatible tracer)
traces_url = "https://storage.googleapis.com/arize-phoenix-assets/datasets/unstructured/llm/context-retrieval/trace.jsonl"
with urlopen(traces_url) as response:
    lines = [line.decode("utf-8") for line in response.readlines()]
trace_df = json_lines_to_df(lines)

# Constructs a TraceDataset from a dataframe of spans
trace_ds = TraceDataset(trace_df)


The trace.jsonl sample:
```json
{"name": "query", "context": {"trace_id": "f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29", "span_id": "bce5b9ae-4587-4ead-9ccc-de3fe29257bc"}, "span_kind": "CHAIN", "parent_id": null, "start_time": "2023-12-11T17:57:17.891021+00:00", "end_time": "2023-12-11T17:57:20.075141+00:00", "status_code": "OK", "status_message": "", "attributes": {"input.value": "How can I query for a monitor's status using GraphQL?", "input.mime_type": "text/plain", "output.value": "You can query for a monitor's status using GraphQL by including the \"status\" field in your query.", "output.mime_type": "text/plain"}, "events": [], "conversation": null}
{"name": "synthesize", "context": {"trace_id": "f40dc5d5-08b7-4e23-80e1-2cd6e9f0cf29", "span_id": "3d59ca9b-5d68-4773-856f-5243cba51647"}, "span_kind": "CHAIN", "parent_id": "bce5b9ae-4587-4ead-9ccc-de3fe29257bc", "start_time": "2023-12-11T17:57:18.973513+00:00", "end_time": "2023-12-11T17:57:20.075056+00:00", "status_code": "OK", "status_message": "", "attributes": {"input.value": "How can I query for a monitor's status using GraphQL?", "input.mime_type": "text/plain", "output.value": "You can query for a monitor's status using GraphQL by including the \"status\" field in your query.", "output.mime_type": "text/plain"}, "events": [], "conversation": null}
```

Launch Phoenix. You can open use Phoenix within your notebook or in a separate browser window by opening the URL.
To note this trace data dates back 12/11/2023 around 12:57PM - make sure to select "All Time" in the webapp to see it


In [4]:
import phoenix as px

session = px.launch_app(trace=trace_ds)
session.view()

🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/


I0000 00:00:1722087673.990868  674680 config.cc:230] gRPC experiments enabled: call_status_override_on_cancellation, event_engine_dns, event_engine_listener, http2_stats_fix, monitoring_experiment, pick_first_new, trace_record_callops, work_serializer_clears_time_cache


Export your retrieved documents and query data from your session into a pandas dataframe.

Note: If you are interested in a different subset of your data, you can export with a custom query.

In [5]:
from phoenix.session.evaluation import get_qa_with_reference, get_retrieved_documents

retrieved_documents_df = get_retrieved_documents(px.Client())
queries_df = get_qa_with_reference(px.Client())

Phoenix evaluates your application data by prompting an LLM to classify whether a retrieved document is relevant or irrelevant to the corresponding query, whether a response is grounded in a retrieved document, etc. You can even get explanations generated by the LLM to help you understand the results of your evaluations!

This quickstart uses OpenAI and requires an OpenAI API key, but we support a wide variety of APIs and models.  # TODO: Add link

Install the OpenAI SDK with `pip install openai` and instantiate your model.

In [6]:
from phoenix.evals import OpenAIModel

eval_model = OpenAIModel(model = "gpt-4-turbo-preview")

You'll next define your evaluators. Evaluators are built on top of language models and prompt the LLM to assess the quality of responses, the relevance of retrieved documents, etc., and provide a quality signal even in the absence of human-labeled data. Pick an evaluator type and instantiate it with the language model you want to use to perform evaluations using our battle-tested evaluation templates.

![A diagram depicting how evaluators are composed of LLMs and evaluation prompt templates and product labels, scores, and explanations from input data (e.g., queries, references, outputs, etc.)](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/evals/evaluators_diagram.png)

In [7]:
from phoenix.evals import (
    HallucinationEvaluator,
    QAEvaluator,
    RelevanceEvaluator,
)

hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_correctness_evaluator = QAEvaluator(eval_model)
relevance_evaluator = RelevanceEvaluator(eval_model)

Run your evaluations.

In [8]:
import nest_asyncio
from phoenix.evals import (
    run_evals,
)

nest_asyncio.apply()  # needed for concurrency in notebook environments

hallucination_eval_df, qa_correctness_eval_df = run_evals(
    dataframe=queries_df,
    evaluators=[hallucination_evaluator, qa_correctness_evaluator],
    provide_explanation=True,
)
relevance_eval_df = run_evals(
    dataframe=retrieved_documents_df,
    evaluators=[relevance_evaluator],
    provide_explanation=True,
)[0]

I0000 00:00:1722087742.331125  673357 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork
I0000 00:00:1722087742.331332  673357 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


run_evals |          | 0/8 (0.0%) | ⏳ 00:00<? | ?it/s

I0000 00:00:1722087742.431732  673357 work_stealing_thread_pool.cc:320] WorkStealingThreadPoolImpl::PrepareFork
I0000 00:00:1722087742.431907  673357 fork_posix.cc:77] Other threads are currently calling into gRPC, skipping fork() handlers


run_evals |          | 0/8 (0.0%) | ⏳ 00:00<? | ?it/s

Log your evaluations to your running Phoenix session.

In [9]:
from phoenix.trace import DocumentEvaluations, SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_correctness_eval_df),
    DocumentEvaluations(eval_name="Relevance", dataframe=relevance_eval_df),
)

Your evaluations should now appear as annotations on your spans in Phoenix!
You can view aggregate evaluation statistics, surface problematic spans, understand the LLM's reason for each evaluation by reading the corresponding explanation, and pinpoint the cause (irrelevant retrievals, incorrect parameterization of your LLM, etc.) of your LLM application's poor responses.

In [10]:
print(f"🔥🐦 Open back up Phoenix in case you closed it: {session.url}")

🔥🐦 Open back up Phoenix in case you closed it: http://localhost:6006/


You can view aggregate evaluation statistics, surface problematic spans, understand the LLM's reason for each evaluation by reading the corresponding explanation, and pinpoint the cause (irrelevant retrievals, incorrect parameterization of your LLM, etc.) of your LLM application's poor responses.

![A view of the Phoenix UI with evaluation annotations](https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png)