# Quickstart: Evals

This quickstart guide will show you through the basics of evaluating data from your LLM application.

## Install Phoenix Evals

In [8]:
%%bash
pip install -q "arize-phoenix[evals]" arize-phoenix-otel
pip install -q openai nest_asyncio


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1.2[0m[39;49m -> [0m[32;49m24.2[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


## Prepare your dataset
The first thing you'll need is a dataset to evaluate. This could be your own collect or generated set of examples, or data you've exported from Phoenix traces. If you've already collected some trace data, this makes a great starting point.

For the sake of this guide however, we'll download some pre-existing data to evaluate. Feel free to sub this with your own data, just be sure it includes the following columns:
- reference
- query
- response

In [2]:
from phoenix.evals import download_benchmark_dataset
SAMPLE_SIZE=10

df = download_benchmark_dataset(
    task="binary-hallucination-classification", dataset_name="halueval_qa_data"
)
df=df[:SAMPLE_SIZE]
df = df.drop(columns=["is_hallucination"])
df.head()

Unnamed: 0,reference,query,response
0,() is a prefecture-level city in northweste...,Can Fuyang and Gaozhou be found in the same p...,no
1,() is a prefecture-level city in northweste...,Can Fuyang and Gaozhou be found in the same p...,"Yes, Fuyang and Gaozhou are in the same province."
2,"""808"" was a success in the United States beco...",808 peaked at number eight on what?,"Billboard"" Hot 100"
3,"""808"" was a success in the United States beco...",808 peaked at number eight on what?,"""808"" peaked at number nine on ""Billboard"" Hot..."
4,"""Arms"" then made a comeback in 2017 reaching ...",Arms is a song by American singer-songwriter C...,Moana


## Evaluate and Log Results
Set up evaluators (in this case for hallucinations and Q&A correctness), run the evaluations, and log the results to visualize them in Phoenix. We'll use OpenAI as our evaluation model for this example, but Phoenix also supports a number of other models. First, we need to add our OpenAI API key to our environment.

In [4]:
import os
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")

os.environ["OPENAI_API_KEY"] = openai_api_key

In [6]:
import phoenix as px
from phoenix.evals import OpenAIModel, HallucinationEvaluator, QAEvaluator
from phoenix.evals import run_evals
import nest_asyncio
nest_asyncio.apply()  # This is needed for concurrency in notebook environments

# Set your OpenAI API key
eval_model = OpenAIModel(model="gpt-4-turbo-preview")

# Define your evaluators
hallucination_evaluator = HallucinationEvaluator(eval_model)
qa_evaluator = QAEvaluator(eval_model)

# We have to make some minor changes to our dataframe to use the column names expected by our evaluators
# for `hallucination_evaluator` the input df needs to have columns 'output', 'input', 'context'
# for `qa_evaluator` the input df needs to have columns 'output', 'input', 'reference'
df["context"] = df["reference"]
df.rename(columns={"query": "input", "response":"output"}, inplace=True)
assert all(column in df.columns for column in ['output', 'input', 'context', 'reference'])

# Run the evaluators, each evaluator will return a dataframe with evaluation results
# We upload the evaluation results to Phoenix in the next step
hallucination_eval_df, qa_eval_df = run_evals(
    dataframe=df,
    evaluators=[hallucination_evaluator, qa_evaluator],
    provide_explanation=True
)

run_evals |          | 0/20 (0.0%) | ⏳ 00:00<? | ?it/s

INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

Explanation of the parameters used in run_evals above:
- `dataframe` - a pandas dataframe that includes the data you want to evaluate. This could be spans exported from Phoenix, or data you've brought in from elsewhere. This dataframe must include the columns expected by the evaluators you are using. To see the columns expected by each built-in evaluator, check the corresponding page in the Using Phoenix Evaluators section.
- `evaluators` - a list of built-in Phoenix evaluators to use.
- `provide_explanations` - a binary flag that instructs the evaluators to generate explanations for their choices.

## Analyze Your Evaluations
Combine your evaluation results and explanations with your original dataset:

In [7]:
results_df = df.copy()
results_df["hallucination_eval"] = hallucination_eval_df["label"]
results_df["hallucination_explanation"] = hallucination_eval_df["explanation"]
results_df["qa_eval"] = qa_eval_df["label"]
results_df["qa_explanation"] = qa_eval_df["explanation"]
results_df.head()

Unnamed: 0,reference,input,output,context,hallucination_eval,hallucination_explanation,qa_eval,qa_explanation
0,() is a prefecture-level city in northweste...,Can Fuyang and Gaozhou be found in the same p...,no,() is a prefecture-level city in northweste...,factual,The query asks if Fuyang and Gaozhou can be fo...,correct,The reference text clearly states that Fuyang ...
1,() is a prefecture-level city in northweste...,Can Fuyang and Gaozhou be found in the same p...,"Yes, Fuyang and Gaozhou are in the same province.",() is a prefecture-level city in northweste...,hallucinated,The reference text clearly states that Fuyang ...,incorrect,The reference text clearly states that Fuyang ...
2,"""808"" was a success in the United States beco...",808 peaked at number eight on what?,"Billboard"" Hot 100","""808"" was a success in the United States beco...",factual,"The query asks on which chart the song ""808"" p...",correct,"The question asks on which chart the song ""808..."
3,"""808"" was a success in the United States beco...",808 peaked at number eight on what?,"""808"" peaked at number nine on ""Billboard"" Hot...","""808"" was a success in the United States beco...",hallucinated,"The reference text clearly states that ""808"" p...",incorrect,"The reference text clearly states that ""808"" p..."
4,"""Arms"" then made a comeback in 2017 reaching ...",Arms is a song by American singer-songwriter C...,Moana,"""Arms"" then made a comeback in 2017 reaching ...",factual,The query asks from which 2016 American 3D com...,correct,The question asks from which 2016 American 3D ...


## (Optional) Log Results to Phoenix


**Note:** You'll only be able to log evaluations to the Phoenix UI if you used a trace or span dataset exported from Phoenix as your dataset in this quickstart. If you've used your own outside dataset, you won't be able to log these results to Phoenix.

Log your evaluation results to Phoenix using:

In [None]:
# Log the evaluations
from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval_df),
    SpanEvaluations(eval_name="QA Correctness", dataframe=qa_eval_df)
)

You can view aggregate evaluation statistics, surface problematic spans, and understand the LLM's reason for each evaluation by simply reading the corresponding explanation. Phoenix seamlessly pinpoints the cause (irrelevant retrievals, incorrect parameterization of your LLM, etc.) of your LLM application's poor responses.

<img src="https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/evals/traces_with_evaluation_annotations.png"></img>

If you're interested in extending your evaluations to include relevance, explore our detailed Colab guide.
Now that you're set up, read through the Concepts Section to get an understanding of the different components.
If you want to learn how to accomplish a particular task, check out the How-To Guides.