In [None]:
!pip install arize-phoenix openai tiktoken

In [None]:
import nest_asyncio
import numpy as np
import pandas as pd
import phoenix as px
from phoenix.evals import OpenAIModel, llm_classify
from phoenix.evals.default_templates import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    QA_PROMPT_RAILS_MAP,
    QA_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
)
from phoenix.session.evaluation import (
    get_qa_with_reference,
    get_retrieved_documents,
    log_evaluations,
)
from phoenix.trace.exporter import HttpExporter
from phoenix.trace.span_evaluations import DocumentEvaluations, SpanEvaluations
from sklearn.metrics import ndcg_score

In [None]:
nest_asyncio.apply()

# Key Takeaways

1. Evaluations can be performed on individual documents, not just spans. In general, as long as the subject of an evaluation is an identifiable entity, we can update Phoenix to ingest it. So having the appropriate identifiers is important for the evaluation results.
2. Each evaluation result consists of a combination of values under the reserved keys: `label` (str), `score` (float), and `explanation` (str).
3. Each set of evaluations should have a name, which is an arbitrary string. Evaluations with the same name will be grouped together when computing aggregate metrics. The name is also used to identify different evaluations for the same subject when they are displayed in the UI.
4. We need `score` to compute averages, but may need a positive class value to convert `label` to `score`, e.g. setting `score` = 1 if `label` == "relevant". This is done manually here, but may be automated in the future.
5. Document retrieval metrics such as NDCG can be computed by Phoenix automatically but its UI is not ready yet, so we compute them manually here and show how arbitrary evaluation results can be ingested into Phoenix, by considering them as annotations.
6. Evaluation results are ingested via the HttpExporter. A helper function is provided for doing that.
7. Inputs to the evaluation templates are extracted from the active Phoenix session via helper functions. These helper functions also put in place the appropriate identifiers for ingesting the evaluation results back to Phoenix.
8. A more complicated evaluation shown here is the `Q&A Correctness`, which requires merging details from separate spans, i.e. the retrieved documents from the retriever span, and LLM response from a different span. This is done manually here, but we may automate it in the future.

# Launch Phoenix

We start Phoenix with a small dataset of traces to represent the scenario where traces have already been exported into Phoenix (e.g. from tracing the user's applications), and after seeing the spans in Phoenix, the user wants to perform evaluations on them. These example spans in particular comes from an RAG application, so some spans contain a list of retrieved documents. Our user is interested in how relevant each document is relative to the question that was asked, how to compute aggregate metrics on these retrievals, and how the document level evaluations impact the overall qualities of the Q&A spans.

One key concept that's demonstrated in the notebook is that the subject of an evaluation is not limited to only spans. Subcomponents of a span such individual retrieved documents can be evaluated as well. As long as appropriate identifiers are present for joining the evaluations back to their subjects, the evaluations can be ingested, and Phoenix can perform further analysis on them such as aggregation and metrics. This allows the user to decompose the evaluation of a span into its subcomponents, perform evaluations on them separately, and then aggregate the results back to the span level. 

In [None]:
ds = px.load_example_traces("llama_index_rag")
px.launch_app(trace=ds)

Once Phoenix is running in the background, we can apply helper functions to extract relevant details needed to perform various kinds of evaluations. The details extracted are dependent on the input requirements of each type of evaluation, and the helper functions will put in place the appropriate identifiers for joins later on when the eval results are ingested back to Phoenix.

### Ingestion will take place via HTTP

We may streamline this process in the future, but for the time being we'll just manually re-use the existing exporter that we have for spans.

In [None]:
exporter = HttpExporter()

# Extract Retrieved Documents

First we will evaluate the relevance of the documents that were retrieved by the RAG model. The helper function `get_retrieved_documents` will extract the relevant details from the spans and the retrieved documents, and put them in a dataframe that can be used for evaluation.  Note that the input to the helper function is the current session object, but it's possible that the session was started in a different notebook and the user doesn't have access to it here, so we'll need to provide a solution in a future update if that becomes a real use case.

In [None]:
retrieved_documents = get_retrieved_documents(px.active_session())
retrieved_documents

The result dataframe is indexed by the span ID and document position. The index is important because `llm_classify` will return a result dataframe preserving the index of the input dataframe, and during ingestion we'll use the index values of each row as the identifier of each evaluation result. `input` is the question that was asked, `reference` is the retrieved document. These columns are named as such because those are the template variable names. The `document_score` column is the score that the RAG model assigned to the document. The score is useful for sorting the documents when computing a metric such as NDCG. We also added trace_id here because later on we'll show an eval that requires us to join the documents to a different span containing the final answer from the LLM.

# Set Up OpenAI

In [None]:
import os
from getpass import getpass

import openai

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

In [None]:
model = OpenAIModel(model_name="gpt-3.5-turbo")
model("hi")

# Evaluate Document Relevance

We'll use the `llm_classify` function as usual, but add a new column named `score` containing binary integers, i.e. 1 or 0. The importance of the `score` column is to let us "numerify" the evaluation results, so we can do aggregations such as averaging. This is actually not a simple task in the general case, because it requires knowing which label is the positive one, i.e. the one receiving the value 1 instead of 0. For a general eval, the user would have to tell us how to compute the score, but in this case we know that the label `relevant` should have the score of 1.

`label`, `score`, and `explanation` are the principal keywords for the values of an evaluation result. Every evaluation result to be ingested into Phoenix must have one of these values filled out. On the other hand, it's possible to have missing values for example when `llm_classify` is terminated by user before completion. In that case this notebook will not crash and those results with all missing values won't be ingested.

In [None]:
retrieved_documents_eval = llm_classify(
    retrieved_documents,
    model,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
retrieved_documents_eval["score"] = (
    retrieved_documents_eval.label[~retrieved_documents_eval.label.isna()] == "relevant"
).astype(int)
retrieved_documents_eval.to_parquet(
    "llama_index_rag_with_rerank.documents_eval.parquet"
)  # not required

We're saving the results in parquet solely as a convenience for the users of this notebook. This is not a required step in general.

In [None]:
retrieved_documents_eval = pd.read_parquet(
    "llama_index_rag_with_rerank.documents_eval.parquet"
)  # not required
retrieved_documents_eval

# Merge Evaluation Results Back to the Original Data to Compute Retrieval Metrics

Retrieval metrics such as NDCG and Precision@k can be computed by Phoenix automatically, but the UI is not ready yet, so for demonstration purposes, we'll carry out the calculations by hand, and ingest the result as separate evaluations. This also doubles as a demonstration of how to ingest evaluations that are not generated by `llm_classify`, by considering evaluation results as general annotations.

First we'll merge the retrieved documents with the evaluation results. This is needed to compute NDCG based on the original document scores assigned by the retriever. The `eval_` prefix is added to the columns from the evaluation results just for clarity, but is not required.

In [None]:
combined = pd.concat([retrieved_documents, retrieved_documents_eval.add_prefix("eval_")], axis=1)
combined

# Compute NDCG@2

This one has a small wrinkle for handling missing values in case `llm_classify` was terminated before completion. Otherwise, it's a pretty straightforward application of `sklearn.metrics.ndcg_score`.

In [None]:
def _compute_ndcg(df: pd.DataFrame, k: int):
    """Compute NDCG@k in the presence of missing values"""
    n = max(2, len(df))
    eval_scores = np.zeros(n)
    doc_scores = np.zeros(n)
    eval_scores[: len(df)] = df.eval_score
    doc_scores[: len(df)] = df.document_score
    try:
        return ndcg_score([eval_scores], [doc_scores], k=k)
    except ValueError:
        return np.nan


ndcg_at_2 = pd.DataFrame({"score": combined.groupby("context.span_id").apply(_compute_ndcg, k=2)})
ndcg_at_2.to_parquet("llama_index_rag_with_rerank.ndcg_at_2.parquet")  # not required

We're saving the results in parquet solely as a convenience for the users of this notebook. This is not a required step in general.

In [None]:
ndcg_at_2 = pd.read_parquet("llama_index_rag_with_rerank.ndcg_at_2.parquet")  # not required
ndcg_at_2

# Compute Precision@3

Precision@k is really simple. We change the value for `k` here just for fun.

In [None]:
precision_at_3 = pd.DataFrame(
    {
        "score": combined.groupby("context.span_id").apply(
            lambda x: x.eval_score[:3].sum(skipna=False) / 3
        )
    }
)
precision_at_3.to_parquet("llama_index_rag_with_rerank.precision_at_3.parquet")  # not required

We're saving the results in parquet solely as a convenience for the users of this notebook. This is not a required step in general.

In [None]:
precision_at_3 = pd.read_parquet(
    "llama_index_rag_with_rerank.precision_at_3.parquet"
)  # not required
precision_at_3

# Merge Documents from Retrieval Spans to Q&A Spans (to Compute Q&A Correctness)

Trace ID is what we use to merge together details from separate spans, i.e. the retrieved documents from the retriever span, and LLM response from a different span. This is done manually here, but we may automate it in the future.

In [None]:
qa_df = get_qa_with_reference(px.active_session())
qa_df

# Evaluate Q&A Correctness

In [None]:
qa_correctness_eval = llm_classify(
    qa_df,
    model,
    QA_PROMPT_TEMPLATE,
    list(QA_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
qa_correctness_eval["score"] = (
    qa_correctness_eval.label[~qa_correctness_eval.label.isna()] == "correct"
).astype(int)
qa_correctness_eval.to_parquet(
    "llama_index_rag_with_rerank.qa_correctness_eval.parquet"
)  # not required

We're saving the results in parquet solely as a convenience for the users of this notebook. This is not a required step in general.

In [None]:
qa_correctness_eval = pd.read_parquet(
    "llama_index_rag_with_rerank.qa_correctness_eval.parquet"
)  # not required
qa_correctness_eval

# Evaluate Hallucination

Note that we set `factual` to be the positive class, so the score will be 1 if the label is `factual`, so that a higher average score indicates a more positive outcome.

In [None]:
hallucination_eval = llm_classify(
    qa_df,
    model,
    HALLUCINATION_PROMPT_TEMPLATE,
    list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    provide_explanation=True,
)
hallucination_eval["score"] = (
    hallucination_eval.label[~hallucination_eval.label.isna()] == "factual"
).astype(int)
hallucination_eval.to_parquet(
    "llama_index_rag_with_rerank.hallucination_eval.parquet"
)  # not required

We're saving the results in parquet solely as a convenience for the users of this notebook. This is not a required step in general.

In [None]:
hallucination_eval = pd.read_parquet(
    "llama_index_rag_with_rerank.hallucination_eval.parquet"
)  # not required
hallucination_eval

In [None]:
evaluations = [
    DocumentEvaluations(eval_name="Relevance", dataframe=retrieved_documents_eval),
    SpanEvaluations(eval_name="Hallucination", dataframe=hallucination_eval),
    SpanEvaluations(eval_name="Q&A Correctness", dataframe=qa_correctness_eval),
    SpanEvaluations(eval_name="ndcg@2", dataframe=ndcg_at_2),
    SpanEvaluations(eval_name="precision@3", dataframe=precision_at_3),
]

In [None]:
log_evaluations(*evaluations)

# Colocate evaluation results with the traces

In [None]:
from phoenix.trace import TraceDataset

trace_dataframe = px.Client().get_spans_dataframe()
trace_ds = TraceDataset(
    trace_dataframe,
    evaluations=evaluations,
)

In [None]:
trace_ds.get_evals_dataframe()

In [None]:
trace_ds.get_spans_dataframe()

# End Session

This is commented out but shows how to terminate Phoenix running in the background.

In [None]:
# px.active_session().end()