<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Evaluating Evals: Relevance Classification</h1>

As part of 

The purpose of this notebook is:
- to evaluate the performance of Arize's approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
- to provide an experimental framework for users to iterate and improve on Arize's default classification template.


In [None]:
# FIXME: Remove this cell after publishing Phoenix
!npm install -g -s n
!n latest
!npm install -g -s npm@latest
!pip install -qqq git+https://github.com/Arize-ai/phoenix.git@tracing-demo-launch
!pip install "arize-phoenix[experimental]"

In [None]:
!pip install -qq scikit-learn matplotlib ipython

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display
from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    OpenAiModel,
    download_benchmark_dataset,
    llm_eval_binary,
)
from sklearn.metrics import (
    ConfusionMatrixDisplay,
    accuracy_score,
    confusion_matrix,
    f1_score,
    precision_recall_fscore_support,
    precision_score,
    recall_score,
)


## Download Benchmark Dataset

We'll evaluate the combination of LLM model, configuration, and evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include:

- "wiki_qa-train"
- "ms_marco-v1.1-train"

In [None]:
dataset_name = "wiki_qa-train"
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df.head()


## Display Binary Relevance Classification Template

View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(RAG_RELEVANCY_PROMPT_TEMPLATE_STR)


The template variables are:

- query_text
- document_text
- relevant

## Configure an LLM

Configure your LLM.

In [None]:
model = OpenAiModel(
    model_name="gpt-4",
    temperature=0.0,
)

## Run Relevance Classifications

Run relevance classifications against a subset of the data.

In [None]:
df = df.sample(n=100).reset_index(drop=True)
df = df.rename(
    columns={
        "query_text": "query",
        "document_text": "reference",
    },
)

In [None]:
res = llm_eval_binary(
    df=df, template=RAG_RELEVANCY_PROMPT_TEMPLATE_STR, model=model, rails=["relevant", "irrelevant"]
)


## Evaluate Predictions

In [None]:
df["eval_relevance"] = res


In [None]:
df["relevant"].value_counts()


In [None]:
df["eval_relevance"].value_counts()


In [None]:
y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
y_pred = df["eval_relevance"]


In [None]:
# Calculate F1 score
f1 = f1_score(y_true, y_pred, pos_label="relevant")
print("F1 Score:", f1)

# Calculate Precision
precision = precision_score(y_true, y_pred, pos_label="relevant")
print("Precision:", precision)

# Calculate Recall
recall = recall_score(y_true, y_pred, pos_label="relevant")
print("Recall:", recall)

# Calculate Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

# Calculate and print the Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)


In [None]:
true_labels = []  # Your actual labels
predicted_labels = []  # Your predicted labels
mapping = {0: "Class_0", 1: "Class_1"}  # Your class mapping

# Compute Overall Accuracy
accuracy = accuracy_score(true_labels, predicted_labels)
# Compute Per-Class Precision, Recall, F1 Score, Support
precision, recall, f1, support = precision_recall_fscore_support(true_labels, predicted_labels)

# Display the results in a table
metrics_df = pd.DataFrame(
    {
        "Class": ["Overall", mapping[0], mapping[1]],
        "Accuracy": [accuracy, "N/A", "N/A"],
        "Precision": ["N/A", precision[1], precision[0]],
        "Recall": ["N/A", recall[1], recall[0]],
        "F1": ["N/A", f1[1], f1[0]],
        "Support": ["N/A", int(support[1]), int(support[0])],
    }
)

# Display the DataFrame
display(metrics_df)

# Compute confusion matrix
cm = confusion_matrix(true_labels, predicted_labels, labels=[mapping[0], mapping[1]])
# Plot the confusion matrix
disp = ConfusionMatrixDisplay(cm, display_labels=[mapping[0], mapping[1]])
disp.plot()

plt.title("Confusion Matrix")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.show()
