# Evaluate Binary Relevance Classification

The purpose of this notebook is:
- to evaluate the performance of Arize's approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
- to provide an experimental framework for users to iterate and improve on Arize's default classification template.


In [None]:
from phoenix.experimental.chat_models import ChatOpenAI
from phoenix.experimental.evals import (
    download_benchmark_dataset,
    llm_classify_binary,
)
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
)

## Download Benchmark Dataset

Supported datasets include:

- wiki_qa-train
- ms_marco-v1.1-train

In [None]:
dataset_name = "wiki_qa-train"
df = download_benchmark_dataset(task="binary-relevance-classification", dataset="wiki_qa-train")
df.head()

## Display Binary Relevance Classification Template

View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default.

In [None]:
BINARY_RELEVANCE_CLASSIFICATION_TEMPLATE

The template variables are:

- query_text
- document_text
- relevant

## Configure an LLM

Configure your LLM.

In [None]:
llm = ChatOpenAI(model_name="gpt-4", temperature=0.0)

## Run Relevance Classifications

In [None]:
predictions = llm_classify_binary(
    dataframe=df,
    classification_template=BINARY_RELEVANCE_CLASSIFICATION_TEMPLATE,
    llm=llm,
    rails=["relevant", "irrelevant"],
)
# returns pd.Series([ True, False, ...])

## Evaluate Predictions

In [None]:
y_true = df["relevant"]
y_pred = predictions

# Calculate F1 score
f1 = f1_score(y_true, y_pred)
print("F1 Score:", f1)

# Calculate Precision
precision = precision_score(y_true, y_pred)
print("Precision:", precision)

# Calculate Recall
recall = recall_score(y_true, y_pred)
print("Recall:", recall)

# Calculate Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

# Calculate and print the Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)