# Evaluate Binary Relevance Classification

The purpose of this notebook is:
- to evaluate the performance of Arize's approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
- to provide an experimental framework for users to iterate and improve on Arize's default classification template.


In [None]:
from phoenix.experimental.evals import (
    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    OpenAiModel,
    download_benchmark_dataset,
    llm_eval_binary,
)
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
)

## Download Benchmark Dataset

Supported datasets include:

- wiki_qa-train
- ms_marco-v1.1-train

In [None]:
dataset_name = "wiki_qa-train"
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-train"
)
df.head()

## Display Binary Relevance Classification Template

View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default.

In [None]:
print(RAG_RELEVANCY_PROMPT_TEMPLATE_STR)

#### KIKO INCLUDE A LINK TO THE APPENDIX FOR HELP ON THE PROMPT TEMPLATE CLASS

The template variables are:

- query_text
- document_text
- relevant

## Configure an LLM

Configure your LLM.

In [None]:
model = OpenAiModel(
    model_name="gpt-4",
    temperature=0.9,
    presence_penalty=0.45,
    model_kwargs={
        "frequency_penalty": 0.87,
    },
    retry_min_seconds=10,
    retry_max_seconds=90,
    max_retries=10,
)
model

## Run Relevance Classifications

In [None]:
df = df.sample(n=100).reset_index(drop=True)

In [None]:
df = df.rename(
    columns={
        "query_text": "query",
        "document_text": "reference",
    },
)

In [None]:
res = llm_eval_binary(df=df, template=RAG_RELEVANCY_PROMPT_TEMPLATE_STR, model=model)

In [None]:
res

In [None]:
!pip install tiktoken

In [None]:
import tiktoken


def num_tokens_from_string(string: str, encoding_name: str) -> int:
    encoding = tiktoken.get_encoding(encoding_name)
    num_tokens = len(encoding.encode(string))
    return num_tokens


encoding_name = "cl100k_base"

df["reference"].map(lambda x: num_tokens_from_string(x, encoding_name=encoding_name)).to_list()

In [None]:
df.head()

## Evaluate Predictions

In [None]:
df["eval_relevance"] = res

In [None]:
df["relevant"].value_counts()

In [None]:
df["eval_relevance"].value_counts()

In [None]:
y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
y_pred = df["eval_relevance"]

In [None]:
# Calculate F1 score
f1 = f1_score(y_true, y_pred, pos_label="relevant")
print("F1 Score:", f1)

# Calculate Precision
precision = precision_score(y_true, y_pred, pos_label="relevant")
print("Precision:", precision)

# Calculate Recall
recall = recall_score(y_true, y_pred, pos_label="relevant")
print("Recall:", recall)

# Calculate Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

# Calculate and print the Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)