# Evaluate Binary Relevance Classification

The purpose of this notebook is:
- to evaluate the performance of Arize's approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
- to provide an experimental framework for users to iterate and improve on Arize's default classification template.


In [1]:
import os
from phoenix.experimental.evals import (
    download_benchmark_dataset,
    llm_eval_binary,
    OpenAiModel,
    RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    PromptTemplate
)
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    f1_score,
    precision_score,
    recall_score,
)


## Download Benchmark Dataset

Supported datasets include:

- wiki_qa-train
- ms_marco-v1.1-train

In [2]:
dataset_name = "wiki_qa-train"
df = download_benchmark_dataset(task="binary-relevance-classification", dataset_name="wiki_qa-train")
df.head()


Unnamed: 0,query_id,query_text,document_title,document_text,document_text_with_emphasis,relevant
0,Q1,how are glacier caves formed?,Glacier cave,A partly submerged glacier cave on Perito More...,A partly submerged glacier cave on Perito More...,True
1,Q10,how an outdoor wood boiler works,Outdoor wood-fired boiler,The outdoor wood boiler is a variant of the cl...,The outdoor wood boiler is a variant of the cl...,False
2,Q100,what happens to the light independent reactio...,Light-independent reactions,The simplified internal structure of a chlorop...,The simplified internal structure of a chlorop...,True
3,Q1000,where in the bible that palestine have no land...,Philistines,"The Philistine cities of Gaza, Ashdod, Ashkelo...","The Philistine cities of Gaza, Ashdod, Ashkelo...",False
4,Q1001,what are the test scores on asvab,Armed Services Vocational Aptitude Battery,The Armed Services Vocational Aptitude Battery...,The Armed Services Vocational Aptitude Battery...,False


## Display Binary Relevance Classification Template

View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default.

In [3]:
print(RAG_RELEVANCY_PROMPT_TEMPLATE_STR)


    You are comparing a reference text to a question and trying to determine if the reference text contains
    information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {query}
    ************
    [Reference text]: {reference}
    [END DATA]

    Compare the Question above to the Reference text. You must determine whether the Reference text contains
    information that can answer the Question. Please focus on whether the very specific question can be
    answered by the information in the Reference text.
    Your response must be single word, either "relevant" or "irrelevant",
    and should not contain any text or characters aside from that word.
    "irrelevant" means that the reference text does not contain an answer to the Question.
    "relevant" means the reference text contains an answer to the Question.



#### KIKO INCLUDE A LINK TO THE APPENDIX FOR HELP ON THE PROMPT TEMPLATE CLASS

The template variables are:

- query_text
- document_text
- relevant

## Configure an LLM

Configure your LLM.

In [4]:
model = OpenAiModel(
    model_name="gpt-4",
    temperature=0.9,
    presence_penalty=0.45,
    model_kwargs={
        "frequency_penalty":0.87,
    },
    retry_min_seconds=10,
    retry_max_seconds=90,
    max_retries=10,
)
model

OpenAiModel(model_name='gpt-4', temperature=0.9, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0.45, n=1, model_kwargs={'frequency_penalty': 0.87}, batch_size=20, request_timeout=None, max_retries=10, retry_min_seconds=10, retry_max_seconds=90)

## Run Relevance Classifications

In [5]:
## MUST DISCUSS WHAT TO DO WITH THE NON-MATCHING COLUMN NAMES

In [6]:
df_sampled = df.sample(n=100).reset_index(drop=True)
df = df_sampled.copy()

In [7]:
PromptTemplate(text=RAG_RELEVANCY_PROMPT_TEMPLATE_STR).variables

['query', 'reference']

In [8]:
df.columns

Index(['query_id', 'query_text', 'document_title', 'document_text',
       'document_text_with_emphasis', 'relevant'],
      dtype='object')

In [9]:
df.rename(columns=
    {
        "query_text": "query",
        "document_text": "reference",
    },
    inplace=True
)

In [10]:
df.columns

Index(['query_id', 'query', 'document_title', 'reference',
       'document_text_with_emphasis', 'relevant'],
      dtype='object')

In [11]:
res = await llm_eval_binary(
    df=df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE_STR,
    model=model
)

Eta:2023-09-01 14:05:08.877756 |█████████████████████████| 100.0% (100/100) [02:40<00:00,  1.60s/it]


## Evaluate Predictions

In [12]:
df["eval_relevance"] = res

In [13]:
df["relevant"].value_counts()

False    67
True     33
Name: relevant, dtype: int64

In [14]:
df["eval_relevance"].value_counts()

relevant      52
irrelevant    48
Name: eval_relevance, dtype: int64

In [15]:
y_true = df["relevant"].map({True: "relevant", False: "irrelevant"})
y_pred = df["eval_relevance"]

In [16]:
# Calculate F1 score
f1 = f1_score(y_true, y_pred, pos_label="relevant")
print("F1 Score:", f1)

# Calculate Precision
precision = precision_score(y_true, y_pred, pos_label="relevant")
print("Precision:", precision)

# Calculate Recall
recall = recall_score(y_true, y_pred, pos_label="relevant")
print("Recall:", recall)

# Calculate Accuracy
accuracy = accuracy_score(y_true, y_pred)
print("Accuracy:", accuracy)

# Calculate and print the Confusion Matrix
conf_matrix = confusion_matrix(y_true, y_pred)
print("Confusion Matrix:")
print(conf_matrix)


F1 Score: 0.6588235294117647
Precision: 0.5384615384615384
Recall: 0.8484848484848485
Accuracy: 0.71
Confusion Matrix:
[[43 24]
 [ 5 28]]
