<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Retrieval Relevance Evals</h1>

Arize provides tooling to evaluate LLM applications, including tools to determine the relevance or irrelevance of documents retrieved by retrieval-augmented generation (RAG) applications. This relevance is then used to measure the quality of each retrieval using ranking metrics such as precision@k. In order to determine whether each retrieved document is relevant or irrelevant to the corresponding query, our approach is straightforward: ask an LLM.

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted approach to relevance classification against information retrieval datasets with ground-truth relevance labels,
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [1]:
#####################
## N_EVAL_SAMPLE_SIZE
#####################
# Eval sample size determines the run time
# 100 samples: GPT-4 ~ 80 sec / GPT-3.5 ~ 40 sec
# 1,000 samples: GPT-4 ~15-17 min / GPT-3.5 ~ 6-7min (depending on retries)
# 10,000 samples GPT-4 ~170 min / GPT-3.5 ~ 70min
N_EVAL_SAMPLE_SIZE = 500

In [None]:
!pip install -qq "arize-phoenix-evals>=0.0.5" "openai>=1" ipython matplotlib pycm scikit-learn tiktoken nest_asyncio

ℹ️ To enable async request submission in notebook environments like Jupyter or Google Colab, optionally use `nest_asyncio`. `nest_asyncio` globally patches `asyncio` to enable event loops to be re-entrant. This is not required for non-notebook environments.

Without `nest_asyncio`, eval submission can be much slower, depending on your organization's rate limits. Speed increases of about 5x are typical.

In [4]:
import nest_asyncio

nest_asyncio.apply()

In [5]:
import os
from getpass import getpass

import matplotlib.pyplot as plt
import pandas as pd
from phoenix.evals import (
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from pycm import ConfusionMatrix
from sklearn.metrics import classification_report

pd.set_option("display.max_colwidth", None)

## Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include:

- "wiki_qa-train"
- "ms_marco-v1.1-train"

In [6]:
df = download_benchmark_dataset(
    task="binary-relevance-classification", dataset_name="wiki_qa-test"
)
df.head()

Unnamed: 0,query_id,query_text,document_title,document_text,document_text_with_emphasis,relevant
0,Q0,HOW AFRICAN AMERICANS WERE IMMIGRATED TO THE US,African immigration to the United States,"African immigration to the United States refers to immigrants to the United States who are or were nationals of Africa . The term African in the scope of this article refers to geographical or national origins rather than racial affiliation. From the Immigration and Nationality Act of 1965 to 2007, an estimated total of 0.8 to 0.9 million Africans immigrated to the United States, accounting for roughly 3.3% of total immigration to the United States during this period. African immigrants in the United States come from almost all regions in Africa and do not constitute a homogeneous group. They include people from different national, linguistic, ethnic, racial, cultural and social backgrounds. As such, African immigrants are to be distinguished from African American people, the latter of whom are descendants of mostly West and Central Africans who were involuntarily brought to the United States by means of the historic Atlantic slave trade .","African immigration to the United States refers to immigrants to the United States who are or were nationals of Africa . The term African in the scope of this article refers to geographical or national origins rather than racial affiliation. From the Immigration and Nationality Act of 1965 to 2007, an estimated total of 0.8 to 0.9 million Africans immigrated to the United States, accounting for roughly 3.3% of total immigration to the United States during this period. African immigrants in the United States come from almost all regions in Africa and do not constitute a homogeneous group. They include people from different national, linguistic, ethnic, racial, cultural and social backgrounds. AS SUCH, AFRICAN IMMIGRANTS ARE TO BE DISTINGUISHED FROM AFRICAN AMERICAN PEOPLE, THE LATTER OF WHOM ARE DESCENDANTS OF MOSTLY WEST AND CENTRAL AFRICANS WHO WERE INVOLUNTARILY BROUGHT TO THE UNITED STATES BY MEANS OF THE HISTORIC ATLANTIC SLAVE TRADE .",True
1,Q1012,what are points on a mortgage,Point (mortgage),"Points, sometimes also called a ""discount point"", are a form of pre-paid interest . One point equals one percent of the loan amount. By charging a borrower points, a lender effectively increases the yield on the loan above the amount of the stated interest rate . Borrowers can offer to pay a lender points as a method to reduce the interest rate on the loan, thus obtaining a lower monthly payment in exchange for this up-front payment. For each point purchased, the loan rate is typically reduced by 1/8% (0.125%). Paying Points represent a calculated gamble on the part of the buyer. There will be a specific point in the timeline of the loan where the money spent to buy down the interest rate will be equal to the money saved by making reduced loan payments resulting from the lower interest rate on the loan. Selling the property or refinancing prior to this break-even point will result in a net financial loss for the buyer while keeping the loan for longer than this break-even point will result in a net financial savings for the buyer. The longer you keep the property financed under the loan with purchased points, the more the money spent on the points will pay off. Accordingly, if the intention is to buy and sell the property or refinance in a rapid fashion, buying points is actually going to end up costing more than just paying the loan at the higher interest rate. Points may also be purchased to reduce the monthly payment for the purpose of qualifying for a loan. Loan qualification based on monthly income versus the monthly loan payment may sometimes only be achievable by reducing the monthly payment through the purchasing of points to buy down the interest rate, thereby reducing the monthly loan payment. Discount points may be different from origination fee or broker fee . Discount points are always used to buy down the interest rates, while origination fees sometimes are fees the lender charges for the loan or sometimes just another name for buying down the interest rate. Origination fee and discount points are both items listed under lender-charges on the HUD-1 Settlement Statement . The difference in savings over the life of the loan can make paying points a benefit to the borrower. If you intend to stay in your home for an extended period of time, it may be worthwhile to pay additional points in order to obtain a lower interest rate. Any significant changes in fees should be re-disclosed in the final good faith estimate (GFE). Also directly related to points is the concept of the ' no closing cost loan '. If points are paid to acquire a loan, it is impossible at the same time for a broker bank or lender to make a premium for a higher rate. When premium is earned by making the note rate higher, this premium is sometimes used to pay the closing costs.","POINTS, SOMETIMES ALSO CALLED A ""DISCOUNT POINT"", ARE A FORM OF PRE-PAID INTEREST . One point equals one percent of the loan amount. By charging a borrower points, a lender effectively increases the yield on the loan above the amount of the stated interest rate . Borrowers can offer to pay a lender points as a method to reduce the interest rate on the loan, thus obtaining a lower monthly payment in exchange for this up-front payment. For each point purchased, the loan rate is typically reduced by 1/8% (0.125%). Paying Points represent a calculated gamble on the part of the buyer. There will be a specific point in the timeline of the loan where the money spent to buy down the interest rate will be equal to the money saved by making reduced loan payments resulting from the lower interest rate on the loan. Selling the property or refinancing prior to this break-even point will result in a net financial loss for the buyer while keeping the loan for longer than this break-even point will result in a net financial savings for the buyer. The longer you keep the property financed under the loan with purchased points, the more the money spent on the points will pay off. Accordingly, if the intention is to buy and sell the property or refinance in a rapid fashion, buying points is actually going to end up costing more than just paying the loan at the higher interest rate. Points may also be purchased to reduce the monthly payment for the purpose of qualifying for a loan. Loan qualification based on monthly income versus the monthly loan payment may sometimes only be achievable by reducing the monthly payment through the purchasing of points to buy down the interest rate, thereby reducing the monthly loan payment. Discount points may be different from origination fee or broker fee . Discount points are always used to buy down the interest rates, while origination fees sometimes are fees the lender charges for the loan or sometimes just another name for buying down the interest rate. Origination fee and discount points are both items listed under lender-charges on the HUD-1 Settlement Statement . The difference in savings over the life of the loan can make paying points a benefit to the borrower. If you intend to stay in your home for an extended period of time, it may be worthwhile to pay additional points in order to obtain a lower interest rate. Any significant changes in fees should be re-disclosed in the final good faith estimate (GFE). Also directly related to points is the concept of the ' no closing cost loan '. If points are paid to acquire a loan, it is impossible at the same time for a broker bank or lender to make a premium for a higher rate. When premium is earned by making the note rate higher, this premium is sometimes used to pay the closing costs.",True
2,Q1018,when did 311 make Amber,311 (band),"311 (pronounced ""three-eleven"") is an American rock band from Omaha , Nebraska . The band was formed in 1988 by vocalist/guitarist Nick Hexum , lead guitarist Jim Watson (who would later be replaced by Tim Mahoney), bassist Aaron ""P-Nut"" Wills and drummer Chad Sexton. In 1992, Doug ""SA"" Martinez joined to sing and provide turntables for 311's later albums, rounding out the current line-up. 311 is generally known as an alternative rock band, but it is also classified as rap rock , rap metal , funk rock , funk metal , reggae and jazz fusion . The band's name originates from the police code for indecent exposure in Omaha, Nebraska, after the original guitarist for the band was arrested for streaking . After a series of independent releases, 311 was signed to Capricorn Records in 1992 and released the albums Music (1993) and Grassroots (1994) to moderate success. They achieved greater success with their 1995 triple platinum self-titled album , which reached No. 12 on the Billboard 200 on the strength of the singles "" Down "" and "" All Mixed Up "", the former of which topped the Billboard Hot Modern Rock Tracks in 1996. The band's next three albums, Transistor (1997), Soundsystem (1999) and From Chaos (2001), did not achieve the massive success of the self-titled album, though they were still successful, with the first going platinum and the last two going gold. Their 2004 compilation album Greatest Hits '93–'03 was also certified gold. The band's most recent studio album is 2011's Universal Pulse . To date, 311 has released ten studio albums, one live album, four compilation albums, four EPs and four DVDs. As of 2011, 311 has sold over 8.5 million records in the US. They are currently working on a new album, which is due for release in 2013.","311 (pronounced ""three-eleven"") is an American rock band from Omaha , Nebraska . The band was formed in 1988 by vocalist/guitarist Nick Hexum , lead guitarist Jim Watson (who would later be replaced by Tim Mahoney), bassist Aaron ""P-Nut"" Wills and drummer Chad Sexton. In 1992, Doug ""SA"" Martinez joined to sing and provide turntables for 311's later albums, rounding out the current line-up. 311 is generally known as an alternative rock band, but it is also classified as rap rock , rap metal , funk rock , funk metal , reggae and jazz fusion . The band's name originates from the police code for indecent exposure in Omaha, Nebraska, after the original guitarist for the band was arrested for streaking . After a series of independent releases, 311 was signed to Capricorn Records in 1992 and released the albums Music (1993) and Grassroots (1994) to moderate success. They achieved greater success with their 1995 triple platinum self-titled album , which reached No. 12 on the Billboard 200 on the strength of the singles "" Down "" and "" All Mixed Up "", the former of which topped the Billboard Hot Modern Rock Tracks in 1996. The band's next three albums, Transistor (1997), Soundsystem (1999) and From Chaos (2001), did not achieve the massive success of the self-titled album, though they were still successful, with the first going platinum and the last two going gold. Their 2004 compilation album Greatest Hits '93–'03 was also certified gold. The band's most recent studio album is 2011's Universal Pulse . To date, 311 has released ten studio albums, one live album, four compilation albums, four EPs and four DVDs. As of 2011, 311 has sold over 8.5 million records in the US. They are currently working on a new album, which is due for release in 2013.",False
3,Q102,how does interlibrary loan work,Interlibrary loan,"Interlibrary loan (abbreviated ILL, and sometimes called interloan, document delivery, or document supply) is a service whereby a user of one library can borrow books or receive photocopies of documents that are owned by another library. The user makes a request with their local library, which, acting as an intermediary, identifies owners of the desired item, places the request, receives the item, makes it available to the user, and arranges for its return. The lending library usually sets the due date and overdue fees of the material borrowed. Although books and journal articles are the most frequently requested items, some libraries will lend audio recordings, video recordings, maps, sheet music, and microforms of all kinds. In many cases, nominal fees accompany interlibrary loan services. The term document delivery may also be used for a related service, namely the supply of journal articles and other copies on a personalized basis, whether these come from other libraries or direct from publishers. The end user is usually responsible for any fees, such as costs for postage or photocopying. Commercial document delivery services will borrow on behalf of any customer willing to pay their rates.","INTERLIBRARY LOAN (ABBREVIATED ILL, AND SOMETIMES CALLED INTERLOAN, DOCUMENT DELIVERY, OR DOCUMENT SUPPLY) IS A SERVICE WHEREBY A USER OF ONE LIBRARY CAN BORROW BOOKS OR RECEIVE PHOTOCOPIES OF DOCUMENTS THAT ARE OWNED BY ANOTHER LIBRARY. THE USER MAKES A REQUEST WITH THEIR LOCAL LIBRARY, WHICH, ACTING AS AN INTERMEDIARY, IDENTIFIES OWNERS OF THE DESIRED ITEM, PLACES THE REQUEST, RECEIVES THE ITEM, MAKES IT AVAILABLE TO THE USER, AND ARRANGES FOR ITS RETURN. The lending library usually sets the due date and overdue fees of the material borrowed. Although books and journal articles are the most frequently requested items, some libraries will lend audio recordings, video recordings, maps, sheet music, and microforms of all kinds. In many cases, nominal fees accompany interlibrary loan services. The term document delivery may also be used for a related service, namely the supply of journal articles and other copies on a personalized basis, whether these come from other libraries or direct from publishers. The end user is usually responsible for any fees, such as costs for postage or photocopying. Commercial document delivery services will borrow on behalf of any customer willing to pay their rates.",True
4,Q1021,who did cy young play for,Cy Young,"Denton True ""Cy"" Young (March 29, 1867 – November 4, 1955) was an American Major League Baseball pitcher . During his 22-year baseball career (1890–1911), he pitched for five different teams. Young established numerous pitching records, some of which have stood for a century. Young compiled 511 wins , which is most in Major League history and 94 ahead of Walter Johnson who is second on the list. Young was elected to the National Baseball Hall of Fame in 1937. One year after Young's death, the Cy Young Award was created to honor the previous season's best pitcher. In addition to wins, Young still holds the major league records for most career innings pitched (7,355), most career games started (815), and most complete games (749). He also retired with 316 losses , the most in MLB history. Young's 76 career shutouts are fourth all-time. He also won at least 30 games in a season five times, with ten other seasons of 20 or more wins. In addition, Young pitched three no-hitters , including the third perfect game in baseball history, first in baseball's ""modern era"". In 1999, 88 years after his final major league appearance and 44 years after his death, editors at The Sporting News ranked Cy Young 14th on their list of ""Baseball's 100 Greatest Players"". That same year, baseball fans named him to the Major League Baseball All-Century Team . Young's career started in 1890 with the Cleveland Spiders . After eight years with the Spiders, Young was moved to St. Louis in 1899. After two years there, Young jumped to the newly-created American League , joining the Boston franchise. He was traded back to Cleveland in 1909, before spending the final two months of his career with the Boston Rustlers . After his retirement, Young went back to his farm in Ohio , where he stayed until his death at age 88 in 1955.","Denton True ""Cy"" Young (March 29, 1867 – November 4, 1955) was an American Major League Baseball pitcher . During his 22-year baseball career (1890–1911), he pitched for five different teams. Young established numerous pitching records, some of which have stood for a century. Young compiled 511 wins , which is most in Major League history and 94 ahead of Walter Johnson who is second on the list. Young was elected to the National Baseball Hall of Fame in 1937. One year after Young's death, the Cy Young Award was created to honor the previous season's best pitcher. In addition to wins, Young still holds the major league records for most career innings pitched (7,355), most career games started (815), and most complete games (749). He also retired with 316 losses , the most in MLB history. Young's 76 career shutouts are fourth all-time. He also won at least 30 games in a season five times, with ten other seasons of 20 or more wins. In addition, Young pitched three no-hitters , including the third perfect game in baseball history, first in baseball's ""modern era"". In 1999, 88 years after his final major league appearance and 44 years after his death, editors at The Sporting News ranked Cy Young 14th on their list of ""Baseball's 100 Greatest Players"". That same year, baseball fans named him to the Major League Baseball All-Century Team . Young's career started in 1890 with the Cleveland Spiders . After eight years with the Spiders, Young was moved to St. Louis in 1899. After two years there, Young jumped to the newly-created American League , joining the Boston franchise. He was traded back to Cleveland in 1909, before spending the final two months of his career with the Boston Rustlers . After his retirement, Young went back to his farm in Ohio , where he stayed until his death at age 88 in 1955.",False


## Display Binary Relevance Classification Template

View the default template used to classify relevance. You can tweak this template and evaluate its performance relative to the default.

In [7]:
print(RAG_RELEVANCY_PROMPT_TEMPLATE)


You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference text]: {reference}
    ************
    [END DATA]
Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.


The template variables are:

- **input:** the question asked by a user
- **reference:** the text of the retrieved document
- **output:** a ground-truth relevance label

## Configure the LLM

Configure your OpenAI API key.

In [8]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
os.environ["OPENAI_API_KEY"] = openai_api_key

## Benchmark Dataset Sample
Sample size determines run time
Recommend iterating small: 100 samples
Then increasing to large test set

In [9]:
df_sample = df.sample(n=N_EVAL_SAMPLE_SIZE).reset_index(drop=True)
df_sample = df_sample.rename(
    columns={
        "query_text": "input",
        "document_text": "reference",
    },
)

## LLM Evals: Retrieval Relevance Classifications GPT-4
Run relevance against a subset of the data.
Instantiate the LLM and set parameters.

In [10]:
model = OpenAIModel(
    model_name="gpt-4",
    temperature=0.0,
)

The `model_name` field is deprecated. Use `model` instead.                 This will be removed in a future release.


In [11]:
model("Hello world, this is a test if you are working?")

"Hello! I'm working perfectly. How can I assist you today?"

## Run Relevance Classifications

Run relevance classifications against a subset of the data.

In [12]:
# The rails is used to hold the output to specific values based on the template
# It will remove text such as ",,," or "..."
# Will ensure the binary value expected from the template is returned
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

llm_classify |          | 0/500 (0.0%) | ⏳ 00:00<? | ?it/s

## Evaluate Classifications

Evaluate the predictions against human-labeled ground-truth relevance labels.

In [29]:
s = "{{hello}}"
s.format(hello="world")

'{hello}'

In [37]:
prompt_template = """
Basic Instruction: Compare the query above to the reference text. You must determine whether the reference text contains information that can help answer the query. First, write out in a step by step manner an EXPLANATION that reasons about how to arrive at the correct answer. Avoid simply stating the correct answer at the outset.

Proposed Instruction: Your task is to analyze the provided query and compare it with the content in the reference text to see if the text can assist in addressing the query. Begin your response by systematically presenting a thorough step-by-step explanation of your analytical reasoning. Break down each aspect regarding why parts of the reference are relevant or irrelevant to the query. Please detail any implied or explicit connection and avoid disclosing the final determination immediately; instead, guide the reader through your analysis before revealing the conclusion towards the end.

---

Follow the following format.

Query: a query from the user

Reference: a reference document

Reasoning: Let's think step by step in order to produce the answer. We ...

Analysis and Reasoning: a one-word answer, either 'relevant' or 'irrelevant'

---

Query: {input}

Reference: {reference}

Reasoning: Let's think step by step in order to
"""

In [112]:
import re


def extract(s):
    m = re.search(r"Analysis and Reasoning: (relevant|irrelevant)", s, re.IGNORECASE)
    if m is None or len(m.groups()) < 1:
        return ""
    lowercase = m.group(1).lower()
    if lowercase == "irrelevant":
        return "unrelated"
    return "relevant"

In [120]:
relevance_classifications = output_df["output"].map(extract)
relevance_classifications

0      unrelated
1      unrelated
2      unrelated
3       relevant
4       relevant
         ...    
495     relevant
496    unrelated
497    unrelated
498     relevant
499     relevant
Name: output, Length: 500, dtype: object

In [121]:
relevance_classifications.value_counts()

output
relevant     268
unrelated    219
              13
Name: count, dtype: int64

In [38]:
from phoenix.evals import llm_generate

output_df = llm_generate(df_sample, prompt_template, model=model)
output_df

llm_generate |          | 0/500 (0.0%) | ⏳ 00:00<? | ?it/s

Unnamed: 0,output
0,"produce the answer. We first need to understand the query, which is asking for the field dimensions for the sport of lacrosse. This would typically include information such as the length and width of the field, the size of the goal area, and other related details. Now, we turn to the reference text. The text provides a detailed description of the sport of lacrosse, including its history, the equipment used, and the rules of the game. It mentions that field lacrosse is a version of the sport played internationally and that it is a full contact outdoor men's sport played with ten players on each team. However, the text does not provide any specific information about the dimensions of the field on which lacrosse is played. Therefore, while the text is about lacrosse, it does not contain the specific information requested in the query.\n\nAnalysis and Reasoning: Irrelevant."
1,"produce the answer. We start by examining the query, which is asking for a specific piece of information: the number of ribs Adam has. Next, we look at the reference text. The text provides a detailed account of the story of Adam and Eve according to the Abrahamic religions. It discusses their creation, their life in the Garden of Eden, and the consequences of their disobedience to God. However, despite the detailed narrative, there is no mention of the number of ribs Adam has. The text does not provide any information on Adam's physical attributes or anatomy. Therefore, the reference text does not contain information that can help answer the query.\n\nAnalysis and Reasoning: Irrelevant."
2,"produce the answer. The query is asking for specific information about Donald Trump's wealth. The reference text provides information about Donald Trump's career and business ventures, such as his role as chairman and president of The Trump Organization and the founder of Trump Entertainment Resorts. It also mentions his extravagant lifestyle and the fact that he is the son of a wealthy real-estate developer. However, the text does not provide any specific figures or estimates about Donald Trump's net worth. Therefore, while the text suggests that Donald Trump is wealthy, it does not provide the specific information needed to answer the query about how rich he is.\n\nAnalysis and Reasoning: Irrelevant"
3,"produce the answer. The query is asking for information on how Ludacris started Disturbing tha Peace. The reference text provides information that Ludacris, along with his manager Chaka Zulu and Zulu's brother Jeff Dixon, founded Disturbing tha Peace. However, the reference text does not provide specific details on how they started the record label, such as the process they went through, the challenges they faced, or the motivation behind its creation. \n\nAnalysis and Reasoning: Relevant"
4,"produce the answer. We start by examining the query, which is asking for a definition or explanation of what sado masochism is. The reference text provides a detailed explanation of sadomasochism, including its relation to pleasure, pain, and humiliation, its connection to BDSM, the roles of sadist and masochist, the concept of a switch, and the use of the acronym SM or S/M. It also clarifies that sadomasochism is not considered a clinical paraphilia unless it leads to significant distress or impairment. Therefore, the reference text directly addresses the query and provides comprehensive information about sado masochism.\n\nAnalysis and Reasoning: Relevant."
...,...
495,"produce the answer. We start by identifying the query, which is asking for the population of San Francisco. Next, we examine the reference text to see if it contains this information. The text provides a detailed description of San Francisco, including its history, cultural significance, and geographical details. Importantly, it also mentions the population of San Francisco, stating that it was 805,235 as of the 2010 Census. Therefore, the reference text does contain information that can help answer the query.\n\nAnalysis and Reasoning: Relevant"
496,"produce the answer. We first look at the query, which is asking for the real address of the character SpongeBob. This means we need to find specific information about where SpongeBob lives. Now, we examine the reference text. The text provides a lot of information about the SpongeBob SquarePants television series, including its creation, popularity, and the people involved in its production. It also mentions that the series is set in the underwater city of Bikini Bottom. However, it does not provide a specific address for SpongeBob within Bikini Bottom. Therefore, while the text does provide some relevant information about the setting of the series, it does not provide the specific information needed to answer the query.\n\nAnalysis and Reasoning: Irrelevant"
497,"produce the answer. We first need to identify if the reference text provides any information about when Harley-Davidson releases new year models. The reference text provides a brief history of Harley-Davidson, including its founding, survival through the Great Depression, and competition with Japanese manufacturers. It also discusses the types of motorcycles the company sells, the customization tradition, and the company's attempts to establish itself in the light motorcycle market. It also mentions the company's brand community, events, and a museum, as well as licensing of the brand and logo. However, there is no mention of when Harley-Davidson releases new year models. Therefore, the reference text does not contain information that can help answer the query.\n\nAnalysis and Reasoning: Irrelevant"
498,"produce the answer. We first look at the query, which is asking about the cause of the shapes in the Giant's Causeway. The reference text provides information about the Giant's Causeway, including its location, its status as a World Heritage Site and a National Nature Reserve, and its popularity as a tourist attraction. However, the most relevant part of the text to the query is the description of the Causeway as an area of about 40,000 interlocking basalt columns, which are the result of an ancient volcanic eruption. This suggests that the shapes in the Giant's Causeway were caused by this volcanic activity. The text also mentions that most of the columns are hexagonal, and some have four, five, seven or eight sides. This provides additional information about the shapes in the Giant's Causeway. \n\nAnalysis and Reasoning: Relevant."


In [123]:
relevance_classifications

0      unrelated
1      unrelated
2      unrelated
3       relevant
4       relevant
         ...    
495     relevant
496    unrelated
497    unrelated
498     relevant
499     relevant
Name: output, Length: 500, dtype: object

In [124]:
true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

              precision    recall  f1-score   support

    relevant       0.65      0.90      0.76       194
   unrelated       0.93      0.66      0.77       306

   micro avg       0.78      0.76      0.77       500
   macro avg       0.79      0.78      0.77       500
weighted avg       0.82      0.76      0.77       500



pycmVectorError: The type of input vectors is assumed to be a list or a NumPy array

## Classifications with explanations

When evaluating a dataset for relevance, it can be useful to know why the LLM classified a document as relevant or irrelevant. The following code block runs `llm_classify` with explanations turned on so that we can inspect why the LLM made the classification it did. There is speed tradeoff since more tokens is being generated but it can be highly informative when troubleshooting.

In [None]:
small_df_sample = df_sample.copy().sample(n=5).reset_index(drop=True)
relevance_classifications_df = llm_classify(
    dataframe=small_df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    provide_explanation=True,
    verbose=True,
    concurrency=20,
)

In [None]:
# Let's view the data
merged_df = pd.merge(
    small_df_sample, relevance_classifications_df, left_index=True, right_index=True
)
merged_df[["input", "reference", "label", "explanation"]].head()

## LLM Evals: relevance Classifications GPT-3.5 Turbo
Run relevance against a subset of the data using GPT-3.5. GPT-3.5 can significantly speed up the classification process. However there are tradeoffs as  we will see below.

In [None]:
model = OpenAIModel(model="gpt-3.5-turbo", temperature=0.0, request_timeout=20)

In [None]:
rails = list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values())
relevance_classifications = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=rails,
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)

## Preview: Running with GPT-4 Turbo

In [None]:
model = OpenAIModel(model_name="gpt-4-turbo-preview")
relevance_classifications = llm_classify(
    dataframe=df_sample,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    concurrency=20,
)["label"].tolist()

In [None]:
true_labels = df_sample["relevant"].map(RAG_RELEVANCY_PROMPT_RAILS_MAP).tolist()

print(classification_report(true_labels, relevance_classifications, labels=rails))
confusion_matrix = ConfusionMatrix(
    actual_vector=true_labels, predict_vector=relevance_classifications, classes=rails
)
confusion_matrix.plot(
    cmap=plt.colormaps["Blues"],
    number_label=True,
    normalized=True,
)