<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/phoenix-logo-light.svg" width="200"/>
        <br>
        <a href="https://docs.arize.com/phoenix/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/phoenix">GitHub</a>
        |
        <a href="https://join.slack.com/t/arize-ai/shared_invite/zt-1px8dcmlf-fmThhDFD_V_48oU7ALan4Q">Community</a>
    </p>
</center>
<h1 align="center">Hallucination Classification Evals</h1>

The purpose of this notebook is:

- to evaluate the performance of an LLM-assisted approach to detecting hallucinations,
- to provide an experimental framework for users to iterate and improve on the default classification template.

## Install Dependencies and Import Libraries

In [None]:
%pip install -qqqU ipython matplotlib openai pycm scikit-learn tiktoken nest-asyncio

# For Colab, install from branch

In [None]:
!npm install -g -s n
!n latest
!npm install -g -s npm@latest
%pip install git+https://github.com/Arize-ai/phoenix.git@benchmarking-function-calling-and-explanations

In [1]:
import math
import os
from getpass import getpass

import openai
import pandas as pd
from phoenix.experimental.evals import (
    HALLUCINATION_PROMPT_RAILS_MAP,
    HALLUCINATION_PROMPT_TEMPLATE,
    RAG_RELEVANCY_PROMPT_RAILS_MAP,
    RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    download_benchmark_dataset,
    llm_classify,
)
from sklearn.metrics import classification_report

# import nest_asyncio
# nest_asyncio.apply()

pd.set_option("display.max_colwidth", None)

# Download Benchmark Dataset

We'll evaluate the evaluation system consisting of an LLM model and settings in addition to an evaluation prompt template against benchmark datasets of queries and retrieved documents with ground-truth relevance labels. Currently supported datasets include "halueval_qa_data" from the HaluEval benchmark:

- https://arxiv.org/abs/2305.11747
- https://github.com/RUCAIBox/HaluEval

In [2]:
halueval_qa_data = download_benchmark_dataset(
    task="binary-hallucination-classification",
    dataset_name="halueval_qa_data",
)
halueval_qa_data.sample().T

Unnamed: 0,11308
reference,"Gülhane Park (Turkish: ""Gülhane Parkı"" , ""Rosehouse Park""; from Persian: ""Gulkhāna"", ""house of flowers"") is a historical urban park in the Eminönü district of Istanbul, Turkey; it is located adjacent to and on the grounds of the Topkapı Palace. ""Oasis of Peace"" or ""Valley of Peace""), is a synagogue in the Karaköy quarter of Beyoğlu district, in Istanbul, Turkey."
query,Are both Gülhane Park and Neve Shalom Synagogue located in Istanbul?
response,yes
is_hallucination,False


In [3]:
wiki_qa_train = download_benchmark_dataset(
    task="binary-relevance-classification",
    dataset_name="wiki_qa-train",
)
wiki_qa_train.sample().T

Unnamed: 0,984
query_id,Q2266
query_text,what is the annual budget of medicare
document_title,United States federal budget
document_text,"Fiscal Year 2012 U.S. Federal Spending – Cash or Budget Basis. Fiscal Year 2012 U.S. Federal Receipts. The Budget of the United States Government often begins as the President 's proposal to the U.S. Congress which recommends funding levels for the next fiscal year , beginning October 1. However, Congress is the body required by law to pass a budget annually and to submit the budget passed by both houses to the President for signature. Congressional decisions are governed by rules and legislation regarding the federal budget process . Budget committees set spending limits for the House and Senate committees and for Appropriations subcommittees, which then approve individual appropriations bills to allocate funding to various federal programs. If Congress fails to pass an annual budget (as has been the case since 2009), a series of Appropriations bills must be passed as ""stop gap"" measures. After Congress approves an appropriations bill, it is sent to the President, who may sign it into law, or may veto it (as he would a budget when passed by the Congress). A vetoed bill is sent back to Congress, which can pass it into law with a two-thirds majority in each chamber. Congress may also combine all or some appropriations bills into an omnibus reconciliation bill. In addition, the president may request and the Congress may pass supplemental appropriations bills or emergency supplemental appropriations bills. Several government agencies provide budget data and analysis. These include the Government Accountability Office (GAO), Congressional Budget Office , the Office of Management and Budget (OMB) and the U.S. Treasury Department . These agencies have reported that the federal government is facing a series of important financing challenges. In the short-run, tax revenues have declined significantly due to a severe recession and tax policy choices, while expenditures have expanded for wars, unemployment insurance and other safety net spending. In the long-run, expenditures related to healthcare programs such as Medicare and Medicaid are projected to grow faster than the economy overall as the population matures."
document_text_with_emphasis,"Fiscal Year 2012 U.S. Federal Spending – Cash or Budget Basis. Fiscal Year 2012 U.S. Federal Receipts. The Budget of the United States Government often begins as the President 's proposal to the U.S. Congress which recommends funding levels for the next fiscal year , beginning October 1. However, Congress is the body required by law to pass a budget annually and to submit the budget passed by both houses to the President for signature. Congressional decisions are governed by rules and legislation regarding the federal budget process . Budget committees set spending limits for the House and Senate committees and for Appropriations subcommittees, which then approve individual appropriations bills to allocate funding to various federal programs. If Congress fails to pass an annual budget (as has been the case since 2009), a series of Appropriations bills must be passed as ""stop gap"" measures. After Congress approves an appropriations bill, it is sent to the President, who may sign it into law, or may veto it (as he would a budget when passed by the Congress). A vetoed bill is sent back to Congress, which can pass it into law with a two-thirds majority in each chamber. Congress may also combine all or some appropriations bills into an omnibus reconciliation bill. In addition, the president may request and the Congress may pass supplemental appropriations bills or emergency supplemental appropriations bills. Several government agencies provide budget data and analysis. These include the Government Accountability Office (GAO), Congressional Budget Office , the Office of Management and Budget (OMB) and the U.S. Treasury Department . These agencies have reported that the federal government is facing a series of important financing challenges. In the short-run, tax revenues have declined significantly due to a severe recession and tax policy choices, while expenditures have expanded for wars, unemployment insurance and other safety net spending. In the long-run, expenditures related to healthcare programs such as Medicare and Medicaid are projected to grow faster than the economy overall as the population matures."
relevant,False


# Configure the LLM

Configure your OpenAI API key.

In [4]:
if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key

🔑 Enter your OpenAI API key:  ········


# Benchmark Dataset Sample

In [5]:
N_EVAL_SAMPLE_SIZE = 2  # choose an even number for binary stratified sampling
N_EVAL_SAMPLE_SIZE = math.ceil(N_EVAL_SAMPLE_SIZE / 2) * 2
N_EVAL_SAMPLE_SIZE

2

In [6]:
hallucination_df = (
    halueval_qa_data.groupby("is_hallucination", group_keys=False)
    .apply(
        lambda x: x.sample(n=math.ceil(N_EVAL_SAMPLE_SIZE / 2), random_state=42)
    )  # balanced sampling
    .replace({"is_hallucination": HALLUCINATION_PROMPT_RAILS_MAP})
    .rename({"is_hallucination": "ground_truth_label"}, axis=1)
    .rename({"query": "input", "response": "output"}, axis=1)
)
hallucination_df.head(1).T

Unnamed: 0,12504
reference,"Jonathan Mark Hedges (born 24 February 1964) is a British journalist, and the Editor of ""Country Life"", published by Time Inc. UK.Time Inc. UK (formerly International Publishing Corporation and IPC Media), a British equivalent division of Time Inc., is a consumer magazine and digital publisher in the United Kingdom, with a large portfolio selling over 350 million copies each year."
input,"Jonathan Mark Hedges is the Editor of ""Country Life"", published by a consumer magazine selling over how many copies each year?"
output,350 million
ground_truth_label,factual


In [7]:
relevance_df = (
    wiki_qa_train.groupby("relevant", group_keys=False)
    .apply(
        lambda x: x.sample(n=math.ceil(N_EVAL_SAMPLE_SIZE / 2), random_state=42)
    )  # balanced sampling
    .replace({"relevant": RAG_RELEVANCY_PROMPT_RAILS_MAP})
    .rename({"relevant": "ground_truth_label"}, axis=1)
    .rename({"query_text": "input", "document_text": "reference"}, axis=1)
)
relevance_df.head(1).T

Unnamed: 0,1312
query_id,Q2690
input,what is darwin's origin of species
document_title,On the Origin of Species
reference,"On the Origin of Species, published on 24 November 1859, is a work of scientific literature by Charles Darwin which is considered to be the foundation of evolutionary biology . Its full title was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. For the sixth edition of 1872, the short title was changed to The Origin of Species. Darwin's book introduced the scientific theory that populations evolve over the course of generations through a process of natural selection . It presented a body of evidence that the diversity of life arose by common descent through a branching pattern of evolution . Darwin included evidence that he had gathered on the Beagle expedition in the 1830s and his subsequent findings from research, correspondence, and experimentation. Various evolutionary ideas had already been proposed to explain new findings in biology . There was growing support for such ideas among dissident anatomists and the general public, but during the first half of the 19th century the English scientific establishment was closely tied to the Church of England , while science was part of natural theology . Ideas about the transmutation of species were controversial as they conflicted with the beliefs that species were unchanging parts of a designed hierarchy and that humans were unique, unrelated to other animals. The political and theological implications were intensely debated, but transmutation was not accepted by the scientific mainstream. The book was written for non-specialist readers and attracted widespread interest upon its publication. As Darwin was an eminent scientist, his findings were taken seriously and the evidence he presented generated scientific, philosophical, and religious discussion. The debate over the book contributed to the campaign by T.H. Huxley and his fellow members of the X Club to secularise science by promoting scientific naturalism . Within two decades there was widespread scientific agreement that evolution, with a branching pattern of common descent, had occurred, but scientists were slow to give natural selection the significance that Darwin thought appropriate. During the "" eclipse of Darwinism "" from the 1880s to the 1930s, various other mechanisms of evolution were given more credit. With the development of the modern evolutionary synthesis in the 1930s and 1940s, Darwin's concept of evolutionary adaptation through natural selection became central to modern evolutionary theory, now the unifying concept of the life sciences ."
document_text_with_emphasis,"On the Origin of Species, published on 24 November 1859, is a work of scientific literature by Charles Darwin which is considered to be the foundation of evolutionary biology . Its full title was On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life. For the sixth edition of 1872, the short title was changed to The Origin of Species. Darwin's book introduced the scientific theory that populations evolve over the course of generations through a process of natural selection . It presented a body of evidence that the diversity of life arose by common descent through a branching pattern of evolution . Darwin included evidence that he had gathered on the Beagle expedition in the 1830s and his subsequent findings from research, correspondence, and experimentation. Various evolutionary ideas had already been proposed to explain new findings in biology . There was growing support for such ideas among dissident anatomists and the general public, but during the first half of the 19th century the English scientific establishment was closely tied to the Church of England , while science was part of natural theology . Ideas about the transmutation of species were controversial as they conflicted with the beliefs that species were unchanging parts of a designed hierarchy and that humans were unique, unrelated to other animals. The political and theological implications were intensely debated, but transmutation was not accepted by the scientific mainstream. The book was written for non-specialist readers and attracted widespread interest upon its publication. As Darwin was an eminent scientist, his findings were taken seriously and the evidence he presented generated scientific, philosophical, and religious discussion. The debate over the book contributed to the campaign by T.H. Huxley and his fellow members of the X Club to secularise science by promoting scientific naturalism . Within two decades there was widespread scientific agreement that evolution, with a branching pattern of common descent, had occurred, but scientists were slow to give natural selection the significance that Darwin thought appropriate. During the "" eclipse of Darwinism "" from the 1880s to the 1930s, various other mechanisms of evolution were given more credit. With the development of the modern evolutionary synthesis in the 1930s and 1940s, Darwin's concept of evolutionary adaptation through natural selection became central to modern evolutionary theory, now the unifying concept of the life sciences ."
ground_truth_label,irrelevant


# Instantiate the LLM and set parameters.

In [8]:
model_gpt4_turbo = OpenAIModel(model_name="gpt-4-1106-preview", temperature=0.0)
model_gpt4_turbo("Hello!")

(521, 'Hello! How can I assist you today?')

In [9]:
model_gpt4 = OpenAIModel(model_name="gpt-4-0613", temperature=0.0)
model_gpt4("Hello!")

(1217, 'Hello! How can I assist you today?')

In [10]:
model_gpt35_turbo = OpenAIModel(model_name="gpt-3.5-turbo-1106", temperature=0.0)
model_gpt35_turbo("Hello!")

(4629, 'Hello! How can I assist you today?')

In [11]:
model_gpt35_turbo_instruct = OpenAIModel(model_name="gpt-3.5-turbo-instruct", temperature=0.0)
model_gpt35_turbo_instruct("Hello!")

(900,
 ' I am a 22 year old female who is looking for a room to rent in the city of Toronto. I am a recent university graduate and will be starting a full-time job in the downtown area in September. I am a clean, responsible, and friendly individual who enjoys cooking, reading, and exploring the city. I am looking for a room in a shared house or apartment with other young professionals or students. My budget is around $800-1000 per month. Please contact me if you have a room available. Thank you!')

# Run evals in batches

## Hallucination

### gpt-3.5-turbo-instruct

In [12]:
model = model_gpt35_turbo_instruct
hallucination_classification_without_function_calling_without_explanation_gpt35_turbo_instruct = (
    llm_classify(
        dataframe=hallucination_df,
        template=HALLUCINATION_PROMPT_TEMPLATE,
        model=model,
        rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
        use_function_calling_if_available=False,
        provide_explanation=False,
    )
)
hallucination_classification_without_function_calling_with_explanation_gpt35_turbo_instruct = (
    llm_classify(
        dataframe=hallucination_df,
        template=HALLUCINATION_PROMPT_TEMPLATE,
        model=model,
        rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
        use_function_calling_if_available=False,
        provide_explanation=True,
    )
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [13]:
for name in (
    "hallucination_classification_without_function_calling_without_explanation_gpt35_turbo_instruct",
    "hallucination_classification_without_function_calling_with_explanation_gpt35_turbo_instruct",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            hallucination_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


hallucination_classification_without_function_calling_without_explanation_gpt35_turbo_instruct

median latency: 88.0ms

              precision    recall  f1-score   support

     factual       0.50      1.00      0.67         1
hallucinated       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

hallucination_classification_without_function_calling_with_explanation_gpt35_turbo_instruct

median latency: 926.5ms

              precision    recall  f1-score   support

     factual       0.50      1.00      0.67         1
hallucinated       0.00      0.00      0.00         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

-----

### gpt-3.5-turbo

In [14]:
model = model_gpt35_turbo
hallucination_classification_without_function_calling_without_explanation_gpt35_turbo = (
    llm_classify(
        dataframe=hallucination_df,
        template=HALLUCINATION_PROMPT_TEMPLATE,
        model=model,
        rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
        use_function_calling_if_available=False,
        provide_explanation=False,
    )
)
hallucination_classification_with_function_calling_without_explanation_gpt35_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=False,
)
hallucination_classification_with_function_calling_with_explanation_gpt35_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=True,
)
hallucination_classification_without_function_calling_with_explanation_gpt35_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=True,
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [15]:
for name in (
    "hallucination_classification_without_function_calling_without_explanation_gpt35_turbo",
    "hallucination_classification_with_function_calling_without_explanation_gpt35_turbo",
    "hallucination_classification_with_function_calling_with_explanation_gpt35_turbo",
    "hallucination_classification_without_function_calling_with_explanation_gpt35_turbo",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            hallucination_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


hallucination_classification_without_function_calling_without_explanation_gpt35_turbo

median latency: 1834.5ms

              precision    recall  f1-score   support

     factual       1.00      1.00      1.00         1
hallucinated       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

----------------------------------------------------------------------------------------------------

hallucination_classification_with_function_calling_without_explanation_gpt35_turbo

median latency: 2201.0ms

              precision    recall  f1-score   support

     factual       1.00      1.00      1.00         1
hallucinated       1.00      1.00      1.00         1

    accuracy                           1.00         2
   macro avg       1.00      1.00      1.00         2
weighted avg       1.00      1.00      1.00         2

--------------------

### gpt-4

In [16]:
model = model_gpt4
hallucination_classification_without_function_calling_without_explanation_gpt4 = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=False,
)
hallucination_classification_with_function_calling_without_explanation_gpt4 = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=False,
)
hallucination_classification_with_function_calling_with_explanation_gpt4 = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=True,
)
hallucination_classification_without_function_calling_with_explanation_gpt4 = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=True,
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [17]:
for name in (
    "hallucination_classification_without_function_calling_without_explanation_gpt4",
    "hallucination_classification_with_function_calling_without_explanation_gpt4",
    "hallucination_classification_with_function_calling_with_explanation_gpt4",
    "hallucination_classification_without_function_calling_with_explanation_gpt4",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            hallucination_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


hallucination_classification_without_function_calling_without_explanation_gpt4

median latency: 838.0ms

              precision    recall  f1-score   support

     factual       0.00      0.00      0.00         1
hallucinated       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

hallucination_classification_with_function_calling_without_explanation_gpt4

median latency: 1629.5ms

              precision    recall  f1-score   support

     factual       0.00      0.00      0.00         1
hallucinated       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

-----------------------------------

### gpt-4-turbo

In [18]:
model = model_gpt4_turbo
hallucination_classification_without_function_calling_without_explanation_gpt4_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=False,
)
hallucination_classification_with_function_calling_with_explanation_gpt4_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=True,
)
hallucination_classification_with_function_calling_without_explanation_gpt4_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=False,
)
hallucination_classification_without_function_calling_with_explanation_gpt4_turbo = llm_classify(
    dataframe=hallucination_df,
    template=HALLUCINATION_PROMPT_TEMPLATE,
    model=model,
    rails=list(HALLUCINATION_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=True,
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [19]:
for name in (
    "hallucination_classification_without_function_calling_without_explanation_gpt4_turbo",
    "hallucination_classification_with_function_calling_without_explanation_gpt4_turbo",
    "hallucination_classification_with_function_calling_with_explanation_gpt4_turbo",
    "hallucination_classification_without_function_calling_with_explanation_gpt4_turbo",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            hallucination_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


hallucination_classification_without_function_calling_without_explanation_gpt4_turbo

median latency: 423.0ms

              precision    recall  f1-score   support

     factual       0.00      0.00      0.00         1
hallucinated       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

hallucination_classification_with_function_calling_without_explanation_gpt4_turbo

median latency: 771.0ms

              precision    recall  f1-score   support

     factual       0.00      0.00      0.00         1
hallucinated       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

------------------------

## Relevence

### gpt-3.5-turbo

In [20]:
model = model_gpt35_turbo_instruct
relevance_classification_without_function_calling_without_explanation_gpt35_turbo_instruct = (
    llm_classify(
        dataframe=relevance_df,
        template=RAG_RELEVANCY_PROMPT_TEMPLATE,
        model=model,
        rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
        use_function_calling_if_available=False,
        provide_explanation=False,
    )
)
relevance_classification_without_function_calling_with_explanation_gpt35_turbo_instruct = (
    llm_classify(
        dataframe=relevance_df,
        template=RAG_RELEVANCY_PROMPT_TEMPLATE,
        model=model,
        rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
        use_function_calling_if_available=False,
        provide_explanation=True,
    )
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [21]:
for name in (
    "relevance_classification_without_function_calling_without_explanation_gpt35_turbo_instruct",
    "relevance_classification_without_function_calling_with_explanation_gpt35_turbo_instruct",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            relevance_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


relevance_classification_without_function_calling_without_explanation_gpt35_turbo_instruct

median latency: 90.0ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

relevance_classification_without_function_calling_with_explanation_gpt35_turbo_instruct

median latency: 1172.5ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

------------

In [22]:
model = model_gpt35_turbo
relevance_classification_without_function_calling_without_explanation_gpt35_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=False,
)
relevance_classification_with_function_calling_without_explanation_gpt35_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=False,
)
relevance_classification_with_function_calling_with_explanation_gpt35_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=True,
)
relevance_classification_without_function_calling_with_explanation_gpt35_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=True,
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [23]:
for name in (
    "relevance_classification_without_function_calling_without_explanation_gpt35_turbo",
    "relevance_classification_with_function_calling_without_explanation_gpt35_turbo",
    "relevance_classification_with_function_calling_with_explanation_gpt35_turbo",
    "relevance_classification_without_function_calling_with_explanation_gpt35_turbo",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            relevance_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


relevance_classification_without_function_calling_without_explanation_gpt35_turbo

median latency: 1539.0ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

relevance_classification_with_function_calling_without_explanation_gpt35_turbo

median latency: 820.0ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

-----------------------------

### gpt-4

In [24]:
model = model_gpt4
relevance_classification_without_function_calling_without_explanation_gpt4 = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=False,
)
relevance_classification_with_function_calling_without_explanation_gpt4 = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=False,
)
relevance_classification_with_function_calling_with_explanation_gpt4 = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=True,
)
relevance_classification_without_function_calling_with_explanation_gpt4 = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=True,
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [25]:
for name in (
    "relevance_classification_without_function_calling_without_explanation_gpt4",
    "relevance_classification_with_function_calling_without_explanation_gpt4",
    "relevance_classification_with_function_calling_with_explanation_gpt4",
    "relevance_classification_without_function_calling_with_explanation_gpt4",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            relevance_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


relevance_classification_without_function_calling_without_explanation_gpt4

median latency: 591.0ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

relevance_classification_with_function_calling_without_explanation_gpt4

median latency: 1043.0ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

-------------------------------------------

## gpt-4-turbo

In [26]:
model = model_gpt4_turbo
relevance_classification_without_function_calling_without_explanation_gpt4_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=False,
)
relevance_classification_with_function_calling_with_explanation_gpt4_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=True,
)
relevance_classification_with_function_calling_without_explanation_gpt4_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=True,
    provide_explanation=False,
)
relevance_classification_without_function_calling_with_explanation_gpt4_turbo = llm_classify(
    dataframe=relevance_df,
    template=RAG_RELEVANCY_PROMPT_TEMPLATE,
    model=model,
    rails=list(RAG_RELEVANCY_PROMPT_RAILS_MAP.values()),
    use_function_calling_if_available=False,
    provide_explanation=True,
)

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/2 (0.0%) | ⏳ 00:00<? | ?it/s

In [27]:
for name in (
    "relevance_classification_without_function_calling_without_explanation_gpt4_turbo",
    "relevance_classification_with_function_calling_without_explanation_gpt4_turbo",
    "relevance_classification_with_function_calling_with_explanation_gpt4_turbo",
    "relevance_classification_without_function_calling_with_explanation_gpt4_turbo",
):
    print(
        f"\n{name}\n",
        f"median latency: {locals()[name].latency.median()}ms\n",
        classification_report(
            relevance_df.ground_truth_label, locals()[name].label, zero_division=0
        ),
        "-" * 100,
        sep="\n",
    )


relevance_classification_without_function_calling_without_explanation_gpt4_turbo

median latency: 1702.5ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

----------------------------------------------------------------------------------------------------

relevance_classification_with_function_calling_without_explanation_gpt4_turbo

median latency: 807.0ms

              precision    recall  f1-score   support

  irrelevant       0.00      0.00      0.00         1
    relevant       0.50      1.00      0.67         1

    accuracy                           0.50         2
   macro avg       0.25      0.50      0.33         2
weighted avg       0.25      0.50      0.33         2

-------------------------------