## Groundedness Evaluations

In many ways, feedbacks can be thought of as LLM apps themselves. Given text, they return some result. Thinking in this way, we can use TruLens to evaluate and track our feedback quality. We can even do this for different models (e.g. gpt-3.5 and gpt-4) or prompting schemes (such as chain-of-thought reasoning).

This notebook follows an evaluation of a set of test cases generated from human annotated datasets. In particular, we generate test cases from [SummEval](https://arxiv.org/abs/2007.12626).

SummEval is one of the datasets dedicated to automated evaluations on summarization tasks, which are closely related to the groundedness evaluation in RAG with the retrieved context (i.e. the source) and response (i.e. the summary). It contains human annotation of numerical score (**1** to **5**) comprised of scoring from 3 human expert annotators and 5 croweded-sourced annotators. There are 16 models being used for generation in total for 100 paragraphs in the test set, so there are a total of 16,000 machine-generated summaries. Each paragraph also has several human-written summaries for comparative analysis. 


For evaluating groundedness feedback functions, we compute the annotated "consistency" scores, a measure of whether the summarized response is factually consisntent with the source texts and hence can be used as a proxy to evaluate groundedness in our RAG triad, and normalized to **0** to **1** score as our **expected_score** and to match the output of feedback functions.

In [1]:
# Import groundedness feedback function
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
from trulens_eval import TruBasicApp, Feedback, Tru, Select
from test_cases import generate_summeval_groundedness_golden_set

Tru().reset_database()

# generator for groundedness golden set
test_cases_gen = generate_summeval_groundedness_golden_set("./datasets/summeval_test_100.json")

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


In [2]:
# specify the number of test cases we want to run the smoke test on
groundedness_golden_set = []
for i in range(5):
    groundedness_golden_set.append(next(test_cases_gen))

In [3]:
groundedness_golden_set[:5]


[{'query': '(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the court decision Tuesday, her lawyer told CNN affiliate KABC. "This is a victory for the Sterling family in recovering the $2,630,000 that Donald lavished on a conniving mistress," attorney Pierce O\'Donnell s

In [4]:
import os
os.environ["OPENAI_API_KEY"] = "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"
os.environ["HUGGINGFACE_API_KEY"] = "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"

### Benchmarking various Groundedness feedback function providers (OpenAI GPT-3.5-turbo vs GPT-4 vs Huggingface)

In [5]:
from trulens_eval.feedback.provider.hugs import Huggingface
from trulens_eval.feedback.provider import OpenAI
import numpy as np

huggingface_provider = Huggingface()
groundedness_hug = Groundedness(groundedness_provider=huggingface_provider)
f_groundedness_hug = Feedback(groundedness_hug.groundedness_measure, name = "Groundedness Huggingface").on_input().on_output().aggregate(groundedness_hug.grounded_statements_aggregator)
def wrapped_groundedness_hug(input, output):
    return np.mean(list(f_groundedness_hug(input, output)[0].values()))
     
    
    
groundedness_openai = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-3.5-turbo"))  # GPT-3.5-turbot being the default model if not specified
f_groundedness_openai = Feedback(groundedness_openai.groundedness_measure, name = "Groundedness OpenAI GPT-3.5").on_input().on_output().aggregate(groundedness_openai.grounded_statements_aggregator)
def wrapped_groundedness_openai(input, output):
    return f_groundedness_openai(input, output)[0]['full_doc_score']

groundedness_openai_gpt4 = Groundedness(groundedness_provider=OpenAI(model_engine="gpt-4"))
f_groundedness_openai_gpt4 = Feedback(groundedness_openai_gpt4.groundedness_measure, name = "Groundedness OpenAI GPT-4").on_input().on_output().aggregate(groundedness_openai_gpt4.grounded_statements_aggregator)
def wrapped_groundedness_openai_gpt4(input, output):
    return f_groundedness_openai_gpt4(input, output)[0]['full_doc_score']

✅ In Groundedness Huggingface, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness Huggingface, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness OpenAI GPT-3.5, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness OpenAI GPT-3.5, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness OpenAI GPT-4, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness OpenAI GPT-4, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [6]:
# Create a Feedback object using the numeric_difference method of the ground_truth object
ground_truth = GroundTruthAgreement(groundedness_golden_set)
# Call the numeric_difference method with app and record and aggregate to get the mean absolute error
f_mae = Feedback(ground_truth.mae, name = "Mean Absolute Error").on(Select.Record.calls[0].args.args[0]).on(Select.Record.calls[0].args.args[1]).on_output()

✅ In Mean Absolute Error, input prompt will be set to __record__.calls[0].args.args[0] .
✅ In Mean Absolute Error, input response will be set to __record__.calls[0].args.args[1] .
✅ In Mean Absolute Error, input score will be set to __record__.main_output or `Select.RecordOutput` .


In [7]:
tru_wrapped_groundedness_hug = TruBasicApp(wrapped_groundedness_hug, app_id = "groundedness huggingface", feedbacks=[f_mae])
tru_wrapped_groundedness_openai = TruBasicApp(wrapped_groundedness_openai, app_id = "groundedness openai gpt-3.5", feedbacks=[f_mae])
tru_wrapped_groundedness_openai_gpt4 = TruBasicApp(wrapped_groundedness_openai_gpt4, app_id = "groundedness openai gpt-4", feedbacks=[f_mae])

In [8]:
for i in range(len(groundedness_golden_set)):
    source = groundedness_golden_set[i]["query"]
    response = groundedness_golden_set[i]["response"]
    with tru_wrapped_groundedness_hug as recording:
        tru_wrapped_groundedness_hug.app(source, response)
    with tru_wrapped_groundedness_openai as recording:
        tru_wrapped_groundedness_openai.app(source, response)
    with tru_wrapped_groundedness_openai_gpt4 as recording:
        tru_wrapped_groundedness_openai_gpt4.app(source, response)

A new object of type <class 'trulens_eval.tru_basic_app.TruWrapperApp'> at 0x7f010c177bd0 is calling an instrumented method <function TruWrapperApp._call at 0x7f010ca5a020>. The path of this call may be incorrect.
Guessing path of new object is app based on other object (0x7f010c00df90) using this function.
Feedback function `groundedness_measure` was renamed to `groundedness_measure_with_cot_reasons`. The new functionality of `groundedness_measure` function will no longer emit reasons as a lower cost option. It may have reduced accuracy due to not using Chain of Thought reasoning in the scoring.


Groundendess per statement in source:   0%|          | 0/9 [00:00<?, ?it/s]

Waiting for {'error': 'Model MoritzLaurer/DeBERTa-v3-base-mnli-fever-docnli-ling-2c is currently loading', 'estimated_time': 20.0} (20.0) second(s).
Unsure what the main input string is for the call to _call with args ['(CNN)Donald Sterling\'s racist remarks cost him an NBA team last year. But now it\'s his former female companion who has lost big. A Los Angeles judge has ordered V. Stiviano to pay back more than $2.6 million in gifts after Sterling\'s wife sued her. In the lawsuit, Rochelle "Shelly" Sterling accused Stiviano of targeting extremely wealthy older men. She claimed Donald Sterling used the couple\'s money to buy Stiviano a Ferrari, two Bentleys and a Range Rover, and that he helped her get a $1.8 million duplex. Who is V. Stiviano? Stiviano countered that there was nothing wrong with Donald Sterling giving her gifts and that she never took advantage of the former Los Angeles Clippers owner, who made much of his fortune in real estate. Shelly Sterling was thrilled with the

RuntimeError: API openai request failed 4 time(s).

In [14]:
Tru().get_leaderboard(app_ids=[]).sort_values(by="Mean Absolute Error")

Unnamed: 0_level_0,Mean Absolute Error,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
groundedness openai gpt-4,0.088,3.59,0.028865
groundedness openai gpt-3.5,0.1856,3.59,0.001405
groundedness huggingface,0.239318,3.59,0.0
