## Easy Example of how to evaluate the accuracy of the model's response to the queries based on the ground truth with TruLens 

In [40]:
from trulens.feedback import GroundTruthAgreement

# 1. Define multiple queries and variables used in the queries
queries = ["What is the capital of France?", "What is the capital of Germany?"]
variables = [{"country": "France"}, {"country": "Germany"}]

# 2. Set ground truths and model predictions
ground_truths = [
    {"query": "What is the capital of France?", "expected_response": "Paris"},
    {"query": "What is the capital of Germany?", "expected_response": "Berlin"}
]

# Model's predictions (these are now in simple string format)
predictions = ["Paris", "Munich"]  # The response to each of the queries

# 4. Use GroundTruthAgreement's agreement_measure
# Initialize GroundTruthAgreement using the corrected ground_truth format
ground_truth_agreement = GroundTruthAgreement(ground_truth=ground_truths)

# Calculate agreement scores for each prediction using the prompt (query) and response (predicted response)
agreement_scores = []
for i in range(len(queries)):
    score = ground_truth_agreement.agreement_measure(
        prompt=queries[i],           # The prompt (query) from which the prediction is made
        response=predictions[i]      # The predicted response to the query
    )
    agreement_scores.append(score)

print(f"Agreement Scores: {agreement_scores}")


Agreement Scores: [(1.0, {'ground_truth_response': 'Paris'}), (0.2, {'ground_truth_response': 'Berlin'})]


## Ground Truth Evaluation for our application ( Categorization )

- Evaluation Method to use 
    - Ground Truth Evaluation - we have a golden data set and our goal is to evaluate the accuracy of the model's response to the queries based on the ground truth. 
    - QA based evaluation, golden dataset consists of query and expected response. 
- Metrics used for classification task : 
    - Ground Truth Agreement : compare the similiarity between the model's response and the ground truth. 
        - accuracy : 0 ~ 1 depending on the exactness of the response 
        - In general, ground truth evaluation only takes the input and the response into account. Thus, for a non-RAG application like this, it best suits the purpose. 
- Metrics that can be used for summarization ( for other features of insurance product ) 
    - ROUGE : Recall-Oriented Understudy for Gisting Evaluation is a group of metrics that evaluate LLM summarization and NLP (natural language processing) translations. It also uses a numerical scale from 0 to 1.
    - [Bert Score](https://huggingface.co/spaces/evaluate-metric/bertscore)
    - Groundedness ( Hallucination check )
        - llm : measures whether the model's response is grounded in the input query/context. 
        - nli : measure the groundedness of the model's response using natural language inference.
        - [trulens.providers.huggingface - groundedness_measure_with_nli](https://www.trulens.org/reference/trulens/providers/huggingface/?h=groundedness#trulens.providers.huggingface.Huggingface.groundedness_measure_with_nli)   / [groundedness_measure_with_cot_reasons](https://www.trulens.org/reference/trulens/feedback/?h=groundedness_measure_with_#trulens.feedback.LLMProvider.groundedness_measure_with_cot_reasons) 
        - [groundedness evaluation - reference link](https://www.trulens.org/component_guides/evaluation_benchmarks/groundedness_benchmark/?h=groundedness#benchmarking-various-groundedness-feedback-function-providers-openai-gpt-35-turbo-vs-gpt-4-vs-huggingface) : just for reference
        - [summarization evaluation - reference link](https://www.trulens.org/cookbook/use_cases/summarization_eval/#dependencies) : can be adopted for summarization evaluation 
    - Refer to [summarization evaluation - reference link](https://www.trulens.org/cookbook/use_cases/summarization_eval/#dependencies), and ```Test_2_Eval_TruLens_3_[GroundTruth]_[Summary Evaluation].ipynb file``` for summary example 
- Metrics not relevant to our application 
    - Perplexity 
    - BLEU : Bilingual Evaluation Understudy evaluates the precision of LLM-generated text, or how closely it resembles human sources, using a numerical scale from 0 to 1.
--- 
### Limitation with TruLens    
The best way to evaluate our application for the categorization task is to evaluate the accruacy of the model's response to the queries based on the ground truth. 

It turned out that evaluating prompts based on the ground truth datasets does not require for the application to have retrieval nor embeddings. Interestingly, other frameworks such as LangSmith work the same way. 

However, TruLens' groundtruth feedback functions only take the input query.
- It is difficult to pass a structured prompt format to the feedback functions. For example, I want to pass a query and variables separately to the feedback functions, but I could not find a way to do so. ( THis is possible with LangSmith )
- Especially, I wanted to check the groundedness of the model's response based on the "content" field of the input query, but it is difficult to do so. The use of selector on a specific field is not intuitive nor easy to understand. ( poor documentation )  

LangSmith can be a good alternative as it works with structured prompt formats and provides a better documentation with rich examples 

--- 
### References

TruLens reference : 
 - [trules.core.schema.groundtruth](https://www.trulens.org/reference/trulens/core/schema/groundtruth/)
    - (attr) query : The query for which the ground truth is provided. ( for example, page content from which the category of the product is derived )
    - (attr) response : The ground truth response.   ( the agent's response to the prompt / not the query ! )  
    ex) ```groundtruth_obj(prompt, response)```  
- [trulens.feedback.groundtruth](https://www.trulens.org/reference/trulens/feedback/groundtruth/)
- [with_record](https://www.trulens.org/reference/trulens/apps/basic/?h=with_record#trulens.apps.basic.TruBasicApp.print_instrumented_components)
- [Apps](https://www.trulens.org/reference/apps/) : wrapper 
    - TruBasicAPP
    - TruCustomApp ( This is used here )
    - TurVirtual 
    - Optionally : TruChain 
- [groundedness evaluation - reference link](https://www.trulens.org/component_guides/evaluation_benchmarks/groundedness_benchmark/?h=groundedness#benchmarking-various-groundedness-feedback-function-providers-openai-gpt-35-turbo-vs-gpt-4-vs-huggingface) / [Text to Text](https://www.trulens.org/getting_started/quickstarts/text2text_quickstart/)   
- [trulens.providers.huggingface - groundedness_measure_with_nli](https://www.trulens.org/reference/trulens/providers/huggingface/?h=groundedness#trulens.providers.huggingface.Huggingface.groundedness_measure_with_nli) 
    - groundedness measure with nli ( A measure to track if the source material supports each sentence in the statement using an NLI model. First the response will be split into statements using a sentence tokenizer.The NLI model will process each statement using a natural language inference model, and will use the entire source.)   
- [groundedness_measure_with_cot_reasons](https://www.trulens.org/reference/trulens/feedback/?h=groundedness_measure_with_#trulens.feedback.LLMProvider.groundedness_measure_with_cot_reasons) 
    - A measure to track if the source material supports each sentence in the statement using an LLM provider. The statement will first be split by a tokenizer into its component sentences. Then, trivial statements are eliminated so as to not dilute the evaluation.The LLM will process each statement, using chain of thought methodology to emit the reasons. Abstentions will be considered as grounded.

Template Structured Output Reference :
- chat completion format reference ( passing variables to messages )
    - https://docs.smith.langchain.com/prompt_engineering/tutorials/optimize_classifier 
- chat completion structured output reference
    - https://platform.openai.com/docs/guides/structured-outputs
 

Metric Examples 
- [Benchmark indice - in Korean](https://wikidocs.net/252253) 
- [Evaluation Concept - Langchain Documnet](https://docs.smith.langchain.com/evaluation/concepts)  
- [Gowri Shankar - Evaluating Large Language Models Generated Contents with TruEra’s TruLens](https://gowrishankar.info/blog/evaluating-large-language-models-generated-contents-with-trueras-trulens/)
- [Medium - Evaluating LLM Systems Metrics, Challenges, and Best Practices](https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5)
- [granica - Large Language Model Evaluation: The Complete Guide](https://granica.ai/blog/large-language-model-evaluation-grc)
- [arize - LLM Evaluation: Everything You Need To Run, Benchmark LLM Evals](https://arize.com/blog-course/llm-evaluation-the-definitive-guide/)  
- [Hugging Face - Bert Score ](https://huggingface.co/spaces/evaluate-metric/bertscore)

Other Reference 
- [TruLens GroundTruth Example Blog - Gowri Shankar](https://gowrishankar.info/blog/evaluating-large-language-models-generated-contents-with-trueras-trulens/)
- [LangSmith Classification Example](https://docs.smith.langchain.com/prompt_engineering/tutorials/optimize_classifier)
- [TruLens Example Blog](https://lablab.ai/t/trulens-tutorial-langchain-chatbot)  
- [RAGAS example - in Korean](https://beeny-ds.tistory.com/entry/RAGASLangSmith-로-LLM-생성-데이터-평가하기)
- [Medium - Multi-label Text Classification using Transformers(BERT)](https://medium.com/analytics-vidhya/multi-label-text-classification-using-transformers-bert-93460838e62b)
- [Velog - BERTScore: Evaluating Text Generation with BERT ( in Korean )](https://velog.io/@tobigs-nlp/BERTScore-Evaluating-Text-Generation-with-BERT)

## Set API Keys

In [3]:
import os
from dotenv import load_dotenv
import pandas as pd  # Import pandas to fix the error

# Loading API KEY from environment
load_dotenv()
OpenAI_key = os.getenv("OPENAI_API_KEY") 
Huggingface_key = os.getenv("HUGGINGFACE_API_KEY")

## Load datset to be used as a golden set 

In [4]:
df = pd.read_csv('stage_1_df.csv')

## Start a TruLens Session and run the dashboard 

In [43]:
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()
run_dashboard(session, force=True) 

Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]

Starting dashboard ...





Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.178.102:50754 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## Define a custom application 

In [6]:
# Load prompts
prompt_path = "prompts/classification_prompts/"

def read_prompt(prompt_path):
    with open(prompt_path, 'r') as file:
        return file.read()

system_input_stage_1 = read_prompt(prompt_path + "classification_system_stage_1.txt")
human_input_stage_1 = read_prompt(prompt_path + "classification_human_stage_1.txt")

In [45]:
# Define Pydantic models
from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser 

class ResponseModel(BaseModel):  # Renamed class to avoid conflict with variable name
    answer: str = Field(..., description="answer to the question")

parser = PydanticOutputParser(pydantic_object=ResponseModel)  # Updated to use the renamed class
format_inst = parser.get_format_instructions() 


In [46]:
# Define a first app to use

from trulens.apps.custom import instrument
import openai 

class CategorizationApp:
    @instrument
    def categorize(self, human_input_query):
        client = openai.OpenAI()
        response = (
            client.beta.chat.completions.parse(
                model="gpt-4o",
                messages=[
                    {
                        "role": "system", 
                        "content": system_input_stage_1.format(format_instructions=format_inst)
                    },
                    {
                        "role": "user", 
                        "content": human_input_query 
                    }
                ],
                response_format=ResponseModel
            )
            .choices[0]
            .message.parsed
        )
        return response.answer 

In [47]:
categorization_app = CategorizationApp() 

## Input Example 
System query : stays the same   
Human query : Pass a query including the necessary information to answer the question   


In [48]:
print(categorization_app.categorize(human_input_stage_1.format(company=df.iloc[0]['company'], title=df.iloc[0]['title'], content=df.iloc[0]['content'])))

yes


## Define a golden data set  per each task (e.g., categorization, summarization, etc )
Here, only for the categorization task. 

1. We already have a golden data set : The data set we generated from other agents 
    - We have to annotate the data set ( human supervision is required now )
2. The ground truth data set to be used for evaluation is a dictionary with keys 'query' and 'expected_response'.  
    - Query is the question that the model is supposed to answer. 
    - Expected_response is the answer is the ground truth answer we have.  
    - When these keys are not set correctly, there will be a key-error.    

For more details, refer to [trulens.feedback.groundtruth](https://www.trulens.org/reference/trulens/feedback/groundtruth/)  

TruLens will run the application and compare the model's response to the ground truth answer.  

In [8]:
query_dic = {}
for index, row in df.iterrows():
    query_dic['query'] = human_input_stage_1.format(company=row['company'], title=row['title'], content=row['content'])

In [13]:
import pprint 

categorization_golden_set = pd.DataFrame({
    'query': query_dic['query'],
    'expected_response': df['is_insuranxce']
}).to_dict("records")

pprint.pprint(categorization_golden_set[0])
# print(categorization_golden_set[0])

{'expected_response': 'yes',
 'query': 'Here are the definitions of categories and sub-categories of '
          'insurance in the following comma separated format: \n'
          '\n'
          '[Example format start]\n'
          'Main Category name, Sub Category name, Sub Category Description\n'
          '[Example format end] \n'
          '\n'
          '[start of insurance categories]\n'
          'main_category,sub_category,description\n'
          'Health Insurance,Individual Health Insurance,Covers medical '
          'expenses for an individual policyholder.\n'
          'Health Insurance,Family Health Insurance,Provides health coverage '
          'for the entire family under a single policy.\n'
          'Health Insurance,Critical Illness Insurance,Offers a lump sum '
          'benefit upon diagnosis of a critical illness.\n'
          'Health Insurance,Dental Insurance,"Covers dental care expenses '
          'including check-ups, cleanings, and procedures."\n'
          '

## Define feedback functions

In [51]:
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
from trulens.apps.custom import TruCustomApp
# from trulens.core import Select
# from trulens.providers.huggingface import Huggingface

provider = OpenAI(model_engine="gpt-4o")
# hug_provider = Huggingface() # for groundedness evaluation 

gta = GroundTruthAgreement(categorization_golden_set, provider=provider)

f_groundtruth = Feedback(
    gta.agreement_measure, name="Ground Truth Similarity (LLM)"
).on_input_output()

✅ In Ground Truth Similarity (LLM), input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Ground Truth Similarity (LLM), input response will be set to __record__.main_output or `Select.RecordOutput` .


In [52]:
# Define a wrapper ( Instrument the callable for logging with TruLens )
categorization_recorder = TruCustomApp(
    categorization_app,
    app_name="stage_1_categorization",
    app_version="v1",
    feedbacks=[
        f_groundtruth
    ],
)

## Check the output of ground truth agreement measure for the first 10 records 

In [53]:
for row in categorization_golden_set[:10]:
    print(gta.agreement_measure(
        prompt = row['query'], 
        response = categorization_app.categorize(human_input_query = row['query'])
    ))

(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})
(1.0, {'ground_truth_response': 'yes'})


## Test a single run of the app. This should show up on the dashboard. 

In [54]:
with categorization_recorder:
        categorization_app.categorize(human_input_query=categorization_golden_set[0]['query'])

In [55]:
session.get_leaderboard()

Unnamed: 0_level_0,Unnamed: 1_level_0,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1
stage_1_categorization,v1,1.35927,0.0


## Evaluate the prompt for 50 records with recording 

In [56]:
for row in categorization_golden_set[:50]:
    with categorization_recorder:
            categorization_app.categorize(human_input_query = row['query'])


In [57]:
# show the result 
session.get_leaderboard()

Unnamed: 0_level_0,Unnamed: 1_level_0,Comprehensiveness,Ground Truth Similarity (LLM),Groundedness - LLM Judge,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Summarization example,v1,0.612302,,1.0,1.215207,0.001027
stage_1_categorization,v1,,1.0,,1.272471,0.0
stage_1_categorization_TruChain,Chain1,,,,0.879068,0.015775


In [58]:
# show the result on the web 
run_dashboard(session, force=True) 

Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.178.102:50899 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>