In [3]:
import os
from dotenv import load_dotenv
import pandas as pd  # Import pandas to fix the error

# Loading API KEY from environment
load_dotenv()
OpenAI_key = os.getenv("OPENAI_API_KEY") 
Huggingface_key = os.getenv("HUGGINGFACE_API_KEY")

## Easy Example of how to evaluate the accuracy of the model's response to the queries based on the ground truth with TruLens 

In [4]:
from trulens.feedback import GroundTruthAgreement

# 1. Define multiple queries and variables used in the queries
queries = ["What is the capital of France?", "What is the capital of Germany?"]
variables = [{"country": "France"}, {"country": "Germany"}]

# 2. Set ground truths and model predictions
ground_truths = [
    {"query": "What is the capital of France?", "expected_response": "Paris"},
    {"query": "What is the capital of Germany?", "expected_response": "Berlin"}
]

# Model's predictions (these are now in simple string format)
predictions = ["Paris", "Munich"]  # The response to each of the queries

# 4. Use GroundTruthAgreement's agreement_measure
# Initialize GroundTruthAgreement using the corrected ground_truth format
ground_truth_agreement = GroundTruthAgreement(ground_truth=ground_truths)

# Calculate agreement scores for each prediction using the prompt (query) and response (predicted response)
agreement_scores = []
for i in range(len(queries)):
    score = ground_truth_agreement.agreement_measure(
        prompt=queries[i],           # The prompt (query) from which the prediction is made
        response=predictions[i]      # The predicted response to the query
    )
    agreement_scores.append(score)

print(f"Agreement Scores: {agreement_scores}")


ü¶ë Initialized with db url sqlite:///default.sqlite .
üõë Secret keys may be written to the database. See the `database_redact_keys` option of `TruSession` to prevent this.


Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]


Agreement Scores: [(1.0, {'ground_truth_response': 'Paris'}), (0.2, {'ground_truth_response': 'Berlin'})]


## Ground Truth Evaluation for our application ( Categorization )

- Evaluation Method to use 
    - Ground Truth Evaluation - we have a golden data set and our goal is to evaluate the accuracy of the model's response to the queries based on the ground truth. 
    - QA based evaluation, golden dataset consists of query and expected response. 
- Metrics used for classification task : 
    - Ground Truth Agreement : compare the similiarity between the model's response and the ground truth. 
        - accuracy : 0 ~ 1 depending on the exactness of the response 
        - In general, ground truth evaluation only takes the input and the response into account. Thus, for a non-RAG application like this, it best suits the purpose. 
- Metrics that can be used for summarization ( for other features of insurance product ) 
    - ROUGE : Recall-Oriented Understudy for Gisting Evaluation is a group of metrics that evaluate LLM summarization and NLP (natural language processing) translations. It also uses a numerical scale from 0 to 1.
    - [Bert Score](https://huggingface.co/spaces/evaluate-metric/bertscore)
    - Groundedness ( Hallucination check )
        - llm : measures whether the model's response is grounded in the input query/context. 
        - nli : measure the groundedness of the model's response using natural language inference.
        - [trulens.providers.huggingface - groundedness_measure_with_nli](https://www.trulens.org/reference/trulens/providers/huggingface/?h=groundedness#trulens.providers.huggingface.Huggingface.groundedness_measure_with_nli)   / [groundedness_measure_with_cot_reasons](https://www.trulens.org/reference/trulens/feedback/?h=groundedness_measure_with_#trulens.feedback.LLMProvider.groundedness_measure_with_cot_reasons) 
        - [groundedness evaluation - reference link](https://www.trulens.org/component_guides/evaluation_benchmarks/groundedness_benchmark/?h=groundedness#benchmarking-various-groundedness-feedback-function-providers-openai-gpt-35-turbo-vs-gpt-4-vs-huggingface) : just for reference
        - [summarization evaluation - reference link](https://www.trulens.org/cookbook/use_cases/summarization_eval/#dependencies) : can be adopted for summarization evaluation 
    - Refer to [summarization evaluation - reference link](https://www.trulens.org/cookbook/use_cases/summarization_eval/#dependencies), and ```Test_2_Eval_TruLens_3_[GroundTruth]_[Summary Evaluation].ipynb file``` for summary example 
- Metrics not relevant to our application 
    - Perplexity 
    - BLEU : Bilingual Evaluation Understudy evaluates the precision of LLM-generated text, or how closely it resembles human sources, using a numerical scale from 0 to 1.
--- 
### Limitation with TruLens    
The best way to evaluate our application for the categorization task is to evaluate the accruacy of the model's response to the queries based on the ground truth. 

It turned out that evaluating prompts based on the ground truth datasets does not require for the application to have retrieval nor embeddings. Interestingly, other frameworks such as LangSmith work the same way. 

However, TruLens' groundtruth feedback functions only take the input query.
- It is difficult to pass a structured prompt format to the feedback functions. For example, I want to pass a query and variables separately to the feedback functions, but I could not find a way to do so. ( This is possible with LangSmith )
- Especially, I wanted to check the groundedness of the model's response based on the "content" field of the input query, but it is difficult to do so. The use of selector on a specific field is not intuitive nor easy to understand. ( poor documentation )  

LangSmith can be a good alternative as it works with structured prompt formats and provides a better documentation with rich examples 

--- 
### References

TruLens reference : 
 - [trules.core.schema.groundtruth](https://www.trulens.org/reference/trulens/core/schema/groundtruth/)
    - (attr) query : The query for which the ground truth is provided. ( for example, page content from which the category of the product is derived )
    - (attr) response : The ground truth response.   ( the agent's response to the prompt / not the query ! )  
    ex) ```groundtruth_obj(prompt, response)```  
- [trulens.feedback.groundtruth](https://www.trulens.org/reference/trulens/feedback/groundtruth/)
- [with_record](https://www.trulens.org/reference/trulens/apps/basic/?h=with_record#trulens.apps.basic.TruBasicApp.print_instrumented_components)
- [Apps](https://www.trulens.org/reference/apps/) : wrapper 
    - TruBasicAPP
    - TruCustomApp ( This is used here )
    - TurVirtual 
    - Optionally : TruChain 
- [groundedness evaluation - reference link](https://www.trulens.org/component_guides/evaluation_benchmarks/groundedness_benchmark/?h=groundedness#benchmarking-various-groundedness-feedback-function-providers-openai-gpt-35-turbo-vs-gpt-4-vs-huggingface) / [Text to Text](https://www.trulens.org/getting_started/quickstarts/text2text_quickstart/)   
- [trulens.providers.huggingface - groundedness_measure_with_nli](https://www.trulens.org/reference/trulens/providers/huggingface/?h=groundedness#trulens.providers.huggingface.Huggingface.groundedness_measure_with_nli) 
    - groundedness measure with nli ( A measure to track if the source material supports each sentence in the statement using an NLI model. First the response will be split into statements using a sentence tokenizer.The NLI model will process each statement using a natural language inference model, and will use the entire source.)   
- [groundedness_measure_with_cot_reasons](https://www.trulens.org/reference/trulens/feedback/?h=groundedness_measure_with_#trulens.feedback.LLMProvider.groundedness_measure_with_cot_reasons) 
    - A measure to track if the source material supports each sentence in the statement using an LLM provider. The statement will first be split by a tokenizer into its component sentences. Then, trivial statements are eliminated so as to not dilute the evaluation.The LLM will process each statement, using chain of thought methodology to emit the reasons. Abstentions will be considered as grounded.

Template Structured Output Reference :
- chat completion format reference ( passing variables to messages )
    - https://docs.smith.langchain.com/prompt_engineering/tutorials/optimize_classifier 
- chat completion structured output reference
    - https://platform.openai.com/docs/guides/structured-outputs
 

Metric Examples 
- [Benchmark indice - in Korean](https://wikidocs.net/252253) 
- [Evaluation Concept - Langchain Documnet](https://docs.smith.langchain.com/evaluation/concepts)  
- [Gowri Shankar - Evaluating Large Language Models Generated Contents with TruEra‚Äôs TruLens](https://gowrishankar.info/blog/evaluating-large-language-models-generated-contents-with-trueras-trulens/)
- [Medium - Evaluating LLM Systems Metrics, Challenges, and Best Practices](https://medium.com/data-science-at-microsoft/evaluating-llm-systems-metrics-challenges-and-best-practices-664ac25be7e5)
- [granica - Large Language Model Evaluation: The Complete Guide](https://granica.ai/blog/large-language-model-evaluation-grc)
- [arize - LLM Evaluation: Everything You Need To Run, Benchmark LLM Evals](https://arize.com/blog-course/llm-evaluation-the-definitive-guide/)  
- [Hugging Face - Bert Score ](https://huggingface.co/spaces/evaluate-metric/bertscore)

Other Reference 
- [TruLens GroundTruth Example Blog - Gowri Shankar](https://gowrishankar.info/blog/evaluating-large-language-models-generated-contents-with-trueras-trulens/)
- [LangSmith Classification Example](https://docs.smith.langchain.com/prompt_engineering/tutorials/optimize_classifier)
- [TruLens Example Blog](https://lablab.ai/t/trulens-tutorial-langchain-chatbot)  
- [RAGAS example - in Korean](https://beeny-ds.tistory.com/entry/RAGASLangSmith-Î°ú-LLM-ÏÉùÏÑ±-Îç∞Ïù¥ÌÑ∞-ÌèâÍ∞ÄÌïòÍ∏∞)
- [Medium - Multi-label Text Classification using Transformers(BERT)](https://medium.com/analytics-vidhya/multi-label-text-classification-using-transformers-bert-93460838e62b)
- [Velog - BERTScore: Evaluating Text Generation with BERT ( in Korean )](https://velog.io/@tobigs-nlp/BERTScore-Evaluating-Text-Generation-with-BERT)

Some other strenght on LangSmith 
- (Labeled) Criteria such as "helpfulness". 


## Start a TruLens Session and run the dashboard 

In [6]:
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()
run_dashboard(session, force=True) 

Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]

Force stopping dashboard ...
Starting dashboard ...





Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu‚Ä¶

Dashboard started at http://192.168.178.104:51433 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

#### Load prompts and data 

In [21]:
# Load prompts
prompt_path = "prompts/"

def read_prompt(prompt_path):
    with open(prompt_path, 'r') as file:
        return file.read()

system_input_stage_1 = read_prompt(prompt_path + "system_insurance_classification_prompt.txt")
human_input_stage_1 = read_prompt(prompt_path + "human_insurance_classification_prompt.txt") 

df = pd.read_excel('ground_truth_set.xlsx')

#### Define custom application 

In [22]:
# Define Pydantic models
from pydantic import BaseModel, Field
from langchain_core.output_parsers import PydanticOutputParser 

class classify_category_model(BaseModel):
    category: str = Field(description="determine the category of the insurance product",
                          enum = [                             
                             "Term Life Insurance", "Whole Life Insurance", "Pension Insurance", "Disability Insurance", 
                             "Long-Term Care Pension Insurance", "Health Insurance", "Critical Illness Insurance", "Basic Ability Insurance", 
                             "Long-Term Care Cost Insurance", "Long-Term Care Daily Allowance Insurance", "Liability Insurance", 
                             "Business Interruption Insurance", "Home Contents Insurance", "Building Insurance", "Business Property Insurance", 
                             "Commercial Insurance", "Loan Repayment Insurance", "Construction Performance Insurance", 
                             "Machinery Breakdown and Machinery Insurance", "Credit Insurance", "Fidelity Guarantee Insurance", "Erection Insurance", 
                             "Natural Disaster Insurance", "Accident Insurance", "Travel Insurance", "Transport Insurance", "Private Unemployment Insurance", 
                             "Pet Insurance", "Driver Protection Insurance", "Legal Protection Insurance"]
    )

parser = PydanticOutputParser(pydantic_object=classify_category_model)  # Updated to use the renamed class
format_inst = parser.get_format_instructions() 


In [23]:
# Define a first app to use

from trulens.apps.custom import instrument
import openai 

class CategorizationApp:
    @instrument
    def categorize(self, human_input_query):
        client = openai.OpenAI()
        response = (
            client.beta.chat.completions.parse(
                model="gpt-4o",
                messages=[
                    {
                        "role": "system", 
                        "content": system_input_stage_1.format(format_instructions=format_inst)
                    },
                    {
                        "role": "user", 
                        "content": human_input_query 
                    }
                ],
                response_format=classify_category_model
            )
            .choices[0].message.parsed
        )
        return response.category

In [24]:
categorization_app = CategorizationApp() 

## Input Example 
System query : stays the same   
Human query : Pass a query including the necessary information to answer the question   


In [11]:
print(categorization_app.categorize(human_input_stage_1.format(company=df.iloc[0]['company'], title=df.iloc[0]['title'], content=df.iloc[0]['content'])))

Liability Insurance


## Define a golden data set  per each task (e.g., categorization, summarization, etc )
Here, only for the categorization task. 

1. We already have a golden data set : The data set we generated from other agents 
    - We have to annotate the data set ( human supervision is required now )
2. The ground truth data set to be used for evaluation is a dictionary with keys 'query' and 'expected_response'.  
    - Query is the question that the model is supposed to answer. 
    - Expected_response is the answer is the ground truth answer we have.  
    - When these keys are not set correctly, there will be a key-error.    

For more details, refer to [trulens.feedback.groundtruth](https://www.trulens.org/reference/trulens/feedback/groundtruth/)  

TruLens will run the application and compare the model's response to the ground truth answer.  

In [25]:
# build a complete prompt by passing the variables to the prompt 
query_dic = {'query': []}
for index, row in df.iterrows():
    query_dic['query'].append(human_input_stage_1.format(company=row['company'], title=row['title'], content=row['content']))

In [29]:
# build a QA dataset 
import pprint 

categorization_golden_set = pd.DataFrame({
    'query': query_dic['query'],
    'expected_response': df['category']
}).to_dict("records")

print(pd.DataFrame(categorization_golden_set)['expected_response'].to_list())

['Commercial Insurance', 'Liability Insurance', 'Commercial Insurance', 'Commercial Insurance', 'Commercial Insurance', 'Driver Protection Insurance', 'Commercial Insurance', 'Legal Protection Insurance', 'Legal Protection Insurance', 'Health Insurance', 'Commercial Insurance', 'Commercial Insurance', 'Commercial Insurance', 'Health Insurance', 'Commercial Insurance', 'Legal Protection Insurance', 'Legal Protection Insurance', 'Legal Protection Insurance', 'Legal Protection Insurance', 'Legal Protection Insurance']


## Define feedback functions

In [30]:
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI
from trulens.apps.custom import TruCustomApp
# from trulens.core import Select
# from trulens.providers.huggingface import Huggingface

provider = OpenAI(model_engine="gpt-4o")
# hug_provider = Huggingface() # for groundedness evaluation 

gta = GroundTruthAgreement(categorization_golden_set, provider=provider)

f_groundtruth = Feedback(
    gta.agreement_measure, name="Ground Truth Similarity (LLM)"
).on_input_output()

‚úÖ In Ground Truth Similarity (LLM), input prompt will be set to __record__.main_input or `Select.RecordInput` .
‚úÖ In Ground Truth Similarity (LLM), input response will be set to __record__.main_output or `Select.RecordOutput` .


In [31]:
# Define a wrapper ( Instrument the callable for logging with TruLens )
categorization_recorder = TruCustomApp(
    categorization_app,
    app_name="categorization",
    app_version="v1",
    feedbacks=[
        f_groundtruth
    ],
)

WARNI [trulens.apps.custom] Function <function CategorizationApp.categorize at 0x141fa8160> was not found during instrumentation walk. Make sure it is accessible by traversing app <__main__.CategorizationApp object at 0x1430ee8c0> or provide a bound method for it as TruCustomApp constructor argument `methods_to_instrument`.


## Check the output of ground truth agreement measure for the first 10 records 

In [60]:
sum = 0
for row in categorization_golden_set:
    response = categorization_app.categorize(human_input_query = row['query'])
    score = gta.agreement_measure(
        prompt = row['query'], 
        response = categorization_app.categorize(human_input_query = row['query'])
    )
    print(f"score: {score[0]:<2} | ground truth: {score[1]['ground_truth_response']:<30} | received: {response:<30}")   
    sum += score[0]
print(f"Average score: {sum / len(categorization_golden_set)}")

score: 1.0 | ground truth: Commercial Insurance           | received: Commercial Insurance          
score: 0.0 | ground truth: Liability Insurance            | received: Liability Insurance           
score: 0.0 | ground truth: Commercial Insurance           | received: Liability Insurance           
score: 1.0 | ground truth: Commercial Insurance           | received: Legal Protection Insurance    
score: 0.1 | ground truth: Commercial Insurance           | received: Legal Protection Insurance    
score: 0.0 | ground truth: Driver Protection Insurance    | received: Commercial Insurance          
score: 0.0 | ground truth: Commercial Insurance           | received: Legal Protection Insurance    
score: 1.0 | ground truth: Legal Protection Insurance     | received: Legal Protection Insurance    
score: 1.0 | ground truth: Legal Protection Insurance     | received: Legal Protection Insurance    
score: 1.0 | ground truth: Health Insurance               | received: Health Insurance     

## Evaluate the prompt 

In [39]:
# for row in categorization_golden_set:
#     with categorization_recorder:
#             categorization_app.categorize(human_input_query = row['query'])

# To avoid the token per minute limit, we can use the following code
# https://www.trulens.org/cookbook/use_cases/summarization_eval/#write-feedback-functions
from tenacity import retry
from tenacity import stop_after_attempt
from tenacity import wait_random_exponential 

@retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
def run_with_backoff(doc):
    return categorization_recorder.with_record(categorization_app.categorize, human_input_query = row['query'])

for row in categorization_golden_set:
    llm_response = run_with_backoff(row["query"])
    print(llm_response)

('Commercial Insurance', Record(record_id='record_hash_b3ad8b467c95fb5333a10962512936eb', app_id='app_hash_064b90faeb1d9d62a9ae967cdb89ccf7', cost=Cost(n_requests=0, n_successful_requests=0, n_completion_requests=0, n_classification_requests=0, n_classes=0, n_embedding_requests=0, n_embeddings=0, n_tokens=0, n_stream_chunks=0, n_prompt_tokens=0, n_completion_tokens=0, n_cortex_guardrails_tokens=0, cost=0.0, cost_currency='USD'), perf=Perf(start_time=datetime.datetime(2025, 1, 17, 2, 55, 38, 859439), end_time=datetime.datetime(2025, 1, 17, 2, 55, 39, 905472)), ts=datetime.datetime(2025, 1, 17, 2, 55, 39, 906132), tags='-', meta=None, main_input='Here are the examples of insurance categories \n\n[start of insurance categories]\n- Term Life Insurance\n- Whole Life Insurance\n- Pension Insurance\n- Disability Insurance\n- Long-Term Care Pension Insurance\n- Health Insurance\n- Critical Illness Insurance\n- Basic Ability Insurance\n- Long-Term Care Cost Insurance\n- Long-Term Care Daily All

ERROR [trulens.core.instruments] Error calling wrapped function categorize.
ERROR [trulens.core.instruments] Traceback (most recent call last):
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/instruments.py", line 769, in tru_wrapper
    rets, tally = core_endpoint.Endpoint.track_all_costs_tally(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
    result, cbs = Endpoint.track_all_costs(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
    return Endpoint._track_costs(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
    result: T = __func(*args, **kwargs)
  File "/var/folders/7_/lzvh2hfd7nbfj2n6q9k0hb980000gn/T/ipykernel_39040/2287920619.py", line 11,

('Legal Protection Insurance', Record(record_id='record_hash_394e937c9032413a48db24e2e307b7c5', app_id='app_hash_064b90faeb1d9d62a9ae967cdb89ccf7', cost=Cost(n_requests=0, n_successful_requests=0, n_completion_requests=0, n_classification_requests=0, n_classes=0, n_embedding_requests=0, n_embeddings=0, n_tokens=0, n_stream_chunks=0, n_prompt_tokens=0, n_completion_tokens=0, n_cortex_guardrails_tokens=0, cost=0.0, cost_currency='USD'), perf=Perf(start_time=datetime.datetime(2025, 1, 17, 2, 57, 16, 722583), end_time=datetime.datetime(2025, 1, 17, 2, 57, 31, 811929)), ts=datetime.datetime(2025, 1, 17, 2, 57, 31, 812305), tags='-', meta=None, main_input='Here are the examples of insurance categories \n\n[start of insurance categories]\n- Term Life Insurance\n- Whole Life Insurance\n- Pension Insurance\n- Disability Insurance\n- Long-Term Care Pension Insurance\n- Health Insurance\n- Critical Illness Insurance\n- Basic Ability Insurance\n- Long-Term Care Cost Insurance\n- Long-Term Care Dai

ERROR [trulens.core.instruments] Error calling wrapped function categorize.
ERROR [trulens.core.instruments] Traceback (most recent call last):
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/instruments.py", line 769, in tru_wrapper
    rets, tally = core_endpoint.Endpoint.track_all_costs_tally(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
    result, cbs = Endpoint.track_all_costs(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
    return Endpoint._track_costs(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
    result: T = __func(*args, **kwargs)
  File "/var/folders/7_/lzvh2hfd7nbfj2n6q9k0hb980000gn/T/ipykernel_39040/2287920619.py", line 11,

('Health Insurance', Record(record_id='record_hash_a4f76af9609c1b0dca1390605d0ecbe2', app_id='app_hash_064b90faeb1d9d62a9ae967cdb89ccf7', cost=Cost(n_requests=0, n_successful_requests=0, n_completion_requests=0, n_classification_requests=0, n_classes=0, n_embedding_requests=0, n_embeddings=0, n_tokens=0, n_stream_chunks=0, n_prompt_tokens=0, n_completion_tokens=0, n_cortex_guardrails_tokens=0, cost=0.0, cost_currency='USD'), perf=Perf(start_time=datetime.datetime(2025, 1, 17, 2, 58, 13, 392081), end_time=datetime.datetime(2025, 1, 17, 2, 58, 28, 726429)), ts=datetime.datetime(2025, 1, 17, 2, 58, 28, 726740), tags='-', meta=None, main_input='Here are the examples of insurance categories \n\n[start of insurance categories]\n- Term Life Insurance\n- Whole Life Insurance\n- Pension Insurance\n- Disability Insurance\n- Long-Term Care Pension Insurance\n- Health Insurance\n- Critical Illness Insurance\n- Basic Ability Insurance\n- Long-Term Care Cost Insurance\n- Long-Term Care Daily Allowan

ERROR [trulens.core.instruments] Error calling wrapped function categorize.
ERROR [trulens.core.instruments] Traceback (most recent call last):
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/instruments.py", line 769, in tru_wrapper
    rets, tally = core_endpoint.Endpoint.track_all_costs_tally(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 589, in track_all_costs_tally
    result, cbs = Endpoint.track_all_costs(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 551, in track_all_costs
    return Endpoint._track_costs(
  File "/Users/jaeyeopchung/.pyenv/versions/test-en/lib/python3.10/site-packages/trulens/core/feedback/endpoint.py", line 666, in _track_costs
    result: T = __func(*args, **kwargs)
  File "/var/folders/7_/lzvh2hfd7nbfj2n6q9k0hb980000gn/T/ipykernel_39040/2287920619.py", line 11,

('Legal Protection Insurance', Record(record_id='record_hash_2cfe20400db3414fe68f744766cecf7a', app_id='app_hash_064b90faeb1d9d62a9ae967cdb89ccf7', cost=Cost(n_requests=0, n_successful_requests=0, n_completion_requests=0, n_classification_requests=0, n_classes=0, n_embedding_requests=0, n_embeddings=0, n_tokens=0, n_stream_chunks=0, n_prompt_tokens=0, n_completion_tokens=0, n_cortex_guardrails_tokens=0, cost=0.0, cost_currency='USD'), perf=Perf(start_time=datetime.datetime(2025, 1, 17, 2, 59, 45, 771479), end_time=datetime.datetime(2025, 1, 17, 3, 0, 1, 202273)), ts=datetime.datetime(2025, 1, 17, 3, 0, 1, 202440), tags='-', meta=None, main_input='Here are the examples of insurance categories \n\n[start of insurance categories]\n- Term Life Insurance\n- Whole Life Insurance\n- Pension Insurance\n- Disability Insurance\n- Long-Term Care Pension Insurance\n- Health Insurance\n- Critical Illness Insurance\n- Basic Ability Insurance\n- Long-Term Care Cost Insurance\n- Long-Term Care Daily A

In [59]:
# show the result 
session.get_leaderboard()["Ground Truth Similarity (LLM)"].values


array([0.40263158])

In [21]:
# show the result on the web 
run_dashboard(session, force=True) 

Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu‚Ä¶

Dashboard started at http://192.168.178.104:52147 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>