## [summarization evaluation - reference link](https://www.trulens.org/cookbook/use_cases/summarization_eval/#dependencies) : Ground Truth Evaluation 
In this example, instead of query, content(dialog) is used as the input for the prompt. 
The summary is evaluated by
- Ground Truth 
- Groundedness ( Tokeninzed )
- Comprehensiveness ( Tokeninzed )
- BERT Score 
- BLEU 
- ROUGE

However, BLEU is not relevant to our application.   
For the evaluation metrics, refer to the ```3. Evaluation_Trulens_GroundTruth_Categorization.ipynb```



### Why is this exmample related to our task?  
1. groundedness for summarization? and why it is relevant to our application?   
Later on, we will compare different different product and I assumed that the output will be in a form of summarization of two products.   
The prompts will take several variables from different products, and then the output will be summarization of the prompts.   

For example, 
- Prompt : “What is the strength of company A over company B? Here are the information of two companies : {information of two companies will be inserted as a variable here} “ 
- Response : “Company A has a, b, c while Company B has b, e, f. Thus Company A has strength in a, c cases, and company B has strength in e ,f. Both companies offer b.” 

So this is more or less similar to summarization task.

2. ground truth   
We will also have a generated response for the prompt above.  
It is reasonable to compare the previously generated response with the generated response.  

## Metrics 
- Metrics that can be used for summarization ( for other features of insurance product ) 
    - ROUGE : Recall-Oriented Understudy for Gisting Evaluation is a group of metrics that evaluate LLM summarization and NLP (natural language processing) translations. It also uses a numerical scale from 0 to 1.
    - [Bert Score](https://huggingface.co/spaces/evaluate-metric/bertscore)
    - Groundedness ( Hallucination check )
        - llm : measures whether the model's response is grounded in the input query/context. 
        - nli : measure the groundedness of the model's response using natural language inference.
        - [trulens.providers.huggingface - groundedness_measure_with_nli](https://www.trulens.org/reference/trulens/providers/huggingface/?h=groundedness#trulens.providers.huggingface.Huggingface.groundedness_measure_with_nli)   / [groundedness_measure_with_cot_reasons](https://www.trulens.org/reference/trulens/feedback/?h=groundedness_measure_with_#trulens.feedback.LLMProvider.groundedness_measure_with_cot_reasons) 
        - [groundedness evaluation - reference link](https://www.trulens.org/component_guides/evaluation_benchmarks/groundedness_benchmark/?h=groundedness#benchmarking-various-groundedness-feedback-function-providers-openai-gpt-35-turbo-vs-gpt-4-vs-huggingface) : just for reference
        - [summarization evaluation - reference link](https://www.trulens.org/cookbook/use_cases/summarization_eval/#dependencies) : can be adopted for summarization evaluation 
    - Ground Truth Agreement : compare the similiarity between the model's response and the ground truth. 
        - accuracy : 0 ~ 1 depending on the exactness of the response 
        - In general, ground truth evaluation only takes the input and the response into account. Thus, for a non-RAG application like this, it best suits the purpose. 
- Metrics not relevant to our application 
    - Perplexity 
    - BLEU : Bilingual Evaluation Understudy evaluates the precision of LLM-generated text, or how closely it resembles human sources, using a numerical scale from 0 to 1.


## Downloading example dataset 

In [25]:
!curl -o dialogsum.dev.jsonl https://raw.githubusercontent.com/cylnlp/dialogsum/main/DialogSum_Data/dialogsum.dev.jsonl

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  460k  100  460k    0     0  1384k      0 --:--:-- --:--:-- --:--:-- 1387k


In [26]:
import pandas as pd

file_path_dev = "dialogsum.dev.jsonl"
dev_df = pd.read_json(path_or_buf=file_path_dev, lines=True)

In [27]:
dev_df.head(10)

Unnamed: 0,fname,dialogue,summary,topic
0,dev_0,"#Person1#: Hello, how are you doing today?\n#P...",#Person2# has trouble breathing. The doctor as...,see a doctor
1,dev_1,#Person1#: Hey Jimmy. Let's go workout later t...,#Person1# invites Jimmy to go workout and pers...,do exercise
2,dev_2,#Person1#: I need to stop eating such unhealth...,#Person1# plans to stop eating unhealthy foods...,healthy foods
3,dev_3,#Person1#: Do you believe in UFOs?\n#Person2#:...,#Person2# believes in UFOs and can see them in...,UFOs and aliens
4,dev_4,#Person1#: Did you go to school today?\n#Perso...,#Person1# didn't go to school today. #Person2#...,go to school
5,dev_5,"#Person1#: Honey, I think you should quit smok...",#Person1# asks #Person2# to quit smoking for h...,quit smoking
6,dev_6,"#Person1#: Excuse me, Mr. White? I just need y...",Sherry reminds Mr. White to sign.,workplace conversation
7,dev_7,"#Person1#: Hey, Karen. Look like you got some ...",#Person1# asks Karen where Karen stayed and ho...,holidays
8,dev_8,#Person1#: How do you usually spend your leisu...,#Person1# asks about #Person2#'s hobbies. #Per...,hobby
9,dev_9,#Person1#: have you ever seen Bill Gate's home...,#Person1# and #Person2# talk about Bill Gate's...,dream home


## Create a simple summarization app and instrument it

In [28]:
import os
from dotenv import load_dotenv
from openai import OpenAI

# Load environment variables and initialize clients
load_dotenv()
OpenAI_key = os.getenv("OPENAI_API_KEY")  
Huggingface_key = os.getenv("HUGGINGFACE_API_KEY")
client = OpenAI()

In [29]:

from trulens.apps.custom import TruCustomApp
from trulens.apps.custom import instrument
import openai


# [TODO] : change the code to fit our application 
class DialogSummaryApp:
    @instrument
    def summarize(self, dialog): 
        client = openai.OpenAI() # [TODO] ChatPromptTemplate으로 변경 / company_name": row['company'],"title": row['title'], "content": row['content'], "format_instructions": format_instructions 매개변수로 받아 넘김 / for문으로 앱 호출 / response(cateogry) 반환 
        summary = (
            client.chat.completions.create( #ChatPromptTemplate 으로 변경 가능? 
                model="gpt-4o",
                messages=[
                    {
                        "role": "system",
                        "content": """Summarize the given dialog into 1-2 sentences based on the following criteria:  
                        1. Convey only the most salient information; 
                        2. Be brief; 
                        3. Preserve important named entities within the conversation; 
                        4. Be written from an observer perspective; 
                        5. Be written in formal language. """, # [TODO] : replace the content with "system prompt text"
                    },
                    {"role": "user", "content": dialog},  # [TODO] : replace content with "user prompt text" ( categorization query )
                ],
            )
            .choices[0]
            .message.content
        )
        return summary

## Initialize Database and View Dashboard 

In [30]:
from trulens.core import TruSession
from trulens.dashboard import run_dashboard

session = TruSession()
session.reset_database()
# If you have a database you can connect to, use a URL. For example:
# session = TruSession(database_url="posbtgresql://hostname/database?user=username&password=password")

Updating app_name and app_version in apps table: 0it [00:00, ?it/s]
Updating app_id in records table: 0it [00:00, ?it/s]
Updating app_json in apps table: 0it [00:00, ?it/s]


In [31]:
run_dashboard(session, force=True)

Starting dashboard ...


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.0.153:62575 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

## Write feedback functions
We will now create the feedback functions that will evaluate the app. Remember that the criteria we were evaluating against were:

- Ground truth agreement
    - For these set of metrics, we will measure how similar the generated summary is to some human-created ground truth.**      
- Different measures are used in this example
    - BERT score, BLEU, ROUGE and a measure where an LLM is prompted to produce a similarity score.   
        - BLEU is not relevant to our application. 
    - Groundedness
        - For this measure, we will estimate if the generated summary can be traced back to parts of the original transcript. ( Not necessary )  

## Building a Golden Dataset ( Ground Truth Evaluation )

In [32]:
golden_set = (
    dev_df[["dialogue", "summary"]] 
    .rename(columns={"dialogue": "query", "summary": "response"})
    .to_dict("records")
)

In [33]:
golden_set[:10]

[{'query': "#Person1#: Hello, how are you doing today?\n#Person2#: I ' Ve been having trouble breathing lately.\n#Person1#: Have you had any type of cold lately?\n#Person2#: No, I haven ' t had a cold. I just have a heavy feeling in my chest when I try to breathe.\n#Person1#: Do you have any allergies that you know of?\n#Person2#: No, I don ' t have any allergies that I know of.\n#Person1#: Does this happen all the time or mostly when you are active?\n#Person2#: It happens a lot when I work out.\n#Person1#: I am going to send you to a pulmonary specialist who can run tests on you for asthma.\n#Person2#: Thank you for your help, doctor.",
  'response': '#Person2# has trouble breathing. The doctor asks #Person2# about it and will send #Person2# to a pulmonary specialist.'},
 {'query': "#Person1#: Hey Jimmy. Let's go workout later today.\n#Person2#: Sure. What time do you want to go?\n#Person1#: How about at 3:30?\n#Person2#: That sounds good. Today we work on Legs and forearm.\n#Person1#

## Define the feedback functions 

In [34]:
# !pip install trulens.providers.huggingface

In [43]:
from trulens.core import Select, Feedback
from trulens.providers.huggingface import Huggingface
from trulens.providers.openai import OpenAI
from trulens.feedback import GroundTruthAgreement

provider = OpenAI(model_engine="gpt-4o")
hug_provider = Huggingface() # for groundedness_measure_with_nli 

ground_truth_collection = GroundTruthAgreement(golden_set, provider=provider)
f_groundtruth = Feedback(
    ground_truth_collection.agreement_measure, name="Similarity (LLM)"
).on_input_output()

f_bert_score = Feedback(ground_truth_collection.bert_score).on_input_output()
f_bleu = Feedback(ground_truth_collection.bleu).on_input_output()
f_rouge = Feedback(ground_truth_collection.rouge).on_input_output()
# Groundedness between each context chunk and the response.


f_groundedness_llm = (
    Feedback(
        provider.groundedness_measure_with_cot_reasons,
        name="Groundedness - LLM Judge",
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)

f_groundedness_nli = (
    Feedback(
        hug_provider.groundedness_measure_with_nli,
        name="Groundedness - NLI Judge",
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)

f_comprehensiveness = (
    Feedback(
        provider.comprehensiveness_with_cot_reasons, name="Comprehensiveness"
    )
    .on(Select.RecordInput)
    .on(Select.RecordOutput)
)
provider.comprehensiveness_with_cot_reasons(
    "the white house is white. obama is the president",
    "the white house is white. obama is the president",
)

✅ In Similarity (LLM), input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Similarity (LLM), input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In bert_score, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In bert_score, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In bleu, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In bleu, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In rouge, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In rouge, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness - LLM Judge, input source will be set to __record__.main_input or `Select.RecordInput` .
✅ In Groundedness - LLM Judge, input statement will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Groundedness - NLI Judge, 

(1.0,
 {'reasons': 'Score: 3\nKey Point: Obama is the president.\nSupporting Evidence: The summary explicitly states "obama is the president," which fully captures the key point.\n\n'})

## Wrap up the application with feedback functions
Now we are ready to wrap our summarization app with TruLens as a TruCustomApp.   
Now each time it will be called, TruLens will log inputs, outputs and any instrumented intermediate steps and evaluate them ith the feedback functions we created.  

In [36]:
summary_app = DialogSummaryApp()
print(summary_app.summarize(dev_df.dialogue[498])) # 카테고리하는걸로 변경


A customer contacted Amazon customer service regarding a missing page in "The Paper Bag Night of the Hunter" by R.A. Salvatore, and was instructed to upload a photo of the issue to receive a replacement without returning the original book.


In [37]:
summary_recorder = TruCustomApp(
    summary_app,
    app_name="Summarization example",
    app_version="v1",
    feedbacks=[
        f_groundtruth,
        f_groundedness_llm,
        # f_groundedness_nli,
        f_comprehensiveness,
        f_bert_score,
        f_bleu,
        f_rouge,
    ],
)

## Evaluate the summaries for the first 50 records 

In [38]:
#  test a single run of the App as so. This should show up on the dashboard.
for row in golden_set[:50]:
    with summary_recorder:
        summary_app.summarize(dialog=row['query'])   

  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


## Optionally, tenacity can be used to retry the app if it fails to run
We'll make a lot of queries in a short amount of time, so we need tenacity to make sure that most of our requests eventually go through.

In [39]:
# from tenacity import retry
# from tenacity import stop_after_attempt
# from tenacity import wait_random_exponential

# @retry(wait=wait_random_exponential(min=1, max=60), stop=stop_after_attempt(6))
# def run_with_backoff(doc):
#     return summary_recorder.with_record(summary_app.summarize, dialog=doc)

# for pair in golden_set[:100]:
#     llm_response = run_with_backoff(pair["query"])
#     print(llm_response)


## Run dashboard

In [42]:
session.get_leaderboard()

Unnamed: 0_level_0,Unnamed: 1_level_0,Comprehensiveness,Groundedness - LLM Judge,Groundedness - NLI Judge,latency,total_cost
app_name,app_version,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Summarization example,v1,0.679861,1.0,0.940408,1.661563,0.001037


In [41]:
run_dashboard(session)

Starting dashboard ...
Dashboard already running at path:   Network URL: http://192.168.0.153:62575



<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>