# 3. Evaluating the accuracy of the model 
## Evaluation Concepts & Metrics  
[Evaluation Concept - Langchain Documnet](https://docs.smith.langchain.com/evaluation/concepts)  
[LLM Evaluation Metrics](https://arize.com/blog-course/llm-evaluation-the-definitive-guide/)  
[TruLens Evaluation Blog](https://gowrishankar.info/blog/evaluating-large-language-models-generated-contents-with-trueras-trulens/)

My guess was the best metric to use is the answer relevancy for our use-case, but relevance metric requires retrieval(RAG) to evaluate the prompts, which is not suitable for this task.  ( Johann confirmed this on 29. Nov 2024 )

#### Other reference regarding summarization 
Summarization tools Langchain offers : 
- Stuff / Map-reduce&refine / CoD / Clustering Map-Refine 
- [Stuff / Map-Reduce document](https://python.langchain.com/docs/modules/data_connection/document_transformers/map_reduce) 

Summarization tools offered by LangChain : 
- Stuff / Map-reduce&refine / CoD / Clustering Map-Refine    

---

### Trulens Evaluation 
- [LangChain Quickstart](https://www.trulens.org/getting_started/quickstarts/langchain_quickstart/#create-rag) 
- [Text to Text](https://www.trulens.org/getting_started/quickstarts/text2text_quickstart/) 
    - Can be used for our task, but how to use it for evaluation? 
- [Ground Truth Evaluation](https://www.trulens.org/getting_started/quickstarts/groundtruth_evals/) 
    - Can be used for our task, how to generate the ground truth data set for our task? ( Tried with the RAG based evalation but not suitable for our case )
- [Feedback Function Implementations](https://www.trulens.org/component_guides/evaluation/feedback_implementations/)  
    - [Classification-based](https://www.trulens.org/component_guides/evaluation/feedback_implementations/stock/)  
    - [context_relevance_with_cot_reasons](https://www.trulens.org/reference/trulens/providers/openai/#trulens.providers.openai.OpenAI.context_relevance_with_cot_reasons)  
    - [context_relevance](https://www.trulens.org/reference/trulens/providers/langchain/provider/#trulens.providers.langchain.provider.Langchain.context_relevance)   
    - Some feedback functions require retrieval(RAG) to evaluate the prompts (BaseRetriever), which is not suitable for our task !!  


[TruLens Video 1](https://www.youtube.com/watch?v=ul5huLywzZk )  
[TruLens Vidoe 2](https://www.youtube.com/watch?v=8NP1HLrNuAo)  


### [TODO] Keep in mind that our application is not a RAG based application and find the right metric.

---
 #### [Trulens Example ](https://www.trulens.org/getting_started/quickstarts/langchain_quickstart/#create-rag)
    - RAG based evaluation, which is not suitable for this task. 
 

#### Building RAG 
Doing it without RAG seems to be possible but I failed to find the right resources / documentation.   
Comparing only the relvance of input and output does not make sense for this task. ( e.g., is this a insurance product? / yes or no => no relevance ) 
- Test_Evaluation notebook 


In [1]:
from langchain.schema import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain_openai import OpenAIEmbeddings
from langchain.storage import LocalFileStore
from langchain.embeddings import CacheBackedEmbeddings
from langchain_community.vectorstores import FAISS
import pandas as pd


In [2]:
df = pd.read_csv('insurance_company_db.csv') # load the categorized data 

# Convert DataFrame to documents
documents = [
    Document(
        page_content=f"{row['content']}", 
        metadata={"source": row['source'], "company": row['company']}
    ) for _, row in df.iterrows()
]

# Split documents
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
split_documents = text_splitter.split_documents(documents)

# Initialize embeddings with caching
embeddings_model = OpenAIEmbeddings()
store = LocalFileStore('./cache/')
cached_embedder = CacheBackedEmbeddings.from_bytes_store(
    underlying_embeddings=embeddings_model,
    document_embedding_cache=store,
    namespace=embeddings_model.model
)

# Create and save FAISS vector store
vectorstore = FAISS.from_documents(split_documents, cached_embedder)
vectorstore.save_local('./db/insurance_data')

#### Based on this query, one can build a test set 
- Building a test data set ( Ground Truth Data set )

In [3]:
# Load the saved vectorstore
vectorstore = FAISS.load_local('./db/insurance_data', cached_embedder, allow_dangerous_deserialization=True)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=ChatOpenAI(temperature=0),
    chain_type="stuff",
    retriever=vectorstore.as_retriever()
)

# Example queries
queries = [
    "Doex AXA Germany offer health insurance?",
    "Genrali offers Auto Insurance?",
    "HUK-COBURG offers Health insurance?"
]

# Run queries
for query in queries:
    print(f"\nQuery: {query}")
    print("Answer:", qa_chain.run(query))


  llm=ChatOpenAI(temperature=0),
  print("Answer:", qa_chain.run(query))



Query: Doex AXA Germany offer health insurance?
Answer: Yes, AXA Germany offers health insurance.

Query: Genrali offers Auto Insurance?
Answer: Yes, Generali offers Auto Insurance as one of their products.

Query: HUK-COBURG offers Health insurance?
Answer: Based on the provided context, HUK-COBURG offers accident insurance, but there is no specific mention of health insurance. Therefore, it is unclear if HUK-COBURG offers health insurance.


---

 ### Idea to evaluate the prompt ( [Ground Truth Evaluation](https://www.trulens.org/getting_started/quickstarts/groundtruth_evals/) )
-> What should be the ground truth for this task?   

My idea : 
1. Build a RAG based on the categorized data 
2. Build test data set ( QA based test set ) from the RAG  --> Use this as the ground truth ?? 

3. Use different prompts to categorize the companies 
4. Use the test data set ( from 2 ) to evaluate how good the prompt is. For example, if A was a insurance company from 1~2, the the categorized data set should also have A as a insurance company.  

--> But 2 cannot be a ground truth since the test set is built from another prompt.
--> How should we evaluate the quality of the prompt then?  

In [4]:
from openai import OpenAI
from trulens.apps.custom import instrument
from trulens.core import TruSession

session = TruSession()

🦑 Initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `TruSession` to prevent this.


In [5]:
class answer_format(BaseModel):
    answer : str = Field(description="answer to the question")

llm = ChatOpenAI(model="gpt-4o", api_key=OpenAI_key, temperature=0, max_tokens=16384) 

parser = PydanticOutputParser(pydantic_object=answer_format)
format_instructions = parser.get_format_instructions()


class APP:
    @instrument
    def completion(self, prompt):
        system_input_stage_1 = read_prompt(prompt_path + "classification_system_stage_1.txt")
        df = pd.read_csv('company_data.csv')

        completion = (
            OpenAI().chat.completions.create(
                model = "gpt-4o",
                messages = [
                    {"role": "system", "content": system_input_stage_1},
                    {"role": "user", "content": prompt}
                ],
                response_format = answer_format,
                format_instructions = format_instructions
            ).choices[0].message.content
        )
        return completion
    
llm_app = APP()

NameError: name 'BaseModel' is not defined

In [6]:
# Initialize Feedback Function 
from trulens.core import Feedback
from trulens.feedback import GroundTruthAgreement
from trulens.providers.openai import OpenAI as fOpenAI

golden_set = [
    {
        "query": "Doex AXA Germany offer health insurance?",
        "expected_response": "Yes, AXA Germany offers health insurance.",
    },
    {
        "query": "Genrali offers Auto Insurance?",
        "expected_response": "Yes, Generali offers Auto Insurance as one of their products.",
    },
    {
        "query": "HUK-COBURG offers Health insurance?",
        "expected_response": " Based on the provided context, HUK-COBURG offers accident insurance, liability insurance, and support for driving officials. There is no specific mention of health insurance in the information provided.",
    }
]

f_groundtruth = Feedback(
    GroundTruthAgreement(golden_set, provider=fOpenAI()).agreement_measure,
    name="Ground Truth Semantic Agreement",
).on_input_output()

✅ In Ground Truth Semantic Agreement, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Ground Truth Semantic Agreement, input response will be set to __record__.main_output or `Select.RecordOutput` .


In [7]:
# add trulens as a context manager for llm_app
from trulens.apps.custom import TruCustomApp

tru_app = TruCustomApp(
    llm_app, app_name="LLM App", app_version="v1", feedbacks=[f_groundtruth]
)

NameError: name 'llm_app' is not defined

In [8]:
# Instrumented query engine can operate as a context manager:
with tru_app as recording:
    llm_app.completion("Does AXA Germany offer health insurance?")
    llm_app.completion("Does HUK-COBURG offer health insurance?")

NameError: name 'tru_app' is not defined