# Iterating on LLM Apps with TruLens

In this example, we will build a first prototype RAG to answer questions from an Insurance Handbook PDF from the Insurance Information Institute. Using TruLens, we will identify early failure modes, and then iterate to ensure the app is honest, harmless and helpful.

In [15]:
!pip install trulens_eval llama_index llama_hub html2text llmsherpa tenacity protobuf==3.20

Defaulting to user installation because normal site-packages is not writeable


## Restart the kernel after installing!

In [16]:
from trulens_eval import Tru
tru = Tru(database_redact_keys=True)

## Start with basic RAG.

In [10]:
pip install pymupdf

Defaulting to user installation because normal site-packages is not writeable
Collecting pymupdf
  Downloading PyMuPDF-1.24.11-cp38-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (3.4 kB)
Downloading PyMuPDF-1.24.11-cp38-abi3-manylinux2014_x86_64.manylinux_2_17_x86_64.whl (19.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m19.6/19.6 MB[0m [31m149.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
[0mSuccessfully installed pymupdf-1.24.11
Note: you may need to restart the kernel to use updated packages.


In [17]:
import fitz  # PyMuPDF

llmsherpa_api_url = "https://readers.llmsherpa.com/api/document/developer/parseDocument?renderFormat=all"  # Keeping this for compatibility

# Dummy loader class to keep your code structure
class SmartPDFLoader:
    def __init__(self, llmsherpa_api_url):
        self.api_url = llmsherpa_api_url  # not used here, but retained

    def load_data(self, local_pdf_path):
        doc = fitz.open(local_pdf_path)
        pages = []
        for page in doc:
            text = page.get_text()
            pages.append(text)
        return pages

# Use your local file path here
pdf_loader = SmartPDFLoader(llmsherpa_api_url=llmsherpa_api_url)
documents = pdf_loader.load_data("Insurance_Handbook_20103.pdf")

# Example: print the first 500 characters of the first page
print(documents[0][:500])

Insurance 
Handbook 
A guide to insurance:  
what it does and how it works



In [6]:
from llama_index import Document
document = Document(text="/n/n".join([doc.text for doc in documents]))

from llama_index import ServiceContext
service_context = ServiceContext.from_defaults(
llm=llm,
embed_model="local:BAAI/bge-small-en-v1.5")

from llama_index import VectorStoreIndex
index = VectorStoreIndex.from_documents([document], service_context=service_context)

from llama_index import Prompt
system_prompt = Prompt("We have provided context information below that you may use. \n"
                       "-------------------\n"
                       "{context_str}"
                       "\n-------------------\n"
                       "Please answer the question {query_str}\n"
                       )

rag_basic = index.as_query_engine(text_qa_template=system_prompt)

config.json:   0%|          | 0.00/743 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/134M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/366 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/711k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

[nltk_data] Downloading package punkt to /tmp/llama_index...
[nltk_data]   Unzipping tokenizers/punkt.zip.


## Load test set

In [7]:
honest_evals = [
    "What are the typical coverage options for homeowners insurance",
    "What are the requirements for long term care insurance to start",
    "How much in losses does fraud account for in property and casualty insurance",
    "What was the most costly earthquake in US History for insurers"
]

## Set up Evaluation

In [10]:
import os
import numpy as np
from trulens_eval import Feedback

from trulens_eval.feedback.provider import OpenAI as fOpenAI
openai = fOpenAI()

qa_relevance = (
    Feedback(openai.relevance_with_cot_reasons, name="Answer Relevance")
    .on_input_output()
)

from trulens_eval import TruLlama
qs_relevance = (
    Feedback(openai.qs_relevance_with_cot_reasons, name = "Context Relevance")
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
    .aggregate(np.mean)
)

# embedding distance
from langchain.embeddings.openai import OpenAIEmbeddings
from trulens_eval.feedback import Embeddings

model_name = 'text-embedding-ada-002'

embed_model = OpenAIEmbeddings(
    model=model_name,
    openai_api_key=os.environ["OPENAI_API_KEY"]
)

embed = Embeddings(embed_model=embed_model)
f_embed_dist = (
    Feedback(embed.cosine_distance)
    .on_input()
    .on(TruLlama.select_source_nodes().node.text)
)

from trulens_eval.feedback import Groundedness

grounded = Groundedness(groundedness_provider=openai)

f_groundedness = (
    Feedback(grounded.groundedness_measure_with_cot_reasons, name="Groundedness")
        .on(TruLlama.select_source_nodes().node.text.collect())
        .on_output()
        .aggregate(grounded.grounded_statements_aggregator)
)

honest_feedbacks = [qa_relevance, qs_relevance, f_embed_dist, f_groundedness]

from trulens_eval import TruLlama
tru_recorder_rag_basic = TruLlama(
        rag_basic,
        app_id='1) Basic RAG - Honest Eval',
        feedbacks=honest_feedbacks
    )

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input question will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In cosine_distance, input query will be set to __record__.main_input or `Select.RecordInput` .
✅ In cosine_distance, input document will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text.collect() .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [11]:
# Run evaluation on sample questions
with tru_recorder_rag_basic as recording:
    for question in honest_evals:
        response = rag_basic.query(question)

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Av

In [12]:
# get leaderboard
tru.get_leaderboard(app_ids=["1) Basic RAG - Honest Eval"])

Unnamed: 0_level_0,Groundedness,cosine_distance,Answer Relevance,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1) Basic RAG - Honest Eval,0.333333,0.157069,1.0,0.55,4.5,0.003316


Our simple RAG often struggles with retrieving not enough information from the insurance manual to properly answer the question. The information needed may be just outside the chunk that is identified and retrieved by our app.