# Generating a Test Set with TruLens

In the early stages of developing an LLM app, it is often challenging to generate a comprehensive test set on which to evaluate your app.

This notebook demonstrates the usage of test set generation using TruLens, particularly targeted at applications that leverage private data or context such as RAGs.

By providing your LLM app callable, we can leverage your app to generate its own test set dependant on your specifications for `test_breadth` and `test_depth`. The resulting test set will both question categories tailored to your data, and a list of test prompts for each category. You can specify both the number of categories (`test_breadth`) and number of prompts for each category (`test_depth`).

In [18]:
from trulens_eval.generate_test_set import GenerateTestSet

## Set key

In [None]:
import os
#os.environ["OPENAI_API_KEY"] = "sk-..."

## Build application

In [19]:
# Imports from LangChain to build app
import bs4
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_community.document_loaders import PyPDFLoader
import json

In [3]:
#load example content from the web
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

In [20]:
loader = PyPDFLoader("KJR-Policy-Manual-August-23.pdf")
pages = loader.load_and_split()
vectorstore = Chroma.from_documents(documents=pages, embedding=OpenAIEmbeddings())


In [21]:
retriever = vectorstore.as_retriever()

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

## Generate a test set using the RAG

Now that we've set up the application, we can instantiate the `GenerateTestSet` class with the application. This way the test set generation will be tailored to your app and data.

After instantiating the `GenerateTestSet` class, generate your test set by specifying `test_breadth` and `test_depth`.

In [22]:
test = GenerateTestSet(app_callable = rag_chain.invoke)
test_set = test.generate_test_set(test_breadth = 5, test_depth = 2)
test_set


{'Standards of Conduct': ['What are the consequences for breaching the Code of Conduct at KJR?',
  'How does KJR ensure that team members maintain the highest standards of honesty, integrity, and respect in their conduct?'],
 'Compliance': ['What are the requirements for suppliers to notify KJR of any actual or suspected breaches of the Act?',
  'Who is responsible for approving the annual compliance statement and how is it prepared in association with KJR?'],
 'Professionalism': ['How does KJR expect team members to dress when working on-site or meeting with customers?',
  'What are the consequences for team members who breach the KJR Code of Conduct regarding dress standards?'],
 'Consequences': ['What are examples of misconduct that may result in disciplinary action, including termination of employment?',
  'How does KJR determine whether misconduct is gross or serious, and what are the potential consequences for employees in those cases?'],
 'Unacceptable Use': ['What types of cont

In [23]:
for category in test_set:
        print(category)

Standards of Conduct
Compliance
Professionalism
Consequences
Unacceptable Use


In [24]:
print(json.dumps(test_set, indent=4))

{
    "Standards of Conduct": [
        "What are the consequences for breaching the Code of Conduct at KJR?",
        "How does KJR ensure that team members maintain the highest standards of honesty, integrity, and respect in their conduct?"
    ],
    "Compliance": [
        "What are the requirements for suppliers to notify KJR of any actual or suspected breaches of the Act?",
        "Who is responsible for approving the annual compliance statement and how is it prepared in association with KJR?"
    ],
    "Professionalism": [
        "How does KJR expect team members to dress when working on-site or meeting with customers?",
        "What are the consequences for team members who breach the KJR Code of Conduct regarding dress standards?"
    ],
    "Consequences": [
        "What are examples of misconduct that may result in disciplinary action, including termination of employment?",
        "How does KJR determine whether misconduct is gross or serious, and what are the potentia

In [27]:
filename="generated_prompts"
groundtruth_prompts = []
for category in test_set:
    for i in test_set[category]:
        string = '{"input": "' + i + '", "expected_output": null}'
        new_data = json.loads(string)
        groundtruth_prompts.append(new_data)

with open(filename + '.json', "w") as outfile:
    outfile.write(json.dumps(groundtruth_prompts, indent=4))

We can also provide a list of examples to help guide our app to the types of questions we want to test.

In [28]:
examples = [
    "How much leave to KJR employees get?",
    "What hours are KJR employees expected to work?"
]

fewshot_test_set = test.generate_test_set(test_breadth = 3, test_depth = 2, examples = examples)
fewshot_test_set

{'Professionalism': ['How many days of annual leave do KJR team members accrue per completed year of service?',
  'What is the basic entitlement for personal leave for full-time KJR team members?'],
 'confidentiality': ['What are the consequences for excessive absenteeism or tardiness at KJR?',
  'Is it possible for a complainant or witness to remain anonymous in a complaint or investigation at KJR?'],
 'social media': ['How many hours are KJR employees expected to work?',
  'What are the consequences for excessive absenteeism at KJR?']}