# Generating a Test Set with TruLens

In the early stages of developing an LLM app, it is often challenging to generate a comprehensive test set on which to evaluate your app.

This notebook demonstrates the usage of test set generation using TruLens, particularly targeted at applications that leverage private data or context such as RAGs.

By providing your LLM app callable, we can leverage your app to generate its own test set dependant on your specifications for `test_breadth` and `test_depth`. The resulting test set will both question categories tailored to your data, and a list of test prompts for each category. You can specify both the number of categories (`test_breadth`) and number of prompts for each category (`test_depth`).

In [1]:
from trulens_eval.generate_test_set import GenerateTestSet

## Set key

In [2]:
import os
#os.environ["OPENAI_API_KEY"] = "sk-..."

## Build application

In [3]:
# Imports from LangChain to build app
import bs4
from langchain import hub
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import WebBaseLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.schema import StrOutputParser
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma
from langchain_core.runnables import RunnablePassthrough
from langchain_community.document_loaders import PyPDFLoader
import json
#import sqlite3
#print ("SQLite Version is:", sqlite3.sqlite_version)
#print ("DB-API Version is:", sqlite3.version)

USER_AGENT environment variable not set, consider setting it to identify your requests.


In [4]:
#load example content from the web
loader = WebBaseLoader(
    web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",),
    bs_kwargs=dict(
        parse_only=bs4.SoupStrainer(
            class_=("post-content", "post-title", "post-header")
        )
    ),
)
docs = loader.load()

In [5]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
splits = text_splitter.split_documents(docs)

vectorstore = Chroma.from_documents(documents=splits, embedding=OpenAIEmbeddings())

  warn_deprecated(


In [4]:
loader = PyPDFLoader("KJR-Policy-Manual-August-23.pdf")
pages = loader.load_and_split()
vectorstore = Chroma.from_documents(documents=pages, embedding=OpenAIEmbeddings())


  warn_deprecated(


In [5]:
retriever = vectorstore.as_retriever()

prompt = hub.pull("rlm/rag-prompt")
llm = ChatOpenAI(model_name="gpt-3.5-turbo", temperature=0)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

  warn_deprecated(


## Generate a test set using the RAG

Now that we've set up the application, we can instantiate the `GenerateTestSet` class with the application. This way the test set generation will be tailored to your app and data.

After instantiating the `GenerateTestSet` class, generate your test set by specifying `test_breadth` and `test_depth`.

In [6]:
test = GenerateTestSet(app_callable = rag_chain.invoke)
test_set = test.generate_test_set(test_breadth = 5, test_depth = 2)
test_set


{'Environmental Impact Policy': ['How does KJR aim to minimize the environmental impact of its operations according to its Environmental Impact Policy?',
  'What steps does KJR take to ensure compliance with relevant environmental legislation and regulations in its operations?'],
 'Modern Slavery Policy': ['How does KJR ensure compliance with the Modern Slavery Act in its business operations and relationships?',
  "What are the key elements of KJR's Modern Slavery program to prevent, detect, and respond to the risk of Modern Slavery occurring within the organization or in any other business relationships?"],
 'Code of Conduct': ['What are the consequences for breaching the Code of Conduct at KJR?',
  'How does KJR define bribery and corruption in their Anti-Bribery and Corruption Policy?'],
 'Policy Manual': ['What is the purpose of the KJR Policy Manual?',
  'How many pages are in the KJR Policy Manual?'],
 'Workplace Standards': ['What are the dress standards expected of KJR team mem

In [7]:
for category in test_set:
        print(category)

Environmental Impact Policy
Modern Slavery Policy
Code of Conduct
Policy Manual
Workplace Standards


In [8]:
print(json.dumps(test_set, indent=4))

{
    "Environmental Impact Policy": [
        "How does KJR aim to minimize the environmental impact of its operations according to its Environmental Impact Policy?",
        "What steps does KJR take to ensure compliance with relevant environmental legislation and regulations in its operations?"
    ],
    "Modern Slavery Policy": [
        "How does KJR ensure compliance with the Modern Slavery Act in its business operations and relationships?",
        "What are the key elements of KJR's Modern Slavery program to prevent, detect, and respond to the risk of Modern Slavery occurring within the organization or in any other business relationships?"
    ],
    "Code of Conduct": [
        "What are the consequences for breaching the Code of Conduct at KJR?",
        "How does KJR define bribery and corruption in their Anti-Bribery and Corruption Policy?"
    ],
    "Policy Manual": [
        "What is the purpose of the KJR Policy Manual?",
        "How many pages are in the KJR Policy M

In [9]:
filename="generated_prompts"
groundtruth_prompts = []
for category in test_set:
    for i in test_set[category]:
        string = '{"input": "' + i + '", "expected_output": null}'
        new_data = json.loads(string)
        groundtruth_prompts.append(new_data)

with open(filename + '.json', "w") as outfile:
    outfile.write(json.dumps(groundtruth_prompts, indent=4))

We can also provide a list of examples to help guide our app to the types of questions we want to test.

In [28]:
examples = [
    "How much leave to KJR employees get?",
    "What hours are KJR employees expected to work?"
]

fewshot_test_set = test.generate_test_set(test_breadth = 3, test_depth = 2, examples = examples)
fewshot_test_set

{'Professionalism': ['How many days of annual leave do KJR team members accrue per completed year of service?',
  'What is the basic entitlement for personal leave for full-time KJR team members?'],
 'confidentiality': ['What are the consequences for excessive absenteeism or tardiness at KJR?',
  'Is it possible for a complainant or witness to remain anonymous in a complaint or investigation at KJR?'],
 'social media': ['How many hours are KJR employees expected to work?',
  'What are the consequences for excessive absenteeism at KJR?']}