## Llama-Index Agents + Ground Truth & Custom Evaluations

In this example, we build an agent-based app with Llama Index to answer questions with the help of Yelp. We'll evaluate it using a few different feedback functions (some custom, some out-of-the-box)

The first set of feedback functions complete what the non-hallucination triad. However because we're dealing with agents here,  we've added a fourth leg (query translation) to cover the additional interaction between the query planner and the agent. This combination provides a foundation for eliminating hallucination in LLM applications.

1. Query Translation - The first step. Here we compare the similarity of the original user query to the query sent to the agent. This ensures that we're providing the agent with the correct question.
2. Context or QS Relevance - Next, we compare the relevance of the context provided by the agent back to the original query. This ensures that we're providing context for the right question.
3. Groundedness - Third, we ensure that the final answer is supported by the context. This ensures that the LLM is not extending beyond the information provided by the agent.
4. Question Answer Relevance - Last, we want to make sure that the final answer provided is relevant to the user query. This last step confirms that the answer is not only supported but also useful to the end user.

In this example, we'll add two additional feedback functions.

5. Ratings usage - evaluate if the summarized context uses ratings as justification. Note: this may not be relevant for all queries.
6. Ground truth eval - we want to make sure our app responds correctly. We will create a ground truth set for this evaluation.

Last, we'll compare the evaluation of this app against a standalone LLM. May the best bot win?

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/truera/trulens/blob/main/trulens_eval/examples/frameworks/llama_index/llama_index_agents.ipynb)

### import from TruLens and Llama-Index

In [1]:
%load_ext autoreload
%autoreload 2
from pathlib import Path
import sys

# If running from github repo, can use this:
sys.path.append(str(Path().cwd().parent.parent.parent.resolve()))

# Uncomment for more debugging printouts.

import logging
root = logging.getLogger()
"""
root.setLevel(logging.DEBUG)

handler = logging.StreamHandler(sys.stdout)
handler.setLevel(logging.DEBUG)
formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
handler.setFormatter(formatter)
root.addHandler(handler)
"""
None

In [2]:
#! pip install trulens_eval==0.11.0 llama_index==0.8.21 llama_hub==0.0.27 yelpapi==2.5.0
# ! pip install llama_hub==0.0.27 yelpapi==2.5.0

In [3]:
# Setup OpenAI Agent
import llama_index
from llama_index.agent import OpenAIAgent
from llama_index import question_gen
from llama_index.question_gen import types
import openai

import os

In [4]:
# os.environ["OPENAI_API_KEY"] = "..."
# openai.api_key = os.environ["OPENAI_API_KEY"]

# YELP_API_KEY = "..."
# YELP_CLIENT_ID = "..."
from trulens_eval.keys import check_or_set_keys
check_or_set_keys("YELP_API_KEY", "YELP_CLIENT_ID")

✅ Key YELP_API_KEY set from environment (same value found in .env file at /Users/piotrm/Dropbox/repos/github/trulens/.env).
✅ Key YELP_CLIENT_ID set from environment (same value found in .env file at /Users/piotrm/Dropbox/repos/github/trulens/.env).


### Set up our Llama-Index App

For this app, we will use a tool from Llama-Index to connect to Yelp and allow the Agent to search for business and fetch reviews.

In [5]:
# Import and initialize our tool spec
from llama_hub.tools.yelp.base import YelpToolSpec
from llama_index.tools.tool_spec.load_and_search.base import LoadAndSearchToolSpec

# Add Yelp API key and client ID
tool_spec = YelpToolSpec(
    api_key=os.environ.get("YELP_API_KEY"),
    client_id=os.environ.get("YELP_CLIENT_ID")
)

In [6]:
gordon_ramsay_prompt = "You answer questions about restaurants in the style of Gordon Ramsay, often insulting the asker."

In [7]:
# Create the Agent with our tools
tools = tool_spec.to_tool_list()
agent = OpenAIAgent.from_tools(
    [
        *LoadAndSearchToolSpec.from_defaults(tools[0]).to_tool_list(),
        *LoadAndSearchToolSpec.from_defaults(tools[1]).to_tool_list()
    ],
    verbose=True,
    system_prompt=gordon_ramsay_prompt
)

### Create a standalone GPT3.5 for comparison

In [8]:
def llm_standalone(prompt):
    return openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": gordon_ramsay_prompt},
            {"role": "user", "content": prompt}
        ]
    )["choices"][0]["message"]["content"]

## Evaluation and Tracking with TruLens

In [9]:
# imports required for tracking and evaluation
from trulens_eval import Feedback, OpenAI, Tru, TruBasicApp, TruLlama, Select, OpenAI as fOpenAI
from trulens_eval.feedback import GroundTruthAgreement, Groundedness
import numpy as np

tru = Tru()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of `Tru` to prevent this.


## Evaluation setup

To set up our evaluation, we'll first create two new custom feedback functions: query_translation_score and ratings_usage. These are straight-forward prompts of the OpenAI API.

In [10]:
class OpenAI_custom(OpenAI):
    def query_translation_score(self, question1: str, question2: str) -> float:
        return float(openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "Your job is to rate how similar two quesitons are on a scale of 1 to 10. Respond with the number only."},
            {"role": "user", "content": f"QUESTION 1: {question1}; QUESTION 2: {question2}"}
        ]
    )["choices"][0]["message"]["content"]) / 10

    def ratings_usage(self, last_context: str) -> float:
        print(last_context)
        return float(openai.ChatCompletion.create(
    model="gpt-3.5-turbo",
    messages=[
            {"role": "system", "content": "Your job is to respond with a '1' if the following statement mentions ratings or reviews, and a '0' if not."},
            {"role": "user", "content": f"STATEMENT: {last_context}"}
        ]
    )["choices"][0]["message"]["content"])

Now that we have all of our feedback functions available, we can instantiate them. For many of our evals, we want to check on intermediate parts of our app such as the query passed to the yelp app, or the summarization of the Yelp content. We'll do so here using Select.

In [11]:
custom = OpenAI_custom()
f_query_translation = Feedback(custom.query_translation_score, name = "Query Translation").on_input().on(
    Select.Record.app.query.args.str_or_query_bundle # check the query bundle passed to yelp api
)
f_ratings_usage = Feedback(custom.ratings_usage, name = "Ratings Usage").on(
    Select.Record.app.query.rets.response # check the last content chunk for mentions of ratings or reviews
)

fopenai = fOpenAI()
# Question/statement (context) relevance between question and last context chunk (i.e. summary)
f_context_relevance = Feedback(fopenai.qs_relevance, name = "Context Relevance").on_input().on(
    Select.Record.app.query.rets.response# check context
)

# Question/answer relevance between overall question and answer.
f_qa_relevance = Feedback(fopenai.relevance, name = "Answer Relevance").on_input_output()

# Groundedness
grounded = Groundedness(groundedness_provider=fopenai)

f_groundedness = Feedback(grounded.groundedness_measure, name = "Groundedness").on(
    Select.Record.app.query.rets.response # check context
).on_output().aggregate(grounded.grounded_statements_aggregator)


Feedback function `groundedness_measure` was renamed to `groundedness_measure_with_cot_reasons`. The new functionality of `groundedness_measure` function will no longer emit reasons as a lower cost option. It may have reduced accuracy due to not using Chain of Thought reasoning in the scoring.


✅ In Query Translation, input question1 will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Query Translation, input question2 will be set to *.__record__.app.query.args.str_or_query_bundle .
✅ In Ratings Usage, input last_context will be set to *.__record__.app.query.rets.response .
✅ In Context Relevance, input question will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input statement will be set to *.__record__.app.query.rets.response .
✅ In Answer Relevance, input prompt will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to *.__record__.main_output or `Select.RecordOutput` .
✅ In Groundedness, input source will be set to *.__record__.app.query.rets.response .
✅ In Groundedness, input statement will be set to *.__record__.main_output or `Select.RecordOutput` .


### Ground Truth Eval

It's also useful in many cases to do ground truth eval with small golden sets. We'll do so here.

In [12]:
golden_set = [
    {"query": "What's the vibe like at oprhan andy's in SF?", "response": "welcoming and friendly"},
    {"query": "Is park tavern in San Fran open yet?", "response": "Yes"},
    {"query": "I'm in san francisco for the morning, does Juniper serve pastries?", "response": "Yes"},
    {"query": "What's the address of Gumbo Social in San Francisco?", "response": "5176 3rd St, San Francisco, CA 94124"},
    {"query": "What are the reviews like of Gola in SF?", "response": "Excellent, 4.6/5"},
    {"query": "Where's the best pizza in New York City", "response": "Joe's Pizza"},
    {"query": "What's the best diner in Toronto?", "response": "The George Street Diner"}
]

f_groundtruth = Feedback(GroundTruthAgreement(golden_set).agreement_measure, name = "Ground Truth Eval").on_input_output()

✅ In Ground Truth Eval, input prompt will be set to *.__record__.main_input or `Select.RecordInput` .
✅ In Ground Truth Eval, input response will be set to *.__record__.main_output or `Select.RecordOutput` .


### Run the dashboard

By running the dashboard before we start to make app calls, we can see them come in 1 by 1.

In [13]:
tru.run_dashboard(_dev=Path().cwd().parent.parent.parent.resolve(), force=True)

Force stopping dashboard ...
Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://192.168.1.60:8501 .


<subprocess.Popen at 0x287ce8c40>

### Instrument Yelp App

We can instrument our yelp app with TruLlama and utilize the full suite of evals we set up.

In [14]:
tru_agent = TruLlama(agent,
    app_id='YelpAgent',
    tags = "agent prototype",
    feedbacks = [f_qa_relevance, f_groundtruth, f_context_relevance, f_groundedness, f_query_translation, f_ratings_usage]
)

In [15]:
tru_agent.print_instrumented()

Components:
Agent of trulens_eval.utils.llama component: *.__app__.app
LLM of trulens_eval.utils.llama component: *.__app__.app._llm
Tool of trulens_eval.utils.llama component: *.__app__.app._tools[0]
Other of trulens_eval.utils.llama component: *.__app__.app._tools[0].metadata
Tool of trulens_eval.utils.llama component: *.__app__.app._tools[1]
Other of trulens_eval.utils.llama component: *.__app__.app._tools[1].metadata
Tool of trulens_eval.utils.llama component: *.__app__.app._tools[2]
Other of trulens_eval.utils.llama component: *.__app__.app._tools[2].metadata
Tool of trulens_eval.utils.llama component: *.__app__.app._tools[3]
Other of trulens_eval.utils.llama component: *.__app__.app._tools[3].metadata
Other of trulens_eval.utils.llama component: *.__app__.app.memory

Methods:
Object at 0x287c8f3a0:
	<function BaseQueryEngine.query at 0x16b604670> with path *.__app__.app
	<function BaseQueryEngine.aquery at 0x16b604790> with path *.__app__.app
	<function trace_method.<locals>.deco

### Instrument Standalone LLM app.

Since we don't have insight into the OpenAI innerworkings, we cannot run many of the evals on intermediate steps.

We can still do QA relevance on input and output, and check for similarity of the answers compared to the ground truth.

In [16]:
tru_llm_standalone = TruBasicApp(
    llm_standalone,
    app_id="OpenAIChatCompletion",
    tags = "comparison",
    feedbacks=[f_qa_relevance, f_groundtruth]
)

### Start using our apps!

In [17]:
prompt_set = [
    "What's the vibe like at oprhan andy's in SF?",
    "What are the reviews like of Gola in SF?",
    "Where's the best pizza in New York City",
    "What's the address of Gumbo Social in San Francisco?",
    "I'm in san francisco for the morning, does Juniper serve pastries?",
    "What's the best diner in Toronto?"
]

In [40]:
import functools

# original = llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine.__new__

@functools.wraps(original)
def replacement(cls, *args, **kwargs):
    print("hello there, creating new instance")
    return original(cls)#, *args, **kwargs)

llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine.__new__ = replacement

In [41]:
for prompt in prompt_set[0:1]:
    # with tru_llm_standalone as recording:
    #    llm_standalone(prompt)
    with tru_agent as recording:
        agent.query(prompt)

=== Calling Function ===
Calling function: business_search with args: {
  "location": "San Francisco",
  "term": "Orphan Andy's"
}
Got output: Content loaded! You can now search the information using read_business_search


A new object of type <class 'llama_index.query_engine.retriever_query_engine.RetrieverQueryEngine'> at 0x2e4ffaeb0 is calling an instrumented method <function BaseQueryEngine.query at 0x16b604670>. The path of this call may be incorrect.
Guessing path of new object is *.app based on other object (0x287c8f3a0) using this function.


=== Calling Function ===
Calling function: read_business_search with args: {
  "query": "What's the vibe like at Orphan Andy's in SF?"
}
hello there, creating new instance
Got output: The vibe at Orphan Andy's in San Francisco is not provided in the given context information.


Could not locate *.app.query.rets.response in app/record.
Could not locate *.app.query.rets.response in app/record.
Could not locate *.app.query.args.str_or_query_bundle in app/record.
Could not locate *.app.query.rets.response in app/record.


In [19]:
rec = recording.get()

In [32]:
rec.layout_calls_as_app().app['query']

[RecordAppCall(stack=(RecordAppCallMethod(path=JSONPath().app, method=Method(obj=Obj(cls=llama_index.agent.openai_agent.OpenAIAgent, id=10868028320), name='query')), RecordAppCallMethod(path=JSONPath().app, method=Method(obj=Obj(cls=llama_index.agent.openai_agent.OpenAIAgent, id=10868028320), name='wrapper')), RecordAppCallMethod(path=JSONPath().app._tools[1], method=Method(obj=Obj(cls=llama_index.tools.function_tool.FunctionTool, id=10867977280), name='call')), RecordAppCallMethod(path=JSONPath().app, method=Method(obj=Obj(cls=llama_index.indices.query.base.BaseQueryEngine, id=10868028320), name='query'))), args={'str_or_query_bundle': "What's the vibe like at Orphan Andy's in SF?"}, rets=Response(response="The vibe at Orphan Andy's in San Francisco is not directly mentioned in the context information.", source_nodes=[NodeWithScore(node=TextNode(id_='58354a5a-8fce-44ac-8eed-0a3358fd4a16', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], rela

In [20]:
for c in rec.calls:
    print(c.stack[-1].path)

*.app.memory
*.app._llm
*.app.memory
*.app._tools[0]
*.app._tools[0]
*.app.memory
*.app._llm
*.app.memory
*.app._llm
*.app
*.app._tools[1]
*.app.memory
*.app._llm
*.app.memory
*.app
*.app


In [21]:
agent

<llama_index.agent.openai_agent.OpenAIAgent at 0x287c8f3a0>

In [22]:
def listobj(obj):
    for k in dir(obj):
        v = getattr(obj, k)
        print(type(v).__name__, k)

In [23]:
listobj(agent._tools[0])
# listobj(agent.memory)

method __call__
type __class__
method-wrapper __delattr__
dict __dict__
builtin_function_or_method __dir__
str __doc__
method-wrapper __eq__
builtin_function_or_method __format__
method-wrapper __ge__
method-wrapper __getattribute__
method-wrapper __gt__
method-wrapper __hash__
method __init__
builtin_function_or_method __init_subclass__
method-wrapper __le__
method-wrapper __lt__
str __module__
method-wrapper __ne__
builtin_function_or_method __new__
builtin_function_or_method __reduce__
builtin_function_or_method __reduce_ex__
method-wrapper __repr__
method-wrapper __setattr__
builtin_function_or_method __sizeof__
method-wrapper __str__
builtin_function_or_method __subclasshook__
NoneType __weakref__
function _async_fn
method _fn
ToolMetadata _metadata
method _process_langchain_tool_kwargs
method acall
function async_fn
method call
method fn
method from_defaults
ToolMetadata metadata
method to_langchain_structured_tool
method to_langchain_tool


In [24]:
agent.memory

ChatMemoryBuffer(token_limit=3072, tokenizer_fn=functools.partial(<bound method Encoding.encode of <Encoding 'gpt2'>>, allowed_special='all'), chat_history=[ChatMessage(role=<MessageRole.USER: 'user'>, content="What's the vibe like at oprhan andy's in SF?", additional_kwargs={}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=None, additional_kwargs={'function_call': <OpenAIObject at 0x17364b680> JSON: {
  "arguments": "{\n  \"location\": \"San Francisco\",\n  \"term\": \"Orphan Andy's\"\n}",
  "name": "business_search"
}}), ChatMessage(role=<MessageRole.FUNCTION: 'function'>, content='Content loaded! You can now search the information using read_business_search', additional_kwargs={'name': 'business_search'}), ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, content=None, additional_kwargs={'function_call': <OpenAIObject at 0x297825bd0> JSON: {
  "arguments": "{\n  \"query\": \"What's the vibe like at Orphan Andy's in SF?\"\n}",
  "name": "read_business_search"
}