<center>
    <p style="text-align:center">
    <img alt="arize logo" src="https://storage.googleapis.com/arize-assets/arize-logo-white.jpg" width="300"/>
        <br>
        <a href="https://docs.arize.com/arize/">Docs</a>
        |
        <a href="https://github.com/Arize-ai/client_python">GitHub</a>
        |
        <a href="https://arize-ai.slack.com/join/shared_invite/zt-11t1vbu4x-xkBIHmOREQnYnYDH1GDfCg">Community</a>
    </p>
</center>

<center><h1>Using Arize with RAG</h1></center>

This guide shows you how to create a retrieval augmented generation chatbot and evaluate performance with Arize. RAG is typically to respond to queries using a specified set of documents instead of using the LLM's own training data, reducing hallucination and incorrect generations.

We'll go through the following steps:

* Create a RAG chatbot using LlamaIndex

* Trace the retrieval and llm calls using Arize

* Create a dataset to benchmark performance

* Evaluate performance using LLM as a judge

# Create a RAG chatbot using LlamaIndex

Let's start with all of our boilerplate setup:

1. Install packages for tracing and retrieval
2. Setup our API keys
3. Setup Phoenix for tracing
4. Create our LlamaIndex query engine
5. See your results in Phoenix

### Install packages for tracing and retrieval

In [None]:
# Install llama-index, openai for setting up our RAG chatbot
!pip install -qq llama-index==0.12.5 openai==1.57.1 llama-index-core

# Install arize packages for tracing and evaluation
!pip install -qq "arize-phoenix-evals>=0.17.5" "arize-otel>=0.7.0" "openinference-instrumentation-llama-index>=3.0.4" "arize[Datasets]>7.29.0"

### Setup our API Keys

In [None]:
import os
from getpass import getpass

SPACE_ID = globals().get("SPACE_ID") or getpass(
    "🔑 Enter your Arize Space ID: "
)
API_KEY = globals().get("API_KEY") or getpass("🔑 Enter your Arize API Key: ")
DEVELOPER_KEY = globals().get("DEVELOPER_KEY") or getpass(
    "🔑 Enter your Arize Developer Key: "
)
OPENAI_API_KEY = globals().get("OPENAI_API_KEY") or getpass(
    "🔑 Enter your OpenAI API key: "
)
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY

### Setup Arize for Tracing

To follow with this tutorial, you'll need to sign up for Arize and get your API key. You can see the [guide here](https://docs.arize.com/arize/llm-tracing/quickstart-llm).

In [None]:
# Import open-telemetry dependencies
from arize.otel import register
from openinference.instrumentation.llama_index import LlamaIndexInstrumentor

# Setup OTEL via our convenience function
tracer_provider = register(
    space_id=SPACE_ID,
    api_key=API_KEY,
    project_name="rag-cookbook",  # name this to whatever you would like
)
LlamaIndexInstrumentor().instrument(tracer_provider=tracer_provider)

### Create our LlamaIndex query engine

In [None]:
!mkdir data
!wget "https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt" -O data/paul_graham_essay.txt

In [None]:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from pprint import pprint
from llama_index.llms.openai import OpenAI

# load documents
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(llm=OpenAI(model="gpt-4o-mini"))
response = query_engine.query("What did Paul Graham work on?")
pprint(response)

In [None]:
for node in response.source_nodes:
    text_fmt = node.node.get_content().strip().replace("\n", " ")[:200] + "..."
    print(text_fmt)
    print(node.score)
    print("--------")

### See your results in the Arize UI
Once you've run a single query, you can see the trace in the Arize UI with each step taken by the retriever, the embedding, and the llm query.

Click through the queries to better understand how the query engine is performing. Arize can be used to understand and troubleshoot your RAG pipeline by surfacing:
 - Application latency
 - Token usage
 - Runtime exceptions
 - Retrieved documents
 - Embeddings
 - LLM parameters
 - Prompt templates
 - Tool descriptions
 - LLM function calls
 - And more!

<img src="https://storage.googleapis.com/arize-assets/tutorials/images/llamaindex-arize-starter.png" width="800"/>

# Create synthetic dataset of questions

Using the template below, we're going to generate a dataframe of 25 questions we can use to test our customer support agent.

In [None]:
GEN_TEMPLATE = """
You are an assistant that generates Q&A questions about Paul Graham's essay below.

The questions should involve the essay contents, specific facts and figures,
names, and elements of the story. Do not ask any questions where the answer is
not in the essay contents.

Respond with one question per line. Do not include any numbering at the beginning of each line. Do not include any category headings.
Generate 10 questions. Be sure there are no duplicate questions.

[START ESSAY]
{essay}
[END ESSAY]
"""

with open("data/paul_graham_essay.txt", "r") as file:
    file_content = file.read()

GEN_TEMPLATE = GEN_TEMPLATE.format(essay=file_content)

In [None]:
import nest_asyncio
import pandas as pd

nest_asyncio.apply()
from phoenix.evals import OpenAIModel

pd.set_option("display.max_colwidth", 500)

model = OpenAIModel(model="gpt-4o", max_tokens=1300)

In [None]:
resp = model(GEN_TEMPLATE)

In [None]:
split_response = resp.strip().split("\n\n")

questions_df = pd.DataFrame(split_response, columns=["input"])
print(questions_df.head(3))

Now let's run it and manually inspect the traces! 

In [None]:
def run_rag(query_engine, questions_df):
    response_df = questions_df.copy(deep=True)
    for index, row in response_df.iterrows():
        response = query_engine.query(row["input"])
        reference_text = ""
        for node in response.source_nodes:
            reference_text += node.text
            reference_text += "\n"
        response_df.loc[index, "output"] = response
        response_df.loc[index, "reference"] = reference_text
    text_columns = ["input", "output", "reference"]
    response_df[text_columns] = response_df[text_columns].apply(
        lambda x: x.astype(str)
    )
    return response_df


response_df = run_rag(query_engine, questions_df)
response_df.head(3)

# Evaluating your RAG app

Now that we have a set of test cases, we can create evaluators to measure performance. This way, we don't have to manually inspect every single trace to see if the LLM is doing the right thing.

In [None]:
RELEVANCE_EVAL_TEMPLATE = """You are comparing a reference text to a question and trying to determine if the reference text
contains information relevant to answering the question. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference text]: {reference}
    [END DATA]

Compare the Question above to the Reference text. You must determine whether the Reference text
contains information that can answer the Question. Please focus on whether the very specific
question can be answered by the information in the Reference text.
Your response must be single word, either "relevant" or "unrelated",
and should not contain any text or characters aside from that word.
"unrelated" means that the reference text does not contain an answer to the Question.
"relevant" means the reference text contains an answer to the Question.
"""

CORRECTNESS_EVAL_TEMPLATE = """You are given a question, an answer and reference text. You must determine whether the
given answer correctly answers the question based on the reference text. Here is the data:
    [BEGIN DATA]
    ************
    [Question]: {input}
    ************
    [Reference]: {reference}
    ************
    [Answer]: {output}
    [END DATA]
Your response must be a single word, either "correct" or "incorrect",
and should not contain any text or characters aside from that word.
"correct" means that the question is correctly and fully answered by the answer.
"incorrect" means that the question is not correctly or only partially answered by the answer.
"""

We will be creating an LLM as a judge using the prompt templates above by taking the spans recorded by Phoenix, and then giving them labels using the `llm_classify` function. This function uses LLMs to evaluate your LLM calls and gives them labels and explanations. You can read more detail [here](https://docs.arize.com/phoenix/api/evals#phoenix.evals.llm_classify).

In [None]:
from phoenix.evals import OpenAIModel, llm_classify

RELEVANCE_RAILS = ["relevant", "unrelated"]
CORRECTNESS_RAILS = ["incorrect", "correct"]

relevance_eval_df = llm_classify(
    dataframe=response_df,
    template=RELEVANCE_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=RELEVANCE_RAILS,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

correctness_eval_df = llm_classify(
    dataframe=response_df,
    template=CORRECTNESS_EVAL_TEMPLATE,
    model=OpenAIModel(model="gpt-4o"),
    rails=CORRECTNESS_RAILS,
    provide_explanation=True,
    include_prompt=True,
    concurrency=4,
)

Let's look at and inspect the results of our evaluatiion!

In [None]:
relevance_eval_df

In [None]:
correctness_eval_df

## Experiment with different k-values

We can also experiment with different k-values for the retriever. This is the number of documents retrieved from the vector store. We can also experiment with different chunk sizes, chunk overlaps, and rerankers. We'll be using the ColbertReranker from LlamaIndex. You can read more about it [here](https://docs.llamaindex.ai/docs/postprocessors/colbert-reranker).

In [None]:
from llama_index.core.node_parser import SimpleNodeParser
from llama_index.llms.openai import OpenAI


def run_rag_with_settings(questions_df, k_value, chunk_size, chunk_overlap):
    node_parser = SimpleNodeParser.from_defaults(
        chunk_size=chunk_size, chunk_overlap=chunk_overlap
    )
    nodes = node_parser.get_nodes_from_documents(documents)
    vector_index = VectorStoreIndex(nodes)
    query_engine = vector_index.as_query_engine(
        similarity_top_k=k_value,  # Default is 2
        response_mode="compact",  # or use "tree-summarize"
        llm=OpenAI(model="gpt-4o-mini"),
    )
    response_df = run_rag(query_engine, questions_df)
    return response_df

Let's setup our evaluators to see how the performance changes.

In [None]:
def run_evaluators(rag_df):
    relevance_eval_df = llm_classify(
        dataframe=rag_df,
        template=RELEVANCE_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=RELEVANCE_RAILS,
        provide_explanation=True,
        concurrency=4,
    )
    rag_df["relevance"] = relevance_eval_df["label"]
    rag_df["relevance_explanation"] = relevance_eval_df["explanation"]

    correctness_eval_df = llm_classify(
        dataframe=rag_df,
        template=CORRECTNESS_EVAL_TEMPLATE,
        model=OpenAIModel(model="gpt-4o"),
        rails=CORRECTNESS_RAILS,
        provide_explanation=True,
        concurrency=4,
    )
    rag_df["correctness"] = correctness_eval_df["label"]
    rag_df["correctness_explanation"] = correctness_eval_df["explanation"]
    return rag_df

Let's log these results to Arize and see how they compare.

First we'll create a dataset to store our questions.

In [None]:
from arize.experimental.datasets import ArizeDatasetsClient
from uuid import uuid1
from arize.experimental.datasets.experiments.types import (
    ExperimentTaskResultColumnNames,
    EvaluationResultColumnNames,
)
from arize.experimental.datasets.utils.constants import GENERATIVE
import pandas as pd

# Set up the arize client
arize_client = ArizeDatasetsClient(developer_key=DEVELOPER_KEY, api_key=API_KEY)
dataset = None
dataset_name = "rag-experiments-" + str(uuid1())[:3]

dataset_id = arize_client.create_dataset(
    space_id=SPACE_ID,
    dataset_name=dataset_name,
    dataset_type=GENERATIVE,
    data=questions_df,
)
dataset = arize_client.get_dataset(space_id=SPACE_ID, dataset_id=dataset_id)
print(dataset)

Next we'll define which columns of our dataframe will be mapped to outputs and which will be mapped to evaluation labels and explanations..

In [None]:
# Define column mappings for task
task_cols = ExperimentTaskResultColumnNames(
    example_id="example_id", result="output"
)
# Define column mappings for evaluator
relevance_evaluator_cols = EvaluationResultColumnNames(
    label="relevance",
    explanation="relevance_explanation",
)
correctness_evaluator_cols = EvaluationResultColumnNames(
    label="correctness",
    explanation="correctness_explanation",
)


def log_experiment_to_arize(experiment_df, experiment_name):
    experiment_df["example_id"] = dataset["id"]
    return arize_client.log_experiment(
        space_id=SPACE_ID,
        experiment_name=experiment_name + "-" + str(uuid1())[:2],
        experiment_df=experiment_df,
        task_columns=task_cols,
        evaluator_columns={
            "correctness": correctness_evaluator_cols,
            "relevance": relevance_evaluator_cols,
        },
        dataset_name=dataset_name,
    )

Now let's run it for each of our experiments.

In [None]:
# Run Experiments for k-size
k_2_chunk_100_overlap_10 = run_rag_with_settings(
    questions_df, k_value=2, chunk_size=100, chunk_overlap=10
)
k_4_chunk_100_overlap_10 = run_rag_with_settings(
    questions_df, k_value=4, chunk_size=100, chunk_overlap=10
)
k_10_chunk_100_overlap_10 = run_rag_with_settings(
    questions_df, k_value=10, chunk_size=100, chunk_overlap=10
)
k_2_chunk_100_overlap_10 = run_evaluators(k_2_chunk_100_overlap_10)
k_4_chunk_100_overlap_10 = run_evaluators(k_4_chunk_100_overlap_10)
k_10_chunk_100_overlap_10 = run_evaluators(k_10_chunk_100_overlap_10)

log_experiment_to_arize(k_2_chunk_100_overlap_10, "k_2_chunk_100_overlap_10")
log_experiment_to_arize(k_4_chunk_100_overlap_10, "k_4_chunk_100_overlap_10")
log_experiment_to_arize(k_10_chunk_100_overlap_10, "k_10_chunk_100_overlap_10")

In [None]:
# Run experiments for chunk size
k_2_chunk_200_overlap_10 = run_rag_with_settings(
    questions_df, k_value=2, chunk_size=200, chunk_overlap=10
)
k_2_chunk_500_overlap_20 = run_rag_with_settings(
    questions_df, k_value=2, chunk_size=500, chunk_overlap=20
)
k_2_chunk_1000_overlap_50 = run_rag_with_settings(
    questions_df, k_value=2, chunk_size=1000, chunk_overlap=50
)

k_2_chunk_200_overlap_10 = run_evaluators(k_2_chunk_200_overlap_10)
k_2_chunk_500_overlap_20 = run_evaluators(k_2_chunk_500_overlap_20)
k_2_chunk_1000_overlap_50 = run_evaluators(k_2_chunk_1000_overlap_50)

log_experiment_to_arize(k_2_chunk_200_overlap_10, "k_2_chunk_200_overlap_10")
log_experiment_to_arize(k_2_chunk_500_overlap_20, "k_2_chunk_500_overlap_20")
log_experiment_to_arize(k_2_chunk_1000_overlap_50, "k_2_chunk_1000_overlap_50")

### Experiment with HyDE

We can also experiment with HyDE, a retrieval augmentation technique that uses LLMs to generate synthetic queries to retrieve more relevant documents. You can read more about it [here](https://docs.llamaindex.ai/docs/retrieval/hyde).

In [None]:
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine

# Setup HyDe
hyde = HyDEQueryTransform(
    include_original=True, llm=OpenAI(model="gpt-4o-mini")
)
query_engine = (
    index.as_query_engine()
)  # default k=2, chunk_size=1024, chunk_overlap=20
hyde_query_engine = TransformQueryEngine(query_engine, hyde)

# Run RAG with HyDE
hyde_response_df = run_rag(hyde_query_engine, questions_df)

# Evaluate RAG with HyDE
hyde_response_df = run_evaluators(hyde_response_df)

# Log to Arize
log_experiment_to_arize(hyde_response_df, "hyde")