<img alt="arize llama-index logos" src="https://storage.googleapis.com/arize-phoenix-assets/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png" width="400">



## LlamaIndex Chunk Size, Retrieval Method and K Eval Suite

This colab provides a suite of retrieval performance tests that helps teams understand
how to setup the retrieval system. It makes use of the Phoenix Eval options for 
Q&A (overall did it answer the question) and retrieval (did the right chunks get returned).

There is a sweep of parameters that is stored in experiment_data/results_no_zero_remove, 
check that directory for results. 

The goal is to help teams choose a Chunk size, retireval method, K for return chunks

This colab downloads the script (py) files. Those files can be run without this colab directly,
in a code only environment (VS code for example)

### Retrieval Eval

This Eval evaluates whether a retrieved chunk contains an answer to the query. Its extremely useful for evaluating retrieval systems.

https://docs.arize.com/phoenix/concepts/llm-evals/retrieval-rag-relevance


### Q&A EVal
This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

https://docs.arize.com/phoenix/concepts/llm-evals/q-and-a-on-retrieved-data

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/chunking.png" width=800/>
    </p>
</center>

The challenge in setting up a retrieval system is having solid performance metrics that allow you to evaluate your different strategies:
- Chunk Size
- Retrieval Method
- K value

In setting the above variables you first need some overall Eval metrics.

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/eval_relevance.png" width=800/>
    </p>
</center>

The above is the relevance evaluation used to check whether the chunk retrieved is relevant to the query.

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/EvalQ_A.png" width=600/>
    </p>
</center>

The above Eval shows a Q&A Eval on the entire system Q&A /
on the overall question and answer. 
Each is used as we sweep through parameters to detremine effectiveness of retrieval.

## Sweeping values
The scripts sweep through K, Retrival approach and chunk size, determining the trade off on your own docs.

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_k.png" width=800/>
    </p>
</center>

The above shows sweeping through K=4 and K=6

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/sweep_chunk.png" width=800/>
    </p>
</center>

The above shows sweeping through Chunk Size

In [1]:
# The script below runs a test on the question set, by default we have a 170 Question set
# That takes some time to run so you can default it lower just to test
# Comment this out to run on full dataset
QUESTION_SAMPLES = 4

In [2]:
!pip install cohere matplotlib lxml openai 'arize-phoenix[evals]>=3.4' 'llama_index>0.10.3' 'openinference-instrumentation-llama-index>1.0.0' 'llama-index-callbacks-arize-phoenix' bs4 'llama-index-postprocessor-cohere-rerank'

# Retrieval Eval Scripts 
The following scripts can be run directly. In the case of long test suites, we recommend running 
the python script llama_index_w_evals_and_qa directly.py directly in python. All parameters are available 
in that script.

In [3]:
# # Download scripts
import requests

url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/llama_index_w_evals_and_qa.py"
response = requests.get(url)
with open("llama_index_w_evals_and_qa.py", "w") as file:
    file.write(response.text)

url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/plotresults.py"
response = requests.get(url)
with open("plotresults.py", "w") as file:
    file.write(response.text)

In [4]:
import datetime
import os
import pickle

import cohere
import openai
import pandas as pd

# Phoenix Observabiility
Click link below to visualize llamaIndex queries and chunking as its happening!!!!!

In [5]:
#########################################
### CLICK LINK BELOW FOR PHOENIX VIZ ####
#########################################
# Phoenix can display in real time the traces automatically
# collected from your LlamaIndex application.
import phoenix as px
import phoenix.evals.default_templates as templates
from llama_index.core import download_loader
from llama_index_w_evals_and_qa import get_urls, plot_graphs, run_experiments
from phoenix.evals import OpenAIModel

# Look for a URL in the output to open the App in a browser.
px.launch_app()
# The App is initially empty, but as you proceed with the steps below,
# traces will appear automatically as your LlamaIndex application runs.

import llama_index

llama_index.core.set_global_handler("arize_phoenix")


INFO:phoenix.config:📋 Ensuring phoenix working directory: /Users/dustinqngo/.phoenix


🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix


In [6]:
from getpass import getpass

if not (openai_api_key := os.getenv("OPENAI_API_KEY")):
    openai_api_key = getpass("🔑 Enter your OpenAI API key: ")
openai.api_key = openai_api_key
os.environ["OPENAI_API_KEY"] = openai_api_key
if not (cohere_api_key := os.getenv("COHERE_API_KEY")):
    cohere_api_key = getpass("🔑 Enter your Cohere API key: ")
cohere.api_key = cohere_api_key
os.environ["COHERE_API_KEY"] = cohere_api_key

# if loading from scratch, change these below
web_title = "arize"  # nickname for this website, used for saving purposes
base_url = "https://docs.arize.com/arize"
# Local files
file_name = "raw_documents.pkl"
save_base = "./experiment_data/"
if not os.path.exists(save_base):
    os.makedirs(save_base)

run_name = datetime.datetime.now().strftime("%Y%m%d_%H%M")
save_dir = os.path.join(save_base, run_name)
if not os.path.exists(save_dir):
    # Create a new directory because it does not exist
    os.makedirs(save_dir)


questions = pd.read_csv(
    "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv",
    header=None,
)[0].to_list()

In [7]:
# This will determine run time, how many questions to pull from the data to run
selected_questions = questions[:QUESTION_SAMPLES] if QUESTION_SAMPLES else questions

In [8]:
!pip install -U "urllib3>=2.0.4"

In [9]:
raw_docs_filepath = os.path.join(save_base, file_name)
if not os.path.exists(raw_docs_filepath):
    print(f"'{raw_docs_filepath}' does not exists.")
    urls = get_urls(base_url)  # you need to - pip install lxml
    print(f"LOADED {len(urls)} URLS")

print("GRABBING DOCUMENTS")
BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")
# two options here, either get the documents from scratch or load one from disk
if not os.path.exists(raw_docs_filepath):
    print("LOADING DOCUMENTS FROM URLS")
    # You need to 'pip install lxml'
    loader = BeautifulSoupWebReader()
    documents = loader.load_data(urls=urls)  # may take some time
    with open(save_base + file_name, "wb") as file:
        pickle.dump(documents, file)
    print("Documents saved to raw_documents.pkl")
else:
    print("LOADING DOCUMENTS FROM FILE")
    print("Opening raw_documents.pkl")
    with open(save_base + file_name, "rb") as file:
        documents = pickle.load(file)
##############################
### PARAMETER SWEEPS BELOW ###
##############################
###chunk_sizes### to test, will sweep through values of chunk size
chunk_sizes = [
    100,
    # 300,
    # 500,
    # 1000,
    # 2000,
]  # change this, perhaps experiment from 500 to 3000 in increments of 500

### K ###: Sizes to test, will sweep through values of k
k = [4, 6, 8]
# k = [10]  # num documents to retrieve

### Retrieval Approach ###: transformation to test will sweep through retrieval
# transformations = ["original", "original_rerank","hyde", "hyde_rerank"]
transformations = ["original", "original_rerank"]
# Model for Q&A
llama_index_model = "gpt-4"
# llama_index_model = "gpt-3.5-turbo"
# Model for Evals
eval_model = OpenAIModel(model="gpt-4", temperature=0.0)

qa_template = templates.QA_PROMPT_TEMPLATE

GRABBING DOCUMENTS


  BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")


LOADING DOCUMENTS FROM FILE
Opening raw_documents.pkl


In [10]:
# Uncomment when testing, 3 questions are easy to run through quickly
questions = questions[0:3]
all_data = run_experiments(
    documents=documents,
    queries=questions,
    chunk_sizes=chunk_sizes,
    query_transformations=transformations,
    k_values=k,
    web_title=web_title,
    save_dir=save_dir,
    llama_index_model=llama_index_model,
    eval_model=eval_model,
    template=qa_template,
)

**********
Trace: query
    |_CBEventType.QUERY ->  3.140998 seconds
      |_CBEventType.SYNTHESIZE ->  1.651183 seconds
**********
**********
Trace: query
    |_CBEventType.QUERY ->  1.45729 seconds
      |_CBEventType.SYNTHESIZE ->  0.898783 seconds
**********
**********
Trace: query
    |_CBEventType.QUERY ->  1.175319 seconds
      |_CBEventType.SYNTHESIZE ->  0.687385 seconds
**********


WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/12 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


{'relevant': 1, 'unrelated': 0, 'UNPARSABLE': 0}


llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/12 (0.0%) | ⏳ 00:00<? | ?it/s

{'relevant': 1, 'unrelated': 0, 'UNPARSABLE': 0}
**********
Trace: query
    |_CBEventType.QUERY ->  3.030944 seconds
      |_CBEventType.SYNTHESIZE ->  1.994666 seconds
**********
**********
Trace: query
    |_CBEventType.QUERY ->  1.375151 seconds
      |_CBEventType.SYNTHESIZE ->  0.844216 seconds
**********
**********
Trace: query
    |_CBEventType.QUERY ->  1.043542 seconds
      |_CBEventType.SYNTHESIZE ->  0.53529 seconds
**********


WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/18 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


{'relevant': 1, 'unrelated': 0, 'UNPARSABLE': 0}


llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/18 (0.0%) | ⏳ 00:00<? | ?it/s

{'relevant': 1, 'unrelated': 0, 'UNPARSABLE': 0}
**********
Trace: query
    |_CBEventType.QUERY ->  2.957524 seconds
      |_CBEventType.SYNTHESIZE ->  1.835386 seconds
**********
**********
Trace: query
    |_CBEventType.QUERY ->  1.24611 seconds
      |_CBEventType.SYNTHESIZE ->  0.741425 seconds
**********
**********
Trace: query
    |_CBEventType.QUERY ->  1.062246 seconds
      |_CBEventType.SYNTHESIZE ->  0.577777 seconds
**********


WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/24 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


{'relevant': 1, 'unrelated': 0, 'UNPARSABLE': 0}


llm_classify |          | 0/3 (0.0%) | ⏳ 00:00<? | ?it/s

WARNI [phoenix.evals.executors] 🐌!! If running llm_classify inside a notebook, patching the event loop with nest_asyncio will allow asynchronous eval submission, and is significantly faster. To patch the event loop, run `nest_asyncio.apply()`.


llm_classify |          | 0/24 (0.0%) | ⏳ 00:00<? | ?it/s

{'relevant': 1, 'unrelated': 0, 'UNPARSABLE': 0}


In [11]:
all_data_filepath = os.path.join(save_dir, f"{web_title}_all_data.pkl")
with open(all_data_filepath, "wb") as f:
    pickle.dump(all_data, f)

# The retrievals with 0 relevant context really can't be optimized, removing gives a diff view
plot_graphs(
    all_data=all_data,
    save_dir=os.path.join(save_dir, "results_zero_removed"),
    show=False,
    remove_zero=True,
)
plot_graphs(
    all_data=all_data,
    save_dir=os.path.join(save_dir, "results_zero_not_removed"),
    show=False,
    remove_zero=False,
)

## Example Results Q&A Evals (actual results in experiment_data)

The Q&A Eval runs at the highest level of did you get the question answer correct  based on the data:

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/percentage_incorrect_plot.png" />
    </p>
</center>



## Example Results Retrieval Eval  (actual results in experiment_data)

The retrieval analysis example is below, iterates through the chunk sizes, K (4/6/10), retrieval method
The eval checks whether the retrieved chunk is relevant and has a chance to answer the question

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/all_mean_precisions.png" />
    </p>
</center>

## Example Results Latency  (actual results in experiment_data)

The latency can highly varied based on retrieval approaches, below are latency maps

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-phoenix-assets/assets/images/median_latency_all.png" />
    </p>
</center>
