<img alt="arize llama-index logos" src="https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png" width="400">



## LlamaIndex Chunk Size, Retrieval Method and K Eval Suite

This colab provides a suite of retrieval performance tests that helps teams understand
how to setup the retrieval system. It makes use of the Phoenix Eval options for 
Q&A (overall did it answer the question) and retrieval (did the right chunks get returned).

There is a sweep of parameters that is stored in experiment_data/results_no_zero_remove, 
check that directory for results. 

The goal is to help teams choose a Chunk size, retireval method, K for return chunks

This colab downloads the script (py) files. Those files can be run without this colab directly,
in a code only environment (VS code for example)

### Retrieval Eval

This Eval evaluates whether a retrieved chunk contains an answer to the query. Its extremely useful for evaluating retrieval systems.

https://docs.arize.com/phoenix/concepts/llm-evals/retrieval-rag-relevance


### Q&A EVal
This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

https://docs.arize.com/phoenix/concepts/llm-evals/q-and-a-on-retrieved-data

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/chunking.png" />
    </p>
</center>

The challenge in setting up a retrieval system is having solid performance metrics that allow you to evaluate your different strategies:
- Chunk Size
- Retrieval Method
- K value

In [None]:
# Donwload scripts
import requests

url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/llama_index_w_eavls_and_qa.py"
response = requests.get(url)
with open("example.py", "w") as file:
    file.write(response.text)

url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/plotresults.py"
response = requests.get(url)
with open("example.py", "w") as file:
    file.write(response.text)

url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/config.py"
response = requests.get(url)
with open("example.py", "w") as file:
    file.write(response.text)

### OPEN AI KEY and Cohere Key

Open the config.py and add your keys

In [None]:
import datetime
import os
import pickle

import cohere
import config
import openai
import phoenix.experimental.evals.templates.default_templates as templates
from llama_index import download_loader
from llama_index_w_eavls_and_qa import get_urls, plot_graphs, read_strings_from_csv, run_experiments

name = "BeautifulSoupWebReader"
BeautifulSoupWebReader = download_loader(name)
now = datetime.datetime.now()
run_name = now.strftime("%Y%m%d_%H%M")
# Directory parameter for saving data!
openai.api_key = config.open_ai_key  # replace with the string containing the API key if needed
cohere.api_key = config.cohere_key
# Used by Arize Evals
os.environ["OPENAI_API_KEY"] = config.open_ai_key

# if loading from scratch, change these below
web_title = "arize"  # nickname for this website, used for saving purposes
base_url = "https://docs.arize.com/arize"
# Local files
file_name = "raw_documents.pkl"
save_base = "./experiment_data/"
save_dir = save_base + run_name + "/"

# Read strings from CSV
questions = read_strings_from_csv(
    "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv"
)

if not os.path.exists(save_base + file_name):
    print(f"'{save_base}{file_name}' does not exists.")
    urls = get_urls(base_url)  # you need to - pip install lxml
    print(f"LOADED {len(urls)} URLS")

print("GRABBING DOCUMENTS")
# two options here, either get the documents from scratch or load one from disk
if not os.path.exists(save_base + file_name):
    print("LOADING DOCUMENTS FROM URLS")
    # You need to 'pip install lxml'
    loader = BeautifulSoupWebReader()
    documents = loader.load_data(urls=urls)  # may take some time
    with open(save_base + file_name, "wb") as file:
        pickle.dump(documents, file)
    print("Documents saved to raw_documents.pkl")
else:
    print("LOADING DOCUMENTS FROM FILE")
    print("Opening raw_documents.pkl")
    with open(save_base + file_name, "rb") as file:
        documents = pickle.load(file)
chunk_sizes = [
    100,
    # 300,
    # 500,
    # 1000,
    # 2000,
]  # change this, perhaps experiment from 500 to 3000 in increments of 500

k = [4, 6, 10]
# k = [10]  # num documents to retrieve

# transformations = ["original", "original_rerank","hyde", "hyde_rerank"]
transformations = ["original", "original_rerank"]

lama_index_model = "gpt-4"
# lama_index_model = "gpt-3.5-turbo"
eval_model = "gpt-4"

qa_templ = templates.QA_PROMPT_TEMPLATE_STR
# Uncomment when testing, 3 questions are easy to run through quickly
# questions = questions[0:3]
all_data = run_experiments(
    documents,
    questions,
    chunk_sizes,
    transformations,
    k,
    web_title,
    save_dir,
    lama_index_model,
    eval_model,
    qa_templ,
)


with open(f"{save_dir}{web_title}_all_data.pkl", "wb") as f:
    pickle.dump(all_data, f)

# The retrievals with 0 relevant context really can't be optimized, removing gives a diff view
plot_graphs(all_data, k, save_dir + "/results_zero_removed/", show=False)
plot_graphs(all_data, k, save_dir + "/results_no_zero_remove/", show=False, remove_zero=False)

## Example Results Q&A Evals (actual results in experiment_data)

The Q&A Eval runs at the highest level of did you get the question answer correct  based on the data:

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/percentage_incorrect_plot.png" />
    </p>
</center>



## Example Results Retrieval Eval  (actual results in experiment_data)

The retrieval analysis example is below, iterates through the chunk sizes, K (4/6/10), retrieval method
The eval checks whether the retrieved chunk is relevant and has a chance to answer the question

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/all_mean_precisions.png" />
    </p>
</center>

## Example Results Latency  (actual results in experiment_data)

The latency can highly varied based on retrieval approaches, below are latency maps

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/median_latency_all.png" />
    </p>
</center>
