<img alt="arize llama-index logos" src="https://storage.googleapis.com/arize-assets/phoenix/assets/docs/notebooks/llama-index-knowledge-base-tutorial/arize_llamaindex.png" width="400">



## LlamaIndex Chunk Size, Retrieval Method and K Eval Suite

This colab provides a suite of retrieval performance tests that helps teams understand
how to setup the retrieval system. It makes use of the Phoenix Eval options for 
Q&A (overall did it answer the question) and retrieval (did the right chunks get returned).

There is a sweep of parameters that is stored in experiment_data/results_no_zero_remove, 
check that directory for results. 

The goal is to help teams choose a Chunk size, retireval method, K for return chunks

This colab downloads the script (py) files. Those files can be run without this colab directly,
in a code only environment (VS code for example)

### Retrieval Eval

This Eval evaluates whether a retrieved chunk contains an answer to the query. Its extremely useful for evaluating retrieval systems.

https://docs.arize.com/phoenix/concepts/llm-evals/retrieval-rag-relevance


### Q&A EVal
This Eval evaluates whether a question was correctly answered by the system based on the retrieved data. In contrast to retrieval Evals that are checks on chunks of data returned, this check is a system level check of a correct Q&A.

https://docs.arize.com/phoenix/concepts/llm-evals/q-and-a-on-retrieved-data

<center>
    <p style="text-align:center">
        <img alt="phoenix logo" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/chunking.png" />
    </p>
</center>

The challenge in setting up a retrieval system is having solid performance metrics that allow you to evaluate your different strategies:
- Chunk Size
- Retrieval Method
- K value

In [1]:
!pip install -q cohere matplotlib lxml

In [2]:
# # Donwload scripts
# import requests

# url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/llama_index_w_evals_and_qa.py"
# response = requests.get(url)
# with open("example.py", "w") as file:
#     file.write(response.text)

# url = "https://raw.githubusercontent.com/Arize-ai/phoenix/main/scripts/rag/plotresults.py"
# response = requests.get(url)
# with open("example.py", "w") as file:
#     file.write(response.text)

In [3]:
import datetime
import os
import pickle
import pandas as pd

import cohere
import openai
import phoenix.experimental.evals.templates.default_templates as templates
from phoenix.experimental.evals import OpenAIModel
from llama_index import download_loader
from llama_index_w_evals_and_qa import get_urls, plot_graphs, run_experiments

In [4]:
openai.api_key = os.getenv("OPENAI_API_KEY")
cohere.api_key = os.getenv("COHERE_API_KEY")

# if loading from scratch, change these below
web_title = "arize"  # nickname for this website, used for saving purposes
base_url = "https://docs.arize.com/arize"
# Local files
file_name = "raw_documents.pkl"
save_base = "./experiment_data/"
if not os.path.exists(save_base):
    os.makedirs(save_base)

run_name = datetime.datetime.now().strftime("%Y%m%d_%H%M")
save_dir = os.path.join(save_base, run_name)
if not os.path.exists(save_dir):
    # Create a new directory because it does not exist
    os.makedirs(save_dir)



questions = pd.read_csv(
    "https://storage.googleapis.com/arize-assets/fixtures/Embeddings/GENERATIVE/constants.csv",
    header=None,
)[0].to_list()

In [5]:
questions[:3]

['How do I use the SDK to upload a ranking model?',
 'What drift metrics are supported in Arize?',
 'Does Arize support batch models?']

In [6]:
raw_docs_filepath = os.path.join(save_base, file_name)
if not os.path.exists(raw_docs_filepath):
    print(f"'{raw_docs_filepath}' does not exists.")
    urls = get_urls(base_url)  # you need to - pip install lxml
    print(f"LOADED {len(urls)} URLS")

print("GRABBING DOCUMENTS")
BeautifulSoupWebReader = download_loader("BeautifulSoupWebReader")
# two options here, either get the documents from scratch or load one from disk
if not os.path.exists(raw_docs_filepath):
    print("LOADING DOCUMENTS FROM URLS")
    # You need to 'pip install lxml'
    loader = BeautifulSoupWebReader()
    documents = loader.load_data(urls=urls)  # may take some time
    with open(save_base + file_name, "wb") as file:
        pickle.dump(documents, file)
    print("Documents saved to raw_documents.pkl")
else:
    print("LOADING DOCUMENTS FROM FILE")
    print("Opening raw_documents.pkl")
    with open(save_base + file_name, "rb") as file:
        documents = pickle.load(file)
chunk_sizes = [
    100,
    # 300,
    # 500,
    # 1000,
    # 2000,
]  # change this, perhaps experiment from 500 to 3000 in increments of 500

k = [4, 6, 10]
# k = [10]  # num documents to retrieve

# transformations = ["original", "original_rerank","hyde", "hyde_rerank"]
transformations = ["original", "original_rerank"]

llama_index_model = "gpt-4"
# llama_index_model = "gpt-3.5-turbo"
eval_model = OpenAIModel(model_name="gpt-4", temperature=0.0)

qa_template = templates.QA_PROMPT_TEMPLATE_STR



GRABBING DOCUMENTS
LOADING DOCUMENTS FROM FILE
Opening raw_documents.pkl


In [None]:
# Uncomment when testing, 3 questions are easy to run through quickly
questions = questions[0:3]
all_data = run_experiments(
    documents=documents,
    queries=questions,
    chunk_sizes=chunk_sizes,
    query_transformations=transformations,
    k_values=k,
    web_title=web_title,
    save_dir=save_dir,
    llama_index_model=llama_index_model,
    eval_model=eval_model,
    template=qa_template,
)

INFO:evals:LAMAINDEX MODEL : gpt-4
INFO:evals:PARSING WITH CHUNK SIZE 100
INFO:evals:EXISTING INDEX FOUND, LOADING...
INFO:llama_index.indices.loading:Loading all indices.
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 1: How do I use the SDK to upload a ranking model?
INFO:evals:TRANSFORMATION: original
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 4


**********
Trace: query
    |_query ->  2.264074 seconds
      |_retrieve ->  0.728182 seconds
      |_synthesize ->  1.535716 seconds
        |_templating ->  2.2e-05 seconds
        |_llm ->  1.532364 seconds
**********


INFO:evals:RESPONSE: The context does not provide specific information on how to use the SDK to upload a ranking model.
INFO:evals:LATENCY: 2.27
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 2: What drift metrics are supported in Arize?
INFO:evals:TRANSFORMATION: original
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 4


**********
Trace: query
    |_query ->  2.819141 seconds
      |_retrieve ->  0.440375 seconds
      |_synthesize ->  2.378523 seconds
        |_templating ->  2.2e-05 seconds
        |_llm ->  2.375589 seconds
**********


INFO:evals:RESPONSE: Arize supports several drift metrics including Population Stability Index, KL Divergence, KS Statistic, and JS Distance.
INFO:evals:LATENCY: 2.83
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 3: Does Arize support batch models?
INFO:evals:TRANSFORMATION: original
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 4


**********
Trace: query
    |_query ->  1.764567 seconds
      |_retrieve ->  0.448703 seconds
      |_synthesize ->  1.315665 seconds
        |_templating ->  2e-05 seconds
        |_llm ->  1.312729 seconds
**********


INFO:evals:RESPONSE: The context does not provide specific information about Arize supporting batch models.
INFO:evals:LATENCY: 1.77
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 1: How do I use the SDK to upload a ranking model?
INFO:evals:TRANSFORMATION: original_rerank
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 4
INFO:evals:RESPONSE: The context does not provide specific steps on how to use the SDK to upload a ranking model.
INFO:evals:LATENCY: 7.36
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 2: What drift metrics are supported in Arize?
INFO:evals:TRANSFORMATION: original_rerank
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 4
INFO:evals:RESPONSE: Arize supports a variety of drift metrics including Population Stability Index, KL Divergence, KS Statistic, and JS Distance.
INFO:evals:LATENCY: 7.42
INF

**********
Trace: query
    |_query ->  2.121346 seconds
      |_retrieve ->  0.549007 seconds
      |_synthesize ->  1.571917 seconds
        |_templating ->  2.4e-05 seconds
        |_llm ->  1.567762 seconds
**********


INFO:evals:RESPONSE: The context does not provide specific information on how to use the SDK to upload a ranking model.
INFO:evals:LATENCY: 2.13
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 2: What drift metrics are supported in Arize?
INFO:evals:TRANSFORMATION: original
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 6


**********
Trace: query
    |_query ->  2.291828 seconds
      |_retrieve ->  0.42732 seconds
      |_synthesize ->  1.864243 seconds
        |_templating ->  3.7e-05 seconds
        |_llm ->  1.860402 seconds
**********


INFO:evals:RESPONSE: Arize supports several drift metrics including Population Stability Index, KL Divergence, KS Statistic, and JS Distance.
INFO:evals:LATENCY: 2.30
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 3: Does Arize support batch models?
INFO:evals:TRANSFORMATION: original
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 6


**********
Trace: query
    |_query ->  2.033492 seconds
      |_retrieve ->  0.436229 seconds
      |_synthesize ->  1.596995 seconds
        |_templating ->  2.1e-05 seconds
        |_llm ->  1.593366 seconds
**********


INFO:evals:RESPONSE: The context does not provide specific information on whether Arize supports batch models.
INFO:evals:LATENCY: 2.04
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 1: How do I use the SDK to upload a ranking model?
INFO:evals:TRANSFORMATION: original_rerank
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 6
INFO:evals:RESPONSE: The context does not provide specific steps or instructions on how to use the SDK to upload a ranking model.
INFO:evals:LATENCY: 10.98
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 2: What drift metrics are supported in Arize?
INFO:evals:TRANSFORMATION: original_rerank
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 6
INFO:evals:RESPONSE: Arize supports a range of drift metrics, such as Population Stability Index, KL Divergence, KS Statistic, and JS Distance.
INFO:evals:

**********
Trace: query
    |_query ->  6.142677 seconds
      |_retrieve ->  0.450533 seconds
      |_synthesize ->  5.691957 seconds
        |_templating ->  2.3e-05 seconds
        |_llm ->  5.687176 seconds
**********


INFO:evals:RESPONSE: The SDK is not directly used to upload a ranking model. Instead, data can be uploaded for a ranking model using various Data Connectors. These include Google Cloud Storage (GCS), AWS S3, Azure Blob Storage, Google BigQuery, and other methods such as Python Pandas SDK, UI Drag & Drop, Databricks, and Snowflake. After uploading data, you can then configure and log your model schema for ranking models.
INFO:evals:LATENCY: 6.15
INFO:evals:--------------------------------------------------
INFO:evals:--------------------------------------------------
INFO:evals:QUERY 2: What drift metrics are supported in Arize?
INFO:evals:TRANSFORMATION: original
INFO:evals:CHUNK SIZE: 100
INFO:evals:K : 10


In [None]:
all_data_filepath = os.path.join(save_dir, f"{web_title}_all_data.pkl")
with open(all_data_filepath, "wb") as f:
    pickle.dump(all_data, f)

# The retrievals with 0 relevant context really can't be optimized, removing gives a diff view
plot_graphs(
    all_data=all_data,
    save_dir=os.path.join(save_dir, "results_zero_removed"),
    show=False,
    remove_zero=True,
)
plot_graphs(
    all_data=all_data,
    save_dir=os.path.join(save_dir, "results_zero_not_removed"),
    show=False,
    remove_zero=False,
)

## Example Results Q&A Evals (actual results in experiment_data)

The Q&A Eval runs at the highest level of did you get the question answer correct  based on the data:

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/percentage_incorrect_plot.png" />
    </p>
</center>



## Example Results Retrieval Eval  (actual results in experiment_data)

The retrieval analysis example is below, iterates through the chunk sizes, K (4/6/10), retrieval method
The eval checks whether the retrieved chunk is relevant and has a chance to answer the question

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/all_mean_precisions.png" />
    </p>
</center>

## Example Results Latency  (actual results in experiment_data)

The latency can highly varied based on retrieval approaches, below are latency maps

<center>
    <p style="text-align:center">
        <img alt="phoenix data" src="https://storage.googleapis.com/arize-assets/phoenix/assets/images/median_latency_all.png" />
    </p>
</center>
