# Vanilla RAG versus Agentic RAG

This notebook will compare Vanilla RAG with Agentic RAG on the task of answering questions about Weaviate.

Both systems are connected to a Weaviate Database instance containing chunks of Weaviate's blog posts. These blog posts can help answer questions such as: "How does BM25 work?", "What was released in Weaviate 1.27?", or "What is Retrieval-Augmented Generation?", to give a few examples.

We use an LLM-as-Judge to determine which answer to a question is better, the Vanilla RAG answer or the Agentic RAG answer. Both systems use the GPT-4o Large Language Model.

We find that **Agentic RAG wins 74% of the questions**, with **33 wins** where **Vanilla RAG only achieves 12 wins**.

### 1. Import Data into Weaviate

In [43]:
import weaviate
import weaviate.classes.config as wvcc

weaviate_client = weaviate.connect_to_local()

# Create Schema
if weaviate_client.collections.exists("WeaviateBlogChunk"):
    weaviate_client.collections.delete("WeaviateBlogChunk") 

collection = weaviate_client.collections.create(
    name="WeaviateBlogChunk",
    vectorizer_config=wvcc.Configure.Vectorizer.text2vec_transformers(),
    properties=[
            wvcc.Property(name="content", data_type=wvcc.DataType.TEXT),
            wvcc.Property(name="author", data_type=wvcc.DataType.TEXT),
      ]
)

            Consider upgrading to the latest version. See https://weaviate.io/developers/weaviate/client-libraries/python for details.


In [44]:
import os
import re

def chunk_list(lst, chunk_size):
    """Break a list into chunks of the specified size."""
    return [lst[i:i + chunk_size] for i in range(0, len(lst), chunk_size)]

def split_into_sentences(text):
    """Split text into sentences using regular expressions."""
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def read_and_chunk_index_files(main_folder_path):
    """Read index.md files from subfolders, split into sentences, and chunk every 5 sentences."""
    blog_chunks = []
    for folder_name in os.listdir(main_folder_path):
        subfolder_path = os.path.join(main_folder_path, folder_name)
        if os.path.isdir(subfolder_path):
            index_file_path = os.path.join(subfolder_path, 'index.mdx')
            if os.path.isfile(index_file_path):
                with open(index_file_path, 'r', encoding='utf-8') as file:
                    content = file.read()
                    sentences = split_into_sentences(content)
                    sentence_chunks = chunk_list(sentences, 5)
                    sentence_chunks = [' '.join(chunk) for chunk in sentence_chunks]
                    blog_chunks.extend(sentence_chunks)
    return blog_chunks

# Example usage
main_folder_path = './blog'
blog_chunks = read_and_chunk_index_files(main_folder_path)

In [45]:
len(blog_chunks)

1874

# Chunking Visualization

In [46]:
print(blog_chunks[0]) # 1 500 Token Chunk

---
title: 'Accelerating Vector Search up to +40% with Intel’s latest Xeon CPU - Emerald Rapids'
slug: intel
authors: [zain, asdine, john]
date: 2024-03-26
image: ./img/hero.png
tags: ['engineering', 'research']
description: 'Boosting Weaviate using SIMD-AVX512, Loop Unrolling and Compiler Optimizations'
---

![HERO image](./img/hero.png)

**Overview of Key Sections:**
- [**Vector Distance Calculations**](#vector-distance-calculations) Different vector distance metrics popularly used in Weaviate. - [**Implementations of Distance Calculations in Weaviate**](#vector-distance-implementations) Improvements under the hood for implementation of Dot product and L2 distance metrics. - [**Intel’s 5th Gen Intel Xeon Processor, Emerald Rapids**](#enter-intel-emerald-rapids)  More on Intel's new 5th Gen Xeon processor. - [**Benchmarking Performance**](#lets-talk-numbers) Performance numbers on microbenchmarks along with simulated real-world usage scenarios. What’s the most important calculation a 

In [47]:
print(blog_chunks[1])

What simple operation does it spend the majority of its time performing? If you guessed **vector distance calculations** … BINGO! 🎉

While vector databases use many techniques and algorithms to improve performance (including locality graphs, quantization, hash based approaches), at the end of the day, efficient distance calculations between high-dimensional vectors is a requirement for a good vector database. In fact, when profiling Weaviate indexed using HNSW, we find that 40%-60% of the CPU time is spent doing vector distance calculations. So when someone tells us that they can make this quintessential process *much faster* they have our full attention! If you want to learn how to leverage algorithmic and hardware optimizations to make vector search 40% faster keep reading!

In this post we’ll do a technical deep dive into different implementations for vector distance calculations and optimizations enabled by Intel’s new 5th Gen Xeon Processor - Emerald Rapids, parallelization techni

In [60]:
from weaviate.util import get_valid_uuid
from uuid import uuid4

blogs = weaviate_client.collections.get("WeaviateBlogChunk")

for idx, blog_chunk in enumerate(blog_chunks):
    upload = blogs.data.insert(
        properties={
            "content": blog_chunk
        }
    )

            Please make sure to close the connection using `client.close()`.


### 2. Build Vanilla RAG and Agentic RAG Systems

In [48]:
# Tools model used for OpenAI Function Calling API
from pydantic import BaseModel
from typing import Optional, Literal

class ParameterProperty(BaseModel):
    type: str
    description: str
    enum: Optional[list[str]] = None


class Parameters(BaseModel):
    type: Literal["object"]
    properties: dict[str, ParameterProperty]
    required: Optional[list[str]]


class Function(BaseModel):
    name: str
    description: str
    parameters: Parameters


class Tool(BaseModel):
    type: Literal["function"]
    function: Function

We then implement an LMService to capture:

- Generation from a Prompt
- Generation from a Prompt with a Structured Output Model
- Function Calling Loop Generation (our Agentic RAG System)

In [99]:
import json

class LMService():
    def __init__(
            self,
            model_name: str,
            model_provider: str,
            api_key: str
    ):
        self.model_name = model_name
        self.model_provider = model_provider
        self.api_key = api_key
        
        match self.model_provider:
            case "openai":
                from openai import OpenAI
                self.client = OpenAI(api_key=self.api_key)
            case _:
                raise ValueError(f"Unsupported model_provider: {model_provider}")

    def generate(
            self,
            prompt: str
    ) -> str:
        messages = [
            {
                "role": "system", 
                "content": "You are a helpful assistant. Use the supplied tools to assist the user."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
        
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=messages
        )
        return response.choices[0].message.content

    def generate_with_output_model(
            self,
            prompt: str,
            output_model: BaseModel
    ):
        messages = [
            {
                "role": "system", 
                "content": "You are a helpful assistant. Use the supplied tools to assist the user."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
        response = self.client.beta.chat.completions.parse(
            model=self.model_name,
            messages=messages,
            response_format=output_model
        )
        parsed_response = response.choices[0].message.parsed
        parsed_response = parsed_response.json()
        return parsed_response
        

    def generate_with_function_calling_loop(
            self,
            prompt: str,
            tools: list[Tool],
            tools_mapping: dict
    ) -> str:
        messages = [
            {
                "role": "system",
                "content": "You are a helpful assistant. Use the supplied tools to assist the user."
            },
            {
                "role": "user",
                "content": prompt
            }
        ]
        calls, call_budget = 0, 20
    
        # Initial call to get first response
        response = self.client.chat.completions.create(
            model=self.model_name,
            messages=messages,
            tools=tools
        ).choices[0]

        while calls < call_budget:
            message = response.message
            
            if not message.tool_calls:
                return message.content
            
            # Add assistant message with tool calls
            messages.append({
                "role": "assistant",
                "content": message.content if message.content else None,
                "tool_calls": [
                    {
                        "id": tool_call.id,
                        "type": "function", 
                        "function": {
                            "name": tool_call.function.name,
                            "arguments": tool_call.function.arguments
                        }
                    } for tool_call in message.tool_calls
                ]
            })
            
            # Handle parallel function calls
            for tool_call in message.tool_calls:
                function_to_call = tools_mapping[tool_call.function.name]
                tool_arguments = json.loads(tool_call.function.arguments)
                function_response = function_to_call(**tool_arguments)
                
                messages.append({
                    "role": "tool",
                    "content": function_response,
                    "tool_call_id": tool_call.id
                })
            
            # Get next response
            response = self.client.chat.completions.create(
                model=self.model_name,
                messages=messages,
                tools=tools
            ).choices[0]
            
            calls += 1
        
        return "Exceeded maximum number of function calls"


In [100]:
lm_service = LMService(
    model_name="gpt-4o",
    model_provider="openai",
    api_key=OPENAI_API_KEY
)

print(lm_service.generate("say hello"))

Hello! How can I assist you today?


In [128]:
# The Structured Output makes it so the Language Model can only output either `hello how are you` or `Hello!`
# We use this for our LLM-as-Judge to determine which System produces the better answer to a technical question about Weaviate.

class StructuredHello(BaseModel):
    greeting: Literal["hello how are you", "Hello!"]

print(lm_service.generate_with_output_model(
    prompt="say hello",
    output_model=StructuredHello)
)

{"greeting":"Hello!"}


/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pydantic/main.py:1138: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.9/migration/


### `Vanilla RAG` Quick Test

In [102]:
# Note this code is written to demonstrate the functionality in a jupyter runtime
# For prodcution deployments you would want to decouple the client connection.
# To graduate to Lakehouse-style querying in Weaviate, you remove the collection name internal to the search function.

def search_blogs(search_query: str):
    search_collection = weaviate_client.collections.get("WeaviateBlogChunk")
    results = search_collection.query.hybrid(
        query=search_query,
        limit=5
    )
    stringified_response = ""
    for idx, o in enumerate(results.objects):
        stringified_response += f"Search Result: {idx+1}:\n"
        for prop in o.properties:
            stringified_response += f"{prop}:{o.properties[prop]}"
        stringified_response += "\n"
    
    return stringified_response

In [103]:
print(search_blogs("Hnsw"))

Search Result: 1:
content:At the time of writing this article in early 2021, the first vector index type that's supported is HNSW. By choosing this particular type, one of the limitations is already overcome: HNSW supports querying while inserting. This is a good basis for mutability, but it's not all. Existing HNSW libraries fall short of full CRUD-support. Updating is not possible at all and deleting is only mimicked by marking an object as deleted without cleaning it up.author:None
Search Result: 2:
content:Furthermore, the most popular library hnswlib only supports snapshotting, but not individual writes to disk. To get to where Weaviate is today, a custom HNSW implementation was needed. It follows the same principles [as outlined in this paper](https://arxiv.org/abs/1603.09320) but extends it with more features. Each write is added to a [write-ahead log](https://martinfowler.com/articles/patterns-of-distributed-systems/wal.html). Additionally, since inserts into HNSW are not mutab

In [104]:
def vanilla_rag(
        search_query: str,
        lm_service: LMService
    ) -> str:
    search_query = search_query
    context = search_blogs(search_query)
    prompt = f"""Assess the context and answer the question.

    [[ question ]]
    {search_query}

    [[ context ]]
    {context}

    [[ answer ]]"""
    answer = lm_service.generate(prompt)
    return answer

In [105]:
vanilla_rag(
    search_query="What is HNSW?",
    lm_service=lm_service
)

'HNSW stands for Hierarchical Navigable Small World, which is a type of vector index used for approximate nearest neighbor search. It is known for its efficient querying while allowing concurrent insertions, making it a good choice for mutable datasets. However, HNSW typically lacks full CRUD (Create, Read, Update, Delete) support; for example, updating is generally not supported, and deletions are only simulated by marking objects as deleted. The underlying principle of HNSW is based on building a graph where each node is connected to its nearest neighbors, which helps in navigating the space efficiently to find approximate nearest neighbors.'

### `Agentic RAG` Quick Test

In [106]:
tools = [Tool(
    type="function",
    function=Function(
        name="search_blogs",
        description="Search a Vector Database containing blog posts information about Weaviate.",
        parameters=Parameters(
            type="object",
            properties={
                "search_query": ParameterProperty(
                    type="string",
                    description="The natural language query to search for in the database"
                )
            },
            required=["search_query"]
        )
    )
)]

In [109]:
tools_mapping = {
    "search_blogs": search_blogs
}

In [110]:
lm_service.generate_with_function_calling_loop(
    prompt="What is HNSW?",
    tools=tools,
    tools_mapping=tools_mapping
)

'HNSW stands for "Hierarchical Navigable Small World" graph, which is an algorithm used for approximate nearest neighbor search in high-dimensional spaces. It is widely used in vector search engines and databases due to its efficient performance in handling large datasets.\n\nHNSW operates by building a layered graph structure where each layer represents a different level of abstraction, allowing for rapid navigation through the data to find the closest points. The higher layers contain fewer nodes and provide broader connections between different regions, while the lower layers hold more detailed information and connectivity.\n\nThe key features of HNSW include:\n\n1. **Hierarchical Structure:** Nodes (data points) are arranged in a hierarchy, enabling fast traversal during search queries.\n\n2. **Navigable Small World Properties:** The graph is constructed in a way that mimics the small-world phenomenon, where most nodes can be reached from every other by a small number of steps. Thi

### 3. Evaluate Agentic RAG vs. Vanilla RAG

In [133]:
from pydantic import BaseModel

# Maybe want to add `retrieved_contexts` for eval sake
class Winner(BaseModel):
    winner: Literal["vanilla rag", "agentic rag"]

class RAGEvalModel(BaseModel):
    query: str
    response: str
    win: bool

In [134]:
from datasets import load_dataset

ds = load_dataset("weaviate/WeaviateBlogRAG-0-0-0")["train"] # Please leave a heart if you find this dataset useful!

In [135]:
ds[0]

{'source': 'Note, the current implementation of hybrid search in Weaviate uses BM25/BM25F and vector search. If you’re interested to learn about how dense vector indexes are built and optimized in Weaviate, check out this [article](/blog/why-is-vector-search-so-fast). ### BM25\nBM25 builds on the keyword scoring method [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term-Frequency Inverse-Document Frequency) by taking the [Binary Independence Model](https://en.wikipedia.org/wiki/Binary_Independence_Model) from the IDF calculation and adding a normalization penalty that weighs a document’s length relative to the average length of all the documents in the database. The image below presents the scoring calculation of BM25:\n![BM25 calculation](./img/BM25-calculation.png) <div align="center"><i>Source: Wikipedia page on Okapi BM25</i></div>\n\nThe score of the document, query pair is determined by weighing the uniqueness of each keyword in the query relative to the collection of te

In [144]:
# load queries

compare_system_responses = """Assess the responses from two systems and determine which one had the better response:

[[ answer from vanilla rag system ]]
{vanilla_rag_response}

[[ answer from agentic rag system ]]
{agentic_rag_response}

[[ winning system ]]
"""

vanilla_rag_scores, agentic_rag_scores = [], []
vanilla_rag_wins = 0
agentic_rag_wins = 0

for row in ds:
    query = row["query"]
    print(f"\033[96mQuery: {query}\033[0m")
    
    vanilla_rag_response = vanilla_rag(
        search_query=query,
        lm_service=lm_service
    )
    print("\033[96mVanilla RAG Response:\n\033[0m")
    print(vanilla_rag_response)
    
    agentic_rag_response = lm_service.generate_with_function_calling_loop(
        prompt=query,
        tools=tools,
        tools_mapping=tools_mapping
    )

    print("\033[96mAgentic RAG Response:\n\033[0m")
    print(agentic_rag_response)
    
    formatted_compare_system_responses = compare_system_responses.format(
        vanilla_rag_response=vanilla_rag_response,
        agentic_rag_response=agentic_rag_response
    )
    
    winner = lm_service.generate_with_output_model(
        prompt=formatted_compare_system_responses,
        output_model=Winner
    )
    print("\033[96mJudged Winner to be:\033[0m")
    print(winner)

    winner = json.loads(winner)["winner"]
    
    if winner == "vanilla rag":
        vanilla_rag_wins += 1
    else:
        agentic_rag_wins += 1
        
    print("\033[96mThe current score is:\033[0m")
    print(f"Vanilla RAG: {vanilla_rag_wins} wins")
    print(f"Agentic RAG: {agentic_rag_wins} wins")
    print(f"After {len(vanilla_rag_scores) + 1} rounds\n")

    # Save results
    vanilla_rag_scores.append(RAGEvalModel(
        query=query,
        response=vanilla_rag_response,
        win=(winner == "vanilla rag")
    ))
    
    agentic_rag_scores.append(RAGEvalModel(
        query=query,
        response=agentic_rag_response, 
        win=(winner == "agentic rag")
    ))


{'source': 'Note, the current implementation of hybrid search in Weaviate uses BM25/BM25F and vector search. If you’re interested to learn about how dense vector indexes are built and optimized in Weaviate, check out this [article](/blog/why-is-vector-search-so-fast). ### BM25\nBM25 builds on the keyword scoring method [TF-IDF](https://en.wikipedia.org/wiki/Tf%E2%80%93idf) (Term-Frequency Inverse-Document Frequency) by taking the [Binary Independence Model](https://en.wikipedia.org/wiki/Binary_Independence_Model) from the IDF calculation and adding a normalization penalty that weighs a document’s length relative to the average length of all the documents in the database. The image below presents the scoring calculation of BM25:\n![BM25 calculation](./img/BM25-calculation.png) <div align="center"><i>Source: Wikipedia page on Okapi BM25</i></div>\n\nThe score of the document, query pair is determined by weighing the uniqueness of each keyword in the query relative to the collection of te

/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/site-packages/pydantic/main.py:1138: PydanticDeprecatedSince20: The `json` method is deprecated; use `model_dump_json` instead. Deprecated in Pydantic V2.0 to be removed in V3.0. See Pydantic V2 Migration Guide at https://errors.pydantic.dev/2.9/migration/


[96mJudged Winner to be:[0m
{"winner":"agentic rag"}
[96mThe current score is:[0m
Vanilla RAG: 0 wins
Agentic RAG: 1 wins
After 1 rounds

{'source': 'Updatability: The index data is immutable, and thus no real-time updates are possible. 2. Scalability: Most vector libraries cannot be queried while importing your data, which can be a scalability concern for applications that require importing millions or even billions of objects. Thus, vector libraries are a great solution for applications with a limited static snapshot of data. However, if your application requires real-time scalable semantic search at the production level, you should consider using a vector database.', 'gold_answer': 'Vector libraries might not be suitable for applications that require real-time updates and scalable semantic search because they have immutable index data, preventing real-time updates. They also cannot be queried while importing data, posing a scalability concern for applications that need to import

### Final Win Rate

In [146]:
print(f"Agentic RAG win rate: {agentic_rag_wins / 45 * 100:.2f}%")

Agentic RAG win rate: 73.33%
