# So How does Retrieval Augmented Generation Work?

<img src="./media/rag_comic.png" width=400>

RAG, or Retrieval Augmented Generation, can be understood most simply as providing relevant, up-to-date information alongside a question or query to get an accurate answer or action back from a large language model.

Why is this important? LLMs are incredibly capable and knowledgeable systems already, but they do not have access to up to date, domain specific, or proprietary information. Creating RAG based systems can build on top of LLMs intrinsic knowledge by providing the right context at the right time to enrich and improve responses. This often leads to more accurate and "correct" responses when building systems for niche or esoteric data.

In this notebook we'll cover an intuitive approach towards understanding how RAG systems work, for the curious yet daunted reader.

*Note: Some IFrame's and Graphs do not render on GitHub*

---

## Setup Functions & Imports

These imports and functions set up for the below examples.

In [None]:
# ========= Vector Database Setup =========

from langchain_text_splitters import MarkdownTextSplitter

# Instantiate the Chroma Client
chroma_client = chromadb.PersistentClient(path="./vector_database")

# Create a Collection
collection = chroma_client.get_or_create_collection(name="BMV080")

# Instantiate Splitter
splitter = MarkdownTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",   # OpenAI’s latest GPT family encoder
    chunk_size=1200,
    chunk_overlap=400,
    strip_whitespace=True
)

# Load Markdown File
with open("./documents/bmv080-ds.md", 'r', encoding='utf-8') as file:
    text = file.read()

# Split text
chunks = splitter.split_text(text)

# Embed Chunks to the Collection
collection.add(
    documents=chunks,
    ids=[str(i) for i in range(len(chunks))]
)

In [1]:
# ========= Notebook Helper Functions ==========

import openai
import chromadb
import tiktoken
from IPython.display import display, Markdown, HTML, IFrame
from sentence_transformers import SentenceTransformer
from langchain_text_splitters import MarkdownTextSplitter

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
chroma_client = chromadb.PersistentClient(path="./vector_database")
collection = chroma_client.get_or_create_collection(name="BMV080")
openai_client = openai.OpenAI()

# Simple OpenAI API Caller
def query_openai(prompt):

    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "You are a helpful Bosch assistant. Answer questions fully but succinctly and accurately"
            },
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                ]
            }
        ],
        max_tokens=4000,
        temperature=0.1
    )
            
    return response.choices[0].message.content.strip()

def retrieve_docs(query, collection="BMV080", n=5):
    # Load Chroma Collection
    collection = chroma_client.get_or_create_collection(name=collection)

    # Perform semantic search
    results = collection.query(
        query_texts=[query],
        n_results=n
    )

    # Zip documents and distances together into dicts
    docs = results["documents"][0]
    scores = results["distances"][0]

    # Combine into list of dicts
    return [{"document": doc, "score": score} for doc, score in zip(docs, scores)]

def rag_response(query):

    context = retrieve_docs(query)

    prompt = f"""Use the provided up-to-date context to answer the question

Retrieved Context:
{context}

Question: {query}
"""

    response = query_openai(prompt)

    return response

def pprint(text):
    display(Markdown(text))

---
## The Importance of Retrieval Augmented Generation

<img src="./media/high_level.png" width=600>

To demonstrate the importance of RAG systems in AI systems, let's see how a large language model handles a domain specific question both with and without RAG. Our scenario will be specific questions about the Bosch particulate matter sensor BMV080, a highly specialized air quality sensor with plenty of specific specs. This is a perfect example not only because it's niche, but because it was released in January 2025. We'll be using the primary model behind ChatGPT [gpt-4o](https://platform.openai.com/docs/models/gpt-4o) as our LLM example here which has a knowledge cutoff of October 1st 2023, so the base model (without access to web searching capabilities) should have no clue that this product even exists, let alone a technical spec.

We'll be asking: *What is the maximum power consumption of the BMV080 in continuous measurement mode?*

In [2]:
question = "What is the maximum power consumption of the BMV080 in continuous measurement mode?"

### *Without* Retrieval Augmented Generation (RAG)

In [3]:
response = query_openai(question)

pprint(f"""
**Query:** {question}

**Response:** {response}
""")


**Query:** What is the maximum power consumption of the BMV080 in continuous measurement mode?

**Response:** The maximum power consumption of the BMV080 in continuous measurement mode is 1.3 mA.


### *With* Retrieval Augmented Generation (RAG)

In [4]:
response = rag_response(question)

pprint(f"""
**Query:** {question}

**Response:** {response}
""")


**Query:** What is the maximum power consumption of the BMV080 in continuous measurement mode?

**Response:** The maximum power consumption of the BMV080 in continuous measurement mode is 181.9 mW.


---

Now let's compare to the answer from the [documentation](https://www.bosch-sensortec.com/media/boschsensortec/downloads/datasheets/bst-bmv080-ds000.pdf):

<img src="./media/continous_measurement.png" width=600>

RAG wins! 

---
## Knowledgebase Preparation

All of the relevant domain specific knowledge that you want to be able to use and provide as context is what's referred to as the **knowledgebase**. It is essentially a specially prepared collection of all of your unstructured data. Unstructured data is what's found in the files we use and create day to day, i.e. powerpoints, word documents, emails, excel files, images, recordings, etc. This all needs to be formatted in a way for efficient LLM ingestion and retrieval. But first, some context:

### Text Based Processing

Large language model's primary method of processing is via raw text. There are some conversions back and forth between text and numbers for the actual processing (called tokenization!) but the idea remains consistent.

Tokenization in action, via [OpenAI's tokenization visualizer](https://platform.openai.com/tokenizer):

<img src="./media/tokens_1.png" width=600>
<img src="./media/tokens_2.png" width=600>

While this text rule stands true, some more modern model's are starting to introduce **multimodal** inputs like text, video, audio, images and more as direct inputs:

<img src="./media/gemini_2.5_modelcard.png" width=600>

But for the most part, we need to convert our unstructured data **into text based formats** as a first step. Let's take a look at what that looks like for something like the prior datasheet example.

In [5]:
IFrame("documents/bmv080-ds.pdf", width=1200, height=600)

We can already take the [website](https://www.bosch-sensortec.com/products/environmental-sensors/particulate-matter-sensor/bmv080/#documents) and convert it into a [PDF](https://www.bosch-sensortec.com/media/boschsensortec/downloads/datasheets/bst-bmv080-ds000.pdf) via the provided link. But we still need this in an ingestable text format! We can scrape the text pretty easily but there are also pictures, tables, and other elements that would be nice to capture.

For this toy example I wrote a quick vision language model based OCR script to do this conversion of a PDF into formatted raw text known as Markdown. After running that we get the following output:

In [6]:
IFrame("documents/bmv080-ds.md", width=1200, height=600)

This data processing step is usually (somewhat) customized to the specific type of data you're working with, but this would be repeated across whatever file types you're working with. General text scraping is usually the initial approach, but more specialized approaches like my vLLM transformation can enrich or convert data into more effective formats.

But once you have your data in an LLM ready format, there's still one slight issue. LLM's have what's called a [context window](https://www.ibm.com/think/topics/context-window), or an upper limit to the amount of text you can actually pass in as context. While these are increasing as new technology is developed (1 Million+ tokens in some!), when working with enterprise-scale knowledgebases it is not feasible or cost effective to provide all context at all times for every question. So what we need to do is break our text up into bite sized pieces, otherwise known as **chunking**.

### Chunking

<img src="./media/chunking.gif" width=350>

There are many different ways and approaches to chunking, but the most popular approach is a token based limit. Often this involves initially splitting the text by common seperators like periods, line breaks, paragraph starts, etc. and then combining these splits into chunks that respect a specific token limit. 

I used a prebuilt Markdown based splitter, that uses different MD headers to do the initial splitting, then respects a token based chunk size. Let's see this in action:

In [7]:
with open("./documents/bmv080-ds.md", 'r', encoding='utf-8') as file:
    text = file.read()

# Check how many tokens
encoder = tiktoken.get_encoding("cl100k_base")
tokens = encoder.encode(text)
pprint(f"**Token Count**: {len(tokens)}")

**Token Count**: 22081

Given our document size of ~20.5k tokens, let's split these into 1200 token chunks

In [8]:
# Load splitter
splitter = MarkdownTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base",
    chunk_size=1200,
    chunk_overlap=400,
    strip_whitespace=True
)

# Chunk the text
chunks = splitter.split_text(text)

# Check how many chunks made
pprint(f"**Chunk Count**: {len(chunks)}")

**Chunk Count**: 28

We end up with 28 total chunks of the datasheet! Let's look at what one looks like

In [10]:
print(chunks[20])

**Summary**  
Close the sensor unit.

**Precondition**  
Must be called last to destroy the handle created by bmv080_open.

**Postcondition**  
N/A

**Arguments**

| Argument | Description                     |
|----------|---------------------------------|
| * handle | Unique handle for a sensor unit |

**Return Value**  
E_BMV080_OK if successful. Otherwise, the return value is a BMV080 status code.


---


# 5.2.4 Sensor Identification

## 5.2.4.1 bmv080_get_sensor_id

**Function**


bmv080_status_code_t bmv080_get_sensor_id
(
    const bmv080_handle_t handle,
    char id[13]
);


**Summary**  
Get the sensor ID of a sensor unit.

**Precondition**  
A valid handle generated by `bmv080_open` is required. The application must have allocated the char array id with a size of 13 elements.

**Postcondition**  
N/A

**Arguments**

| Argument | Description                           |
|----------|---------------------------------------|
| handle   | Unique handle for a sensor unit       |
| 

Great! But now that our knowledge is split across chunks, that offers a different challenge. Since we can't pass all context in at all times, we need a way to find the most relevant chunk(s) based on the questions or inputs into the system. 

## Retrieval

So how do you determine what's relevant to answer a question? We need a system that can do database style retrieval of these chunks but for relevancy!

That's where **embeddings** come in, a more advanced concept but crucial to the core of *Retrieval* in retrieval augmented generation.

### Embeddings

The goal of relevancy or similarity based retrieval is to surface the information needed to most accurately and best answer the question being asked. To do this we need to find what chunks are relevant (or similar) to the query being input. I.e. when asking *What is the maximum power consumption of the BMV080 in continuous measurement mode?* We want to surface chunk ID **6** which contains the answer to this question.

<img src="./media/chunk_6.png" width=600>

The first step of being able to do this action is to **encode** the text into a numerical representation known as a **text embedding**. This is done with the help of a seperate language model known as a sentence transformer. These are smaller models that have been trained through predicting fill in the black style language predictions.

<img src="./media/MLM.png" width=600>

Through scaled deep learning and relying on the core transformer architecture and attention mechanisms, these models gain the ability to create a representation of sentences that capture the underlying semantics of the text conditional on the entire sentence.

<img src="./media/sentence_embedding.png" width=800>

Let's see what this looks like real quick, using one of the most popular AI models in existence, [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)

In [11]:
query = "What is the maximum power consumption of the BMV080 in continuous measurement mode?"

representation = embedding_model.encode(query)

pprint(f"**Length of Representation**: {len(representation)} Dimensions")
pprint(f"**First 10 Dimensions**: {representation[:10]} ...")

**Length of Representation**: 384 Dimensions

**First 10 Dimensions**: [ 0.00887027  0.06133682 -0.06260985  0.03094546 -0.06726976  0.00943975
 -0.00361226  0.04340531 -0.0748492  -0.00466658] ...

Now why and how is having a semantically rich numerical representation useful? Let's dig a little further into the "dimensionality" of this. Intuitively, these dimensions are similar to the dimensions we understand, I.E 1D, 2D, 3D, except this time were capturing multiple dimensions of the concepts and ideas and meanings within the text sequence through the machine learning model. Now the interesting part of this is that in a way these can be represented in lower dimensions that we can "see."

Let's take some different categories of words, embed them, and see what that looks like when reduced to 3 dimensions:

In [12]:
from plotly.offline import init_notebook_mode, iplot
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import numpy as np
import pandas as pd

df = pd.read_csv('./documents/100_embeddings.csv')

# Convert string representations of lists to numpy arrays
matrix = np.array(df['embedding'].apply(eval).tolist())

# Create a t-SNE model and transform the data
tsne = TSNE(
    n_components=3,
    perplexity=10,
    max_iter=5000,
    learning_rate='auto',
    init='pca',
    random_state=3
)
vis_dims = tsne.fit_transform(matrix)

category_colors = {
    'Animal': 'red',
    'Food': 'green',
    'Occupation': 'blue',
    'Weather': 'purple'
}

# Create traces for each category
traces = []
for category, color in category_colors.items():
    category_mask = df['Category'] == category
    category_data = vis_dims[category_mask]
    words = df['Word'][category_mask]
    
    # Create hover text with only coordinates
    hovertext = [f"X: {x:.2f}, Y: {y:.2f}, Z: {z:.2f}" 
                 for x, y, z in category_data]
    
    trace = go.Scatter3d(
        x=category_data[:, 0],
        y=category_data[:, 1],
        z=category_data[:, 2],
        mode='markers+text',
        name=category,
        marker=dict(
            size=5,
            color=color,
            opacity=0.7
        ),
        text=words,
        textposition="top center",
        hovertext=hovertext,
        hoverinfo='text',
        textfont=dict(size=12)
    )
    traces.append(trace)

# Create the layout
layout = go.Layout(
    title="Word Embeddings Visualized by Category using t-SNE (3D)",
    scene=dict(
        xaxis_title='X',
        yaxis_title='Y',
        zaxis_title='Z',
        aspectmode='cube',
        camera=dict(
            up=dict(x=0, y=0, z=1),
            center=dict(x=0, y=0, z=0),
            eye=dict(x=1.5, y=1.5, z=1.5)
        ),
    ),
    width=1000,  
    height=800,
    margin=dict(l=0, r=0, b=0, t=40),
    legend=dict(
        x=0.9,
        y=0.9,
        traceorder="normal",
        font=dict(size=12),
        bgcolor="rgba(255, 255, 255, 0.5)"
    ),
    hovermode='closest'
)

# Create the figure and display
fig = go.Figure(data=traces, layout=layout)
init_notebook_mode(connected=True)
fig.show()

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


From this you can see that similar items *conceptually* are located in a **similar space**. This holds past 3D space up to N Dimension vector space! 

So with that understanding, and a numerical representation, we get to the useful part of being able to do direct comparisons of these embedded sentences.

### Semantic Similarity

Once you have these embeddings, finding the similarity between them becomes relatively straightforward! We can take some inspiration from our middle school math classes and consider the distance between two points on a cartesian plane.

<img src="./media/distance_formula.png" width=500>

Very similarly to how we can use the distance formula for 2D points, we can do the same (with some nuance) with high dimensional representations. They can be thought of as the distance between two points. Modern approaches tend to use the difference of angles between points, but intuitively this follows the same line of thinking.

Let's see what this looks like:

In [13]:
# The sentences to encode
sentences = [
    "Dog",
    "Cat",
    "Toyota Prius",
]

embeddings = embedding_model.encode(sentences)

similarities = embedding_model.similarity(embeddings, embeddings)

pprint(f"""
**Sentences:** {', '.join([f'"{s}"' for s in sentences])}

**Similarity Matrix:**
```
{similarities}
```  

**Key Relationships:**
- Dog ↔ Cat: {similarities[0][1]:.4f} (moderate similarity - both animals)
- Dog ↔ Toyota Prius: {similarities[0][2]:.4f} (low similarity)  
- Cat ↔ Toyota Prius: {similarities[1][2]:.4f} (low similarity)
""")


**Sentences:** "Dog", "Cat", "Toyota Prius"

**Similarity Matrix:**
```
tensor([[1.0000, 0.6606, 0.2199],
        [0.6606, 1.0000, 0.2156],
        [0.2199, 0.2156, 1.0000]])
```  

**Key Relationships:**
- Dog ↔ Cat: 0.6606 (moderate similarity - both animals)
- Dog ↔ Toyota Prius: 0.2199 (low similarity)  
- Cat ↔ Toyota Prius: 0.2156 (low similarity)


This can then be extrapolated to compare your documents with queries to retrieve the relevant chunks. Thus your chunked documents become embedded and then stored into a vector retrieval system. Thankfully, there are systems in place for this retrieval system called **vector databases** that do this large scale embedding and similarity calculation retrieval efficiently

<img src="./media/sim.png" width=600>

Before we introduce our vector databases, let's just check out what the similarity between our question and chunk ID 6 is

In [14]:
# The sentences to encode
sentences = [
    query,
    chunks[6],
]

embeddings = embedding_model.encode(sentences)

similarities = embedding_model.similarity(embeddings, embeddings)

In [15]:
pprint(f"""*What is the maximum power consumption of the BMV080 in continuous measurement mode?* and **Chunk ID 6**:

**Score**: {similarities[0][1]:.4f}""")

*What is the maximum power consumption of the BMV080 in continuous measurement mode?* and **Chunk ID 6**:

**Score**: 0.5966

### Retrieval

With that, we can put all of our chunks into a vector database to cover the retrieval part of our retrieval augmented generation step! Often we will retrieve a top set of similar items, as it's not guaranteed that the relevant information is in only one chunk. 

In [16]:
results = collection.query(
    query_texts=["What is the maximum power consumption of the BMV080 in continuous measurement mode?"],
    n_results=5
)

for i, doc in enumerate(results['documents'][0]):
    pprint(f"""
**Retrieved Document**: {i+1}

**Database ID**: {results['ids'][0][i]}

**Distance**: {results['distances'][0][i]:.4f}

**Chunk Snippet**:""")
    print(f"""{doc[:500]}...""")
    pprint("---")


**Retrieved Document**: 1

**Database ID**: 21

**Distance**: 0.7646

**Chunk Snippet**:

---

© Bosch Sensortec GmbH 2025 | All rights reserved, also regarding any disposal, exploitation, reproduction, editing, distribution, as well as in the event of applications for industrial property rights

Document number: BST-BMV080-DS000-11


---


## 5.2.5.1.2 Duty Cycling Measurement

Figure 39 is an activity diagram that shows how to perform a duty cycling measurement – repeating numerous measurements separated by a pause. The main difference from continuous measurement is the duty cyclin...


---


**Retrieved Document**: 2

**Database ID**: 6

**Distance**: 0.8068

**Chunk Snippet**:

---

<sup>11</sup> Supply pins are described in Table 11.

<sup>12</sup> Given self heating during operation resulting in sensor internal temperature increase of ~15 K in continuous measurement mode with the Power Optimized Configuration (Chapter 4), BMV080 is capable to operate at ambient temperatures <15 °C depending on thermal integration design. For more details, refer to Section 3.3 on thermal integration best practices in BMV080 integration guideline (BST-BMV080-AN000).

<sup>13</sup> No c...


---


**Retrieved Document**: 3

**Database ID**: 12

**Distance**: 0.8801

**Chunk Snippet**:

## 4.4.1.2 Proposal for Filtering Signal Errors

Strong disturbing signals on the SCK pin may influence the measurement results of the BMV080. In an environment where strong disturbance signals are present, the SCK pin could be protected with a suitable low pass filter, which filters out the disturbance but allows normal communication.

---

16P = power supply, DI = digital in, DO = digital out, GND = ground.


---


### 4.4.1.3 Power domains

The BMV080 has four power domains, listed in Table 1...


---


**Retrieved Document**: 4

**Database ID**: 23

**Distance**: 0.9441

**Chunk Snippet**:

# 5.2.5.5 bmv080_stop_measurement

## Function
c
bmv080_status_code_t bmv080_stop_measurement
(
    const bmv080_handle_t handle
);


## Summary
Stop particle measurement.

## Precondition
A valid handle generated by `bmv080_open` is required, and the sensor unit entered measurement mode via `bmv080_start_continuous_measurement` or `bmv080_start_duty_cycling_measurement`. Must be called at the end of a data acquisition cycle to ensure that the sensor unit is ready for the next measurement cycle....


---


**Retrieved Document**: 5

**Database ID**: 22

**Distance**: 0.9487

**Chunk Snippet**:

---


# 5.2.5.4 bmv080_serve_interrupt

## Function
c
bmv080_status_code_t bmv080_serve_interrupt
(
    const bmv080_handle_t handle,
    bmv080_callback_data_ready_t data_ready,
    void* callback_parameters
);


## Summary
Serve an interrupt using a callback function.

## Precondition
A valid handle generated by `bmv080_open` is required with the sensor unit currently in measurement mode via `bmv080_start_continuous_measurement` or `bmv080_start_duty_cycling_measurement`.

The application can ...


---

We see that Chunk ID 6 is within the top few results! This was our hope, and is backed up by our database result that shows a smaller distance from the query to that positive chunk.

## RAG

<img src="./media/basic_rag.png" width=600>

So now we have:
- Our unstructured data converted into an LLM ingestible form
- Chunked into manageable and processable pieces
- Embedded and ready for semantic similarity based search

Now we just need to put it all together! Rather than passing the user query directly to the LLM, we first pass it to our vector database, retrieve our relevant context, then pass both the question and the context to the LLM to generate a response:

```python
def rag_response(query):

    context = retrieve_docs(query)

    prompt = f"""Use the provided up-to-date context to answer the question

Retrieved Context:
{context}

Question: {query}
"""

    response = query_openai(prompt)

    return response
```

In [17]:
response = rag_response("What is the maximum power consumption of the BMV080 in continuous measurement mode?")

pprint(f"**AI RAG Response**: {response}")

**AI RAG Response**: The maximum power consumption of the BMV080 in continuous measurement mode is 181.9 mW.