## RAG from scratch

#### Installation

In [1]:
%%capture
import sys

!{sys.executable} -m pip install --upgrade openai pymilvus
!{sys.executable} -m pip install PyPDF2


### Reading the PDF file

In [2]:
from PyPDF2 import PdfReader

pdf_filepath = "external_data/VisRAG.pdf"

def extract_text_from_pdf(pdf_path):
    text = ""
    reader = PdfReader(pdf_path)
    for page in reader.pages:
        text += page.extract_text()
    return text

text = extract_text_from_pdf(pdf_filepath)
# clean text
text = text.replace(".\n", ".NEWLINE")
text = text.replace("\n", " ")
text = text.replace(".NEWLINE", ".\n")
print(len(text), len(text.split()))

78808 11755


### Split text into chunks

In [3]:
def split_text(text, chunk_size, overlap):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start: end])
        start += chunk_size - overlap
    return chunks

chunks = split_text(text, 2000, 500)
print(len(chunks), len(chunks[0].split()))

53 280


### Compute embeddings for each chunk

In [4]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
    
# Generate embeddings
embeddings = [model.encode([chunk]) for chunk in chunks]


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
embeddings[0].shape

(1, 384)

### Create a collection and insert data

In [6]:
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="milvus_openai_demo.db")

# Create a collection
COLLECTION_NAME = "visRAG_paper"
DIMENSION = 384
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)

milvus_client.create_collection(
    collection_name=COLLECTION_NAME, dimension=DIMENSION
)


# Insert data
data = [
    {
        "id": i, "vector": embeddings[i][0].tolist(),
        "text": chunks[i], "subject": "VisRAG"
    }
    for i in range(len(chunks))
]
res = milvus_client.insert(
    collection_name=COLLECTION_NAME,
    data=data
)
res["insert_count"]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


53

#### Testing the retrieval

In [7]:
query = "In VisRAG-retrieval, how the final embedding is generated?"

query_vector = model.encode([query])[0].tolist()

retrieved = milvus_client.search(
    collection_name=COLLECTION_NAME,
    data=[query_vector],
    limit=2,
    output_fields=["text"]
)
print(len(retrieved))

print("Query:", query)
for j, ret in enumerate(retrieved[0]):
    print(f"\n{j}: chunk_id={ret['id']} dist={ret['distance']:.3f}")
    print(ret['entity']['text'][:50])
print("\n")

1
Query: In VisRAG-retrieval, how the final embedding is generated?

0: chunk_id=10 dist=0.625
en q. We follow the dual-encoder paradigm in text-

1: chunk_id=24 dist=0.581
Across the six evaluation datasets, VisRAG shows a




### Setup OpenAI API

In [8]:
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

openai_client = OpenAI(api_key=OPENAI_API_KEY)

### Generate response without context

In [9]:
PROMPT = "Answer the question about the VisRAG paper:\n"

# get natural language response (without retrieved context)
completion = openai_client.chat.completions.create(
    model="gpt-4o",
    store=True,
    messages=[
        {"role": "user", "content": PROMPT + query},
    ]
).to_dict()

print(completion["choices"][0]["message"]["content"])

In VisRAG-retrieval, the final embedding is generated by a process that integrates both visual and textual embeddings. Specifically, for each image query, the approach involves extracting visual features using a Vision Transformer (ViT) model. These visual features are then combined with text embeddings generated from a pre-trained language model. The concatenation of these multi-modal embeddings forms the final representation. This final joint embedding is used to effectively capture and encode the contextual relationships between visual and textual elements, thereby improving the retrieval performance in a cross-modal setting.


### Generate response with the retrieved context

In [10]:
PROMPT = (
    "Answer the question about the VisRAG paper "
    "based on the following context:\n"
)

context = "\n".join([r["entity"]["text"] for r in retrieved[0]])

# get natural language response
completion = openai_client.chat.completions.create(
    model="gpt-4o",
    store=True,
    messages=[
        {
            "role": "user",
            "content": PROMPT + context + "\nQUESTION:" + query
        },
    ]
).to_dict()

print(completion["choices"][0]["message"]["content"])

In VisRAG-retrieval, the final embedding is generated by using the position-weighted mean pooling over the last-layer hidden states produced by the vision-language model (VLM). Each hidden state is weighted, giving higher importance to later tokens in the sequence. The formula used for this is:

\[ v = \sum_{i=1}^{S} w_i h_i \]

Here:
- \( h_i \) is the i-th hidden state.
- \( S \) is the sequence length.
- \( w_i = \frac{i}{\sum_{j=1}^{S} j} \) is the weight for the i-th hidden state.
- \( v \) is the resulting query or page embedding.

This approach utilizes the concept of causal attention in generative VLMs to leverage the position information of tokens for generating the final embedding.


## Appendix: Embeddings

In [11]:
import os
from dotenv import load_dotenv
from openai import OpenAI

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-large"
)

print(response.data[0].embedding)

[0.0037755323573946953, -0.010456031188368797, -0.0030873925425112247, 0.04585136100649834, 0.016894064843654633, -0.010206637904047966, -0.033714234828948975, 0.04718145355582237, -0.01334714237600565, 0.014982052147388458, 0.01828881725668907, 0.024957770481705666, -0.0295022651553154, -0.009264486841857433, -0.010502215474843979, 3.148080941173248e-05, -0.021373901516199112, -0.004068800248205662, -0.028227590024471283, -0.024902349337935448, 0.031386565417051315, 0.02061648480594158, -0.0726010650396347, 0.03812941536307335, -0.004576822742819786, 0.016136648133397102, -0.03635595366358757, 0.011970862746238708, 0.011490549892187119, -0.01205399353057146, 0.03594953566789627, 0.022002002224326134, 0.033511023968458176, 0.004433652851730585, -0.0022664740681648254, -0.015499311499297619, 0.01885226182639599, 0.03897919878363609, 0.0026324812788516283, 0.020080752670764923, 0.004151931032538414, 0.006687426473945379, -0.04769871383905411, 0.002253773622214794, 0.009139790199697018, 0

In [12]:
len(response.data[0].embedding)

3072

In [15]:
from sentence_transformers import SentenceTransformer
sentences = [
    "This is an example sentence",
    "Each sentence is converted"
]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings.shape)




(2, 384)


In [16]:
from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned(
    'BAAI/bge-base-en-v1.5',
    use_fp16=True
)

sentences_1 = ["I love NLP", "I love machine learning"]
sentences_2 = ["I love BGE", "I love text retrieval"]
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
print(embeddings_1.shape)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


(2, 768)
