## RAG from scratch

Video: [RAG from scratch](https://youtu.be/MJVTf63OF5o)

#### Installation

In [1]:
%%capture
import sys

!{sys.executable} -m pip install --upgrade openai pymilvus
!{sys.executable} -m pip install PyPDF2


### Reading the PDF file

In [2]:
from PyPDF2 import PdfReader

pdf_filepath = "external_data/VisRAG.pdf"

def extract_text_from_pdf(pdf_path):
    text = ""
    reader = PdfReader(pdf_path)
    for page in reader.pages:
        text += page.extract_text()
    return text

text = extract_text_from_pdf(pdf_filepath)
# clean text
text = text.replace(".\n", ".NEWLINE")
text = text.replace("\n", " ")
text = text.replace(".NEWLINE", ".\n")
print(len(text), len(text.split()))

78808 11755


### Split text into chunks

In [3]:
def split_text(text, chunk_size, overlap):
    chunks = []
    start = 0
    while start < len(text):
        end = min(start + chunk_size, len(text))
        chunks.append(text[start: end])
        start += chunk_size - overlap
    return chunks

chunks = split_text(text, 2000, 500)
print(len(chunks), len(chunks[0].split()))

53 280


### Compute embeddings for each chunk

In [4]:
from sentence_transformers import SentenceTransformer

# Load the pre-trained model
model = SentenceTransformer('all-MiniLM-L6-v2')
    
# Generate embeddings
embeddings = [model.encode([chunk]) for chunk in chunks]


  from .autonotebook import tqdm as notebook_tqdm


In [5]:
embeddings[0].shape

(1, 384)

### Create a collection and insert data

In [6]:
from pymilvus import MilvusClient
milvus_client = MilvusClient(uri="milvus_openai_demo.db")

# Create a collection
COLLECTION_NAME = "visRAG_paper"
DIMENSION = 384
if milvus_client.has_collection(collection_name=COLLECTION_NAME):
    milvus_client.drop_collection(collection_name=COLLECTION_NAME)

milvus_client.create_collection(
    collection_name=COLLECTION_NAME, dimension=DIMENSION
)


# Insert data
data = [
    {
        "id": i, "vector": embeddings[i][0].tolist(),
        "text": chunks[i], "subject": "VisRAG"
    }
    for i in range(len(chunks))
]
res = milvus_client.insert(
    collection_name=COLLECTION_NAME,
    data=data
)
res["insert_count"]

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


53

#### Testing the retrieval

In [7]:
query = "In VisRAG-retrieval, how the final embedding is generated?"

query_vector = model.encode([query])[0].tolist()

retrieved = milvus_client.search(
    collection_name=COLLECTION_NAME,
    data=[query_vector],
    limit=2,
    output_fields=["text"]
)
print(len(retrieved))

print("Query:", query)
for j, ret in enumerate(retrieved[0]):
    print(f"\n{j}: chunk_id={ret['id']} dist={ret['distance']:.3f}")
    print(ret['entity']['text'][:50])
print("\n")

1
Query: In VisRAG-retrieval, how the final embedding is generated?

0: chunk_id=10 dist=0.625
en q. We follow the dual-encoder paradigm in text-

1: chunk_id=24 dist=0.581
Across the six evaluation datasets, VisRAG shows a




### Setup OpenAI API

In [8]:
import os
from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")

openai_client = OpenAI(api_key=OPENAI_API_KEY)

### Generate response without context

In [9]:
PROMPT = "Answer the question about the VisRAG paper:\n"

# get natural language response (without retrieved context)
completion = openai_client.chat.completions.create(
    model="gpt-4o",
    store=True,
    messages=[
        {"role": "user", "content": PROMPT + query},
    ]
).to_dict()

print(completion["choices"][0]["message"]["content"])

In VisRAG-retrieval, a final embedding is generated by integrating both vision and text information. The process typically involves extracting visual features from images through a vision encoder and extracting textual features from the accompanying text using a text encoder. These two types of embeddings are then combined or fused in some manner to create a joint embedding space that represents both modalities. While the specific details of the combination process can vary, it often involves techniques like concatenation, element-wise addition, or through the use of attention mechanisms to effectively merge the visual and textual information into a single, multimodal representation.


### Generate response with the retrieved context

In [10]:
PROMPT = (
    "Answer the question about the VisRAG paper "
    "based on the following context:\n"
)

context = "\n".join([r["entity"]["text"] for r in retrieved[0]])

# get natural language response
completion = openai_client.chat.completions.create(
    model="gpt-4o",
    store=True,
    messages=[
        {
            "role": "user",
            "content": PROMPT + context + "\nQUESTION:" + query
        },
    ]
).to_dict()

print(completion["choices"][0]["message"]["content"])

In VisRAG-retrieval, the final embedding is generated using position-weighted mean pooling over the last-layer hidden states obtained from the Visual Language Model (VLM). The process involves encoding the query and page separately as text and image in the VLM, which produces a sequence of hidden states. The position-weighted mean pooling gives higher weights to the later tokens in the sequence. Formally, the final embedding \( v \) is calculated as:

\[ v = \sum_{i=1}^{S} w_i h_i, \]

where \( h_i \) is the \( i \)-th hidden state, \( S \) is the sequence length, and \( w_i = \frac{i}{\sum_{j=1}^{S} j} \) is the weight for the \( i \)-th position. This results in the query or page embedding used for calculating the similarity score.


## Appendix: Embeddings

In [11]:
import os
from dotenv import load_dotenv
from openai import OpenAI

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
client = OpenAI(api_key=OPENAI_API_KEY)

response = client.embeddings.create(
    input="Your text string goes here",
    model="text-embedding-3-large"
)

print(response.data[0].embedding)

[0.003769672941416502, -0.01046823337674141, -0.003122915280982852, 0.04579043760895729, 0.016898851841688156, -0.010218770243227482, -0.03364987298846245, 0.04719482734799385, -0.013341685757040977, 0.01496781874448061, 0.018266282975673676, 0.024927886202931404, -0.029492147266864777, -0.009285591542720795, -0.010532909072935581, 4.926473775412887e-05, -0.02136147953569889, -0.004030685871839523, -0.02818015217781067, -0.024890927597880363, 0.03137698397040367, 0.020585371181368828, -0.07258468121290207, 0.03814021870493889, -0.004557331092655659, 0.016076546162366867, -0.03638473525643349, 0.011965015903115273, 0.011475327424705029, -0.01206664927303791, 0.035978201776742935, 0.02202671580016613, 0.033502042293548584, 0.004474176559597254, -0.00224055303260684, -0.015503703616559505, 0.01886684261262417, 0.03895328566431999, 0.002598579740151763, 0.020067963749170303, 0.004190065432339907, 0.006680082064121962, -0.047712232917547226, 0.00225325720384717, 0.00914700049906969, 0.00325

In [12]:
len(response.data[0].embedding)

3072

In [13]:
from sentence_transformers import SentenceTransformer
sentences = [
    "This is an example sentence",
    "Each sentence is converted"
]

model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)
print(embeddings.shape)




(2, 384)


In [14]:
from FlagEmbedding import FlagAutoModel

model = FlagAutoModel.from_finetuned(
    'BAAI/bge-base-en-v1.5',
    use_fp16=True
)

sentences_1 = ["I love NLP", "I love machine learning"]
sentences_2 = ["I love BGE", "I love text retrieval"]
embeddings_1 = model.encode(sentences_1)
embeddings_2 = model.encode(sentences_2)
print(embeddings_1.shape)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


(2, 768)
