<img src="./2025-09-04-14-44-03.png" width=50%>

# Elements of a RAG System

1. Document (pdf, md....)
2. Breaking of the document into pieces (chunking)
3. Vectorized representation of the pieces of the document so they can be compared with each other and external queries (embedding)
4. The 'library' where we store these embedded pieces of the document is called a 'vector store' / 'vector database'
5. LLM (the model we are using to answer the question)
6. The model used to embed the queries (usually never the same as the LLM model)
7. Production of the final answer

1. Document (pdf, md....)

In [1]:
from langchain.document_loaders import PyPDFLoader

loader = PyPDFLoader("./assets-resources/attention-paper.pdf")
docs = loader.load()

docs

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './assets-resources/attention-paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle

2. Breaking of the document into pieces (chunking)


In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=200)
# the size with which you split the document depends on a bunch of stuff like: 
# - the size of the document
# - the complexity of the document
# - the task you are trying to solve
# - the LLM you are using
# - The embedding model you are using

chunks = text_splitter.split_documents(docs)

print(chunks)

[Document(metadata={'producer': 'pdfTeX-1.40.25', 'creator': 'LaTeX with hyperref', 'creationdate': '2024-04-10T21:11:43+00:00', 'author': '', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'subject': '', 'title': '', 'trapped': '/False', 'source': './assets-resources/attention-paper.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='Provided proper attribution is provided, Google hereby grants permission to\nreproduce the tables and figures in this paper solely for use in journalistic or\nscholarly works.\nAttention Is All You Need\nAshish Vaswani∗\nGoogle Brain\navaswani@google.comNoam Shazeer∗\nGoogle Brain\nnoam@google.comNiki Parmar∗\nGoogle Research\nnikip@google.comJakob Uszkoreit∗\nGoogle Research\nusz@google.com\nLlion Jones∗\nGoogle Research\nllion@google.comAidan N. Gomez∗ †\nUniversity of Toronto\naidan@cs.toronto.eduŁukasz Kaiser∗\nGoogle

3. Vectorized representation of the pieces of the document so they can be compared with each other and external queries (embedding)

Go read this article to understand embeddings:
- https://arxiv.org/pdf/1301.3781

In [None]:
from langchain_openai import OpenAIEmbeddings

# 6. The model used to embed the queries (usually never the same as the LLM model)
embedding_model = OpenAIEmbeddings()

sentence_lie = "Lucas is a gorgeous teacher"
embedding_of_lie = embedding_model.embed_query(sentence_lie)
embedding_of_lie

[-0.010256785899400711,
 0.009058533236384392,
 -0.003995262552052736,
 -0.039464205503463745,
 -0.006759710144251585,
 0.006889955140650272,
 -0.013330565765500069,
 0.004066897090524435,
 -0.005668909288942814,
 -0.01898970641195774,
 0.0030167975928634405,
 0.02526751160621643,
 0.0054735420271754265,
 -0.005730775650590658,
 0.019680004566907883,
 -0.0015051426598802209,
 0.0404801145195961,
 0.0021246199030429125,
 0.029826082289218903,
 -0.040818750858306885,
 -0.01137689221650362,
 -0.006440610159188509,
 -0.014691624790430069,
 -0.00683785742148757,
 -0.0030119132716208696,
 -0.008583138696849346,
 0.02561917155981064,
 -0.028367338702082634,
 0.004688816610723734,
 -0.002334639895707369,
 0.012777024880051613,
 0.0014058308443054557,
 0.0006813436630181968,
 -0.02755982056260109,
 -0.03081594407558441,
 -0.02171182446181774,
 0.005209796130657196,
 0.0035980152897536755,
 0.0021246199030429125,
 0.010178638622164726,
 0.03222258761525154,
 -0.012594682164490223,
 -0.0040506166

In [7]:
sentence_true = "Lucas is a funny teacher"
embedding_of_truth = embedding_model.embed_query(sentence_true)
embedding_of_truth

[-0.006929061375558376,
 0.006583886686712503,
 -0.01155695877969265,
 -0.028841258957982063,
 -0.008993716910481453,
 -0.005110502243041992,
 -0.019355349242687225,
 -0.0010163475526496768,
 -0.004129311535507441,
 -0.010150691494345665,
 0.010904962196946144,
 0.025760255753993988,
 0.012541345320641994,
 -0.00040490104584023356,
 0.025376727804541588,
 -0.001708294847048819,
 0.041625503450632095,
 -0.0015780553221702576,
 0.028790121898055077,
 -0.048171039670705795,
 0.004883581772446632,
 0.0009771957993507385,
 -0.024532968178391457,
 -0.009722419083118439,
 0.0009963721968233585,
 -0.0028508868999779224,
 0.02211674489080906,
 -0.026067078113555908,
 0.01543058454990387,
 -0.005893537309020758,
 0.02418779395520687,
 0.00461830897256732,
 -0.00046103185741230845,
 -0.018818410113453865,
 -0.027844088152050972,
 -0.02896910160779953,
 0.0026191724464297295,
 0.007357333786785603,
 0.00663502374663949,
 0.007293412461876869,
 0.026284409686923027,
 -0.009971711784601212,
 -0.0023

In [9]:
sentence_unrelated = "I know buildings with the color yellow"
embedding_of_unrelated = embedding_model.embed_query(sentence_unrelated)
embedding_of_unrelated

[-0.0013056459138169885,
 -0.01860625669360161,
 0.005485638044774532,
 -0.011484552174806595,
 -0.011016187258064747,
 0.018644751980900764,
 -0.016873950138688087,
 -0.028050536289811134,
 0.01958148181438446,
 -0.01074671745300293,
 0.00729493610560894,
 -0.0007366313366219401,
 0.0012799821561202407,
 0.0100730424746871,
 -0.022507155314087868,
 -0.022738128900527954,
 0.023700522258877754,
 0.009341624565422535,
 0.0027604626957327127,
 -0.0020450842566788197,
 -0.017579704523086548,
 0.014641199260950089,
 -0.003589724423363805,
 -0.01519297156482935,
 -0.01598854921758175,
 -0.004706099629402161,
 0.008937419392168522,
 -0.031130192801356316,
 -0.008289407938718796,
 0.01606553979218006,
 0.03115585632622242,
 -0.02137794718146324,
 -0.011798933148384094,
 -0.012177474796772003,
 0.001328101847320795,
 -0.007307767868041992,
 -0.0021445315796881914,
 0.006935642566531897,
 -0.01474385429173708,
 -0.017271738499403,
 -0.0029834171291440725,
 -0.012427696958184242,
 -0.00728210387

In [10]:
import numpy as np

def cosine_similarity(vec1, vec2):
    vec1 = np.array(vec1)
    vec2 = np.array(vec2)
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

similarity_lie_truth = cosine_similarity(embedding_of_lie, embedding_of_truth)
similarity_lie_unrelated = cosine_similarity(embedding_of_lie, embedding_of_unrelated)
similarity_truth_unrelated = cosine_similarity(embedding_of_truth, embedding_of_unrelated)

print("Cosine Similarity Matrix:")
print(f"{'':<25}{'Lie':<25}{'Truth':<25}{'Unrelated':<25}")
print(f"{'Lie':<25}{'1.000':<25}{similarity_lie_truth:<25.3f}{similarity_lie_unrelated:<25.3f}")
print(f"{'Truth':<25}{similarity_lie_truth:<25.3f}{'1.000':<25}{similarity_truth_unrelated:<25.3f}")
print(f"{'Unrelated':<25}{similarity_lie_unrelated:<25.3f}{similarity_truth_unrelated:<25.3f}{'1.000':<25}")

print("\nVisual Representation (higher = more similar):")
print(f"Lie vs Truth:      {similarity_lie_truth:.3f}")
print(f"Lie vs Unrelated:  {similarity_lie_unrelated:.3f}")
print(f"Truth vs Unrelated:{similarity_truth_unrelated:.3f}")


Cosine Similarity Matrix:
                         Lie                      Truth                    Unrelated                
Lie                      1.000                    0.960                    0.746                    
Truth                    0.960                    1.000                    0.743                    
Unrelated                0.746                    0.743                    1.000                    

Visual Representation (higher = more similar):
Lie vs Truth:      0.960
Lie vs Unrelated:  0.746
Truth vs Unrelated:0.743


The similarity score here calculated as the cosine similarity between the embedded sentences, reflects their original meaning.

4. The 'library' where we store these embedded pieces of the document is called a 'vector store' / 'vector database'

In [11]:
from langchain.vectorstores import Chroma

vector_store = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory="./pdf-doc"
)

In [14]:
vector_store.similarity_search("define self attention")

[Document(metadata={'author': '', 'creationdate': '2024-04-10T21:11:43+00:00', 'creator': 'LaTeX with hyperref', 'keywords': '', 'moddate': '2024-04-10T21:11:43+00:00', 'page': 1, 'page_label': '2', 'producer': 'pdfTeX-1.40.25', 'ptex.fullbanner': 'This is pdfTeX, Version 3.141592653-2.6-1.40.25 (TeX Live 2023) kpathsea version 6.3.5', 'source': './assets-resources/attention-paper.pdf', 'subject': '', 'title': '', 'total_pages': 15, 'trapped': '/False'}, page_content='described in section 3.2.\nSelf-attention, sometimes called intra-attention is an attention mechanism relating different positions\nof a single sequence in order to compute a representation of the sequence. Self-attention has been\nused successfully in a variety of tasks including reading comprehension, abstractive summarization,\ntextual entailment and learning task-independent sentence representations [4, 27, 28, 22].\nEnd-to-end memory networks are based on a recurrent attention mechanism instead of sequence-\naligned 

In [15]:
# 5. LLM (the model we are using to answer the question)

In [16]:
from langchain.chat_models import init_chat_model

llm_to_answer_queries_about_paper = init_chat_model(
    model="gpt-4o-mini",
    temperature=0
)

In [17]:
# 7. Production of the final answer

![](2025-09-04-15-09-37.png)

In [20]:
def format_chunks_for_llm(chunks):
    return "\n".join([chunk.page_content for chunk in chunks])

def rag_retrieval_step_from_scratch(query, vector_store, llm):
    # retrieve relevant chunks
    # concatenate the chunks for the LLM
    # put all of that into the LLM to get a final answer
    similar_chunks = vector_store.similarity_search(query)
    formatted_chunks = format_chunks_for_llm(similar_chunks)
    prompt = f"The user asked you the following question: {query}\n\nHere are the pieces of the document that are relevant to the question:\n\n{formatted_chunks}"    
    return llm.invoke(prompt).content

query = "According to this paper about transformers and attentiokn network what is self attention?"
output = rag_retrieval_step_from_scratch(query, vector_store, llm_to_answer_queries_about_paper)

In [21]:
from IPython.display import Markdown

Markdown(output)


Self-attention, also known as intra-attention, is an attention mechanism that relates different positions of a single sequence to compute a representation of that sequence. It allows the model to weigh the importance of various parts of the input sequence when generating a representation, enabling it to capture dependencies between words or tokens regardless of their distance from each other in the sequence. This mechanism has been effectively utilized in various tasks such as reading comprehension, abstractive summarization, and learning task-independent sentence representations.

In the context of the Transformer model, self-attention is a key component that replaces traditional recurrent neural networks (RNNs) and convolutional layers, allowing for more parallelization and faster training. The Transformer architecture relies entirely on self-attention to compute representations of its input and output, which has led to significant improvements in translation quality and efficiency compared to previous models that used recurrent or convolutional structures.