## Retrieval Augmented Generation

In [5]:
import os
from dotenv import load_dotenv
load_dotenv()

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.groq import Groq
from llama_index.core.response.pprint_utils import pprint_response

embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")
llm = Groq(api_key=os.environ["GROQ_API_KEY"], model="openai/gpt-oss-120b")

documents = SimpleDirectoryReader("data").load_data(show_progress=True)
documents

2025-10-23 15:36:33,033 - INFO - Load pretrained SentenceTransformer: sentence-transformers/all-MiniLM-L6-v2
Loading files: 100%|██████████| 6/6 [00:05<00:00,  1.03it/s]


[Document(id_='70769b61-2ceb-4a4a-8b2a-36b43953c6b1', embedding=None, metadata={'page_label': '1', 'file_name': '1506.02640v5.pdf', 'file_path': 'c:\\Users\\fredr\\OneDrive\\Desktop\\llm-utilities\\llama--index\\data\\1506.02640v5.pdf', 'file_type': 'application/pdf', 'file_size': 5296750, 'creation_date': '2025-10-22', 'last_modified_date': '2025-10-22'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, metadata_template='{key}: {value}', metadata_separator='\n', text_resource=MediaResource(embeddings=None, data=None, text='You Only Look Once:\nUniﬁed, Real-Time Object Detection\nJoseph Redmon∗, Santosh Divvala∗†, Ross Girshick¶, Ali Farhadi∗†\nUniversity of Washington∗, Allen Institute for AI†, Facebook AI Research¶\nhttp://pjreddie.com/yolo/\nAbstract\nWe pr

### Index Object

In [6]:
index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, show_progress=True)

query_engine = index.as_query_engine(llm=llm)

Parsing nodes: 100%|██████████| 89/89 [00:00<00:00, 292.99it/s]
Generating embeddings: 100%|██████████| 147/147 [00:13<00:00, 10.66it/s]


In [16]:
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.postprocessor import SimilarityPostprocessor
from llama_index.core import Settings

Settings.embed_model = embed_model
Settings.llm = llm


retriever = VectorIndexRetriever(index=index, similarity_top_k=4)
postprocessor = [SimilarityPostprocessor(similarity_cutoff=0.3)]

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    node_postprocessors=postprocessor,
)

In [17]:
response = query_engine.query("What is Attention is All you need?")
pprint_response(response, show_source=True, wrap_width=140)

2025-10-23 15:40:42,757 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Final Response: “Attention Is All You Need” introduces the Transformer, a neural‑network architecture that replaces recurrence and
convolution with pure attention mechanisms. The model is built from stacked encoder and decoder blocks, each consisting of a multi‑head
self‑attention layer followed by a position‑wise fully‑connected feed‑forward network. Residual connections and layer‑normalization are
applied around every sub‑layer.  The core attention operation is **scaled dot‑product attention**: queries, keys, and values are projected
into vectors, the dot products of queries with all keys are computed, scaled by the inverse square root of the key dimension, and passed
through a softmax to obtain weights that are applied to the values. Multiple such attention heads run in parallel (multi‑head attention),
allowing the model to capture different types of relationships simultaneously.  By stacking six identical encoder layers and six identical
decoder layers, the Transformer can model lo

### Using Persistent Storage

In [None]:
import os.path
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, StorageContext, load_index_from_storage

PERSIST_DIR = "./storage"
if not os.path.exists(PERSIST_DIR):
    documents = SimpleDirectoryReader("data").load_data()
    index = VectorStoreIndex.from_documents(documents, embed_model=embed_model, show_progress=True)
    index.storage_context.persist(persist_dir=PERSIST_DIR)
else:
    storage_context = StorageContext.from_defaults(persist_dir=PERSIST_DIR)
    index = load_index_from_storage(storage_context=storage_context, embed_model=embed_model)

query_engine = index.as_query_engine(llm=llm)
response = query_engine.query("What are Transformers?")

pprint_response(response, wrap_width=140)

2025-10-23 15:41:13,296 - INFO - Loading all indices.
2025-10-23 15:41:14,445 - INFO - HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Final Response: Transformers are a type of neural‑network architecture that processes sequences using stacked self‑attention mechanisms
instead of recurrence or convolution. The model is built from an encoder‑decoder structure, each composed of several identical layers.    In
the encoder, each layer contains a multi‑head self‑attention sub‑layer followed by a position‑wise fully connected feed‑forward network;
both sub‑layers are wrapped with residual connections and layer‑normalization.    The decoder mirrors the encoder but adds a third sub‑layer
that performs multi‑head attention over the encoder’s output. Its self‑attention is masked so that each position can only attend to earlier
positions, ensuring that predictions depend only on already generated tokens.    Overall, the Transformer relies on parallelizable attention
operations, residual links, and normalization to model relationships within and across sequences.
