# Local RAG with LlamaIndex + Ollama

This notebook demonstrates how to build a **fully local** RAG (Retrieval-Augmented Generation) pipeline using:
- **Ollama** for the LLM (runs locally, no API keys needed)
- **HuggingFace Embeddings** for document embedding (sentence-transformers)
- **LlamaIndex** for orchestrating the RAG pipeline

## RAG Pipeline Steps

1. **Load** - Read the PDF document
2. **Chunk** - Split into manageable pieces
3. **Embed** - Convert chunks to vector representations
4. **Index** - Store vectors for efficient retrieval
5. **Query** - Retrieve relevant chunks and generate answers

## Prerequisites

1. **Install Ollama**: Download from [ollama.ai](https://ollama.ai)
2. **Pull a model**: Run `ollama pull llama3.2` in your terminal
3. **Install Python packages**: Run the cell below

In [None]:
# Install required packages
# Note: These are installed as separate packages since LlamaIndex v0.10+
!pip install -q llama-index-core llama-index-llms-ollama llama-index-embeddings-huggingface
!pip install -q llama-index-readers-file pypdf sentence-transformers

## Step 1: Configure the LLM and Embedding Model

We'll use:
- **Ollama** with `llama3.2` for text generation (runs on localhost:11434)
- **BGE-small** from HuggingFace for embeddings (384-dim, fast & accurate)

In [1]:
from llama_index.core import Settings
from llama_index.llms.ollama import Ollama
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

# Configure the LLM - Ollama runs locally on port 11434
Settings.llm = Ollama(
    model="llama3.2",           # Use llama3.2 (3B params, good balance)
    request_timeout=120.0,       # Timeout for generation
    temperature=0.1,             # Low temperature for factual responses
)

# Configure the embedding model - runs locally via sentence-transformers
Settings.embed_model = HuggingFaceEmbedding(
    model_name="BAAI/bge-small-en-v1.5",  # 384-dim, ~130MB
)

print("LLM and Embedding model configured!")

LLM and Embedding model configured!


## Step 2: Load the PDF Document

LlamaIndex's `SimpleDirectoryReader` handles PDF parsing automatically.
Each page becomes a separate Document object with metadata.

In [2]:
from llama_index.core import SimpleDirectoryReader

# Load the PDF - using the attention paper as our example
documents = SimpleDirectoryReader(
    input_files=["./assets-resources/attention_paper.pdf"]
).load_data()

print(f"Loaded {len(documents)} pages from the PDF")
print(f"\nFirst page preview (first 500 chars):\n{documents[0].text[:500]}...")

Loaded 11 pages from the PDF

First page preview (first 500 chars):
Attention Is All You Need
Ashish Vaswani∗
Google Brain
avaswani@google.com
Noam Shazeer∗
Google Brain
noam@google.com
Niki Parmar∗
Google Research
nikip@google.com
Jakob Uszkoreit∗
Google Research
usz@google.com
Llion Jones∗
Google Research
llion@google.com
Aidan N. Gomez∗†
University of Toronto
aidan@cs.toronto.edu
Łukasz Kaiser∗
Google Brain
lukaszkaiser@google.com
Illia Polosukhin∗‡
illia.polosukhin@gmail.com
Abstract
The dominant sequence transduction models are based on complex recurrent or...


## Step 3: Create the Vector Index

This step:
1. Chunks the documents (default: 1024 tokens with 20 overlap)
2. Generates embeddings for each chunk
3. Stores them in an in-memory vector store

For production, you'd use a persistent vector store like ChromaDB or FAISS.

In [None]:
from llama_index.core import VectorStoreIndex

# Create the index - this embeds all chunks
index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True,  # Show embedding progress
)

print("\nIndex created successfully!") 

Parsing nodes:   0%|          | 0/11 [00:00<?, ?it/s]

Generating embeddings:   0%|          | 0/11 [00:00<?, ?it/s]


Index created successfully!


## Step 4: Create the Query Engine

The query engine combines:
- **Retriever**: Finds relevant chunks using vector similarity
- **Response Synthesizer**: Generates answers using the LLM

`similarity_top_k=3` means we retrieve the 3 most relevant chunks.

In [4]:
# Create query engine with top-3 retrieval
query_engine = index.as_query_engine(
    similarity_top_k=3,  # Number of chunks to retrieve
)

print("Query engine ready!")

2026-02-03 17:14:14,833 - INFO - HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"


Query engine ready!


## Step 5: Query the Documents

Now we can ask questions about the paper! The RAG pipeline will:
1. Embed your question
2. Find similar chunks in the index
3. Send chunks + question to the LLM
4. Return the generated answer

In [5]:
# Ask a question about the paper
response = query_engine.query("What is the main contribution of this paper?")

print("Answer:")
print(response.response)

2026-02-03 17:14:22,892 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Answer:
The main contribution of this paper appears to be the presentation of the Transformer, a sequence transduction model based entirely on attention mechanisms. This model significantly improves training speed for translation tasks compared to architectures using recurrent or convolutional layers. The paper also explores various modifications and extensions to the original Transformer architecture, including multi-head attention, positional encoding, and dropout regularization, which enhance its performance and efficiency.


In [7]:
# Let's try another question
response = query_engine.query("What is self-attention and how does it work?Answer in 5 bullet short points")

print("Answer:")
print(response.response)

2026-02-03 17:21:15,998 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"


Answer:
Here are five bullet points explaining what self-attention and how it works:

• Self-attention is an attention mechanism that allows a model to attend to different positions within the same input sequence to compute a representation of the entire sequence.
• In traditional recurrent neural networks (RNNs), information flows sequentially, meaning each position in the sequence only attends to the previous position. Self-attention breaks this constraint by allowing each position to attend to all other positions.
• The self-attention mechanism typically involves three components: queries, keys, and values. These are usually learned linear projections of the input data, which are then used to compute attention weights.
• The attention weights are computed by taking the dot product of the query and key vectors, divided by the square root of the dimensionality of the vectors. This is often referred to as "scaled dot-product attention".
• The final representation of each position is a 

## Inspecting Retrieved Sources

One advantage of RAG is transparency - we can see which chunks were used to generate the answer.

In [8]:
# Inspect the source nodes (retrieved chunks)
print(f"Number of source chunks: {len(response.source_nodes)}\n")

for i, node in enumerate(response.source_nodes):
    print(f"--- Source {i+1} (score: {node.score:.3f}) ---")
    print(f"{node.text[:300]}...\n")

Number of source chunks: 3

--- Source 1 (score: 0.687) ---
MultiHead(Q,K,V ) = Concat(head 1,..., headh)W O
where headi = Attention(QW Q
i ,KW K
i ,VW V
i )
Where the projections are parameter matricesW Q
i ∈ Rdmodel×dk,W K
i ∈ Rdmodel×dk,W V
i ∈ Rdmodel×dv
andW O∈ Rhdv×dmodel.
In this work we employ h = 8 parallel attention layers, or heads. For each of th...

--- Source 2 (score: 0.672) ---
Scaled Dot-Product Attention
 Multi-Head Attention
Figure 2: (left) Scaled Dot-Product Attention. (right) Multi-Head Attention consists of several
attention layers running in parallel.
query with all keys, divide each by√dk, and apply a softmax function to obtain the weights on the
values.
In practi...

--- Source 3 (score: 0.652) ---
Recurrent models typically factor computation along the symbol positions of the input and output
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
statesht, as a function of the previous hidden stateht−1 and the input

## (Optional) Persist the Index

Save the index to disk so you don't need to re-embed documents each time.

In [None]:
# Save the index to disk
index.storage_context.persist(persist_dir="./storage/attention_paper")
print("Index saved to ./storage/attention_paper/")

In [None]:
# To load the index later:
from llama_index.core import StorageContext, load_index_from_storage

# Uncomment to load:
# storage_context = StorageContext.from_defaults(persist_dir="./storage/attention_paper")
# loaded_index = load_index_from_storage(storage_context)
# query_engine = loaded_index.as_query_engine(similarity_top_k=3)

## Summary

You've built a complete local RAG pipeline! Key components:

| Component | Tool | Why |
|-----------|------|-----|
| LLM | Ollama (llama3.2) | Local, no API keys, easy model management |
| Embeddings | HuggingFace (bge-small) | Fast, accurate, runs locally |
| Orchestration | LlamaIndex | Handles chunking, indexing, retrieval |

### Next Steps
- Try different models: `ollama pull mistral` or `ollama pull phi3`
- Use persistent storage: ChromaDB, FAISS
- Experiment with chunk sizes via `Settings.chunk_size`
- Add hybrid search (keyword + semantic)