# RAG from Scratch

**Problem:** LLMs don't know facts they weren't trained on and highly domain-specific content (as they generally try to generalise their knowledge retention).

We'll build RAG step-by-step, adding each component only when we hit a real problem.

**Our Goal:** Ask ChatGPT about obscure facts that don't exist in its training data.

## Setup

In [4]:
# Install required packages
%pip install openai numpy scikit-learn tiktoken python-dotenv -q


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m26.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


In [5]:
import openai
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
import tiktoken
import os
from dotenv import load_dotenv
import re

# Load environment variables from .env file
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")

## 1. The Knowledge Gap

First, let's create obscure facts ChatGPT has never seen.

In [6]:
documents = [
    "Dr. Zephyr Blackwood invented the Quantum Flux Capacitor in 2019, a device that can predict market crashes by measuring temporal graviton fluctuations. It achieved 94.7% accuracy in the 2021 GameStop incident.",
    "The Antarctican Emperor Penguin colony of Sector-7G has developed a unique communication protocol using 847 distinct flipper gestures. Researcher Maria Kowalski documented this in her 2023 paper published in the Journal of Avian Cryptolinguistics.",
    "In 1987, a small village in rural Tasmania named Whimblebrook accidentally created the world's first AI when their library's card catalog system gained sentience after being struck by lightning. It was named Gerald and served the community until 2003.",
    "The lost programming language 'Flibberscript' was developed in 1974 by Estonian mathematician Peeter Järvik. Its unique feature was that all variables had to rhyme with each other, making it simultaneously the most poetic and frustrating language ever created.",
    "Professor Elena Vasquez discovered that platypuses can count to 17 in base-11 using bioluminescent signals from their bills. This breakthrough was published in Nature Xenobiology in December 2022.",
    "The Great Maple Syrup Heist of Quebec in 2012 involved exactly 2,784 barrels worth $18.7 million. The lead investigator, Detective Claude Beaumont, cracked the case by analyzing the viscosity patterns left at the scene.",
    "In the remote mountains of Bhutan, a monastery discovered ancient manuscripts describing the 'Chronicle Compiler,' a mechanical computing device from 1247 CE that used prayer wheels as logic gates. It could perform addition using 63 interconnected wheels.",
    "The rare Amazonian Blue-Spotted Tree Frog (Dendrobates azurius) produces a venom that, when diluted 1:10000, has been shown to improve human pattern recognition by 23% for approximately 4 hours. Discovered by Dr. James Thornhill in 2020."
]

## 2. Problem: How to Split Long Documents?

**Solution:** Different chunking strategies for different needs.

We can't send entire books to the LLM. We need chunks.

In [7]:
def split_sentences(text):
    """Robust sentence splitter that handles abbreviations"""
    # Replace common abbreviations temporarily
    text = text.replace('Dr.', 'Dr<DOT>')
    text = text.replace('Prof.', 'Prof<DOT>')
    text = text.replace('Mr.', 'Mr<DOT>')
    text = text.replace('Mrs.', 'Mrs<DOT>')
    text = text.replace('Ms.', 'Ms<DOT>')
    
    # Split on sentence endings
    sentences = re.split(r'(?<=[.!?])\s+(?=[A-Z])', text)
    
    # Restore abbreviations
    sentences = [s.replace('<DOT>', '.').strip() for s in sentences if s.strip()]
    return sentences

def chunk_by_tokens(text, max_tokens=50, overlap=10):
    """Chunk text by token count with overlap"""
    encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
    tokens = encoding.encode(text)
    chunks = []
    
    for i in range(0, len(tokens), max_tokens - overlap):
        chunk_tokens = tokens[i:i + max_tokens]
        chunks.append(encoding.decode(chunk_tokens))
    return chunks

def chunk_by_sentences(text, sentences_per_chunk=2):
    """Chunk text by sentence count"""
    sentences = split_sentences(text)
    
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = ' '.join(sentences[i:i + sentences_per_chunk])
        chunks.append(chunk)
    return chunks

def chunk_by_semantic_similarity(text, similarity_threshold=0.5):
    """Chunk text by semantic similarity between sentences"""
    sentences = split_sentences(text)
    
    if len(sentences) <= 1:
        return [text]
    
    # Get embeddings for all sentences
    sent_embeddings = np.array(get_embeddings(sentences))
    
    chunks = []
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Calculate similarity with previous sentence
        similarity = cosine_similarity(
            sent_embeddings[i-1:i], 
            sent_embeddings[i:i+1]
        )[0][0]
        
        # If similar enough, add to current chunk, else start new chunk
        if similarity >= similarity_threshold:
            current_chunk.append(sentences[i])
        else:
            chunks.append(' '.join(current_chunk))
            current_chunk = [sentences[i]]
    
    # Add the last chunk
    if current_chunk:
        chunks.append(' '.join(current_chunk))
    
    return chunks

# Combine all documents into one corpus
corpus = " ".join(documents)
print(f"Corpus length: {len(corpus)} characters")
print(f"Total documents: {len(documents)}")

Corpus length: 1881 characters
Total documents: 8


## 3. Problem: How to Find Relevant Chunks?

**Solution:** Convert text to vectors (embeddings) for semantic search.

We have chunks, but how do we find the right ones for a query?

In [8]:
def get_embeddings(texts, model="text-embedding-3-small"):
    """Get embeddings from OpenAI"""
    response = openai.embeddings.create(input=texts, model=model)
    return [item.embedding for item in response.data]

### Apply Semantic Chunking

Chunk by meaning, not arbitrary size. Keeps related sentences together.

In [9]:
# Apply semantic chunking to the corpus
chunks = chunk_by_semantic_similarity(corpus, similarity_threshold=0.5)

print(f"Total semantic chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3], 1):
    print(f"\n--- Chunk {i} ({len(chunk)} chars) ---")
    print(chunk)

Total semantic chunks: 16

--- Chunk 1 (151 chars) ---
Dr. Zephyr Blackwood invented the Quantum Flux Capacitor in 2019, a device that can predict market crashes by measuring temporal graviton fluctuations.

--- Chunk 2 (57 chars) ---
It achieved 94.7% accuracy in the 2021 GameStop incident.

--- Chunk 3 (134 chars) ---
The Antarctican Emperor Penguin colony of Sector-7G has developed a unique communication protocol using 847 distinct flipper gestures.


In [10]:
# Generate embeddings for semantic chunks
chunk_embeddings = get_embeddings(chunks)
chunk_embeddings = np.array(chunk_embeddings)

print(f"Embedding shape: {chunk_embeddings.shape}")
print(f"Ready for retrieval!")

Embedding shape: (16, 1536)
Ready for retrieval!


## 4. Retrieval: Find Similar Chunks

Use cosine similarity between query and chunk embeddings.

In [11]:
def retrieve_chunks(query, chunk_embeddings, chunks, top_k=3):
    """Retrieve top-k most similar chunks"""
    query_embedding = np.array(get_embeddings([query]))
    
    # Calculate cosine similarity
    similarities = cosine_similarity(query_embedding, chunk_embeddings)[0]
    
    # Get top-k indices
    top_indices = np.argsort(similarities)[::-1][:top_k]
    
    results = [(chunks[i], similarities[i]) for i in top_indices]
    return results

# Test retrieval with obscure knowledge
query = "Tell me about the Quantum Flux Capacitor"
retrieved = retrieve_chunks(query, chunk_embeddings, chunks, top_k=3)

print(f"Query: {query}\n")
for i, (chunk, score) in enumerate(retrieved, 1):
    print(f"{i}. [Score: {score:.4f}] {chunk}\n")

Query: Tell me about the Quantum Flux Capacitor

1. [Score: 0.6353] Dr. Zephyr Blackwood invented the Quantum Flux Capacitor in 2019, a device that can predict market crashes by measuring temporal graviton fluctuations.

2. [Score: 0.2489] In 1987, a small village in rural Tasmania named Whimblebrook accidentally created the world's first AI when their library's card catalog system gained sentience after being struck by lightning.

3. [Score: 0.2165] The lost programming language 'Flibberscript' was developed in 1974 by Estonian mathematician Peeter Järvik.



## 5. Problem: Top Results Aren't Always Best

**Solution:** Rerank using LLM to score actual relevance.

Embedding similarity misses nuance.

In [12]:
def rerank_with_llm(query, chunks, top_k=2):
    """Rerank chunks using LLM to evaluate relevance"""
    scores = []
    
    for chunk in chunks:
        prompt = f"""Rate the relevance of this text to the query on a scale of 0-10.
Query: {query}
Text: {chunk}
Only respond with a number between 0 and 10."""
        
        response = openai.chat.completions.create(
            model="gpt-3.5-turbo",
            messages=[{"role": "user", "content": prompt}],
            temperature=0,
            max_tokens=5
        )
        
        try:
            score = float(response.choices[0].message.content.strip())
        except:
            score = 0
        scores.append(score)
    
    # Sort by score
    ranked = sorted(zip(chunks, scores), key=lambda x: x[1], reverse=True)
    return ranked[:top_k]

# Rerank top 3 results
chunks_to_rerank = [chunk for chunk, _ in retrieved]
reranked = rerank_with_llm(query, chunks_to_rerank, top_k=2)

print("Reranked results:")
for i, (chunk, score) in enumerate(reranked, 1):
    print(f"{i}. [Rerank Score: {score}] {chunk}\n")

Reranked results:
1. [Rerank Score: 8.0] Dr. Zephyr Blackwood invented the Quantum Flux Capacitor in 2019, a device that can predict market crashes by measuring temporal graviton fluctuations.

2. [Rerank Score: 0.0] In 1987, a small village in rural Tasmania named Whimblebrook accidentally created the world's first AI when their library's card catalog system gained sentience after being struck by lightning.



## 6. Generation: The Payoff

Now give the LLM context it needs to answer accurately.

In [13]:
def generate_answer(query, context_chunks):
    """Generate answer using retrieved context"""
    context = "\n\n".join([chunk for chunk, _ in context_chunks])
    
    prompt = f"""Answer the question based on the context below. If the answer cannot be found, say so.

Context:
{context}

Question: {query}

Answer:"""
    
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.3,
        max_tokens=200
    )
    
    return response.choices[0].message.content

# Generate answer
answer = generate_answer(query, reranked)
print(f"Query: {query}\n")
print(f"Answer: {answer}")

Query: Tell me about the Quantum Flux Capacitor

Answer: The Quantum Flux Capacitor was invented by Dr. Zephyr Blackwood in 2019.


### Inspect Chunks

In [14]:
# Find chunks containing key entities
keywords = ["Flibberscript", "Gerald", "Quantum Flux"]

for keyword in keywords:
    print(f"\nSearching: '{keyword}'")
    found = False
    for i, chunk in enumerate(chunks, 1):
        if keyword.lower() in chunk.lower():
            print(f"   Found in Chunk {i}: {chunk[:120]}...")
            found = True
    if not found:
        print(f"   Not found!")


Searching: 'Flibberscript'
   Found in Chunk 7: The lost programming language 'Flibberscript' was developed in 1974 by Estonian mathematician Peeter Järvik....

Searching: 'Gerald'
   Found in Chunk 6: It was named Gerald and served the community until 2003....

Searching: 'Quantum Flux'
   Found in Chunk 1: Dr. Zephyr Blackwood invented the Quantum Flux Capacitor in 2019, a device that can predict market crashes by measuring ...


### The Proof: Before vs After

Direct comparison showing RAG solving the original problem.

In [15]:
def ask_without_rag(query):
    """Ask LLM directly without any context"""
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": query}],
        temperature=0.3,
        max_tokens=150
    )
    return response.choices[0].message.content

def ask_with_rag(query, chunks, embeddings, show_context=False):
    """Ask with RAG pipeline"""
    retrieved = retrieve_chunks(query, embeddings, chunks, top_k=5)  # Get more candidates
    chunks_to_rerank = [chunk for chunk, _ in retrieved]
    reranked = rerank_with_llm(query, chunks_to_rerank, top_k=3)  # Keep top 3
    
    if show_context:
        print(f"\n   Retrieved Context:")
        for i, (chunk, score) in enumerate(reranked, 1):
            print(f"      {i}. [{score}] {chunk[:100]}...")
    
    answer = generate_answer(query, reranked)
    return answer

# Test with obscure questions
test_questions = [
    "What is Flibberscript and who created it?",
    "Tell me about Gerald from Tasmania.",
    "What did the Quantum Flux Capacitor predict?"
]

for question in test_questions:
    print(f"\n{'='*70}")
    print(f"QUESTION: {question}")
    print(f"{'='*70}")
    
    # Without RAG
    print(f"\nWITHOUT RAG:")
    without_rag = ask_without_rag(question)
    print(f"   {without_rag}")
    
    # With RAG
    print(f"\nWITH RAG:")
    with_rag = ask_with_rag(question, chunks, chunk_embeddings, show_context=True)
    print(f"\n   Answer: {with_rag}")


QUESTION: What is Flibberscript and who created it?

WITHOUT RAG:
   Flibberscript is a fictional language created by author Roald Dahl for his book "The BFG" (The Big Friendly Giant). In the book, the BFG speaks Flibberscript, a language that is a combination of English and gibberish. Dahl created Flibberscript to give the BFG a unique way of speaking that reflects his quirky and whimsical personality.

WITH RAG:

   Retrieved Context:
      1. [10.0] The lost programming language 'Flibberscript' was developed in 1974 by Estonian mathematician Peeter...
      2. [2.0] Its unique feature was that all variables had to rhyme with each other, making it simultaneously the...
      3. [1.0] The Antarctican Emperor Penguin colony of Sector-7G has developed a unique communication protocol us...

   Answer: Flibberscript is a lost programming language developed in 1974 by Estonian mathematician Peeter Järvik.

QUESTION: Tell me about Gerald from Tasmania.

WITHOUT RAG:
   I'm sorry, but I do 

### Why It Works

**Without RAG**: No specific knowledge = hallucination
**With RAG**: Retrieved context = accurate answers

Each component we built solved a specific problem in the pipeline.

## 7. Complete RAG Pipeline

In [16]:
class RAGSystem:
    def __init__(self, documents):
        self.documents = documents
        self.embeddings = np.array(get_embeddings(documents))
    
    def query(self, question, top_k=3, rerank=True, rerank_k=2):
        """Complete RAG pipeline"""
        # Retrieve
        retrieved = retrieve_chunks(question, self.embeddings, self.documents, top_k)
        
        # Rerank (optional)
        if rerank:
            chunks_to_rerank = [chunk for chunk, _ in retrieved]
            context = rerank_with_llm(question, chunks_to_rerank, rerank_k)
        else:
            context = retrieved[:rerank_k]
        
        # Generate
        answer = generate_answer(question, context)
        
        return {
            "answer": answer,
            "retrieved": retrieved,
            "reranked": context if rerank else None
        }

# Initialize RAG system
rag = RAGSystem(chunks)

# Test queries with our obscure knowledge base
test_queries = [
    "What is Flibberscript?",
    "Who is Gerald?",
    "What can platypuses do with their bills?"
]

for q in test_queries:
    result = rag.query(q, rerank=True)
    print(f"\n{'='*60}")
    print(f"Q: {q}")
    print(f"A: {result['answer']}")


Q: What is Flibberscript?
A: Flibberscript is a lost programming language developed in 1974 by Estonian mathematician Peeter Järvik.

Q: Who is Gerald?
A: Gerald is not a person, it is likely an object or entity that served the community until 2003.

Q: What can platypuses do with their bills?
A: Platypuses can count to 17 in base-11 using bioluminescent signals from their bills.


## 8. Extensions: When Basic RAG Isn't Enough

### Hybrid Search

**Solution:** Combine semantic + keyword scoring.
**Problem:** Semantic search misses exact keyword matches.

In [17]:
def hybrid_search(query, chunks, embeddings, alpha=0.5, top_k=3):
    """Combine keyword and semantic search"""
    # Semantic similarity
    query_emb = np.array(get_embeddings([query]))
    semantic_scores = cosine_similarity(query_emb, embeddings)[0]
    
    # Keyword matching (simple TF)
    query_words = set(query.lower().split())
    keyword_scores = np.array([
        len(query_words & set(chunk.lower().split())) / len(query_words)
        for chunk in chunks
    ])
    
    # Normalize scores
    semantic_scores = (semantic_scores - semantic_scores.min()) / (semantic_scores.max() - semantic_scores.min() + 1e-10)
    keyword_scores = (keyword_scores - keyword_scores.min()) / (keyword_scores.max() - keyword_scores.min() + 1e-10)
    
    # Combine
    hybrid_scores = alpha * semantic_scores + (1 - alpha) * keyword_scores
    top_indices = np.argsort(hybrid_scores)[::-1][:top_k]
    
    return [(chunks[i], hybrid_scores[i]) for i in top_indices]

# Test hybrid search with obscure knowledge
query = "penguin communication flipper gestures"
hybrid_results = hybrid_search(query, chunks, chunk_embeddings, alpha=0.7, top_k=3)

print(f"Hybrid Search Results for: '{query}'\n")
for i, (chunk, score) in enumerate(hybrid_results, 1):
    print(f"{i}. [Score: {score:.4f}] {chunk}\n")

Hybrid Search Results for: 'penguin communication flipper gestures'

1. [Score: 1.0000] The Antarctican Emperor Penguin colony of Sector-7G has developed a unique communication protocol using 847 distinct flipper gestures.

2. [Score: 0.3090] Professor Elena Vasquez discovered that platypuses can count to 17 in base-11 using bioluminescent signals from their bills.

3. [Score: 0.2531] Researcher Maria Kowalski documented this in her 2023 paper published in the Journal of Avian Cryptolinguistics.



### Query Expansion

**Solution:** Generate alternative phrasings before retrieving.
**Problem:** User query wording might not match document wording.

In [18]:
def expand_query(query):
    """Generate similar queries for better retrieval"""
    prompt = f"""Generate 2 alternative phrasings of this question:
{query}

Provide only the questions, one per line."""
    
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=[{"role": "user", "content": prompt}],
        temperature=0.7,
        max_tokens=100
    )
    
    expanded = [query] + response.choices[0].message.content.strip().split('\n')
    return [q.strip() for q in expanded if q.strip()]

# Test query expansion
original_query = "What is the Chronicle Compiler from Bhutan?"
expanded_queries = expand_query(original_query)

print("Query Expansion:")
for i, q in enumerate(expanded_queries, 1):
    print(f"{i}. {q}")

Query Expansion:
1. What is the Chronicle Compiler from Bhutan?
2. - What exactly is the Chronicle Compiler from Bhutan?
3. - Can you explain the purpose of the Chronicle Compiler from Bhutan?


## Summary

**Components:**
1. Chunking: Token, sentence, semantic
2. Embeddings: OpenAI vectors
3. Retrieval: Cosine similarity
4. Reranking: LLM scoring
5. Generation: Context-aware answers
6. Hybrid Search: Keyword + semantic
7. Query Expansion: Better coverage