# Level 2: Quick and Dirty RAG Implementation (Python)

This notebook demonstrates a simple, quick-and-dirty implementation of a Retrieval-Augmented Generation (RAG) system for PDF documents. The focus is on clarity and simplicity, not on best practices or optimization.

---

## Core Steps
1. Extract text from a PDF (simulated here for simplicity)
2. Split text into chunks
3. Generate fake embeddings (random vectors)
4. Store chunks in a simple list (instead of a real vector DB)
5. Find the most relevant chunks for a question (using cosine similarity)
6. Generate an answer by concatenating the most relevant chunks

---

## Implementation

Below is a simple Python implementation. You can run and modify the code to see how it works.

In [None]:
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# 1. Simulate PDF text extraction (normally you'd use PyMuPDF or similar)
pdf_text = """
Page 1: The sun is the center of our solar system. It provides light and heat to the planets.
Page 2: The Earth orbits the sun once every 365 days. The moon orbits the Earth.
Page 3: Solar energy can be converted into electricity using solar panels.
"""

# 2. Split text into chunks (one per page for simplicity)
chunks = [
    {"content": line, "metadata": {"page": i+1}}
    for i, line in enumerate(pdf_text.strip().split('\n'))
]

# 3. Generate fake embeddings (random vectors for demo)
def fake_embedding(text):
    np.random.seed(abs(hash(text)) % (2**32))
    return np.random.rand(5)  # 5-dimensional vector

for chunk in chunks:
    chunk["embedding"] = fake_embedding(chunk["content"])

# 4. Store in a simple list (no real DB)
vector_store = chunks

# 5. Find most relevant chunks for a question
def retrieve_relevant_chunks(question, vector_store, top_k=2):
    q_emb = fake_embedding(question)
    chunk_embs = np.array([c["embedding"] for c in vector_store])
    sims = cosine_similarity([q_emb], chunk_embs)[0]
    top_indices = sims.argsort()[-top_k:][::-1]
    return [vector_store[i] for i in top_indices]

# 6. Generate answer by concatenating relevant chunks
def answer_question(question, vector_store):
    relevant = retrieve_relevant_chunks(question, vector_store)
    answer = "\n".join([f"[Page {c['metadata']['page']}]: {c['content']}" for c in relevant])
    return answer

# ---
# Test Example
question = "How does the Earth move around the sun?"
print("Question:", question)
print("\nAnswer:")
print(answer_question(question, vector_store))
