# RAG General Knowledge

## Overview
This notebook shows a minimal Retrieval-Augmented Generation (RAG) workflow:
- Download short context paragraphs from a public dataset (SQuAD validation).
- Chunk and deduplicate the text.
- Compute embeddings using Ollama embeddings and store them in PostgreSQL using `pgvector` (via `langchain_postgres.PGVector`).
- Build a retriever + LLM chain to answer questions grounded in the stored contexts.

In [16]:
import uuid
import os
import psycopg
from dotenv import load_dotenv
from datasets import load_dataset
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_postgres import PGVector
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

### Database Configuration And Preparation

In [None]:
load_dotenv()

CONNECTION_STRING = os.getenv("DATABASE_URL")
COLLECTION_NAME = "general_knowledge"

if not CONNECTION_STRING:
    raise RuntimeError("DATABASE_URL not set. Please set the environment variable or create a local .env file. See .env.example.")

# Used to limit the number of paragraphs processed and therefore the resource usage
MAX_PARAGRAPH_COUNT = 150000 

# Used to not overload Postgres with too many parameters
DB_BATCH_SIZE = 1000

### 1. Database Cleanup
To always start with a fresh database

In [18]:
print("Cleaning database...")
try:
    raw_conn_str = CONNECTION_STRING.replace("+psycopg", "")
    with psycopg.connect(raw_conn_str) as conn:
        with conn.cursor() as cur:
            cur.execute("DROP TABLE IF EXISTS langchain_pg_embedding CASCADE;")
            cur.execute("DROP TABLE IF EXISTS langchain_pg_collection CASCADE;")
        conn.commit()
    print(" -> Database cleared.")
except Exception as e:
    print(f" -> Cleanup skipped: {e}")

Cleaning database...
 -> Database cleared.


### 2. Ollama & POSTGRES Setup

In [19]:
embeddings = OllamaEmbeddings(model="nomic-embed-text")

vector_store = PGVector(
    embeddings=embeddings,
    collection_name=COLLECTION_NAME,
    connection=CONNECTION_STRING,
    use_jsonb=True,
)
vector_store.create_tables_if_not_exists()

### 3. Download Dataset From Huggingface
[SQuAD dataset on Hugging Face](https://huggingface.co/datasets/rajpurkar/squad)

In [20]:
print("Downloading SQuAD dataset...")
ds = load_dataset("squad", split="validation")

unique_contexts = set()
raw_docs = []

print(f"Selecting first {MAX_PARAGRAPH_COUNT} unique paragraphs...")

for row in ds:
    text = row['context']
    title = row['title']
    
    # Removing duplicates in Dataset
    if text not in unique_contexts:
        unique_contexts.add(text)
        
        doc = Document(
            page_content=text,
            metadata={"title": title, "source": "wikipedia"}
        )
        raw_docs.append(doc)
    
    if len(raw_docs) >= MAX_PARAGRAPH_COUNT:
        break

Downloading SQuAD dataset...
Selecting first 150000 unique paragraphs...


### 4. Chunk Data

This cell splits the collected raw documents into fixed-size, overlapping chunks that are ready for embedding and insertion into the vector store.

Purpose: produce manageable-length passages for embedding, while the overlap preserves context across chunk boundaries so retrieved chunks remain coherent.

In [21]:
print(f"Splitting {len(raw_docs)} raw documents...")

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100
)

split_docs = text_splitter.split_documents(raw_docs)
print(f" -> Generated {len(split_docs)} chunks to embed.")

Splitting 2067 raw documents...
 -> Generated 2472 chunks to embed.


### 5. Ingestion

We split the collected texts into smaller chunks so embeddings and retrieval are manageable. Chunking keeps passage length within model/context limits, improves relevance of retrieved passages, and helps preserve local coherence (often using overlap) across chunk boundaries.

In [22]:
# Generate UUIDs for every single chunk
doc_ids = [str(uuid.uuid4()) for _ in split_docs]

print("Embedding and inserting...")
total_inserted = 0

# Loop in batches to support larger datasets
for i in range(0, len(split_docs), DB_BATCH_SIZE):
    batch_end = i + DB_BATCH_SIZE
    
    batch_docs = split_docs[i:batch_end]
    batch_ids = doc_ids[i:batch_end]
    
    try:
        vector_store.add_documents(documents=batch_docs, ids=batch_ids)
        total_inserted += len(batch_docs)
        print(f" -> Inserted batch {i} to {batch_end} (Total: {total_inserted})")
    except Exception as e:
        print(f"❌ Error on batch {i}: {e}")

print("✅ Succesfully completed ingestion.")

Embedding and inserting...
 -> Inserted batch 0 to 1000 (Total: 1000)
 -> Inserted batch 1000 to 2000 (Total: 2000)
 -> Inserted batch 2000 to 3000 (Total: 2472)
✅ Succesfully completed ingestion.


### Setup Retriever And LLM
We look for the 10 most relevant chunks. Setting temperature to a low value seems to be recommended for RAG.

In [23]:
retriever = vector_store.as_retriever(search_kwargs={"k": 10})

llm = ChatOllama(
    model="qwen2.5:32b",
    temperature=0.1,    # Keep low for factual answers
    num_ctx=16384,      # Increase context window to fit all retrieved docs
)


### Define The Prompt

In `context` the Postgres search result will be provided. `context` will hold the question.

In [24]:
template = """
You are an expert research assistant. 
Answer the question based ONLY on the following context.
If the context contains conflicting information, note the conflict.
If the answer is not in the context, state that you do not know.

Context:
{context}

Question: 
{question}
"""

prompt = ChatPromptTemplate.from_template(template)

### Build The Chain

In [25]:
def format_docs(docs):
    return "\n\n".join([f"[Source: {d.metadata.get('title', 'Unknown')}]\n{d.page_content}" for d in docs])

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)


### Run Query

In [26]:
question = "Why was the Apollo Space Program significant?"

print(f"Querying LLM: '{question}'")

print("\n--- ANSWER ---")

response = rag_chain.invoke(question)

print(response)

Querying LLM: 'Why was the Apollo Space Program significant?'

--- ANSWER ---
The Apollo Space Program was significant for several reasons:

- It achieved major human spaceflight milestones, including being the first to send manned missions beyond low Earth orbit.
- The program marked the first time humans landed on another celestial body with Apollo 8 and completed six Moon landings by the end of Apollo 17.
- It returned a substantial amount (842 pounds or 382 kg) of lunar rocks and soil, which greatly contributed to scientific understanding about the Moon's composition and geological history.
- The program laid foundational capabilities for NASA’s future human spaceflight endeavors and funded critical infrastructure like its Johnson Space Center and Kennedy Space Center.
- It spurred technological advancements in various fields such as avionics, telecommunications, and computers.

These achievements not only advanced space exploration but also had significant impacts on technology an