Purpose of this notebook: 

-Manual loading of PDFs
-Manual chunking
-Manual embedding with OllamaEmbeddings
-Manual creation of Chroma vector store
-Manual retrieval
-Manual prompt formatting
-Manual LLM call

Meant for debugging, exploring, and understanding each RAG component separately.
Manual--> Explicitly executing each stage of the RAG pipeline, calling the PDF loader, the text splitter, the embedding setup.

Advantage: I can experiment with different chunk sizes, PDF quality, or prompt formats before automating it in the full pipeline

In [1]:
# Cell 0 — Fix sys.path so we can import modules from app/
import sys
import os

# Path: rag_pipeline_project/app/notebooks/
# We want to reach: rag_pipeline_project/
project_root = os.path.abspath(os.path.join(os.getcwd(), "../../"))
if project_root not in sys.path:
    sys.path.insert(0, project_root)


In [2]:
# Cell 1 – Core pipeline imports
from app.pdf_loader import load_pdfs_from_folder
from app.ollama_client import ask_ollama

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma


In [3]:
# Cell 2 – Load PDFs
SOURCE_DIR = "../../documents/sources"
PERSIST_DIR = "../../embeddings/chromadb"

documents = load_pdfs_from_folder(SOURCE_DIR)
print(f"Loaded {len(documents)} pages")

[DEBUG] Looking for PDFs in: /Users/apple/Desktop/Langchain_projects/Lanchain_rag_practical/rag_pipeline_project/documents/sources
[DEBUG] Found 1 PDFs: ['Koalitionsvertrag-–-barrierefreie-Version.pdf']
Loaded 146 pages from Koalitionsvertrag-–-barrierefreie-Version.pdf
Loaded 146 pages


In [4]:
# Cell 3 – Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(chunks[0].page_content)

Split into 955 chunks
Verantwortung 
für Deutschland
Koalitionsvertrag zwischen  
CDU, CSU und SPD
21. Legislaturperiode


In [5]:
# Cell 4 – Create Vector DB
embedding_model = OllamaEmbeddings(model="llama3.2:latest")
vectorstore = Chroma.from_documents(chunks, embedding=embedding_model, persist_directory=PERSIST_DIR)
vectorstore.persist()

  embedding_model = OllamaEmbeddings(model="llama3.2:latest")


KeyboardInterrupt: 

In [None]:
# Cell 5 – Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
sample_query = "Was steht in der CDU-Wahlprogramm über Migration?"
relevant_docs = retriever.get_relevant_documents(sample_query)

print("Top 3 Relevant Chunks:\n")
for i, doc in enumerate(relevant_docs):
    print(f"--- Chunk {i+1} ---\n{doc.page_content[:300]}...\n")


In [None]:
# Cell 6 – Build final prompt with system_prompt.json

from app.utils import load_system_prompt  # Load your motivational interviewing prompt

system_prompt = load_system_prompt()

context = "\n\n".join([doc.page_content for doc in relevant_docs])

prompt = f"""{system_prompt}

## Retrieved Context:
{context}

## User Question:
{sample_query}

## Assistant:"""  # this tells llama model to respond

response = ask_ollama(prompt)
print("LLaMA3 Response:\n")
print(response)
