Purpose of this notebook: 

-Manual loading of PDFs
-Manual chunking
-Manual embedding with OllamaEmbeddings
-Manual creation of Chroma vector store
-Manual retrieval
-Manual prompt formatting
-Manual LLM call

Meant for debugging, exploring, and understanding each RAG component separately.
Manual--> Explicitly executing each stage of the RAG pipeline, calling the PDF loader, the text splitter, the embedding setup.

Advantage: I can experiment with different chunk sizes, PDF quality, or prompt formats before automating it in the full pipeline

In [None]:
# Cell 0 - Fix sys.path so we can import modules from app/
import sys
import os

# Path: rag_pipeline_project/app/notebooks/
# We want to reach: rag_pipeline_project/
project_root = os.path.abspath(os.path.join(os.getcwd(), "../../"))
if project_root not in sys.path:
    sys.path.insert(0, project_root)


In [None]:
#Cell 1 - Core pipeline imports
from app.pdf_loader import load_pdfs_from_folder
from app.ollama_client import ask_ollama

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OllamaEmbeddings
from langchain.vectorstores import Chroma


In [None]:
#Cell 2 - Load PDFs
SOURCE_DIR = "../../documents/sources"
PERSIST_DIR = "../../embeddings/chromadb"

documents = load_pdfs_from_folder(SOURCE_DIR)
print(f"Loaded {len(documents)} pages")

[DEBUG] Looking for PDFs in: /home/crl/Desktop/Langchain-rag-practical/rag_pipeline_project/documents/sources
[DEBUG] Found 1 PDFs: ['Koalitionsvertrag-–-barrierefreie-Version.pdf']
Loaded 146 pages from Koalitionsvertrag-–-barrierefreie-Version.pdf
Loaded 146 pages


In [None]:
#Cell 3 - Chunk documents
splitter = RecursiveCharacterTextSplitter(chunk_size=1200, chunk_overlap=300)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
print(chunks[0].page_content)

Split into 483 chunks
Verantwortung 
für Deutschland
Koalitionsvertrag zwischen  
CDU, CSU und SPD
21. Legislaturperiode


In [None]:
#Cell 4 - Create Vector DB using proper embedding model
embedding_model = OllamaEmbeddings(model="nomic-embed-text")  #alternative backup if this model has issues in different machines "mxbai-embed-large"
os.makedirs(PERSIST_DIR, exist_ok=True)
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embedding_model,
    persist_directory=PERSIST_DIR,
)

  embedding_model = OllamaEmbeddings(model="nomic-embed-text")  #alternative backup if this model has issues in different machines "mxbai-embed-large"


In [None]:
#Cell 5 – Retrieve relevant chunks
retriever = vectorstore.as_retriever(search_kwargs={"k": 10})
sample_query = "Was steht in der CDU-Wahlprogramm über Migration?"
relevant_docs = retriever.get_relevant_documents(sample_query)

print("Top 10 Relevant Chunks:\n")
for i, doc in enumerate(relevant_docs):
    print(f"--- Chunk {i+1} ---\n{doc.page_content[:300]}...\n")


Top 10 Relevant Chunks:

--- Chunk 1 ---
leisten, auch um sie von der gefährlichen Flucht nach Europa abzuhalten und ihnen in ihrer Heimat 
Chancen und Perspektiven zu geben. Die Kooperationsbereitschaft der Partnerländer bei den 
Bemühungen, die irreguläre Migration nach Europa zu begrenzen und eigene Staatsbürgerinnen und 
Staatsbürger z...

--- Chunk 2 ---
Osten hat längst bewiesen, dass Transformation gelingen kann. Darauf wollen wir aufbauen. Wer 
schon einmal Transformation gemeistert hat, kann auch Zukunft gestalten.
Koalitionen aus CDU, CSU und SPD waren immer dann stark, wenn wir uns große Antworten zugetraut 
haben. Das ist auch jetzt unser Ans...

--- Chunk 3 ---
Westbalkan-Regelung begrenzen
Reguläre Migration nach Deutschland im Rahmen der sogenannten Westbalkan-Regelung werden wir 
auf 25.000 Personen pro Jahr begrenzen.
Begrenzung der Migration
Zurückweisung an den Staatsgrenzen
Wir werden in Abstimmung mit unseren europäischen Nachbarn Zurückweisungen a...

--- Chunk 4 

  relevant_docs = retriever.get_relevant_documents(sample_query)


In [7]:
# Cell 6 - Build final prompt with system_prompt.json

from app.utils import load_system_prompt  # Load motivational interviewing prompt

system_prompt = load_system_prompt()

context = "\n\n".join([doc.page_content for doc in relevant_docs])

prompt = f"""{system_prompt}

## Retrieved Context:
{context}

## User Question:
{sample_query}

## Assistant:"""  # this tells llama model to respond

response = ask_ollama(prompt)
print("Llama3:70b Response:\n")
print(response)


Llama3:70b Response:

Let's take a closer look at the CDU election program regarding migration. It seems that they're focusing on regulating and steering migration, while also promoting integration and reducing irregular migration. They plan to limit regular migration from certain countries, like those in the West Balkans, to 25,000 people per year.

Additionally, they want to strengthen border controls and work with European partners to reduce irregular migration. They also aim to expand the list of safe countries of origin and make it easier for migrants who are already in Germany to integrate into society.

It's interesting that they're highlighting the importance of integrating migrants from the start, including providing language courses, vocational training, and other forms of support. They also want to create a binding integration agreement that outlines rights and responsibilities for both migrants and the German government.

What are your thoughts on this approach? Do you thin