### Initialization chapter 7 - QA across documents

In [1]:
"""
Chapter 7 ‚Äî Indexing pipeline (Q&A across documents)
- Load many file types from a folder (docx/pdf/txt)
- Split -> Embed -> Store in Chroma
- OpenRouter-compatible env (fail-fast)
"""

'\nChapter 7 ‚Äî Indexing pipeline (Q&A across documents)\n- Load many file types from a folder (docx/pdf/txt)\n- Split -> Embed -> Store in Chroma\n- OpenRouter-compatible env (fail-fast)\n'

In [2]:
# =============================================================================
# IMPORTS
# =============================================================================
import os
from dotenv import load_dotenv

from langchain_community.document_loaders import Docx2txtLoader, PyPDFLoader, TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma

In [3]:
# =============================================================================
# ENV SETUP (fail-fast)
# =============================================================================
load_dotenv()

OPENROUTER_API_KEY = os.getenv("OPENROUTER_API_KEY")
OPENROUTER_BASE_URL = os.getenv("OPENROUTER_BASE_URL")  # e.g. https://openrouter.ai/api/v1
if not OPENROUTER_API_KEY:
    raise RuntimeError("Missing OPENROUTER_API_KEY in .env")
if not OPENROUTER_BASE_URL:
    raise RuntimeError("Missing OPENROUTER_BASE_URL in .env")

In [4]:

# =============================================================================
# EMBEDDINGS
# =============================================================================
embeddings_model = OpenAIEmbeddings(
    api_key=OPENROUTER_API_KEY,
    base_url=OPENROUTER_BASE_URL,
    model="text-embedding-3-small",
)

### Indexing pipeline for RAG (chapter 7 - QA across documents)

In [5]:
# =============================================================================
# VECTOR STORE (Chroma)
# =============================================================================
vector_db = Chroma(
    collection_name="tourist_info",
    embedding_function=embeddings_model,
    persist_directory="./chroma_db",
)


In [6]:
# =============================================================================
# SPLITTER
# =============================================================================
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=0,
)

In [7]:
# =============================================================================
# LOADER FACTORY (extension -> loader)
# =============================================================================
loader_classes = {
    "docx": Docx2txtLoader,
    "pdf": PyPDFLoader,
    "txt": TextLoader,
}

def get_loader(filename: str):
    _, ext = os.path.splitext(filename)
    ext = ext.lstrip(".").lower()

    loader_class = loader_classes.get(ext)
    if loader_class:
        return loader_class(filename)

    raise ValueError(f"No loader available for file extension '{ext}'")

In [8]:
# =============================================================================
# INGESTION ORCHESTRATOR
# =============================================================================
def split_and_import(loader):
    docs = loader.load()  # -> list[Document]
    chunks = text_splitter.split_documents(docs)  # Document -> list[Document]
    vector_db.add_documents(chunks)  # embed + store
    print(f"{loader.__class__.__name__}: {len(chunks)} chunks ingested")

In [9]:
# =============================================================================
# INGEST A FOLDER
# =============================================================================
folder_path = "CilentoTouristInfo"

for filename in os.listdir(folder_path):
    file_path = os.path.join(folder_path, filename)

    if os.path.isfile(file_path):
        try:
            loader = get_loader(file_path)
            print(f"Loader for {filename}: {loader.__class__.__name__}")
            split_and_import(loader)
        except ValueError as e:
            print(e)

print("‚úÖ Indexing complete.")

Loader for Acciaroli.pdf: PyPDFLoader
PyPDFLoader: 6 chunks ingested
Loader for Cape Palinuro.txt: TextLoader
TextLoader: 4 chunks ingested
Loader for Casalvelino.txt: TextLoader
TextLoader: 4 chunks ingested
Loader for Cilentan coast.docx: Docx2txtLoader
Docx2txtLoader: 5 chunks ingested
Loader for Cilento Coast Map and Travel Guide.docx: Docx2txtLoader
Docx2txtLoader: 24 chunks ingested
Loader for Cilento DOC.pdf: PyPDFLoader
PyPDFLoader: 3 chunks ingested
Loader for Cilento Park.txt: TextLoader
TextLoader: 32 chunks ingested
Loader for Cilento.docx: Docx2txtLoader
Docx2txtLoader: 17 chunks ingested
Loader for Cilento.pdf: PyPDFLoader
PyPDFLoader: 9 chunks ingested
Loader for Marina di Camerota info.pdf: PyPDFLoader
PyPDFLoader: 16 chunks ingested
Loader for Marina di Camerota.docx: Docx2txtLoader
Docx2txtLoader: 9 chunks ingested
Loader for NationalParkOfCilentoAndValloDiDiano.txt: TextLoader
TextLoader: 9 chunks ingested
Loader for Parco Nazionale del Cilento.pdf: PyPDFLoader
PyPDF

### üîç ‡πÄ‡∏ä‡πá‡∏Å‡πÄ‡∏û‡∏¥‡πà‡∏°‡πÅ‡∏ö‡∏ö ‚Äú‡∏û‡∏¥‡∏™‡∏π‡∏à‡∏ô‡πå index‚Äù (‡πÅ‡∏ô‡∏∞‡∏ô‡∏≥‡∏°‡∏≤‡∏Å)

In [11]:
docs = vector_db.similarity_search("Cilento Coast", k=3)
for i, d in enumerate(docs, 1):
    print(f"\n--- hit {i} ---")
    print("source:", d.metadata.get("source"))
    print(d.page_content[:300])



--- hit 1 ---
source: CilentoTouristInfo\Cilentan coast.docx
The¬†Cilento Coast¬†(Italian:¬†Costiera Cilentana) is an Italian stretch of coastline in¬†Cilento, on the southern side of the¬†Province of Salerno. It is situated between the gulfs of¬†Salerno¬†and¬†Policastro, extending from the municipalities of¬†Capaccio-Paestum¬†in the north-west, to¬†Sapri¬†in the south-e

--- hit 2 ---
source: CilentoTouristInfo\Cilento Park.txt
Cilento we mean today the southernmost part of the province of Salerno: a strip of plains south of the Sele river and the mountainous bastion between Agropoli, Sapri and Vallo di Diano. Mountains that reach almost 2000 meters and 100 km of mainly rocky coast dotted with bays and coves. The toponym h

--- hit 3 ---
source: CilentoTouristInfo\Cilento Park.txt
Cilento sea
Una caratteristica grotta lungo la costa di Marina di Camerota


In [10]:
print(vector_db._collection.count())


212


###  Generation pipeline for RAG (chapter 7 - QA across documents)

In [12]:
query = "Where was Poseidonia and who renamed it to Paestum?"
results = vector_db.similarity_search(query, 4) # four clostest results
print(results)

[Document(id='15e1e168-3f99-4753-859a-da5dd2519cfc', metadata={'source': 'CilentoTouristInfo\\Cilento Park.txt'}, page_content='Everybody knows today Paestum, they know of its three magnificent Doric temples, among the most important of the Greek world in the archaic and classical age. The temple of Neptune or Poseidon, a perfect example of Doric architecture, which some recent studies have attributed to Apollo. The temple of Hera, also known as the Basilica, is the oldest of the three buildings that stand in the excavation area and certainly belongs to the first generation of the great temples. The temple of Athena,'), Document(id='5ec1057f-cbee-4928-bbbb-c2390aad56b9', metadata={'source': 'CilentoTouristInfo\\Cilento Park.txt'}, page_content="Paestum\nL'area archeologica di Velia-Elea"), Document(id='1b99757e-39fd-4488-838d-54a012341f7c', metadata={'source': 'CilentoTouristInfo\\Cilento Park.txt'}, page_content='rises to the north, at the highest point of the archaeological area. It 

In [41]:
len(results)

4

In [44]:
rag_prompt_template = """Use the following pieces of context
to answer the question at the end.
If you don't know the answer, just say that you don't know,
don't try to make up an answer.
Use three sentences maximum and keep the
answer as concise as possible.
{context}
Question: {question}
Helpful Answer:"""

rag_prompt = PromptTemplate.from_template(rag_prompt_template)

In [45]:
retriever = vector_db.as_retriever()

In [47]:
question_feeder = RunnablePassthrough()

In [48]:
# set up RAG chain

rag_chain = {"context": retriever,
             "question": question_feeder}|rag_prompt|chatbot

In [49]:
def execute_chain(chain, question):
    answer = chain.invoke(question)
    return answer

In [51]:
question = """Where was Poseidonia and who renamed
it to Paestum. Also tell me the source."""

In [52]:
answer = execute_chain(rag_chain, question)

In [53]:
print(answer.content)

Poseidonia was an ancient Greek city in southern Italy, established by settlers from Sybaris. It was renamed Paestum by the Romans when they took over in 273 BCE. The source of this information is the Wikipedia article on Paestum.


In [54]:
print(answer)

content='Poseidonia was an ancient Greek city in southern Italy, established by settlers from Sybaris. It was renamed Paestum by the Romans when they took over in 273 BCE. The source of this information is the Wikipedia article on Paestum.' additional_kwargs={'refusal': None} response_metadata={'token_usage': {'completion_tokens': 52, 'prompt_tokens': 1980, 'total_tokens': 2032, 'completion_tokens_details': {'accepted_prediction_tokens': None, 'audio_tokens': None, 'reasoning_tokens': 0, 'rejected_prediction_tokens': None, 'image_tokens': 0}, 'prompt_tokens_details': {'audio_tokens': 0, 'cached_tokens': 0, 'video_tokens': 0}, 'cost': 0.00547, 'is_byok': False, 'cost_details': {'upstream_inference_cost': None, 'upstream_inference_prompt_cost': 0.00495, 'upstream_inference_completions_cost': 0.00052}}, 'model_provider': 'openai', 'model_name': 'openai/gpt-4o', 'system_fingerprint': 'fp_a0e9480a2f', 'id': 'gen-1766241567-UV60d6rbKO5YggHnlsjf', 'finish_reason': 'stop', 'logprobs': None} id

In [55]:
question = """And then, what they do?
Tell me only if you know.
Also tell me the source"""
answer = execute_chain(rag_chain, question)

In [56]:
print(answer.content)

I don't know.
