# DocQA MVP
I created this notebook to demonstrate the **core Retrieval-Augmented Generation (RAG)** workflow that became the foundation of the `DocQA` package. I made a deliberate decision to use **Ollama** during experimentation so I could iterate quickly without spending OpenAI credits. The components are designed to be **modular**, so I can easily swap Ollama for **OpenAI (or any other provider)** later by changing the LLM and embedding implementations—without rewriting the pipeline.

**Topics**

1. Load documents (**PDF** + **JSON**)
2. Split into chunks
3. Embed chunks and store in a **Vector DB (FAISS)**
4. Retrieve top‑K similar chunks for a question
5. Generate an answer with an LLM (**Ollama**)

**Local Setup for Ollama**

I run Ollama locally and pull the models I need:

```bash
ollama serve
ollama pull qwen2.5:7b
ollama pull nomic-embed-text
``` 

Models used :
* `qwen2.5:7b` → answer generation
* `nomic-embed-text` → embeddings for vector search


In [None]:
import json
from pathlib import Path
from typing import List, Tuple

from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import FAISS
from langchain_ollama import ChatOllama, OllamaEmbeddings

import time

## Helper Function

In [2]:
def convert_csv_to_json(file):
    df = pd.read_csv(f"../../documents/{file}.csv", index_col=0)
    df = df.fillna("")
    records = df.to_dict(orient="records")
    with open(f"../../documents/{file}.json", "w", encoding="utf-8") as f:
        json.dump(records, f, indent=2, ensure_ascii=False)

In [3]:
# convert_csv_to_json("sample_json")

## Config

Ollama Config

In [4]:
LLM_MODEL = "qwen2.5:7b"
EMBED_MODEL = "nomic-embed-text"
TEMPERATURE = 0.0

Retriever Config

In [5]:
FAISS_INDEX_DIR = Path("./faiss_index")
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
TOP_K = 6

## Document Loaders

In [6]:
from langchain_community.document_loaders import PyPDFLoader

In [7]:
def load_pdf(path: str) -> list[Document]:
    """Loads a pdf"""
    loader = PyPDFLoader(path)
    pages = loader.load()
    return pages

In [8]:
def load_json(path: str) -> List[Document]:
    """Loads a json"""
    data = json.loads(Path(path).read_text(encoding="utf-8"))
    docs: List[Document] = []

    for item in data:
        text = (
            f"Question: {item.get('question','')}\n"
            f"Answer: {item.get('answer','')}\n"
            f"Comments: {item.get('comments','')}\n"
        )
        docs.append(
            Document(
                page_content=text,
                metadata={"id": item.get("id"), "source": path}
            )
        )
    return docs

## Chunking

In [9]:
def split_documents(
    docs: List[Document],
    chunk_size: int = CHUNK_SIZE,
    overlap: int = CHUNK_OVERLAP,
) -> List[Document]:
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=chunk_size,
        chunk_overlap=overlap,
        add_start_index=True,
    )
    return splitter.split_documents(docs)

## Creating Embeddings + Vector Store

In [10]:
embeddings = OllamaEmbeddings(model=EMBED_MODEL)

In [11]:
def build_or_load_faiss(
    chunks: List[Document],
    index_dir: Path = FAISS_INDEX_DIR,
) -> FAISS:
    """Create FAISS index if missing; otherwise load and add documents."""
    if index_dir.exists():
        vs = FAISS.load_local(
            str(index_dir),
            embeddings,
            allow_dangerous_deserialization=True,
        )
        vs.add_documents(chunks)
    else:
        vs = FAISS.from_documents(chunks, embeddings)

    vs.save_local(str(index_dir))
    return vs

##  Retrieval Helpers

In [12]:
def retrieve_with_scores(
    vector_store: FAISS,
    query: str,
    k: int = TOP_K,
) -> List[Tuple[Document, float]]:
    return vector_store.similarity_search_with_score(query, k=k)


def format_context(docs_and_scores: List[Tuple[Document, float]]) -> str:
    """
        Creates the context from the retrieved sources by appending metadata and citation for the downstream LLM to answer
    """
    parts = []
    for i, (doc, score) in enumerate(docs_and_scores, 1):
        meta = doc.metadata or {}
        cite = []
        
        if meta.get("source_type") == "pdf":
            if "page" in meta:
                cite.append(f"page={meta['page']}")
        
        if meta.get("source_type") == "json" and meta.get("id"):
            cite.append(f"id={meta['id']}")
        
        cite_str = ", ".join(cite) if cite else "no-meta"
        
        parts.append(
            f"[{i}] ({cite_str}, score={score:.4f})\n{doc.page_content}"
        )

    return "\n\n".join(parts)

In [13]:
llm = ChatOllama(model=LLM_MODEL, temperature=TEMPERATURE)

In [14]:
def answer_one(query: str, context: str) -> str:
    prompt = f"""You are a security and compliance documentation assistant.

Answer the question using ONLY the information below.
Rules:
- Do NOT use external knowledge.
- Do NOT guess or infer.
- If the answer is not clearly present, respond exactly with: Data not found.

Information:
{context}

Question:
{query}

Answer:
"""
    resp = llm.invoke(prompt)
    return resp.content.strip()


In [15]:
t0 = time.time()

pdf_docs = load_pdf("../../documents/soc2-type2.pdf")
json_docs = load_json("../../documents/sample_json.json")

all_docs = pdf_docs + json_docs
splits = split_documents(all_docs)

vector_store = build_or_load_faiss(splits)

print(f"Loaded docs: pdf={len(pdf_docs)}, json={len(json_docs)}")
print(f"Chunks: {len(splits)}")
print(f"Index ready. Elapsed: {time.time() - t0:.2f}s")

Loaded docs: pdf=56, json=19
Chunks: 212
Index ready. Elapsed: 6.03s


In [16]:
retriever = vector_store.as_retriever(
    search_type="similarity", 
    search_kwargs={"k": 50}
)

In [17]:
query = "Which cloud providers do you rely on?"

In [18]:
docs_and_scores = retrieve_with_scores(vector_store, query, k=TOP_K)
context = format_context(docs_and_scores)

In [19]:
print(context[:2000])

[1] (no-meta, score=0.7132)
2 
INDEPENDENT SERVICE AUDITOR’S REPORT 
 
To Board of Directors 
Product Fruits s.r.o. 
 
Scope 
 
We have examined the accompanying “ Description of Product Fruits , a cloud -hosted software application ” 
provided by Product Fruits s.r.o. throughout the period July 24, 2024 to July 23, 2025  (the description) and the 
suitability of the design and operating effectiveness of controls to meet Product Fruits s.r.o. ’s service 
commitments and system requirements based on the criteria for Security, Confidentiality, Availability, Processing 
Integrity & Privacy principles set forth in TSP Section  100 Principles and Criteria, Trust Services Principles and 
Criteria for Security, Confidentiality and Availability (applicable trust services criteria) throughout the period July 
24, 2024 to July 23, 2025. 
 
Product Fruits s.r.o.  uses Amazon Web Services Inc. (AWS), a subservice organization, to provide cloud

[2] (no-meta, score=0.7132)
2 
INDEPENDENT SERVICE AU

In [20]:
ans = answer_one(query, context)

In [21]:
print(ans)

Product Fruits s.r.o. relies on Amazon Web Services Inc. (AWS) as a cloud provider.
