## Local Retrieval-Augmented Generation (RAG) with LlamaIndex and Ollama  

**Author:** Pranshu Goyal  
**Date:** August 16, 2025  
**Python Version:** ≥ 3.10  

### Abstract  
This notebook presents a structured implementation of a **Retrieval-Augmented Generation (RAG)** workflow that combines **LlamaIndex** for document ingestion, preprocessing, embedding, and vector indexing, with **Ollama** as a local large language model (LLM) for downstream tasks such as question answering and summarization.  

The workflow has been designed with an emphasis on **reproducibility, resilience, and safe defaults**. Specifically, it provides:  

- **Environment Bootstrapping:** Automatic setup of directories (`./docs`, `./storage`, `./logs`) with placeholder files to ensure smooth execution on first run.  
- **Deterministic Code Style:** Imports are consistently organized (isort-compatible) and code follows [Black](https://black.readthedocs.io/) formatting conventions.  
- **Heuristic Privacy Safeguards:** A lightweight PHI (Protected Health Information) pattern detector is included, though it is heuristic-only and **not HIPAA-compliant**.  
- **Graceful Degradation:** The pipeline continues to function if external components are unavailable (e.g., skips question answering/summarization if **Ollama** is not running, or defaults to `UnstructuredReader` if no **LlamaParse** API key is provided).  

### Usage Instructions  

1. **Execute All Cells:**  
   Select *Kernel → Restart & Run All* to initialize the notebook. On first execution, the notebook will:  
   - Install and validate required dependencies.  
   - Create the necessary working directories (`./docs`, `./storage`, `./logs`).  
   - Attempt to establish a connection with **Ollama**. If unavailable, the system will safely bypass Ollama-dependent demonstrations.  

2. **Optional PDF Parsing via LlamaParse:**  
   To enable more accurate PDF ingestion using LlamaParse, create a `.env` file in the project root and specify your API key:  

   ```bash
   LLAMA_PARSE_KEY="your-llamaparse-key"

## Setup and Imports

This section prepares the environment by installing all required dependencies for the **RAG pipeline**.  
Both **pip** and **conda** are used to ensure reproducibility and to cover packages better supported by each ecosystem.

### 1. 📦 Pip Dependencies
- **Core utilities**:
  - `requests` → lightweight HTTP client (used to check Ollama availability).  
  - `nest-asyncio` → patches Jupyter’s async loop for libraries like LlamaIndex.  
  - `python-dotenv` → loads environment variables from `.env`.  
- **Llama ecosystem**:
  - `llama-parse` → high-fidelity PDF/complex document parser.  
  - `llama-index-core` → core LlamaIndex framework.  
  - `llama-index-llms-ollama` → Ollama LLM integration.  
  - `llama-index-embeddings-huggingface` → HuggingFace embeddings backend.  
  - `llama-index-readers-file` → file-type readers for ingestion (DOCX, PPTX, etc.).  
- **Data ingestion and preprocessing**:
  - `unstructured` → parses diverse file formats into text.  
  - `pypdf` → robust PDF parsing.  
- **ML/NLP stack**:
  - `torch` → PyTorch for deep learning models.  
  - `transformers` → HuggingFace Transformers for LLMs.  
  - `sentence-transformers` → pre-trained embedding models.  
- **Data handling**:
  - `pandas==2.2.3` → DataFrame operations (pinned version ensures stability).  

⚠️ **Note:** `unstructured[pdf]` is commented out, since it sometimes causes dependency conflicts. Uncomment only if advanced PDF parsing via `unstructured` is required.

### 2. 🔎 Conda Dependencies
- `faiss-cpu=1.7.4` → Facebook AI Similarity Search (FAISS) for vector indexing and retrieval.  
  Installed via conda-forge channel for compatibility and performance.  

In [None]:
# # ==============================
# # Environment Setup 
# # ==============================

# %pip install -U -qq \
#   requests \
#   nest-asyncio \
#   python-dotenv \
#   llama-parse \
#   llama-index-core \
#   llama-index-llms-ollama \
#   llama-index-embeddings-huggingface \
#   llama-index-readers-file \
#   unstructured \
#   pandas==2.2.3 \
#   pypdf \
#   sentence-transformers \
#   transformers \
#   torch \
#   # "unstructured[pdf]"\

# print("Pip Dependencies Installed Successfully.")

# %conda install -y -q -c conda-forge faiss-cpu=1.7.4

# print("Conda Dependencies Installed Successfully.")

---
## Overview

This cell initializes a lightweight document-processing environment that can optionally work with a local Ollama LLM and LlamaIndex:<br>
*	Silences noisy warnings (including InconsistentVersionWarning from scikit-learn).<br>
*	Prints deterministic runtime info (Python version & platform).<br>
*	Checks whether Ollama is reachable on http://localhost:11434 and lists available models.<br>
*	Applies nest_asyncio so libraries that nest event loops (e.g., LlamaIndex) work smoothly in Jupyter.<br>
*	Ensures minimal project directories exist: docs/, storage/, logs/.<br>
*	Drops a placeholder file in docs/ if it’s empty, so your later indexing steps never fail due to missing inputs.<br>
*	Imports LlamaIndex components and readers, plus LlamaParse for advanced document parsing.<br>

This cell is safe to run even if Ollama isn’t running; Q&A/summarization demos can be skipped when Ollama is down.<br>

## Dependencies & What They’re For
	•	Standard library: os, sys, time, Path, json, logging, re, functools, typing utilities.
	•	Warnings control: Filters InconsistentVersionWarning, FutureWarning, and UserWarning to keep output clean.
	•	requests: Pings Ollama’s REST endpoint to detect availability.
	•	nest_asyncio: Patches the Jupyter event loop to avoid “already running loop” errors.
	•	python-dotenv: Loads environment variables if you later call load_dotenv() (import is present here).
	•	LlamaIndex core: Document, Settings, VectorStoreIndex, storage utilities, directory readers.
	•	Embeddings & LLM: HuggingFaceEmbedding, Ollama (the LLM wrapper for local models).
	•	File readers: DocxReader, MarkdownReader, PptxReader, UnstructuredReader (broad file-type support).
	•	llama_parse.LlamaParse: Optional high-fidelity parsing of PDFs/complex docs (API-based).

---
## Ollama Availability Check

is_ollama_up():
	•	Sends a GET to http://localhost:11434/api/tags (2s timeout).
	•	If OK, prints status and shows up to 5 model tags (with an ellipsis if more).
	•	On failure, prints a friendly message and returns False.
	•	The boolean is stored in OLLAMA_UP for later conditional logic.

This design lets you write downstream code like:

In [16]:
# Patch Jupyter's event loop (useful for libraries that nest asyncio, e.g. LlamaIndex)
nest_asyncio.apply()
print("✅ nest_asyncio patch applied.")

# Prepare minimal project directories
for dir_name in ("docs", "storage", "logs"):
    Path(dir_name).mkdir(parents=True, exist_ok=True)

# Create a placeholder in ./docs so indexing always has at least one document
docs_path = Path("docs")
if not any(docs_path.iterdir()):
    placeholder = docs_path / "placeholder.txt"
    placeholder.write_text(
        "Add your documents here. This is just a placeholder.\n",
        encoding="utf-8",
    )
    print("Created placeholder document at ./docs/placeholder.txt")
else:
    print("Docs directory already populated.")

✅ nest_asyncio patch applied.
The history saving thread hit an unexpected error (OperationalError('attempt to write a readonly database')).History will not be written to the database.
Docs directory already populated.


# Retrieval Transparency & Chunking Stats Helpers

This cell defines two **visual-only helper functions** that improve **interpretability and debugging** when working with Retrieval-Augmented Generation (RAG) pipelines.  
They **do not affect the actual data flow** — they are purely diagnostic and safe to remove if needed.

---

### 1. `_show_retrieval_sources(resp, top=5)`
- **Purpose**: Displays the top retrieved context nodes from a query response.  
- **Inputs**:  
  - `resp`: A LlamaIndex query response (expected to have `.source_nodes`).  
  - `top`: Number of top nodes to display (default `5`).  
- **Logic**:
  - Iterates over retrieved nodes, extracting:  
    - `score`: retrieval score (float).  
    - `source`: provenance info (file name/path/ID).  
    - `chars`: character length of text.  
  - Uses `_safe_text()` to extract content length.  
  - Displays results in a **Rich table** if available; falls back to plain printing.  
- **Why useful**: Gives **transparency** into what documents contributed to the answer.

In [6]:
def _show_retrieval_sources(resp, top: int = 5):
    """
    Pretty-print retrieved source nodes for transparency (visual only).
    Safe to remove without changing behaviour.
    """
    nodes = getattr(resp, "source_nodes", None)
    if not nodes:
        return
    rows = []
    for i, sn in enumerate(nodes[:top], 1):
        # Node score
        score = getattr(sn, "score", None)
        try:
            score = float(score) if score is not None else 0.0
        except Exception:
            score = 0.0

        # Try common metadata keys for filename-like provenance
        md = getattr(sn.node, "metadata", {}) or {}
        src = md.get("file_name") or md.get("file_path") or md.get("filename") or getattr(sn.node, "id_", "—")
        text = _safe_text(sn.node)
        rows.append((i, score, str(src), len(text)))

    if _RICH and console:
        try:
            t = Table(title="Retrieved Context (top-k)", box=box.SIMPLE_HEAVY)
            t.add_column("Rank", justify="right")
            t.add_column("Score", justify="right")
            t.add_column("Source", overflow="fold")
            t.add_column("Chars", justify="right")
            for r in rows:
                t.add_row(str(r[0]), f"{r[1]:.3f}", r[2], str(r[3]))
            console.print(t)
            return
        except Exception:
            pass

    # Fallback plain print
    print("\nRetrieved Context (top-k):")
    for r in rows:
        print(f"{r[0]:>2}. score={r[1]:.3f} source={r[2]} chars={r[3]}")

---

### 2. `_estimate_chunk_stats(documents, chunk_size, chunk_overlap)`
- **Purpose**: Estimates how many chunks each document would generate under given chunking parameters.  
- **Inputs**:  
  - `documents`: list of documents (`Document` objects).  
  - `chunk_size`: max characters per chunk.  
  - `chunk_overlap`: overlap between chunks.  
- **Logic**:
  - Calculates stride (`chunk_size - chunk_overlap`).  
  - For each doc, estimates number of chunks based on text length and stride.  
  - Collects `(name, char_count, ~chunk_count)` for reporting.  
  - Displays results in a **Rich table** or as plain text fallback.  
- **Why useful**: Helps **tune chunk_size/overlap** before running actual embedding/indexing.

In [7]:
def _estimate_chunk_stats(documents, chunk_size: int, chunk_overlap: int):
    """
    Estimate how many chunks per document given settings (visual only).
    Does NOT change how LlamaIndex chunks internally.
    """
    rows = []
    stride = max(1, chunk_size - chunk_overlap)
    for i, d in enumerate(documents, 1):
        name = (getattr(d, "metadata", {}) or {}).get("file_name") or f"doc_{i}"
        text = _safe_text(d)
        est_chunks = 0
        if text:
            est_chunks = max(1, (len(text) + stride - 1) // stride)
        rows.append((name, len(text), est_chunks))

    if _RICH and console:
        try:
            t = Table(title=f"Estimated Chunking (size={chunk_size}, overlap={chunk_overlap})",
                      box=box.SIMPLE_HEAVY)
            t.add_column("Source", overflow="fold")
            t.add_column("Chars", justify="right")
            t.add_column("~Chunks", justify="right")
            for name, chars, ch in rows:
                t.add_row(str(name), str(chars), str(ch))
            console.print(t)
            return
        except Exception:
            pass

    # Fallback plain print
    print(f"\nEstimated Chunking (size={chunk_size}, overlap={chunk_overlap})")
    for name, chars, ch in rows:
        print(f"- {name}: chars={chars}, ~chunks={ch}")

## Environment, Logging, Timing & Safety Utilities

This cell sets up **essential utilities** that support the rest of the notebook:

---

### 1. Environment Variable Setup
- `load_dotenv()` loads environment variables from a `.env` file.
- Example: `LLAMA_PARSE_KEY` is expected here for using **LlamaParse** API.
- Prints `"FOUND"` or `"NOT FOUND"` depending on whether the key is set.  
  (Currently set as an empty string.)

In [8]:
# Load environment variables (e.g., LLAMA_PARSE_KEY)
load_dotenv()
LLAMA_PARSE_KEY = ""
print("🔑 LlamaParse Key:", "FOUND" if LLAMA_PARSE_KEY else "NOT FOUND")

🔑 LlamaParse Key: NOT FOUND


---

### 2. Logger Utility — `get_logger(name: str)`
- Creates (or reuses) a named logger.
- **Avoids duplicate handlers** when rerun in Jupyter.
- Configures two handlers:
  - **Console handler** → logs stream to stdout with timestamp + level.  
  - **File handler** → saves logs into `./logs/queries.log` (UTF-8 encoded).
- Useful for structured, persistent tracking of queries and experiments.

In [9]:
def get_logger(name: str) -> logging.Logger:
    """Create (or return existing) logger with console and file handlers.

    Handlers are attached once to avoid duplicate logs across re-runs.
    """
    logger = logging.getLogger(name)
    logger.setLevel(logging.INFO)
    if not logger.handlers:
        # Console handler
        ch = logging.StreamHandler(sys.stdout)
        ch.setFormatter(
            logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
        )
        logger.addHandler(ch)
        # File handler
        fh = logging.FileHandler("./logs/queries.log", encoding="utf-8")
        fh.setFormatter(
            logging.Formatter("%(asctime)s - %(name)s - %(levelname)s - %(message)s")
        )
        logger.addHandler(fh)
    return logger

---

### 3. Timing Decorator — `time_it(func)`
- Wraps any function to **print its execution time** in seconds.
- Provides wall-clock measurement for profiling.

In [10]:
def time_it(func):
    """Decorator that prints a function's wall-clock execution time."""

    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        start = time.time()
        result = func(*args, **kwargs)
        end = time.time()
        print(
            f"\n--- ⏱️ Function '{func.__name__}' executed in {end - start:.2f} seconds. ---"
        )
        return result

    return wrapper


# Heuristic PHI-like patterns (NOT HIPAA-compliant; informational only)
PHI_PATTERNS: Dict[str, re.Pattern[str]] = {
    "SSN": re.compile(r"\b\d{3}-\d{2}-\d{4}\b"),
    "Phone": re.compile(r"\b\(\d{3}\)\s*\d{3}-\d{4}\b|\b\d{3}-\d{3}-\d{4}\b"),
    "Email": re.compile(r"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b"),
}


def check_for_phi(text: str) -> bool:
    """Return True if any naive PHI-like pattern matches; False otherwise."""
    return any(pattern.search(text) for pattern in PHI_PATTERNS.values())


notebook_logger = get_logger("biomed_notebook")
print("✅ Utility functions (logging, safety) are defined.")


✅ Utility functions (logging, safety) are defined.


---
## Core RAG Components — Ingestion, Indexing, and Query Engine

This cell defines the **core pipeline functions** needed for a Retrieval-Augmented Generation (RAG) system using **LlamaIndex**.  
These utilities handle document ingestion, index building, persistence, and query engine creation.

---

### 1. 📂 `load_documents(directory: str) -> List[Document]`
- **Purpose**: Reads all documents from a given directory into a LlamaIndex-compatible `Document` list.  
- **Logic**:
  - Defines `file_extractor` mapping of extensions → reader:
    - `.html` → `UnstructuredReader`  
    - `.pptx` → `PptxReader`  
    - `.docx` → `DocxReader`  
    - `.md` → `MarkdownReader`  
  - For PDFs:
    - If `LLAMA_PARSE_KEY` is set → use **LlamaParse** (high-fidelity parsing).  
    - Else → fall back to `UnstructuredReader`.  
  - Wraps readers in `SimpleDirectoryReader` (recursive).  
  - Logs progress and returns loaded documents.

**Why useful**: Centralized ingestion logic makes the pipeline extensible for new formats.

In [None]:
def load_documents(directory: str) -> List[Document]:
    """Load documents from `directory`.

    - Uses **LlamaParse** for PDFs if `LLAMA_PARSE_KEY` is available.
    - Falls back to `UnstructuredReader` for PDFs otherwise.
    - Supports HTML, PPTX, DOCX, MD via dedicated readers.
    """
    notebook_logger.info(f"Loading documents from: {directory}")

    file_extractor = {
        ".html": UnstructuredReader(),
        ".pptx": PptxReader(),
        ".docx": DocxReader(),
        ".md": MarkdownReader(),
    }

    if LLAMA_PARSE_KEY:
        notebook_logger.info("LlamaParse key found. Using LlamaParse for PDFs.")
        parser = LlamaParse(api_key=LLAMA_PARSE_KEY, result_type="markdown")
        file_extractor[".pdf"] = parser
    else:
        notebook_logger.info("LlamaParse key not found. Using UnstructuredReader for PDFs.")
        file_extractor[".pdf"] = UnstructuredReader()

    reader = SimpleDirectoryReader(directory, file_extractor=file_extractor, recursive=True)
    documents = reader.load_data()
    notebook_logger.info(f"Successfully loaded {len(documents)} document(s).")
    return documents


---

### 2. 📦 `build_index(documents, embed_model_name, storage_dir, chunk_size, chunk_overlap)`
- **Purpose**: Creates and persists a **VectorStoreIndex** with embeddings.  
- **Steps**:
  - Configures `Settings` for embedding model, chunk size, and overlap.  
  - Disables `Settings.llm` since indexing does not need an LLM.  
  - Builds index via `VectorStoreIndex.from_documents(documents)`.  
  - Persists index to `storage_dir` for reuse later.  

**Why useful**: Separates indexing from querying, enabling offline persistence.

In [None]:
def build_index(
    documents: List[Document],
    embed_model_name: str,
    storage_dir: str,
    chunk_size: int,
    chunk_overlap: int,
) -> None:
    """Build and persist a VectorStoreIndex using the given embedding model."""
    notebook_logger.info(f"Using embedding model: {embed_model_name}")
    Settings.embed_model = HuggingFaceEmbedding(model_name=embed_model_name)
    Settings.chunk_size = chunk_size
    Settings.chunk_overlap = chunk_overlap
    Settings.llm = None  # Important: no LLM required for indexing

    notebook_logger.info(f"Building index with chunk size {chunk_size} …")
    index = VectorStoreIndex.from_documents(documents)
    index.storage_context.persist(persist_dir=storage_dir)
    notebook_logger.info(f"Index persisted to '{storage_dir}'.")

---

### 3. ✅ `check_index_exists(storage_dir: str) -> bool`
- **Purpose**: Utility check to verify if a persisted index already exists.  
- Returns `True` if `storage_dir` exists and is non-empty.  
- Prevents unnecessary re-indexing.

In [None]:
def check_index_exists(storage_dir: str) -> bool:
    """Return True if `storage_dir` exists and is non-empty."""
    p = Path(storage_dir)
    return p.exists() and any(p.iterdir())

---

### 4. 🔎 `create_query_engine(storage_dir, llm_model, embed_model_name, k, temperature)`
- **Purpose**: Loads a persisted index and initializes a query engine.  
- **Steps**:
  - Configures `Settings.llm` with **Ollama** (local LLM) + temperature.  
  - Configures embedding model again to ensure consistency.  
  - Loads `StorageContext` from `storage_dir`.  
  - Restores `VectorStoreIndex` from storage.  
  - Returns a `query_engine` with `similarity_top_k=k`.  

**Why useful**: Separates LLM query logic from indexing.  
Enables switching between embedding/LLM backends without rebuilding the index.

In [11]:
def create_query_engine(
    storage_dir: str,
    llm_model: str,
    embed_model_name: str,
    k: int,
    temperature: float,
):
    """Create a query engine from a persisted index and the given LLM/embed settings."""
    notebook_logger.info("Creating query engine …")
    Settings.llm = Ollama(model=llm_model, temperature=temperature, request_timeout=120.0)
    Settings.embed_model = HuggingFaceEmbedding(model_name=embed_model_name)

    storage_context = StorageContext.from_defaults(persist_dir=storage_dir)
    index = load_index_from_storage(storage_context)
    query_engine = index.as_query_engine(similarity_top_k=k)
    notebook_logger.info(f"Query engine ready (LLM='{llm_model}', top-k={k}).")
    return query_engine

print("✅ Core RAG components (ingestion, indexing, engine) are defined.")

✅ Core RAG components (ingestion, indexing, engine) are defined.


---
## Engine Cache Helpers — QA & Summary Engines

This cell defines a simple **engine caching mechanism** to avoid rebuilding query engines multiple times during a notebook session.  
Instead of creating a new engine on every call, it reuses previously built instances when available.  

### 1. Global Instances
- `_qa_engine_instance` → holds the singleton instance of the **Q&A query engine**.  
- `_summary_engine_instance` → holds the singleton instance of the **Summarization query engine**.  
- Both are initialized as `None` and created lazily on first use.

### 2. `get_or_create_engine(engine_type: str)`
- **Purpose**: Returns a cached engine if available; otherwise creates it once and caches it.  
- **Parameters**:  
  - `engine_type`: must be either `"qa"` or `"summary"`.  
- **Logic**:
  - For `"qa"`:
    - If `_qa_engine_instance` is `None` → calls `create_query_engine()` with:  
      - storage: `./storage`  
      - LLM: `"mistral"`  
      - Embedding model: `"BAAI/bge-small-en"`  
      - `k=4` (retrieval depth)  
      - `temperature=0.1` (low randomness for factual Q&A).  
    - Else → reuses the existing instance.  
  - For `"summary"`:
    - If `_summary_engine_instance` is `None` → builds with:  
      - same storage, LLM, and embedding model  
      - `k=8` (broader retrieval)  
      - `temperature=0.3` (slightly higher randomness for summaries).  
    - Else → reuses existing instance.  
  - If `engine_type` is anything else → raises `ValueError`.  

### 3. Why This Matters
- **Performance**: Prevents repeatedly rebuilding the same engine (saves time & compute).  
- **Consistency**: Ensures that subsequent queries use the **same engine configuration**.  
- **Flexibility**: Different settings for `"qa"` vs `"summary"` support different retrieval and generation strategies.  

In [None]:
_qa_engine_instance = None
_summary_engine_instance = None

def get_or_create_engine(engine_type: str):
    """Return a cached engine of type 'qa' or 'summary', building it if necessary."""
    global _qa_engine_instance, _summary_engine_instance

    if engine_type == "qa":
        if _qa_engine_instance is None:
            notebook_logger.info("🧠 No QA engine found. Creating one …")
            _qa_engine_instance = create_query_engine(
                storage_dir="./storage",
                llm_model="mistral",
                embed_model_name="BAAI/bge-small-en",
                k=4,
                temperature=0.1,
            )
            notebook_logger.info("✅ QA engine is ready.")
        else:
            notebook_logger.info("⚡ Reusing existing QA engine from memory.")
        return _qa_engine_instance

    if engine_type == "summary":
        if _summary_engine_instance is None:
            notebook_logger.info("🧠 No Summary engine found. Creating one …")
            _summary_engine_instance = create_query_engine(
                storage_dir="./storage",
                llm_model="mistral",
                embed_model_name="BAAI/bge-small-en",
                k=8,
                temperature=0.3,
            )
            notebook_logger.info("✅ Summary engine is ready.")
        else:
            notebook_logger.info("⚡ Reusing existing Summary engine from memory.")
        return _summary_engine_instance

    raise ValueError("Unknown engine type. Use 'qa' or 'summary'.")

print("✅ Engine cache helpers are defined.")

✅ Engine cache helpers are defined.


---
## Job Helpers — Indexing, Querying, Summarization & Interactive Sessions

This cell defines the **high-level workflow functions** that orchestrate the end-to-end Retrieval-Augmented Generation (RAG) pipeline.  
Each function is decorated with `@time_it`, so runtime performance is measured and printed.

### 1. 📂 `run_index()`
- **Purpose**: Ingests documents from `./docs`, builds a vector index, and persists it to `./storage`.
- **Visuals/Diagnostics**:
  - Pipeline diagram (`_print_pipeline_diagram`)
  - Document inventory by extension (`_print_doc_inventory`)
  - Estimated chunking stats (`_estimate_chunk_stats`)
  - Spinner/progress bar during index build
- **Process**:
  1. Loads documents using `load_documents`.
  2. Shows file-type distribution and chunking estimates.
  3. If documents exist → builds index with `build_index`.
  4. Persists index into `./storage`.

In [None]:
@time_it
def run_index() -> None:
    """Ingest and index the current contents of ./docs to ./storage.
    Visuals:
      • Pipeline diagram (explains flow)
      • Document inventory (file-type counts)
      • Estimated chunking table (size/overlap effect)
      • Progress bar around index build/persist
    """
    print("--- Starting document ingestion and indexing ---")
    _print_pipeline_diagram()  # purely cosmetic diagram

    DOCS_DIR = "./docs"
    STORAGE_DIR = "./storage"
    EMBED_MODEL = "BAAI/bge-small-en"
    CHUNK_SIZE = 512
    CHUNK_OVERLAP = 50

    try:
        # Visual spinner for load
        with _spinner("Loading documents …"):
            documents = load_documents(DOCS_DIR)

        # Inventory + estimated chunking
        _print_doc_inventory(DOCS_DIR)
        _estimate_chunk_stats(documents, CHUNK_SIZE, CHUNK_OVERLAP)

        if not documents:
            print("🛑 No documents found in './docs'. Please add files to ./docs and try again.")
            return

        # Progress wrapper (visual only)
        if _RICH and console:
            with Progress(
                SpinnerColumn(), TextColumn("[progress.description]{task.description}"),
                BarColumn(), TimeElapsedColumn()
            ) as progress:
                task = progress.add_task("Building and persisting index …", total=1)
                build_index(
                    documents=documents,
                    embed_model_name=EMBED_MODEL,
                    storage_dir=STORAGE_DIR,
                    chunk_size=CHUNK_SIZE,
                    chunk_overlap=CHUNK_OVERLAP,
                )
                progress.update(task, advance=1)
        else:
            print("[…] Building and persisting index …")
            build_index(
                documents=documents,
                embed_model_name=EMBED_MODEL,
                storage_dir=STORAGE_DIR,
                chunk_size=CHUNK_SIZE,
                chunk_overlap=CHUNK_OVERLAP,
            )

        print(f"\n✅ Successfully built index with {len(documents)} documents.")
        print("--- Indexing complete ---")
    except Exception as exc:  # noqa: BLE001
        notebook_logger.error(f"Failed during indexing: {exc}", exc_info=True)
        print(f"Error: {exc}")

---

### 2. `run_ask(question: str)`
- **Purpose**: Runs a Q&A query against the persisted index using **Ollama**.
- **Pre-checks**:
  - Verifies index exists with `check_index_exists`.
  - Confirms Ollama is up (`OLLAMA_UP`).
  - Warns if query text matches PHI patterns (`check_for_phi`).
- **Visuals/Diagnostics**:
  - Query config panel (top-k and temperature).
  - Spinner while generating answer.
  - Retrieved sources table (`_show_retrieval_sources`).
- **Post-processing**:
  - Prints the final answer, truncated at `MAX_TOKENS` for display clarity.

In [None]:
@time_it
def run_ask(question: str) -> None:
    """Ask a question against the persisted index using Ollama (if available).
    Visuals:
      • Query config panel (top-k & temperature)
      • Spinner while generating answer
      • Retrieved sources table (top-k provenance)
    """
    print(f"\n❓ Question: {question}")

    STORAGE_DIR = "./storage"
    MAX_TOKENS = 350

    if not check_index_exists(STORAGE_DIR):
        print("🛑 Index not found. Please run `run_index()` first.")
        return
    if not OLLAMA_UP:
        print("🛑 Ollama is not reachable. Skipping Q&A demonstration.")
        return
    if check_for_phi(question):
        print("⚠️ Warning: Potential PHI detected in query.")

    notebook_logger.info("Received query (content omitted for safety log).")

    # Tiny visual panel for current settings (no logic change)
    if _RICH and console:
        console.print(Panel.fit("Engine: QA • top-k≈4 • temperature=0.1",
                                title="Query Config", border_style="green"))
    else:
        print("Query Config: Engine=QA, top-k≈4, temperature=0.1")

    try:
        with _spinner("Generating answer …"):
            qa_engine = get_or_create_engine("qa")
            qa_response = qa_engine.query(question)

        # Visual-only provenance
        _show_retrieval_sources(qa_response, top=5)

        # Print answer (with a simple token cap for display)
        answer = str(qa_response)
        if len(answer.split()) > MAX_TOKENS:
            answer = " ".join(answer.split()[:MAX_TOKENS]) + " …"
        print(f"\n✅ Answer:\n{answer}")
    except Exception as exc:  # noqa: BLE001
        notebook_logger.error(f"Failed during query: {exc}", exc_info=True)
        print(f"Error: {exc}")


---

### 3.`run_summarize(topic: str)`
- **Purpose**: Summarizes a given topic using the persisted index and Ollama.
- **Process**:
  1. Checks index existence and Ollama availability.
  2. Creates a custom abstractive summary prompt.
  3. Runs query via `"summary"` engine (`get_or_create_engine("summary")`).
  4. Prints retrieved context and the generated summary (truncated to `MAX_TOKENS`).
- **Visuals/Diagnostics**: Similar to `run_ask`, with a blue-styled panel.

In [None]:
@time_it
def run_summarize(topic: str) -> None:
    """Summarize a topic using the persisted index via Ollama (if available).
    Visuals:
      • Query config panel (top-k & temperature)
      • Spinner while producing summary
      • Retrieved sources table (top-k provenance)
    """
    print(f"\n📖 Summarizing topic: {topic}")

    STORAGE_DIR = "./storage"
    MAX_TOKENS = 350

    if not check_index_exists(STORAGE_DIR):
        print("🛑 Index not found. Please run `run_index()` first.")
        return
    if not OLLAMA_UP:
        print("🛑 Ollama is not reachable. Skipping summarization demonstration.")
        return

    summary_prompt = (
        f"Using only the provided context, generate a concise, fully abstractive summary of "
        f"the key points regarding '{topic}'. Present the essential information in a logically "
        f"organized, coherent narrative without introductory phrases or filler."
    )

    notebook_logger.info("Received summarization request (content omitted for safety log).")

    # Visual panel for current settings (no logic change)
    if _RICH and console:
        console.print(Panel.fit("Engine: Summary • top-k≈8 • temperature=0.3",
                                title="Query Config", border_style="blue"))
    else:
        print("Query Config: Engine=Summary, top-k≈8, temperature=0.3")

    try:
        with _spinner("Producing summary …"):
            summary_engine = get_or_create_engine("summary")
            summary_response = summary_engine.query(summary_prompt)

        # Visual-only provenance
        _show_retrieval_sources(summary_response, top=5)

        summary_text = str(summary_response)
        if len(summary_text.split()) > MAX_TOKENS:
            summary_text = " ".join(summary_text.split()[:MAX_TOKENS]) + " …"
        print(f"\n✅ Summary:\n{summary_text}")
    except Exception as exc:  # noqa: BLE001
        notebook_logger.error(f"Failed during summarization: {exc}", exc_info=True)
        print(f"Error: {exc}")

---

### 4. `run_interactive_process_session()`
- **Purpose**: Starts an **ad-hoc interactive Q&A session** over pasted raw text.
- **Process**:
  1. User pastes free-form text, ends input with `DONE`.
  2. Builds an in-memory vector index (`VectorStoreIndex`) on the fly.
  3. User can iteratively enter questions or summary prompts.
  4. Each query is answered in real-time until user types `"exit"`.
- **Features**:
  - Warns if PHI-like text is detected.
  - Shows query runtime duration.
  - Does not persist index — temporary only.

In [None]:
def run_interactive_process_session() -> None:
    """Interactive ad-hoc Q&A over pasted text.

    Not auto-invoked to keep Run-All non-blocking. Paste text, then ask questions.
    """
    from llama_index.core import VectorStoreIndex

    LLM_MODEL = "mistral"
    EMBED_MODEL = "BAAI/bge-small-en"
    TEMPERATURE = 0.1

    print("--- 🚀 Starting Interactive Processing Session 🚀 ---")
    print("Paste text to analyze. Type DONE on a new line to finish.")

    lines = []
    while True:
        line = input()
        if line.strip().upper() == "DONE":
            break
        lines.append(line)
    text_to_process = "\n".join(lines)
    if not text_to_process.strip():
        print("No text provided. Exiting session.")
        return

    notebook_logger.info("Received ad-hoc text for interactive session.")

    try:
        # Visual spinner for temporary index setup
        with _spinner("Preparing in-memory index …"):
            start_setup = time.time()
            Settings.llm = Ollama(model=LLM_MODEL, temperature=TEMPERATURE, request_timeout=120.0)
            Settings.embed_model = HuggingFaceEmbedding(model_name=EMBED_MODEL)
            Settings.chunk_size = 512

            temp_docs = [Document(text=text_to_process)]
            temp_index = VectorStoreIndex.from_documents(temp_docs)
            temp_engine = temp_index.as_query_engine()
            setup_duration = time.time() - start_setup

        notebook_logger.info(
            f"✅ In-memory index created for interactive session in {setup_duration:.2f}s."
        )
        print("\nYou can now ask questions. Type 'exit' to finish.")

        while True:
            query = input("\nEnter question/summary (or 'exit'): ")
            if query.lower() in {"exit", "quit"}:
                print("Exiting session.")
                break
            if check_for_phi(query):
                print("⚠️ Warning: Potential PHI detected in query.")
            start_q = time.time()
            response = temp_engine.query(query)
            print(f"\n💡 Answer:\n{response}")
            print(f"(Query processed in {time.time() - start_q:.2f} seconds)")
    except Exception as exc:  # noqa: BLE001
        notebook_logger.error(
            f"Failed during interactive processing session: {exc}", exc_info=True
        )
        print(f"Error: {exc}")

print("✅ Job helpers are defined.")
print("Setup complete. You can now run `run_index()`, `run_ask(question)`, or `run_summarize(topic)`.")

✅ Job helpers are defined.
Setup complete. You can now run `run_index()`, `run_ask(question)`, or `run_summarize(topic)`.


---
## Running the RAG Workflow — End-to-End Demo

This section shows how to execute the **full Retrieval-Augmented Generation pipeline** step by step.

### 1) Build (or Rebuild) the Index
- Runs `run_index()`, which:
  - Loads documents from `./docs`
  - Builds embeddings and a vector index
  - Persists the index into `./storage`
- Should always be run after adding or updating documents.

In [14]:
# 1) Build (or rebuild) the index
run_index()

--- Starting document ingestion and indexing ---


2025-08-23 01:30:33,467 - biomed_notebook - INFO - Loading documents from: ./docs


2025-08-23 01:30:33,467 - INFO - Loading documents from: ./docs


2025-08-23 01:30:33,467 - biomed_notebook - INFO - LlamaParse key not found. Using UnstructuredReader for PDFs.


2025-08-23 01:30:33,467 - INFO - LlamaParse key not found. Using UnstructuredReader for PDFs.
2025-08-23 01:30:35,209 - INFO - pikepdf C++ to Python logger bridge initialized




2025-08-23 01:39:45,302 - biomed_notebook - INFO - Successfully loaded 1 document(s).


2025-08-23 01:39:45,302 - INFO - Successfully loaded 1 document(s).


2025-08-23 01:39:45,309 - biomed_notebook - INFO - Using embedding model: BAAI/bge-small-en


2025-08-23 01:39:45,309 - INFO - Using embedding model: BAAI/bge-small-en
2025-08-23 01:39:45,343 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en
2025-08-23 01:39:47,508 - INFO - 1 prompt is loaded, with the key: query


2025-08-23 01:39:47,614 - biomed_notebook - INFO - Building index with chunk size 512 …


2025-08-23 01:39:47,614 - INFO - Building index with chunk size 512 …


2025-08-23 02:13:44,997 - biomed_notebook - INFO - Index persisted to './storage'.


2025-08-23 02:13:44,997 - INFO - Index persisted to './storage'.



✅ Successfully built index with 1 documents.
--- Indexing complete ---

--- ⏱️ Function 'run_index' executed in 2591.74 seconds. ---


In [15]:
# 2) Ask a question (requires Ollama)
run_ask("What is the most common cause of oral cancer?")

# 3) Summarize a topic (requires Ollama)
run_summarize("findings on oral cancer and ethnicity")

# 4) Optional interactive session (NOT auto-run to keep Run-All non-blocking)
# run_interactive_process_session()


❓ Question: What is the most common cause of oral cancer?
2025-08-23 02:13:45,207 - biomed_notebook - INFO - Received query (content omitted for safety log).


2025-08-23 02:13:45,207 - INFO - Received query (content omitted for safety log).


2025-08-23 02:13:45,210 - biomed_notebook - INFO - 🧠 No QA engine found. Creating one …


2025-08-23 02:13:45,210 - INFO - 🧠 No QA engine found. Creating one …


2025-08-23 02:13:45,211 - biomed_notebook - INFO - Creating query engine …


2025-08-23 02:13:45,211 - INFO - Creating query engine …
2025-08-23 02:13:45,216 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en
2025-08-23 02:13:47,294 - INFO - 1 prompt is loaded, with the key: query


2025-08-23 02:15:00,656 - INFO - Loading all indices.
2025-08-23 02:15:01,122 - INFO - HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"


2025-08-23 02:15:01,125 - biomed_notebook - INFO - Query engine ready (LLM='mistral', top-k=4).


2025-08-23 02:15:01,125 - INFO - Query engine ready (LLM='mistral', top-k=4).


2025-08-23 02:15:01,125 - biomed_notebook - INFO - ✅ QA engine is ready.


2025-08-23 02:15:01,125 - INFO - ✅ QA engine is ready.
2025-08-23 02:15:42,692 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"



✅ Answer:
 The provided context does not specify a most common cause of oral cancer across all patients. However, it shows that in the cases presented, long-term smokeless tobacco use and heavy tobacco use are associated with oral cancer. It's important to note that this is based on a small dataset and may not represent the general population. For a comprehensive understanding, further research or consultation with a healthcare professional would be recommended.

--- ⏱️ Function 'run_ask' executed in 117.55 seconds. ---

📖 Summarizing topic: findings on oral cancer and ethnicity
2025-08-23 02:15:42,764 - biomed_notebook - INFO - Received summarization request (content omitted for safety log).


2025-08-23 02:15:42,764 - INFO - Received summarization request (content omitted for safety log).


2025-08-23 02:15:42,770 - biomed_notebook - INFO - 🧠 No Summary engine found. Creating one …


2025-08-23 02:15:42,770 - INFO - 🧠 No Summary engine found. Creating one …


2025-08-23 02:15:42,771 - biomed_notebook - INFO - Creating query engine …


2025-08-23 02:15:42,771 - INFO - Creating query engine …
2025-08-23 02:15:42,788 - INFO - Load pretrained SentenceTransformer: BAAI/bge-small-en
2025-08-23 02:15:45,301 - INFO - 1 prompt is loaded, with the key: query


2025-08-23 02:16:53,103 - INFO - Loading all indices.
2025-08-23 02:16:53,567 - INFO - HTTP Request: POST http://localhost:11434/api/show "HTTP/1.1 200 OK"


2025-08-23 02:16:53,569 - biomed_notebook - INFO - Query engine ready (LLM='mistral', top-k=8).


2025-08-23 02:16:53,569 - INFO - Query engine ready (LLM='mistral', top-k=8).


2025-08-23 02:16:53,570 - biomed_notebook - INFO - ✅ Summary engine is ready.


2025-08-23 02:16:53,570 - INFO - ✅ Summary engine is ready.
2025-08-23 02:17:32,412 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"



✅ Summary:
 Oral cancer cases were presented for various patients, all showing non-healing ulcers on the lateral border of the tongue for varying durations. The lesions were painful and showed progressive growth. Associated symptoms included mild odynophagia. All patients had a history of long-term smokeless tobacco use.

Examinations revealed ulcerated, indurated lesions in various locations: floor of mouth, buccal mucosa, and left tonsillar pillar. The size of the lesions ranged from 2.5x1.5 cm to 3.0 cm. Keratin pearl formation was noted in some cases.

Lymphadenopathy on the ipsilateral side or a firm, non-mobile submandibular lymph node was palpable in several instances.

The final diagnoses included Well-Differentiated Squamous Cell Carcinoma (SCC), HPV-16 positive, and Poorly Differentiated SCC. The cancer was invasive and arose from the floor of mouth, buccal mucosa, or left tonsillar pillar in different cases.

The ethnicity of the patients was not explicitly mentioned in the