# Multimodal RAG (PDF → Text/Images/Tables → Cohere Embeddings → FAISS → Azure OpenAI Q&A)

This notebook implements a production-ready, step-by-step pipeline for *multimodal RAG* on PDFs:

1. **Extraction**: Separate functions for **text**, **images**, and **tables**.
2. **Heading-aware chunking**: Text is chunked by headings (each heading and its following body is one section).
3. **Embeddings**: Cohere embeddings for **text**, **tables** (Markdown), and **images** (data URI).
4. **Vector store**: Store all embeddings in FAISS with rich metadata.
5. **RAG query**: Use **Azure OpenAI** (via `AzureChatOpenAI`) to answer questions grounded in retrieved chunks.

At each stage, the notebook surfaces JSON records like:
```json
{
  "type": "table",
  "title": "Annual Sales Data",
  "embedding": [0.2, 0.3, 0.5],
  "metadata": {"source": "Report2025.pdf", "page": 12}
}
```

**Notes & Accuracy Choices**
- **Text / Headings**: Uses PyMuPDF's font-size & span inspection to detect headings robustly.
- **Images**: Uses PyMuPDF's `get_images()` / `extract_image()` + nearby caption detection (e.g., *Figure 1:*).
- **Tables**: Uses Camelot (both `lattice` and `stream`, with fallbacks) and pdfplumber as the final fallback. Captions are resolved heuristically from nearby lines (e.g., *Table 1:*).
- **Cohere embeddings**: A single model for all modalities (text + images) so vector dimensions match. Suggested default: `embed-multilingual-v3.0` (1024 dims). Changeable in config.
- **Vector store**: FAISS (cosine similarity) with normalized vectors.
- **LLM**: Azure OpenAI chat model via LangChain (your deployment name in `LLM_MODEL`).

Sections are function-based and callable individually so you can **verify results step-by-step**.

In [21]:
# %%capture
# If running first time, uncomment installs below:
# !pip install -U pymupdf camelot-py[cv] pdfplumber cohere langchain langchain-community langchain-openai faiss-cpu pandas numpy pillow
# Optional high-res PDF parsing (heavier deps):
# !pip install unstructured unstructured-inference

import os, re, io, base64, math, tempfile, json, textwrap
from typing import List, Dict, Any, Tuple
import numpy as np
import pandas as pd
import fitz  # PyMuPDF
import pdfplumber
import camelot
import cohere
from PIL import Image

# Vector store
from langchain_community.vectorstores import FAISS
from langchain_community.docstore.in_memory import InMemoryDocstore
from langchain_core.documents import Document
from langchain_openai import AzureChatOpenAI
import openai

try:
    import faiss  # noqa
except Exception as e:
    raise RuntimeError("FAISS import failed. Try installing faiss-cpu: pip install faiss-cpu")


## Config
Set your API keys and model names. We assume Azure OpenAI credentials are set in env vars.

- `COHERE_API_KEY`  
- `AZURE_OPENAI_TYPE`, `AZURE_OPENAI_ENDPOINT`, `AZURE_OPENAI_API_KEY`, `AZURE_OPENAI_API_VERSION`  
- `LLM_MODEL` (your Azure OpenAI **deployment** name)  

You can also pick a Cohere embeddings model. For **multimodal** (text + images) use a v3+ model, e.g. `embed-multilingual-v3.0`. For v4, set `EMBED_OUTPUT_DIM` if you want non-default output dims.

In [None]:
# === ENV / Model Config ===
COHERE_API_KEY = os.getenv("COHERE_API_KEY", "")
EMBED_MODEL    = os.getenv("EMBED_MODEL", )  # supports text + image, 1024 dims
EMBED_OUTPUT_DIM = None  # for embed-v4.0 you can set 256/512/1024/1536; leave None for model default

openai.api_type    = os.getenv("AZURE_OPENAI_TYPE", "")
openai.api_base    = os.getenv("AZURE_OPENAI_ENDPOINT", "")
openai.api_key     = os.getenv("AZURE_OPENAI_API_KEY", "")
openai.api_version = os.getenv("AZURE_OPENAI_API_VERSION","")
LLM_MODEL          = os.getenv("LLM_MODEL","")  # your Azure deployment name

assert COHERE_API_KEY, "Set COHERE_API_KEY"
assert openai.api_key and openai.api_base and openai.api_version, "Set Azure OpenAI env vars"

co = cohere.ClientV2(api_key=COHERE_API_KEY)
llm = AzureChatOpenAI(
    azure_deployment=LLM_MODEL,
    openai_api_key=openai.api_key,
    openai_api_version=openai.api_version,
    azure_endpoint=openai.api_base,
)


AssertionError: Set COHERE_API_KEY

## Utilities
Helper functions for cleaning, caption search, and data-URI conversion.

In [None]:
def _clean_text(s: str) -> str:
    s = s.replace('\u00ad', '')  # soft hyphen
    s = re.sub(r"\s+", " ", s)
    return s.strip()

def _to_data_uri(image_bytes: bytes, ext: str) -> str:
    fmt = "jpeg" if ext.lower() in ["jpg", "jpeg"] else "png"
    b64 = base64.b64encode(image_bytes).decode("utf-8")
    return f"data:image/{fmt};base64,{b64}"

def _table_to_markdown(df: pd.DataFrame) -> str:
    # Render as pipe-table markdown for embedding & readability
    return df.to_markdown(index=False)

def _cosine_normalize(mat: np.ndarray) -> np.ndarray:
    norms = np.linalg.norm(mat, axis=1, keepdims=True) + 1e-12
    return mat / norms

def _guess_caption_from_lines(lines: List[Tuple[float, float, str]], pivot_y: float, kind: str = "figure") -> str:
    """Find the nearest preceding line that looks like a caption.
    lines: list of (y0, y1, text) sorted top→bottom
    pivot_y: search above this y (top is small y)
    kind: 'figure' or 'table'
    """
    pattern = r"^(?:%s|%s)\s*\d+\s*[:\.-]?\s*(.+)$" % (kind.capitalize(), kind[:3].capitalize()+r"\.?")
    candidates = [t for (y0, y1, t) in lines if y1 <= pivot_y + 2]  # lines above or touching pivot
    for txt in reversed(candidates[-10:]):  # look in last 10 lines above
        m = re.match(pattern, txt.strip())
        if m:
            return _clean_text(m.group(0))
    # fallback: return empty
    return ""


## Text Extraction & Heading-Aware Chunking (PyMuPDF)
This uses `page.get_text("dict")` to inspect spans (font sizes, flags) and identify headings by **size percentile** + **bold/uppercase** heuristics, then groups the text between headings into sections.

In [35]:
# Improved text extraction function with hierarchical heading detection
def extract_text_sections_by_headings(pdf_path: str, skip_contents: bool = True, fallback_page_chunk: bool = True) -> List[Dict[str, Any]]:
    """
    Extract hierarchical text sections by recognizing headings and subheadings.
    The section title is built as 'Level1 heading -> Level2 heading'.
    Lines with heading levels deeper than 2 are treated as part of the body.

    Args:
        pdf_path: Path to a PDF file.
        skip_contents: If True, skip pages that begin with 'Contents' or 'Table of Contents'.
        fallback_page_chunk: If True and no headings are found, chunk by page.

    Returns:
        A list of section dictionaries with type, title, content, and metadata.
    """
    import fitz
    import re
    doc = fitz.open(pdf_path)
    sections: List[Dict[str, Any]] = []
    current_h1 = None
    current_h2 = None
    current_lines: List[str] = []
    section_start_page = 0
    headings_found = 0

    def flush_section(end_page: int):
        nonlocal current_lines, current_h1, current_h2, section_start_page
        if not current_lines:
            return
        text = ' '.join(current_lines).strip()
        if not text:
            current_lines.clear()
            return
        if current_h2:
            title = f"{current_h1} -> {current_h2}"
        elif current_h1:
            title = current_h1
        else:
            title = "Untitled Section"
        sections.append({
            'type': 'text',
            'title': title,
            'content': _clean_text(text),
            'metadata': {
                'page_start': section_start_page,
                'page_end': end_page
            }
        })
        current_lines.clear()

    for pno in range(len(doc)):
        page = doc[pno]
        # Skip contents page if the first line indicates a table of contents
        if skip_contents:
            first_line = page.get_text("text").strip().split(" ", 1)[0].strip().lower()
            if re.match(r'(table of contents|contents)', first_line):
                continue
        lines = page.get_text("text").split(" ")
        for line in lines:
            line_stripped = line.strip()
            if not line_stripped:
                continue
            # Numeric heading detection (e.g., '1 Introduction', '1.1 Purpose')
            num_match = re.match(r'^(\d+(?:\.\d+)*)\s+(.+)$', line_stripped)
            if num_match:
                number = num_match.group(1)
                heading_text = num_match.group(2).strip()
                level = number.count('.') + 1
                if level == 1:
                    flush_section(pno)
                    current_h1 = heading_text
                    current_h2 = None
                    section_start_page = pno
                    headings_found += 1
                    continue
                elif level == 2:
                    flush_section(pno)
                    if current_h1 is None:
                        current_h1 = heading_text
                        current_h2 = None
                    else:
                        current_h2 = heading_text
                    section_start_page = pno
                    headings_found += 1
                    continue
                else:
                    # Levels deeper than 2 are treated as normal text
                    pass
            # All-caps heading fallback (short uppercase phrases without numbers)
            cap_match = re.match(r'^[A-Z][A-Z\s]{2,}$', line_stripped)
            if cap_match and len(line_stripped.split()) <= 10:
                flush_section(pno)
                if current_h1 is None:
                    current_h1 = line_stripped.title()
                    current_h2 = None
                elif current_h2 is None:
                    current_h2 = line_stripped.title()
                else:
                    current_h2 = line_stripped.title()
                section_start_page = pno
                headings_found += 1
                continue
            # Otherwise, treat as body text
            current_lines.append(line_stripped)
        # End of line processing for page
    # Flush the final accumulated lines
    flush_section(len(doc) - 1)
    # Fallback: If no headings were detected, chunk by page
    if headings_found == 0 and fallback_page_chunk:
        sections = []
        for pno in range(len(doc)):
            page_text = doc[pno].get_text("text").strip()
            if not page_text:
                continue
            sections.append({
                'type': 'text',
                'title': f"Page {pno + 1}",
                'content': _clean_text(page_text),
                'metadata': {'page_start': pno, 'page_end': pno}
            })
    return sections


## Image Extraction (PyMuPDF)
Extracts images via `page.get_images()` / `doc.extract_image(xref)` and attempts to find a nearby caption (e.g., `Figure 1:`) from the page text. Returns JSON-like records with paths and metadata.

In [24]:
def extract_images(pdf_path: str, save_dir: str = "extracted_images") -> List[Dict[str, Any]]:
    os.makedirs(save_dir, exist_ok=True)
    doc = fitz.open(pdf_path)
    out = []
    for pno in range(len(doc)):
        page = doc[pno]
        # Prepare text lines for caption heuristics
        d = page.get_text("dict")
        lines = []
        for b in d.get('blocks', []):
            for l in b.get('lines', []):
                y0 = min(s.get('bbox', l.get('bbox', b.get('bbox')))[1] for s in l.get('spans', [])) if l.get('spans') else l.get('bbox', [0,0,0,0])[1]
                y1 = max(s.get('bbox', l.get('bbox', b.get('bbox')))[3] for s in l.get('spans', [])) if l.get('spans') else l.get('bbox', [0,0,0,0])[3]
                txt = _clean_text(" ".join(s.get('text','') for s in l.get('spans', [])))
                if txt:
                    lines.append((y0, y1, txt))
        lines.sort(key=lambda x: x[0])

        for img in page.get_images(full=True):
            xref = img[0]
            try:
                info = doc.extract_image(xref)
            except Exception:
                continue
            if not info:
                continue
            ext = info.get('ext', 'png')
            img_bytes = info.get('image')
            # save to disk for inspection
            img_name = f"page{pno+1}_xref{xref}.{ext}"
            img_path = os.path.join(save_dir, img_name)
            with open(img_path, 'wb') as f:
                f.write(img_bytes)
            # estimate vertical position using image rects (if available)
            rects = page.get_image_rects(xref)
            pivot_y = rects[0].y1 if rects else page.rect.height/2
            caption = _guess_caption_from_lines(lines, pivot_y, kind="figure")
            title = caption if caption else f"Image (Page {pno+1})"
            out.append({
                'type': 'image',
                'title': title,
                'image_path': img_path,
                'metadata': {'source': os.path.basename(pdf_path), 'page': pno+1}
            })
    return out


## Table Extraction (Camelot → pdfplumber fallback)
Tries Camelot `lattice` and `stream` per page; if nothing is detected, uses pdfplumber as a fallback. Also attempts to find **table captions** by scanning nearby lines for patterns like `Table 1:`.

In [25]:
def extract_tables(pdf_path: str) -> List[Dict[str, Any]]:
    doc = fitz.open(pdf_path)
    results = []
    # Build page lines for caption search
    page_lines = []
    for pno in range(len(doc)):
        page = doc[pno]
        d = page.get_text("dict")
        lines = []
        for b in d.get('blocks', []):
            for l in b.get('lines', []):
                y0 = min(s.get('bbox', l.get('bbox', b.get('bbox')))[1] for s in l.get('spans', [])) if l.get('spans') else l.get('bbox', [0,0,0,0])[1]
                y1 = max(s.get('bbox', l.get('bbox', b.get('bbox')))[3] for s in l.get('spans', [])) if l.get('spans') else l.get('bbox', [0,0,0,0])[3]
                txt = _clean_text(" ".join(s.get('text','') for s in l.get('spans', [])))
                if txt:
                    lines.append((y0, y1, txt))
        lines.sort(key=lambda x: x[0])
        page_lines.append(lines)

    # Try Camelot first (lattice then stream)
    try:
        tables_lattice = camelot.read_pdf(pdf_path, pages='all', flavor='lattice')
    except Exception:
        tables_lattice = None
    try:
        tables_stream = camelot.read_pdf(pdf_path, pages='all', flavor='stream')
    except Exception:
        tables_stream = None

    # Merge both
    camelot_tables = []
    for coll in [tables_lattice, tables_stream]:
        if coll is not None and coll.n > 0:
            camelot_tables.extend(list(coll))

    # If Camelot found none, fallback to pdfplumber
    if not camelot_tables:
        with pdfplumber.open(pdf_path) as pdf:
            for pno, page in enumerate(pdf.pages):
                try:
                    tbls = page.extract_tables()
                except Exception:
                    tbls = []
                for idx, data in enumerate(tbls or []):
                    df = pd.DataFrame(data[1:], columns=data[0]) if data and data[0] else pd.DataFrame(data)
                    title = _guess_caption_from_lines(page_lines[pno], pivot_y=page.height/2, kind="table") or f"Table (Page {pno+1})"
                    results.append({
                        'type': 'table',
                        'title': title,
                        'dataframe': df,
                        'metadata': {'source': os.path.basename(pdf_path), 'page': pno+1}
                    })
        return results

    # Build results from Camelot
    # Camelot tables have .df and .parsing_report with page info; order is original reading order
    for i, tbl in enumerate(camelot_tables, start=1):
        try:
            df = tbl.df
        except Exception:
            # Skip broken tables
            continue
        # page number from parsing_report or fallback to 1
        pg = int(tbl.parsing_report.get('page', 1)) if hasattr(tbl, 'parsing_report') else 1
        # Heuristic caption: search above table middle if table bbox is available
        try:
            # tbl._bbox is non-public; if not present, use mid-page as pivot
            bbox = getattr(tbl, '_bbox', None)
            pivot_y = (bbox[3] if bbox else doc[pg-1].rect.height/2)
        except Exception:
            pivot_y = doc[pg-1].rect.height/2
        title = _guess_caption_from_lines(page_lines[pg-1], pivot_y, kind="table") or f"Table (Page {pg})"
        results.append({
            'type': 'table',
            'title': title,
            'dataframe': df,
            'metadata': {'source': os.path.basename(pdf_path), 'page': pg}
        })
    return results


## Embeddings with Cohere (text + images)
Embeds:
- **Text** chunks (sections) and **tables** (as Markdown) using `input_type='search_document'`.
- **Images** by passing a **data URI** and `input_type='image'`.

All embeddings are stored back into the item dict under `embedding`.

In [None]:
def embed_items_with_cohere(items: List[Dict[str, Any]], co: cohere.ClientV2, model: str = EMBED_MODEL, output_dim: int = EMBED_OUTPUT_DIM) -> List[Dict[str, Any]]:
    texts = []
    text_idx = []
    images = []
    image_idx = []

    # Prepare payloads
    for i, it in enumerate(items):
        typ = it['type']
        if typ == 'text':
            texts.append(it['content'])
            text_idx.append(i)
        elif typ == 'table':
            md = _table_to_markdown(it['dataframe'])
            it['markdown'] = md
            texts.append(md)
            text_idx.append(i)
        elif typ == 'image':
            # load and convert to data URI
            with open(it['image_path'], 'rb') as f:
                img_bytes = f.read()
            # enforce max size 5MB
            if len(img_bytes) > 5 * 1024 * 1024:
                # downscale
                im = Image.open(io.BytesIO(img_bytes))
                im.thumbnail((1600, 1600))
                buf = io.BytesIO()
                im.save(buf, format='PNG')
                img_bytes = buf.getvalue()
            data_uri = _to_data_uri(img_bytes, 'png')
            images.append(data_uri)
            image_idx.append(i)

    # Batch embed texts
    if texts:
        kwargs = {
            'model': model,
            'input_type': 'search_document',
            'texts': texts,
        }
        if output_dim is not None:
            kwargs['output_dimension'] = output_dim
        resp = co.embed(**kwargs)
        vecs = np.array(resp.embeddings.float, dtype=np.float32)
        for idx, vec in zip(text_idx, vecs):
            items[idx]['embedding'] = vec.tolist()

    # Embed images (one at a time per API constraint)
    for idx, data_uri in zip(image_idx, images):
        kwargs = {
            'model': model,
            'input_type': 'image',
            'images': [data_uri]
        }
        if output_dim is not None:
            kwargs['output_dimension'] = output_dim
        resp = co.embed(**kwargs)
        vec = np.array(resp.embeddings.float[0], dtype=np.float32)
        items[idx]['embedding'] = vec.tolist()

    return items


## Store in FAISS (cosine)
We normalize vectors and store them in a single FAISS index so **text, tables, and images** are retrievable together. The `docstore` holds the original JSON per vector id.

In [None]:
class MixedEmbeddingStore:
    def __init__(self, dim: int):
        self.dim = dim
        self.index = faiss.IndexFlatIP(dim)
        self.docstore = InMemoryDocstore({})
        self.ids = []

    def add(self, items: List[Dict[str, Any]]):
        vecs = []
        new_ids = []
        for it in items:
            emb = it.get('embedding')
            if not emb:
                continue
            v = np.array(emb, dtype=np.float32)
            v = v / (np.linalg.norm(v) + 1e-12)
            vecs.append(v)
            new_id = f"doc-{len(self.ids) + len(new_ids)}"
            new_ids.append(new_id)
            # wrap as LangChain Document for compatibility
            meta = it.get('metadata', {}).copy()
            meta.update({'type': it['type'], 'title': it.get('title', '')})
            page_meta = {'page': meta.get('page'), 'start_page': meta.get('start_page'), 'end_page': meta.get('end_page')}
            content = it.get('content') or it.get('markdown') or it.get('image_path') or ''
            self.docstore.mset({new_id: Document(page_content=content, metadata=meta)})
        if vecs:
            mat = np.vstack(vecs)
            self.index.add(mat)
            self.ids.extend(new_ids)

    def search(self, query_vec: np.ndarray, top_k: int = 5) -> List[Tuple[str, float]]:
        q = query_vec.astype(np.float32)
        q = q / (np.linalg.norm(q) + 1e-12)
        D, I = self.index.search(q.reshape(1, -1), top_k)
        hits = []
        for d, i in zip(D[0], I[0]):
            if i == -1:
                continue
            hits.append((self.ids[i], float(d)))
        return hits

    def get(self, _id: str) -> Document:
        return self.docstore.search(_id)


## RAG Query
1. Embed the query with `input_type='search_query'`.
2. Retrieve top-k from FAISS.
3. Send *only retrieved contexts* to Azure OpenAI with a strict grounding instruction.

In [None]:
def embed_query(query: str, co: cohere.ClientV2, model: str = EMBED_MODEL, output_dim: int = EMBED_OUTPUT_DIM) -> np.ndarray:
    kwargs = {
        'model': model,
        'input_type': 'search_query',
        'texts': [query]
    }
    if output_dim is not None:
        kwargs['output_dimension'] = output_dim
    resp = co.embed(**kwargs)
    return np.array(resp.embeddings.float[0], dtype=np.float32)

def answer_with_azure_openai(question: str, hits: List[Tuple[str, float]], store: MixedEmbeddingStore, llm: AzureChatOpenAI) -> Dict[str, Any]:
    ctxs = []
    for _id, score in hits:
        doc = store.get(_id)
        meta = doc.metadata
        source = meta.get('source', 'unknown')
        page = meta.get('page', meta.get('start_page'))
        title = meta.get('title', '')
        ctxs.append(f"[type={meta.get('type')}, title={title}, source={source}, page={page}]\n{doc.page_content}")
    system_msg = (
        "You are a precise assistant. Answer ONLY using the provided context snippets. "
        "Cite the source filename and page(s) explicitly. If the answer isn't in the snippets, say you don't have it."
    )
    prompt = (
        f"Question: {question}\n\n"
        f"Context:\n" + "\n\n".join(ctxs[:5])
    )
    resp = llm.invoke([{"role": "system", "content": system_msg}, {"role": "user", "content": prompt}])
    return {"answer": resp.content, "contexts": ctxs}


## Orchestrator Functions
These wrap each stage so you can run and validate step-by-step.

In [36]:
def run_extraction(pdf_path: str) -> Dict[str, List[Dict[str, Any]]]:
    text_sections = extract_text_sections_by_headings(pdf_path)
    images = extract_images(pdf_path)
    tables = extract_tables(pdf_path)
    return {"text_sections": text_sections, "images": images, "tables": tables}

def run_embeddings(extracted: Dict[str, List[Dict[str, Any]]]) -> List[Dict[str, Any]]:
    items = []
    items.extend(extracted.get('text_sections', []))
    items.extend(extracted.get('images', []))
    items.extend(extracted.get('tables', []))
    items = embed_items_with_cohere(items, co)
    return items

def build_store(items: List[Dict[str, Any]]) -> MixedEmbeddingStore:
    # Detect embedding dimension from first vector
    first = next((it for it in items if it.get('embedding')), None)
    assert first is not None, "No embeddings produced."
    dim = len(first['embedding'])
    store = MixedEmbeddingStore(dim)
    store.add(items)
    return store

def query_pipeline(question: str, store: MixedEmbeddingStore, top_k: int = 5) -> Dict[str, Any]:
    qv = embed_query(question, co)
    hits = store.search(qv, top_k=top_k)
    return answer_with_azure_openai(question, hits, store, llm)


NameError: name 'MixedEmbeddingStore' is not defined

## Verification Helpers
Show JSON-like outputs in the specified format so you can confirm *type*, *title*, *embedding*, *metadata* at each step.

In [None]:
def preview_item_json(it: Dict[str, Any]) -> Dict[str, Any]:
    meta = it.get('metadata', {}).copy()
    # keep only key fields
    out = {
        'type': it.get('type'),
        'title': it.get('title'),
        'embedding': (it.get('embedding')[:3] if it.get('embedding') else None),
        'metadata': meta
    }
    # fill page convenience
    if 'start_page' in meta and 'end_page' in meta and 'page' not in meta:
        out['metadata']['page'] = meta['start_page'] if meta['start_page'] == meta['end_page'] else f"{meta['start_page']}-{meta['end_page']}"
    return out

def preview_many(items: List[Dict[str, Any]], n: int = 3):
    for it in items[:n]:
        print(json.dumps(preview_item_json(it), indent=2))


## Example Run (Step-by-Step)
Uncomment and point `PDF_PATH` to your file, then run the cells one by one to verify each stage.

In [None]:
PDF_PATH = r""

# # 1) Extraction
extracted = run_extraction(PDF_PATH)
print("Text sections:", len(extracted['text_sections']))
print("Images:", len(extracted['images']))
print("Tables:", len(extracted['tables']))

# # 2) Embeddings
items = run_embeddings(extracted)
preview_many(items, n=5)

# # 3) Vector store
store = build_store(items)

# # 4) Query
question = "What are the annual sales figures and where are they reported?"
result = query_pipeline(question, store, top_k=5)
print(result['answer'])
print("\nContexts used:\n", "\n---\n".join(result['contexts']))


  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generate_columns_and_rows(bbox, user_cols)
  cols, rows, v_s, h_s = self._generat

Text sections: 5389
Images: 7
Tables: 310


### Optional: Persist/Load the Store
You can persist embeddings and metadata to disk and reload to avoid recomputing for the same PDF.

In [28]:
def save_items_json(items: List[Dict[str, Any]], json_path: str):
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(items, f, ensure_ascii=False)

def load_items_json(json_path: str) -> List[Dict[str, Any]]:
    with open(json_path, 'r', encoding='utf-8') as f:
        return json.load(f)


## Save Extracted Chunks, Images, and Tables to JSON
This function organizes the extracted text sections, images, and tables into a single JSON file with metadata so you can persist the results or inspect them easily.

In [32]:
def save_extracted_to_json(text_sections: List[Dict[str, Any]], images: List[Dict[str, Any]], tables: List[Dict[str, Any]], json_path: str) -> str:
    """
    Save the extracted text sections, images, and tables into a JSON file.

    DataFrames in tables are converted into simple serializable structures with
    headers and rows so that json.dump does not fail.

    Args:
        text_sections: List of text section dicts (each has type='text', title, content, metadata)
        images: List of image dicts (type='image', title, image_path/data_uri, metadata)
        tables: List of table dicts (each may contain a pandas DataFrame under 'dataframe')
        json_path: Path to output JSON file

    Returns:
        The path to the saved JSON file.
    """
    import json
    serializable_tables = []
    for tbl in tables:
        # Convert pandas DataFrame into a serializable structure
        df = tbl.get('dataframe')
        if df is not None and hasattr(df, 'values'):
            headers = list(df.columns)
            rows = df.values.tolist()
            serializable_tables.append({
                'type': 'table',
                'title': tbl.get('title'),
                'headers': headers,
                'rows': rows,
                'metadata': tbl.get('metadata', {})
            })
        else:
            # If no DataFrame is present, assume already serializable
            serializable_tables.append(tbl)
    data = {
        'text_sections': text_sections,
        'images': images,
        'tables': serializable_tables
    }
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(data, f, ensure_ascii=False, indent=2)
    return json_path


In [33]:
output_json_path = "extracted_results.json"
save_extracted_to_json(
    extracted["text_sections"],
    extracted["images"],
    extracted["tables"],
    output_json_path
)
print(f"Extracted results saved to {output_json_path}")

Extracted results saved to extracted_results.json


## Azure AI Search Integration
This section defines functions to create an Azure AI Search index suitable for storing mixed modalities (text, images, tables) and upload your documents with embeddings. The index schema includes fields for the document type, titles, content, page numbers, and vector embeddings. It follows guidance from Microsoft's documentation so that the index can be used with vector search.

In [None]:
def create_azure_ai_search_index(service_endpoint: str, admin_key: str, index_name: str, vector_dim: int):
    """
    Create (or recreate) an Azure AI Search index with fields suitable for mixed-modality retrieval.

    The index contains:
      - id (string, key)
      - doc_type (filterable string to distinguish text/table/image)
      - title (searchable string)
      - content (searchable string) for text and tables
      - embedding (vector field for semantic search)
      - page_start, page_end (filterable ints)
      - image_path (string) optional for images

    A HNSW vector search configuration is used with cosine similarity.
    """
    from azure.core.credentials import AzureKeyCredential
    from azure.search.documents.indexes import SearchIndexClient
    from azure.search.documents.indexes.models import (
        SearchIndex, SimpleField, SearchableField, VectorField, VectorSearch,
        VectorSearchAlgorithmConfiguration, HnswParameters
    )

    credential = AzureKeyCredential(admin_key)
    index_client = SearchIndexClient(endpoint=service_endpoint, credential=credential)

    fields = [
        SimpleField(name="id", type="Edm.String", key=True),
        SimpleField(name="doc_type", type="Edm.String", filterable=True, facetable=True),
        SearchableField(name="title", type="Edm.String", analyzer_name="en.lucene"),
        SearchableField(name="content", type="Edm.String", analyzer_name="en.lucene"),
        VectorField(name="embedding", vector_dimensions=vector_dim, vector_search_configuration="my-vector-config"),
        SimpleField(name="page_start", type="Edm.Int32", filterable=True),
        SimpleField(name="page_end", type="Edm.Int32", filterable=True),
        SearchableField(name="image_path", type="Edm.String", analyzer_name=None)
    ]

    vector_search = VectorSearch(
        algorithm_configurations=[
            VectorSearchAlgorithmConfiguration(
                name="my-vector-config",
                kind="hnsw",
                hnsw_parameters=HnswParameters(
                    m=4,
                    ef_construction=400,
                    ef_search=500,
                    metric="cosine"
                )
            )
        ]
    )

    index = SearchIndex(name=index_name, fields=fields, vector_search=vector_search)
    try:
        index_client.delete_index(index_name)
    except Exception:
        pass
    created = index_client.create_or_update_index(index)
    return created


def prepare_documents_for_azure(items: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """
    Prepare a list of documents from extracted items for uploading to Azure AI Search.

    Each returned dict contains the fields defined in the index schema. Note that image documents
    store the image path instead of content, while text and table documents use the 'content' key.
    """
    import uuid
    docs = []
    for i, it in enumerate(items):
        doc = {
            'id': str(uuid.uuid4()),
            'doc_type': it['type'],
            'title': it.get('title', ''),
            'content': it.get('content', it.get('table_md', '')) if it['type'] != 'image' else '',
            'embedding': it.get('embedding').tolist() if it.get('embedding') is not None else None,
            'page_start': it.get('metadata', {}).get('page_start'),
            'page_end': it.get('metadata', {}).get('page_end'),
            'image_path': it.get('path', '') if it['type'] == 'image' else ''
        }
        docs.append(doc)
    return docs


def upload_documents_to_azure(service_endpoint: str, admin_key: str, index_name: str, documents: List[Dict[str, Any]]):
    """
    Upload a batch of documents to Azure AI Search. Documents should already include vector embeddings.

    Returns the result of the upload operation.
    """
    from azure.core.credentials import AzureKeyCredential
    from azure.search.documents import SearchClient

    search_client = SearchClient(endpoint=service_endpoint, index_name=index_name, credential=AzureKeyCredential(admin_key))
    # Upload in batches
    batch_size = 1000
    results = []
    for i in range(0, len(documents), batch_size):
        batch = documents[i:i+batch_size]
        result = search_client.upload_documents(documents=batch)
        results.append(result)
    return results


## Querying Azure AI Search with LLM
This function demonstrates how to query the Azure AI Search index using a natural language question. It embeds the query, performs a vector search against the index, and then uses an Azure OpenAI chat model to generate an answer grounded in the retrieved contexts.

In [34]:
def query_with_llm(query: str, co: cohere.ClientV2, search_client, top_k: int = 5) -> str:
    """
    Answer a question using the Azure AI Search index and an LLM.

    Steps:
      1. Embed the query with Cohere using input_type='search_query'.
      2. Perform a vector search on the index to retrieve the most relevant documents.
      3. Combine the retrieved contexts and send them to the Azure OpenAI chat completion API.

    Args:
        query: The natural language question.
        co: A Cohere Client used to embed the query.
        search_client: An instance of azure.search.documents.SearchClient connected to the target index.
        top_k: How many documents to retrieve.

    Returns:
        The LLM's answer as a string.
    """
    # Embed the query
    query_embedding = embed_query(query, co)
    # Perform vector search
    results = search_client.search(
        search_text="",
        vector=query_embedding.tolist(),
        top_k=top_k,
        vector_fields="embedding",
        select=["title", "content", "doc_type", "page_start", "page_end"]
    )
    contexts = []
    for r in results:
        # Some documents may have empty content (images). Skip those for context.
        if r.get('content'):
            contexts.append(r['content'])
    # Build a prompt for the chat model
    joined_contexts = "".join(contexts)
    messages = [
        {"role": "system", "content": "You are a helpful assistant that answers questions using provided document contexts. If the answer is not contained in the contexts, respond that you don't know."},
        {"role": "user", "content": f"Context:{joined_contexts} Question: {query}"}
    ]
    # Ask the LLM
    response = openai.ChatCompletion.create(
        model=LLM_MODEL,
        messages=messages,
        temperature=0.0
    )
    return response['choices'][0]['message']['content']
