# PDF Extraction Quality Inspector

This notebook downloads sample PDFs (from the Azure AI Search sample data repository) and runs extraction to inspect:
- **Extracted text** — page by page
- **Metadata** — title, author, creation date, producer, keywords, etc.
- **Quality metrics** — word count, char count, extraction completeness

Use this to evaluate and improve the `PdfCracker` in the simulator.

## 1. Install and Import Dependencies

In [30]:
# Install required packages (run once)
%pip install pymupdf pdfplumber requests pandas tabulate jpype1 azure-ai-documentintelligence python-dotenv

Collecting azure-ai-documentintelligence
  Downloading azure_ai_documentintelligence-1.0.2-py3-none-any.whl.metadata (53 kB)
Downloading azure_ai_documentintelligence-1.0.2-py3-none-any.whl (106 kB)
Installing collected packages: azure-ai-documentintelligence
Successfully installed azure-ai-documentintelligence-1.0.2
Note: you may need to restart the kernel to use updated packages.


In [2]:
import os
import json
import time
import textwrap
from pathlib import Path
from collections import defaultdict

import fitz  # PyMuPDF
import pdfplumber
import requests
import pandas as pd
from IPython.display import display, HTML, Markdown

pd.set_option("display.max_colwidth", 120)
pd.set_option("display.max_rows", 100)

# Max chars of extracted text to display per document (0 = no limit, show all)

MAX_TEXT_DISPLAY = 0

print("✅ All libraries imported successfully")


✅ All libraries imported successfully


## 2. Download Sample PDF Files

We use the same PDFs from the [Azure-Samples/azure-search-sample-data](https://github.com/Azure-Samples/azure-search-sample-data) repository (health-plan folder). These are real-world documents with varying complexity.

In [3]:
# Sample PDF URLs from Azure cognitive-search-sample-data repository
SAMPLE_PDFS = {
    "employee_handbook": {
        "url": "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/health-plan/employee_handbook.pdf",
        "title": "Employee Handbook",
        "category": "HR",
    },
    "Benefit_Options": {
        "url": "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/health-plan/Benefit_Options.pdf",
        "title": "Benefit Options",
        "category": "Benefits",
    },
    "PerksPlus": {
        "url": "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/health-plan/PerksPlus.pdf",
        "title": "Perks Plus Program",
        "category": "Benefits",
    }
}

PDF_DIR = Path("../data/pdfs")
PDF_DIR.mkdir(parents=True, exist_ok=True)

downloaded = {}

# 1) Download remote sample PDFs (if not already present)
for doc_id, info in SAMPLE_PDFS.items():
    pdf_path = PDF_DIR / f"{doc_id}.pdf"
    if pdf_path.exists():
        print(f"📄 {info['title']} — already exists ({pdf_path.stat().st_size:,} bytes)")
    else:
        print(f"📥 Downloading {info['title']}...")
        resp = requests.get(info["url"], timeout=30)
        resp.raise_for_status()
        pdf_path.write_bytes(resp.content)
        print(f"   ✅ Saved ({len(resp.content):,} bytes)")
    downloaded[doc_id] = pdf_path

# 2) Discover any additional local PDF files in the same directory
local_count = 0
for pdf_path in sorted(PDF_DIR.glob("*.pdf")):
    doc_id = pdf_path.stem  # filename without extension
    if doc_id in downloaded:
        continue  # already registered from the remote list above
    # Skip the PDFBox JAR (has .jar extension, but just in case)
    if "pdfbox" in doc_id.lower():
        continue
    downloaded[doc_id] = pdf_path
    SAMPLE_PDFS[doc_id] = {
        "title": doc_id.replace("_", " ").replace("-", " ").title(),
        "category": "Local",
    }
    local_count += 1
    print(f"📂 Local PDF: {pdf_path.name} ({pdf_path.stat().st_size:,} bytes)")

remote_count = len(downloaded) - local_count
print(f"\n✅ {len(downloaded)} PDF files ready ({remote_count} remote, {local_count} local) in {PDF_DIR.resolve()}")

📄 Employee Handbook — already exists (142,977 bytes)
📄 Benefit Options — already exists (544,811 bytes)
📄 Perks Plus Program — already exists (115,310 bytes)
📂 Local PDF: 0000950170-25-061046.pdf (2,179,871 bytes)
📂 Local PDF: 0000950170-25-100235.pdf (3,024,506 bytes)
📂 Local PDF: 0001193125-25-256321.pdf (1,802,290 bytes)
📂 Local PDF: 0001193125-26-027207.pdf (2,257,229 bytes)

✅ 7 PDF files ready (3 remote, 4 local) in C:\Projets\AzureAISimulator\samples\data\pdfs


## 3. Configure PDF Extraction Functions

We set up extraction functions for the Python libraries:
- **PyMuPDF** (`fitz`) — fast, C-based, handles complex layouts
- **pdfplumber** — pure-Python, good table extraction, detailed character info

> PDFBox (Java via JPype) extraction is configured in Section 5b below.

In [4]:
def extract_with_pymupdf(pdf_path: Path) -> dict:
    """Extract text and metadata using PyMuPDF (fitz)."""
    doc = fitz.open(str(pdf_path))
    
    pages = []
    full_text_parts = []
    for i, page in enumerate(doc):
        text = page.get_text("text")
        pages.append({
            "page_num": i + 1,
            "text": text,
            "char_count": len(text),
            "word_count": len(text.split()),
            "width": page.rect.width,
            "height": page.rect.height,
            "images": len(page.get_images(full=True)),
            "links": len(page.get_links()),
        })
        full_text_parts.append(text)
    
    metadata = doc.metadata or {}
    full_text = "\n\n".join(full_text_parts)
    
    result = {
        "library": "PyMuPDF",
        "file": pdf_path.name,
        "file_size": pdf_path.stat().st_size,
        "page_count": len(doc),
        "full_text": full_text,
        "total_chars": len(full_text),
        "total_words": len(full_text.split()),
        "pages": pages,
        "metadata": {
            "title": metadata.get("title", ""),
            "author": metadata.get("author", ""),
            "subject": metadata.get("subject", ""),
            "keywords": metadata.get("keywords", ""),
            "creator": metadata.get("creator", ""),
            "producer": metadata.get("producer", ""),
            "creation_date": metadata.get("creationDate", ""),
            "mod_date": metadata.get("modDate", ""),
            "format": metadata.get("format", ""),
            "encryption": metadata.get("encryption", None),
        },
    }
    doc.close()
    return result


def extract_with_pdfplumber(pdf_path: Path) -> dict:
    """Extract text and metadata using pdfplumber."""
    pdf = pdfplumber.open(str(pdf_path))
    
    pages = []
    full_text_parts = []
    for i, page in enumerate(pdf.pages):
        text = page.extract_text() or ""
        tables = page.extract_tables()
        pages.append({
            "page_num": i + 1,
            "text": text,
            "char_count": len(text),
            "word_count": len(text.split()),
            "width": page.width,
            "height": page.height,
            "tables_found": len(tables),
            "chars_count_raw": len(page.chars),
        })
        full_text_parts.append(text)
    
    metadata = pdf.metadata or {}
    full_text = "\n\n".join(full_text_parts)
    
    result = {
        "library": "pdfplumber",
        "file": pdf_path.name,
        "file_size": pdf_path.stat().st_size,
        "page_count": len(pdf.pages),
        "full_text": full_text,
        "total_chars": len(full_text),
        "total_words": len(full_text.split()),
        "pages": pages,
        "metadata": {
            "title": metadata.get("Title", metadata.get("title", "")),
            "author": metadata.get("Author", metadata.get("author", "")),
            "subject": metadata.get("Subject", metadata.get("subject", "")),
            "keywords": metadata.get("Keywords", metadata.get("keywords", "")),
            "creator": metadata.get("Creator", metadata.get("creator", "")),
            "producer": metadata.get("Producer", metadata.get("producer", "")),
            "creation_date": metadata.get("CreationDate", metadata.get("creationDate", "")),
            "mod_date": metadata.get("ModDate", metadata.get("modDate", "")),
        },
    }
    pdf.close()
    return result

print("✅ Extraction functions defined")

✅ Extraction functions defined


## 4. Run PDF Extraction on All Downloaded Files

Execute both extraction methods on each PDF and store the results.

In [5]:
# Run both extractors on each PDF
# Large/complex PDFs can make pdfplumber very slow — we use a per-file timeout.
from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError

EXTRACTION_TIMEOUT_SEC = 120  # max seconds per extraction per file

results_pymupdf = {}
results_pdfplumber = {}
timings_pymupdf = {}
timings_pdfplumber = {}

def _run_with_timeout(func, pdf_path, timeout=EXTRACTION_TIMEOUT_SEC):
    """Run an extraction function with a timeout. Returns (result, elapsed) or raises."""
    with ThreadPoolExecutor(max_workers=1) as executor:
        t0 = time.perf_counter()
        future = executor.submit(func, pdf_path)
        result = future.result(timeout=timeout)
        elapsed = time.perf_counter() - t0
        return result, elapsed

for doc_id, pdf_path in downloaded.items():
    title = SAMPLE_PDFS[doc_id]["title"]
    print(f"🔍 Processing: {title} ({pdf_path.name})")

    try:
        result, elapsed = _run_with_timeout(extract_with_pymupdf, pdf_path)
        results_pymupdf[doc_id] = result
        timings_pymupdf[doc_id] = elapsed
        print(f"   PyMuPDF:    {result['page_count']} pages, "
              f"{result['total_words']:,} words, "
              f"{result['total_chars']:,} chars  "
              f"({elapsed*1000:.1f} ms)")
    except FuturesTimeoutError:
        print(f"   ⏰ PyMuPDF timed out after {EXTRACTION_TIMEOUT_SEC}s — skipped")
    except Exception as e:
        print(f"   ❌ PyMuPDF failed: {e}")

    #try:
    #    result, elapsed = _run_with_timeout(extract_with_pdfplumber, pdf_path)
    #    results_pdfplumber[doc_id] = result
    #    timings_pdfplumber[doc_id] = elapsed
    #    print(f"   pdfplumber: {result['page_count']} pages, "
    #          f"{result['total_words']:,} words, "
    #          f"{result['total_chars']:,} chars  "
    #          f"({elapsed*1000:.1f} ms)")
    #except FuturesTimeoutError:
    #    print(f"   ⏰ pdfplumber timed out after {EXTRACTION_TIMEOUT_SEC}s — skipped")
    #except Exception as e:
    #    print(f"   ❌ pdfplumber failed: {e}")

    print()

print(f"✅ Extraction complete: PyMuPDF={len(results_pymupdf)}, pdfplumber={len(results_pdfplumber)} of {len(downloaded)} documents")

🔍 Processing: Employee Handbook (employee_handbook.pdf)
   PyMuPDF:    11 pages, 2,370 words, 16,118 chars  (422.2 ms)

🔍 Processing: Benefit Options (Benefit_Options.pdf)
   PyMuPDF:    4 pages, 507 words, 3,677 chars  (805.0 ms)

🔍 Processing: Perks Plus Program (PerksPlus.pdf)
   PyMuPDF:    4 pages, 432 words, 2,907 chars  (267.3 ms)

🔍 Processing: 0000950170 25 061046 (0000950170-25-061046.pdf)
   PyMuPDF:    72 pages, 32,946 words, 242,269 chars  (2512.9 ms)

🔍 Processing: 0000950170 25 100235 (0000950170-25-100235.pdf)
   PyMuPDF:    158 pages, 72,865 words, 510,412 chars  (2897.5 ms)

🔍 Processing: 0001193125 25 256321 (0001193125-25-256321.pdf)
   PyMuPDF:    67 pages, 29,166 words, 213,507 chars  (1901.8 ms)

🔍 Processing: 0001193125 26 027207 (0001193125-26-027207.pdf)
   PyMuPDF:    71 pages, 31,837 words, 236,416 chars  (2660.3 ms)

✅ Extraction complete: PyMuPDF=7, pdfplumber=0 of 7 documents


## 5. Run Simulator's PdfCracker (PdfPig / C#)

Call the simulator's C# document crackers via the `DocumentCrackingTool` CLI wrapper.

In [6]:
# Import the Python wrapper for the C# document cracking tool
import sys
sys.path.insert(0, str(Path("../../tools/DocumentCrackingTool").resolve()))
from document_cracking import DocumentCracker

# Initialize (auto-builds the .NET tool on first use)
cracker = DocumentCracker()

# List all available crackers
crackers_info = cracker.list_crackers()
print("Available simulator document crackers:")
for c in crackers_info:
    exts = ", ".join(c["supportedExtensions"])
    types = ", ".join(c["supportedContentTypes"])
    print(f"  📦 {c['name']:20s}  extensions: {exts:30s}  types: {types}")

🔨 Building DocumentCrackingTool...
✅ Build successful
Available simulator document crackers:
  📦 PdfCracker            extensions: .pdf                            types: application/pdf
  📦 PlainTextCracker      extensions: .txt, .md, .markdown, .text     types: text/plain, text/markdown, text/x-markdown
  📦 HtmlCracker           extensions: .html, .htm, .xhtml             types: text/html, application/xhtml+xml
  📦 JsonCracker           extensions: .json                           types: application/json, text/json
  📦 CsvCracker            extensions: .csv, .tsv                      types: text/csv, text/comma-separated-values, application/csv
  📦 ExcelCracker          extensions: .xlsx                           types: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel
  📦 WordDocCracker        extensions: .docx                           types: application/vnd.openxmlformats-officedocument.wordprocessingml.document, application/msword


### Run PdfCracker on all sample PDFs

In [7]:
# Run the simulator's PdfCracker on each PDF
results_pdfpig = {}
timings_pdfpig = {}

for doc_id, pdf_path in downloaded.items():
    title = SAMPLE_PDFS[doc_id]["title"]
    print(f"🔍 Running simulator crackers on: {title}")
    
    t0 = time.perf_counter()
    result = cracker.crack(str(pdf_path), crackers=["PdfCracker"])
    timings_pdfpig[doc_id] = time.perf_counter() - t0
    results_pdfpig[doc_id] = result
    
    # Show the PdfCracker result
    for c in cracker.get_successful_crackers(result):
        print(f"   ✅ {c['crackerName']}: {c.get('pageCount', '?')} pages, "
              f"{c.get('wordCount', '?'):,} words, "
              f"{c.get('characterCount', '?'):,} chars")
        if c.get("title"):
            print(f"      Title: {c['title']}")
        if c.get("author"):
            print(f"      Author: {c['author']}")
        if c.get("createdDate"):
            print(f"      Created: {c['createdDate']}")
        if c.get("metadata"):
            for k, v in c["metadata"].items():
                print(f"      {k}: {v}")
        extraction_ms = c.get('extractionTimeMs', 0)
        print(f"   ⏱️  C# extraction: {extraction_ms:.1f} ms  |  total with CLI overhead: {timings_pdfpig[doc_id]*1000:.0f} ms")
    print()

🔍 Running simulator crackers on: Employee Handbook
   ✅ PdfCracker: 11 pages, 2,367 words, 15,777 chars
      Author: python-docx
      Created: 2023-03-06T13:57:20.0000000+00:00
      creator: Microsoft® Word for Microsoft 365
      producer: Microsoft® Word for Microsoft 365
      pdfVersion: 1,7
   ⏱️  C# extraction: 2288.4 ms  |  total with CLI overhead: 3365 ms

🔍 Running simulator crackers on: Benefit Options
   ✅ PdfCracker: 4 pages, 609 words, 4,289 chars
      Author: Liam Cavanagh
      Created: 2023-03-06T13:58:20.0000000+00:00
      creator: Microsoft® Word for Microsoft 365
      producer: Microsoft® Word for Microsoft 365
      pdfVersion: 1,7
   ⏱️  C# extraction: 570.8 ms  |  total with CLI overhead: 1654 ms

🔍 Running simulator crackers on: Perks Plus Program
   ✅ PdfCracker: 4 pages, 432 words, 2,831 chars
      Author: Liam Cavanagh
      Created: 2023-03-07T10:33:37.0000000+00:00
      creator: Microsoft® Word for Microsoft 365
      producer: Microsoft® Word for Mi

## 5b. Run PDFBox (Java via JPype)

Apache PDFBox is the PDF extraction engine used by **Apache Tika**, which is what **Azure AI Search uses internally** for document cracking. This gives us the most authentic reference point.

We call PDFBox's Java API directly from Python using [JPype](https://jpype.readthedocs.io/).

### Prerequisites

| Requirement | Details |
|---|---|
| **Java Runtime (JRE) or JDK** | Version 11 or later (17+ recommended). JPype needs a JVM (`jvm.dll` on Windows, `libjvm.so` on Linux) to start. |
| **`jpype1` Python package** | Installed via `pip install jpype1` (already included in the pip cell above). |
| **PDFBox JAR** | Downloaded automatically by the cell below from Maven Central. |

### Installing a JRE / JDK

**Option A — System-wide install (recommended, requires admin)**
- **Windows**: `winget install Microsoft.OpenJDK.21` or install [Eclipse Temurin](https://adoptium.net/). The installer sets `JAVA_HOME` automatically.
- **macOS**: `brew install openjdk@21`
- **Linux (Debian/Ubuntu)**: `sudo apt install openjdk-21-jre-headless`

**Option B — Portable / no-admin install (Windows)**
1. Download an OpenJDK **zip** archive (e.g. [Adoptium Temurin JRE releases](https://github.com/adoptium/temurin21-binaries/releases)).
2. Extract it to a local folder, e.g. `%LOCALAPPDATA%\jdk-21-jre\`.
3. No `JAVA_HOME` needed — the code cell below auto-discovers JVMs in common locations:
   - `%LOCALAPPDATA%\jdk-*-jre\bin\server\jvm.dll`
   - `%LOCALAPPDATA%\jdk-*\bin\server\jvm.dll`
   - `C:\Program Files\Java\*\bin\server\jvm.dll`
   - `C:\Program Files\Microsoft\jdk-*\bin\server\jvm.dll`
   - `C:\Program Files\Eclipse Adoptium\*\bin\server\jvm.dll`

### Verifying the installation

Run `java -version` in a terminal. You should see output like:
```
openjdk version "21.0.5" 2024-10-15 LTS
```

> **Note:** The JVM can only be started **once** per Python process. If you need to change the JVM path or classpath, restart the notebook kernel first.

In [8]:
# Download PDFBox standalone JAR and start JVM
import jpype
import jpype.imports
import glob

PDFBOX_VERSION = "3.0.4"
PDFBOX_JAR_URL = f"https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox-app/{PDFBOX_VERSION}/pdfbox-app-{PDFBOX_VERSION}.jar"
PDFBOX_JAR = PDF_DIR / f"pdfbox-app-{PDFBOX_VERSION}.jar"

if not PDFBOX_JAR.exists():
    print(f"📥 Downloading PDFBox {PDFBOX_VERSION} JAR from Maven Central...")
    resp = requests.get(PDFBOX_JAR_URL, timeout=120)
    resp.raise_for_status()
    PDFBOX_JAR.write_bytes(resp.content)
    print(f"   ✅ Saved ({len(resp.content):,} bytes)")
else:
    print(f"📄 PDFBox JAR already exists ({PDFBOX_JAR.stat().st_size:,} bytes)")

# Start JVM with PDFBox on classpath (can only be done once per process)
if not jpype.isJVMStarted():
    # Try default path first, fall back to searching common locations
    jvm_path = None
    try:
        jvm_path = jpype.getDefaultJVMPath()
    except jpype.JVMNotFoundException:
        # Search common portable JDK/JRE install locations on Windows
        for pattern in [
            os.path.expandvars(r"%LOCALAPPDATA%\jdk-*-jre\bin\server\jvm.dll"),
            os.path.expandvars(r"%LOCALAPPDATA%\jdk-*\bin\server\jvm.dll"),
            r"C:\Program Files\Java\*\bin\server\jvm.dll",
            r"C:\Program Files\Microsoft\jdk-*\bin\server\jvm.dll",
            r"C:\Program Files\Eclipse Adoptium\*\bin\server\jvm.dll",
        ]:
            matches = glob.glob(pattern)
            if matches:
                jvm_path = matches[0]
                break

    if jvm_path is None:
        raise RuntimeError(
            "No JVM found! Install a JRE/JDK and set JAVA_HOME, "
            "or place one in %LOCALAPPDATA%\\jdk-*"
        )

    print(f"   JVM: {jvm_path}")
    jpype.startJVM(jvm_path, classpath=[str(PDFBOX_JAR.resolve())])
    print("✅ JVM started with PDFBox on classpath")
else:
    print("✅ JVM already running")

# Import PDFBox Java classes
from java.io import File as JFile
from org.apache.pdfbox import Loader
from org.apache.pdfbox.text import PDFTextStripper

print(f"✅ PDFBox {PDFBOX_VERSION} ready")

📄 PDFBox JAR already exists (13,454,142 bytes)
   JVM: C:\Users\laurelle\AppData\Local\jdk-21.0.5+11-jre\bin\server\jvm.dll
✅ JVM started with PDFBox on classpath
✅ PDFBox 3.0.4 ready


In [9]:
# Run PDFBox extraction on all PDFs
results_pdfbox = {}
timings_pdfbox = {}

stripper = PDFTextStripper()

for doc_id, pdf_path in downloaded.items():
    title = SAMPLE_PDFS[doc_id]["title"]
    print(f"🔍 PDFBox extracting: {title}")

    t0 = time.perf_counter()
    try:
        jfile = JFile(str(pdf_path.resolve()))
        doc = Loader.loadPDF(jfile)

        # Extract text
        text = str(stripper.getText(doc))
        page_count = doc.getNumberOfPages()

        # Extract metadata
        info = doc.getDocumentInformation()
        metadata = {}
        for key in ["Title", "Author", "Subject", "Keywords", "Creator", "Producer"]:
            val = info.getCustomMetadataValue(key)
            if val:
                metadata[key.lower()] = str(val)

        # Try to get dates
        creation_date = info.getCreationDate()
        mod_date = info.getModificationDate()
        if creation_date:
            metadata["creation_date"] = str(creation_date.getTime())
        if mod_date:
            metadata["mod_date"] = str(mod_date.getTime())

        doc.close()
        elapsed = time.perf_counter() - t0
        timings_pdfbox[doc_id] = elapsed

        results_pdfbox[doc_id] = {
            "library": "PDFBox",
            "file": pdf_path.name,
            "file_size": pdf_path.stat().st_size,
            "page_count": page_count,
            "full_text": text,
            "total_chars": len(text),
            "total_words": len(text.split()),
            "metadata": metadata,
        }
        print(f"   ✅ {page_count} pages, {len(text.split()):,} words, "
              f"{len(text):,} chars  ({elapsed*1000:.1f} ms)")

    except Exception as e:
        timings_pdfbox[doc_id] = time.perf_counter() - t0
        print(f"   ❌ PDFBox failed: {e}")

    print()

print(f"✅ PDFBox extraction complete for {len(results_pdfbox)} documents")

🔍 PDFBox extracting: Employee Handbook
   ✅ 11 pages, 2,370 words, 16,454 chars  (767.2 ms)

🔍 PDFBox extracting: Benefit Options
   ✅ 4 pages, 614 words, 4,386 chars  (170.4 ms)

🔍 PDFBox extracting: Perks Plus Program
   ✅ 4 pages, 432 words, 2,940 chars  (428.7 ms)

🔍 PDFBox extracting: 0000950170 25 061046
   ✅ 72 pages, 32,946 words, 244,766 chars  (2021.1 ms)

🔍 PDFBox extracting: 0000950170 25 100235
   ✅ 158 pages, 72,862 words, 515,075 chars  (1726.3 ms)

🔍 PDFBox extracting: 0001193125 25 256321
   ✅ 67 pages, 29,166 words, 215,357 chars  (772.4 ms)

🔍 PDFBox extracting: 0001193125 26 027207
   ✅ 71 pages, 31,837 words, 238,841 chars  (1004.3 ms)

✅ PDFBox extraction complete for 7 documents


## 5c. Run Apache Tika (Docker container)

[Apache Tika](https://tika.apache.org/) is the **actual document-cracking engine used by Azure AI Search**. It wraps PDFBox internally, but adds format detection, language identification, and a rich metadata model on top.

We call Tika's REST API (running in a Docker container) to extract text and metadata, giving us the most faithful reproduction of what Azure AI Search does.

### Starting the Tika container

Pull and run the official image from Docker Hub:

```bash
# Standard image (text extraction only)
docker run -d --name tika -p 9998:9998 apache/tika:latest

# Full image (includes OCR via Tesseract — use for scanned PDFs)
docker run -d --name tika -p 9998:9998 apache/tika:latest-full
```

Verify it's running:
```bash
curl http://localhost:9998/version
# → Apache Tika 3.x.x
```

To stop / remove:
```bash
docker stop tika && docker rm tika
```

> **Note:** The cell below expects Tika at `http://localhost:9998` by default. Change `TIKA_URL` to point elsewhere if needed.

In [23]:
# Apache Tika extraction via REST API
# Change TIKA_URL if your Tika server runs elsewhere
TIKA_URL = "http://localhost:9998"

# Check Tika is reachable
tika_available = False
try:
    tika_version_resp = requests.get(f"{TIKA_URL}/version", timeout=5)
    tika_version_resp.raise_for_status()
    tika_version = tika_version_resp.text.strip()
    tika_available = True
    print(f"✅ Tika server reachable — {tika_version}")
except Exception as e:
    print(f"⚠️  Tika server not reachable at {TIKA_URL}: {e}")
    print("   Tika extraction will be skipped. Start the container with:")
    print("   docker run -d --name tika -p 9998:9998 apache/tika:latest")

results_tika = {}
timings_tika = {}

if tika_available:
    for doc_id, pdf_path in downloaded.items():
        title = SAMPLE_PDFS[doc_id]["title"]
        print(f"🔍 Tika extracting: {title}")

        pdf_bytes = pdf_path.read_bytes()
        t0 = time.perf_counter()

        try:
            # Single call: /tika/text with Accept: application/json
            # Returns JSON with "X-TIKA:content" (the extracted text) + all metadata fields
            resp = requests.put(
                f"{TIKA_URL}/tika/text",
                data=pdf_bytes,
                headers={
                    "Content-Type": "application/pdf",
                    "Accept": "application/json",
                },
                timeout=120,
            )
            resp.raise_for_status()
            raw_json = resp.json()

            elapsed = time.perf_counter() - t0
            timings_tika[doc_id] = elapsed

            # Text is in the "X-TIKA:content" field
            text = raw_json.get("X-TIKA:content", "")
            # Everything else is metadata
            raw_meta = {k: v for k, v in raw_json.items() if k != "X-TIKA:content"}

            # Normalize metadata (Tika returns values as str or list-of-str)
            def _meta_val(d, key):
                v = d.get(key, "")
                return v[0] if isinstance(v, list) else v

            page_count_str = _meta_val(raw_meta, "xmpTPg:NPages") or _meta_val(raw_meta, "meta:page-count") or ""
            page_count = int(page_count_str) if page_count_str.isdigit() else None

            metadata = {
                "title": _meta_val(raw_meta, "dc:title") or _meta_val(raw_meta, "title"),
                "author": _meta_val(raw_meta, "meta:author") or _meta_val(raw_meta, "dc:creator"),
                "subject": _meta_val(raw_meta, "dc:subject"),
                "keywords": _meta_val(raw_meta, "meta:keyword") or _meta_val(raw_meta, "pdf:docinfo:keywords"),
                "creator": _meta_val(raw_meta, "pdf:docinfo:creator_tool") or _meta_val(raw_meta, "xmp:CreatorTool"),
                "producer": _meta_val(raw_meta, "pdf:docinfo:producer"),
                "creation_date": _meta_val(raw_meta, "dcterms:created") or _meta_val(raw_meta, "meta:creation-date"),
                "mod_date": _meta_val(raw_meta, "dcterms:modified") or _meta_val(raw_meta, "Last-Modified"),
                "language": _meta_val(raw_meta, "language") or _meta_val(raw_meta, "dc:language"),
                "content_type": _meta_val(raw_meta, "Content-Type"),
                "pdf_version": _meta_val(raw_meta, "pdf:PDFVersion"),
            }

            results_tika[doc_id] = {
                "library": "Tika",
                "file": pdf_path.name,
                "file_size": pdf_path.stat().st_size,
                "page_count": page_count,
                "full_text": text,
                "total_chars": len(text),
                "total_words": len(text.split()),
                "metadata": metadata,
                "raw_metadata": raw_meta,
            }
            print(f"   ✅ {page_count or '?'} pages, {len(text.split()):,} words, "
                  f"{len(text):,} chars  ({elapsed*1000:.1f} ms)")

        except Exception as e:
            timings_tika[doc_id] = time.perf_counter() - t0
            print(f"   ❌ Tika failed: {e}")

        print()

    print(f"✅ Tika extraction complete for {len(results_tika)} documents")
else:
    print("⏭️  Tika extraction skipped (server not available)")

✅ Tika server reachable — Apache Tika 3.2.3
🔍 Tika extracting: Employee Handbook
   ✅ 11 pages, 2,370 words, 16,514 chars  (41.1 ms)

🔍 Tika extracting: Benefit Options
   ✅ 4 pages, 614 words, 4,407 chars  (57.1 ms)

🔍 Tika extracting: Perks Plus Program
   ✅ 4 pages, 432 words, 2,994 chars  (35.2 ms)

🔍 Tika extracting: 0000950170 25 061046
   ✅ 72 pages, 32,953 words, 242,549 chars  (2809.1 ms)

🔍 Tika extracting: 0000950170 25 100235
   ✅ 158 pages, 72,914 words, 514,662 chars  (3975.2 ms)

🔍 Tika extracting: 0001193125 25 256321
   ✅ 67 pages, 29,173 words, 213,273 chars  (1423.5 ms)

🔍 Tika extracting: 0001193125 26 027207
   ✅ 71 pages, 31,844 words, 236,646 chars  (3174.4 ms)

✅ Tika extraction complete for 7 documents


## 5d. Run Azure AI Document Intelligence (Layout model)

[Azure AI Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/) (formerly Form Recognizer) is a **cloud-based** Azure service that uses AI/ML to extract text, key-value pairs, tables, and structure from documents. The **Layout model** provides:
- OCR text extraction with reading order
- Table detection and extraction
- Selection marks, barcodes
- Paragraph and section headings

### Prerequisites

| Requirement | Details |
|---|---|
| **Azure subscription** | With an Azure AI Document Intelligence resource created |
| **Endpoint + API key** | Set in `.env` as `DOCUMENT_INTELLIGENCE_ENDPOINT` and `DOCUMENT_INTELLIGENCE_KEY` |
| **Python SDK** | `azure-ai-documentintelligence` (installed via pip cell above) |

### Pricing (Pay-As-You-Go)

| Model | Cost per page | Notes |
|---|---|---|
| **Read** | $0.0015 | OCR + text only |
| **Layout** | $0.01 | Text + tables + structure |
| **Prebuilt** | $0.01 | Invoices, receipts, etc. |
| **Free tier** | $0 | 500 pages/month |

> **Note:** We use the **Layout** model here for the richest comparison. Expect 2–15 seconds per document depending on size (cloud round-trip).

In [33]:
# Azure AI Document Intelligence extraction (Layout model)
# Keys are loaded from the .env file at the workspace root

from dotenv import load_dotenv

# Load .env from workspace root (two levels up from notebook dir)
env_path = Path("../../.env").resolve()
load_dotenv(env_path, override=True)

DI_ENDPOINT = os.environ.get("DOCUMENT_INTELLIGENCE_ENDPOINT", "")
DI_KEY = os.environ.get("DOCUMENT_INTELLIGENCE_KEY", "")

di_available = False
if DI_ENDPOINT and DI_KEY:
    from azure.ai.documentintelligence import DocumentIntelligenceClient
    from azure.ai.documentintelligence.models import AnalyzeDocumentRequest, DocumentAnalysisFeature
    from azure.core.credentials import AzureKeyCredential

    di_client = DocumentIntelligenceClient(
        endpoint=DI_ENDPOINT,
        credential=AzureKeyCredential(DI_KEY),
    )
    di_available = True
    print(f"✅ Document Intelligence client ready — {DI_ENDPOINT}")
else:
    print(f"⚠️  Document Intelligence not configured.")
    print(f"   Set DOCUMENT_INTELLIGENCE_ENDPOINT and DOCUMENT_INTELLIGENCE_KEY in {env_path}")
    print(f"   Document Intelligence extraction will be skipped.")

results_docint = {}
timings_docint = {}

if di_available:
    for doc_id, pdf_path in downloaded.items():
        title = SAMPLE_PDFS[doc_id]["title"]
        print(f"🔍 Document Intelligence extracting: {title}")

        pdf_bytes = pdf_path.read_bytes()
        t0 = time.perf_counter()

        try:
            poller = di_client.begin_analyze_document(
                "prebuilt-layout",
                body=pdf_bytes,
                content_type="application/pdf",
            )
            result = poller.result()
            elapsed = time.perf_counter() - t0
            timings_docint[doc_id] = elapsed

            # Extract full text from all pages
            text = result.content or ""

            # Extract metadata from document properties
            page_count = len(result.pages) if result.pages else 0

            # Count tables and paragraphs
            table_count = len(result.tables) if result.tables else 0
            paragraph_count = len(result.paragraphs) if result.paragraphs else 0

            # Build per-page info
            pages_info = []
            for page in (result.pages or []):
                page_text_len = 0
                page_word_count = 0
                if page.words:
                    page_word_count = len(page.words)
                    page_text_len = sum(len(w.content) for w in page.words)
                pages_info.append({
                    "page_num": page.page_number,
                    "width": page.width,
                    "height": page.height,
                    "unit": page.unit,
                    "word_count": page_word_count,
                    "char_count": page_text_len,
                    "lines": len(page.lines) if page.lines else 0,
                    "selection_marks": len(page.selection_marks) if page.selection_marks else 0,
                })

            # Gather metadata (DI doesn't extract PDF metadata like author/title)
            metadata = {
                "title": "",  # DI doesn't read PDF info dict
                "author": "",
                "creator": "",
                "producer": "",
                "creation_date": "",
                "mod_date": "",
                "subject": "",
                "keywords": "",
                "tables": table_count,
                "paragraphs": paragraph_count,
                "content_type": "application/pdf",
                "model": "prebuilt-layout",
            }

            results_docint[doc_id] = {
                "library": "Document Intelligence",
                "file": pdf_path.name,
                "file_size": pdf_path.stat().st_size,
                "page_count": page_count,
                "full_text": text,
                "total_chars": len(text),
                "total_words": len(text.split()),
                "pages": pages_info,
                "metadata": metadata,
                "table_count": table_count,
                "paragraph_count": paragraph_count,
            }
            print(f"   ✅ {page_count} pages, {len(text.split()):,} words, "
                  f"{len(text):,} chars, {table_count} tables  "
                  f"({elapsed*1000:.1f} ms)")

        except Exception as e:
            timings_docint[doc_id] = time.perf_counter() - t0
            print(f"   ❌ Document Intelligence failed: {e}")

        print()

    total_pages = sum(r.get("page_count", 0) for r in results_docint.values())
    cost_read = total_pages * 0.0015
    cost_layout = total_pages * 0.01
    print(f"✅ Document Intelligence extraction complete for {len(results_docint)} documents ({total_pages} pages)")
    print(f"💰 Estimated cost: ${cost_read:.2f} (Read model) / ${cost_layout:.2f} (Layout model)")
else:
    print("⏭️  Document Intelligence extraction skipped (not configured)")

✅ Document Intelligence client ready — https://document-intelligence-laurelle.cognitiveservices.azure.com/
🔍 Document Intelligence extracting: Employee Handbook
   ✅ 11 pages, 2,372 words, 15,656 chars, 0 tables  (8155.6 ms)

🔍 Document Intelligence extracting: Benefit Options
   ✅ 4 pages, 639 words, 4,461 chars, 1 tables  (5606.7 ms)

🔍 Document Intelligence extracting: Perks Plus Program
   ✅ 4 pages, 434 words, 2,829 chars, 0 tables  (5325.8 ms)

🔍 Document Intelligence extracting: 0000950170 25 061046
   ✅ 72 pages, 32,958 words, 217,315 chars, 54 tables  (13575.2 ms)

🔍 Document Intelligence extracting: 0000950170 25 100235
   ✅ 158 pages, 72,922 words, 480,749 chars, 80 tables  (21682.6 ms)

🔍 Document Intelligence extracting: 0001193125 25 256321
   ✅ 67 pages, 29,179 words, 194,408 chars, 53 tables  (12251.0 ms)

🔍 Document Intelligence extracting: 0001193125 26 027207
   ✅ 71 pages, 31,845 words, 210,763 chars, 52 tables  (13247.0 ms)

✅ Document Intelligence extraction compl

## 8. Raw Extraction Results per Library

Dump the full raw output from each extraction library for every PDF, so you can inspect exactly what each one returns.

In [10]:
# ── Raw results: PdfPig (C# / simulator) ─────────────────────────────────────

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else None

    display(Markdown(f"---\n### 🟢 PdfPig — {title}"))

    if not pig:
        print("  ❌ No successful PdfCracker result")
        continue

    # Metrics
    print(f"  Pages:      {pig.get('pageCount', '?')}")
    print(f"  Words:      {pig.get('wordCount', '?'):,}")
    print(f"  Characters: {pig.get('characterCount', '?'):,}")
    print(f"  Time:       {pig.get('extractionTimeMs', 0):.1f} ms")
    print(f"  Title:      {pig.get('title') or '(none)'}")
    print(f"  Author:     {pig.get('author') or '(none)'}")
    print(f"  Created:    {pig.get('createdDate') or '(none)'}")
    print(f"  Modified:   {pig.get('modifiedDate') or '(none)'}")
    print(f"  Language:   {pig.get('language') or '(none)'}")

    # All metadata keys
    meta = pig.get("metadata", {})
    if meta:
        print(f"\n  Raw metadata ({len(meta)} keys):")
        for k, v in meta.items():
            print(f"    {k}: {v}")

    # Warnings
    if pig.get("warnings"):
        print(f"\n  ⚠️  Warnings: {pig['warnings']}")

    # Full text
    text = pig.get("content", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ── Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ──")

    print(display_text + suffix)
    print()

---
### 🟢 PdfPig — Employee Handbook

  Pages:      11
  Words:      2,367
  Characters: 15,777
  Time:       2288.4 ms
  Title:      (none)
  Author:     python-docx
  Created:    2023-03-06T13:57:20.0000000+00:00
  Modified:   2023-03-06T13:57:20.0000000+00:00
  Language:   (none)

  Raw metadata (3 keys):
    creator: Microsoft® Word for Microsoft 365
    producer: Microsoft® Word for Microsoft 365
    pdfVersion: 1,7

  ── Text (15,777 chars) ──
Contoso Electronics Employee Handbook         

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  All rights reserved to Microsoft   

Contoso Electronics Employee Handbook Last Updated: 2023-03-

---
### 🟢 PdfPig — Benefit Options

  Pages:      4
  Words:      609
  Characters: 4,289
  Time:       570.8 ms
  Title:      (none)
  Author:     Liam Cavanagh
  Created:    2023-03-06T13:58:20.0000000+00:00
  Modified:   2023-03-20T13:05:46.0000000+00:00
  Language:   (none)

  Raw metadata (3 keys):
    creator: Microsoft® Word for Microsoft 365
    producer: Microsoft® Word for Microsoft 365
    pdfVersion: 1,7

  ── Text (4,289 chars) ──
Contoso Electronics Plan and Benefit Packages

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document. All rights reserved to Microsoft

Welcome to Contoso Electronics! We are excited to offer our employees

---
### 🟢 PdfPig — Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,831
  Time:       622.2 ms
  Title:      (none)
  Author:     Liam Cavanagh
  Created:    2023-03-07T10:33:37.0000000+00:00
  Modified:   2023-03-07T10:33:37.0000000+00:00
  Language:   (none)

  Raw metadata (3 keys):
    creator: Microsoft® Word for Microsoft 365
    producer: Microsoft® Word for Microsoft 365
    pdfVersion: 1,7

  ── Text (2,831 chars) ──
PerksPlus Health and Wellness Reimbursement Program for Contoso Electronics Employees        

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  All rights reserved to Microsoft   

Overview Introduc

---
### 🟢 PdfPig — 0000950170 25 061046

  Pages:      72
  Words:      31,681
  Characters: 227,261
  Time:       2124.6 ms
  Title:      Form 10-Q for Microsoft Corp filed 04/30/2025
  Author:     Kaleidoscope - kscope.io
  Created:    2025-04-30T20:14:09.0000000+00:00
  Modified:   2025-04-30T20:14:12.0000000+00:00
  Language:   (none)

  Raw metadata (5 keys):
    subject: 10-Q filed 04/30/2025
    keywords: Microsoft Corp 10-Q
    creator: Chromium
    producer: KS - PDF Engine v1.2
    pdfVersion: 1,7

  ── Text (227,261 chars) ──
UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-Q  ☒QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Quarterly Period Ended March 31, 2025  OR  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845  MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT WA

---
### 🟢 PdfPig — 0000950170 25 100235

  Pages:      158
  Words:      71,031
  Characters: 491,352
  Time:       2349.3 ms
  Title:      Form 10-K for Microsoft Corp filed 07/30/2025
  Author:     Kaleidoscope - kscope.io
  Created:    2025-07-30T20:14:39.0000000+00:00
  Modified:   2025-07-30T20:14:43.0000000+00:00
  Language:   (none)

  Raw metadata (5 keys):
    subject: 10-K filed 07/30/2025
    keywords: Microsoft Corp 10-K
    creator: Chromium
    producer: KS - PDF Engine v1.2
    pdfVersion: 1,7

  ── Text (491,352 chars) ──
UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-K  ☒ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Fiscal Year Ended June 30, 2025   OR  ☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845   MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID)ONE MICROSOFT WAY, REDMO

---
### 🟢 PdfPig — 0001193125 25 256321

  Pages:      67
  Words:      28,095
  Characters: 201,591
  Time:       2160.9 ms
  Title:      Form 10-Q for Microsoft Corp filed 10/29/2025
  Author:     Kaleidoscope - kscope.io
  Created:    2025-10-29T20:16:21.0000000+00:00
  Modified:   2025-10-29T20:16:23.0000000+00:00
  Language:   (none)

  Raw metadata (5 keys):
    subject: 10-Q filed 10/29/2025
    keywords: Microsoft Corp 10-Q
    creator: Chromium
    producer: KS - PDF Engine v1.2
    pdfVersion: 1,7

  ── Text (201,591 chars) ──
UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-Q  ☒QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Quarterly Period Ended September 30, 2025  OR  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845  MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOF

---
### 🟢 PdfPig — 0001193125 26 027207

  Pages:      71
  Words:      30,516
  Characters: 220,770
  Time:       2267.5 ms
  Title:      Form 10-Q for Microsoft Corp filed 01/28/2026
  Author:     Kaleidoscope - kscope.io
  Created:    2026-01-28T21:13:50.0000000+00:00
  Modified:   2026-01-28T21:13:55.0000000+00:00
  Language:   (none)

  Raw metadata (5 keys):
    subject: 10-Q filed 01/28/2026
    keywords: Microsoft Corp 10-Q
    creator: Chromium
    producer: KS - PDF Engine v1.2
    pdfVersion: 1,7

  ── Text (220,770 chars) ──
UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-Q  ☒QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Quarterly Period Ended December 31, 2025  OR  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845  MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT

In [11]:
# ── Raw results: PyMuPDF ──────────────────────────────────────────────────────

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    mu = results_pymupdf.get(doc_id)

    display(Markdown(f"---\n### 🔵 PyMuPDF — {title}"))

    if not mu:
        print("  ❌ No PyMuPDF result")
        continue

    # Metrics
    print(f"  Pages:      {mu['page_count']}")
    print(f"  Words:      {mu['total_words']:,}")
    print(f"  Characters: {mu['total_chars']:,}")
    print(f"  File size:  {mu['file_size']:,} bytes")
    print(f"  Time:       {timings_pymupdf.get(doc_id, 0)*1000:.1f} ms")

    # Metadata
    meta = mu.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "✅" if v else "❌"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Per-page summary
    print(f"\n  Per-page breakdown:")
    print(f"    {'Page':>5}  {'Words':>7}  {'Chars':>7}  {'Images':>7}  {'Links':>6}")
    for p in mu["pages"]:
        print(f"    {p['page_num']:>5}  {p['word_count']:>7,}  {p['char_count']:>7,}  "
              f"{p.get('images', 0):>7}  {p.get('links', 0):>6}")

    # Full text
    text = mu.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ── Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ──")
    print(display_text + suffix)
    print()

---
### 🔵 PyMuPDF — Employee Handbook

  Pages:      11
  Words:      2,370
  Characters: 16,118
  File size:  142,977 bytes
  Time:       422.2 ms

  Metadata (6 non-empty / 10 total):
    ❌ title: (empty)
    ✅ author: python-docx
    ❌ subject: (empty)
    ❌ keywords: (empty)
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: D:20230306135720-08'00'
    ✅ mod_date: D:20230306135720-08'00'
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1        4       56        1       0
        2       66      482        0       0
        3      363    2,411        0       0
        4      322    2,092        0       0
        5      288    1,912        0       0
        6      324    2,254        0       0
        7      265    1,834        0       0
        8      200    1,421        0       0
        9      214    1,456        0       0
       10      244    1,631        0       0
       11

---
### 🔵 PyMuPDF — Benefit Options

  Pages:      4
  Words:      507
  Characters: 3,677
  File size:  544,811 bytes
  Time:       805.0 ms

  Metadata (6 non-empty / 10 total):
    ❌ title: (empty)
    ✅ author: Liam Cavanagh
    ❌ subject: (empty)
    ❌ keywords: (empty)
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: D:20230306135820-08'00'
    ✅ mod_date: D:20230320130546-07'00'
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1        6       47        1       0
        2       66      476        0       0
        3      393    2,877        0       0
        4       42      271        1       0

  ── Text (3,677 chars) ──
Contoso Electronics 
Plan and Benefit Packages


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
bel

---
### 🔵 PyMuPDF — Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,907
  File size:  115,310 bytes
  Time:       267.3 ms

  Metadata (6 non-empty / 10 total):
    ❌ title: (empty)
    ✅ author: Liam Cavanagh
    ❌ subject: (empty)
    ❌ keywords: (empty)
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: D:20230307103337-08'00'
    ✅ mod_date: D:20230307103337-08'00'
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1       10      109        1       0
        2       66      482        0       0
        3      352    2,283        0       0
        4        4       27        0       0

  ── Text (2,907 chars) ──
 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
 
 


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for de

---
### 🔵 PyMuPDF — 0000950170 25 061046

  Pages:      72
  Words:      32,946
  Characters: 242,269
  File size:  2,179,871 bytes
  Time:       2512.9 ms

  Metadata (9 non-empty / 10 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 04/30/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-Q filed 04/30/2025
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: D:20250430201409+00'00'
    ✅ mod_date: D:20250430201412Z
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1      409    2,747        0       0
        2      184    1,319        0      19
        3      187    2,254        0       0
        4       90      986        0       0
        5      239    2,380        0       0
        6      344    3,621        0       0
        7      177    2,049        0       0
        8      556    3,946        0       0
        9      695    4,532        0       0
       10      6

---
### 🔵 PyMuPDF — 0000950170 25 100235

  Pages:      158
  Words:      72,865
  Characters: 510,412
  File size:  3,024,506 bytes
  Time:       2897.5 ms

  Metadata (9 non-empty / 10 total):
    ✅ title: Form 10-K for Microsoft Corp filed 07/30/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-K filed 07/30/2025
    ✅ keywords: Microsoft Corp 10-K
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: D:20250730201439+00'00'
    ✅ mod_date: D:20250730201443Z
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1      647    4,284        0       0
        2      233    2,121        0      27
        3      561    4,032        0       0
        4      582    4,064        0       0
        5      590    4,221        0       0
        6      496    3,760        0       0
        7      622    4,613        0       0
        8      479    3,348        0       0
        9      683    4,722        0       0
       10      

---
### 🔵 PyMuPDF — 0001193125 25 256321

  Pages:      67
  Words:      29,166
  Characters: 213,507
  File size:  1,802,290 bytes
  Time:       1901.8 ms

  Metadata (9 non-empty / 10 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 10/29/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-Q filed 10/29/2025
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: D:20251029201621+00'00'
    ✅ mod_date: D:20251029201623Z
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1      409    2,753        0       0
        2      176    1,307        0      19
        3      134    1,346        0       0
        4       66      638        0       0
        5      235    2,351        0       0
        6      253    2,323        0       0
        7      131    1,356        0       0
        8      580    4,078        0       0
        9      682    4,498        0       0
       10      6

---
### 🔵 PyMuPDF — 0001193125 26 027207

  Pages:      71
  Words:      31,837
  Characters: 236,416
  File size:  2,257,229 bytes
  Time:       2660.3 ms

  Metadata (9 non-empty / 10 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 01/28/2026
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-Q filed 01/28/2026
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: D:20260128211350+00'00'
    ✅ mod_date: D:20260128211355Z
    ✅ format: PDF 1.7
    ❌ encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1      409    2,752        0       0
        2      184    1,333        0      19
        3      188    2,089        0       0
        4       90      889        0       0
        5      235    2,350        0       0
        6      331    3,437        0       0
        7      177    2,071        0       0
        8      447    3,212        0       0
        9      798    5,201        0       0
       10      6

In [12]:
# ── Raw results: pdfplumber ───────────────────────────────────────────────────

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    pb = results_pdfplumber.get(doc_id)

    display(Markdown(f"---\n### 🟠 pdfplumber — {title}"))

    if not pb:
        print("  ❌ No pdfplumber result")
        continue

    # Metrics
    print(f"  Pages:      {pb['page_count']}")
    print(f"  Words:      {pb['total_words']:,}")
    print(f"  Characters: {pb['total_chars']:,}")
    print(f"  File size:  {pb['file_size']:,} bytes")
    print(f"  Time:       {timings_pdfplumber.get(doc_id, 0)*1000:.1f} ms")

    # Metadata
    meta = pb.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "✅" if v else "❌"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Per-page summary
    print(f"\n  Per-page breakdown:")
    print(f"    {'Page':>5}  {'Words':>7}  {'Chars':>7}  {'Tables':>7}  {'Raw chars':>10}")
    for p in pb["pages"]:
        print(f"    {p['page_num']:>5}  {p['word_count']:>7,}  {p['char_count']:>7,}  "
              f"{p.get('tables_found', 0):>7}  {p.get('chars_count_raw', 0):>10,}")

    # Full text
    text = pb.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ── Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ──")
    print(display_text + suffix)
    print()

---
### 🟠 pdfplumber — Employee Handbook

  ❌ No pdfplumber result


---
### 🟠 pdfplumber — Benefit Options

  ❌ No pdfplumber result


---
### 🟠 pdfplumber — Perks Plus Program

  ❌ No pdfplumber result


---
### 🟠 pdfplumber — 0000950170 25 061046

  ❌ No pdfplumber result


---
### 🟠 pdfplumber — 0000950170 25 100235

  ❌ No pdfplumber result


---
### 🟠 pdfplumber — 0001193125 25 256321

  ❌ No pdfplumber result


---
### 🟠 pdfplumber — 0001193125 26 027207

  ❌ No pdfplumber result


In [13]:
# ── Raw results: PDFBox (Java via JPype) ──────────────────────────────────────

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    bx = results_pdfbox.get(doc_id)

    display(Markdown(f"---\n### 🟣 PDFBox (Java) — {title}"))

    if not bx:
        print("  ❌ No PDFBox result")
        continue

    # Metrics
    print(f"  Pages:      {bx['page_count']}")
    print(f"  Words:      {bx['total_words']:,}")
    print(f"  Characters: {bx['total_chars']:,}")
    print(f"  File size:  {bx['file_size']:,} bytes")
    print(f"  Time:       {timings_pdfbox.get(doc_id, 0)*1000:.1f} ms")

    # Metadata
    meta = bx.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "✅" if v else "❌"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Full text
    text = bx.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ── Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ──")
    print(display_text + suffix)
    print()

---
### 🟣 PDFBox (Java) — Employee Handbook

  Pages:      11
  Words:      2,370
  Characters: 16,454
  File size:  142,977 bytes
  Time:       767.2 ms

  Metadata (5 non-empty / 5 total):
    ✅ author: python-docx
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: Mon Mar 06 22:57:20 CET 2023
    ✅ mod_date: Mon Mar 06 22:57:20 CET 2023

  ── Text (16,454 chars) ──
Contoso Electronics 
Employee Handbook 
 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this document.  
All rights reserved to Microsoft 
  
Contoso Electronics Employee Handbook 
Last Updated: 2023-03-05 
 
Co

---
### 🟣 PDFBox (Java) — Benefit Options

  Pages:      4
  Words:      614
  Characters: 4,386
  File size:  544,811 bytes
  Time:       170.4 ms

  Metadata (5 non-empty / 5 total):
    ✅ author: Liam Cavanagh
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: Mon Mar 06 22:58:20 CET 2023
    ✅ mod_date: Mon Mar 20 21:05:46 CET 2023

  ── Text (4,386 chars) ──
Contoso Electronics 
Plan and Benefit Packages
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document. 
All rights reserved to Microsoft
Welcome to Contoso Electronics! We are excited to offer our employees two comprehensi

---
### 🟣 PDFBox (Java) — Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,940
  File size:  115,310 bytes
  Time:       428.7 ms

  Metadata (5 non-empty / 5 total):
    ✅ author: Liam Cavanagh
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: Tue Mar 07 19:33:37 CET 2023
    ✅ mod_date: Tue Mar 07 19:33:37 CET 2023

  ── Text (2,940 chars) ──
 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document.  
All rights reserved to Microsoft 
  
Overview 
Introduci

---
### 🟣 PDFBox (Java) — 0000950170 25 061046

  Pages:      72
  Words:      32,946
  Characters: 244,766
  File size:  2,179,871 bytes
  Time:       2021.1 ms

  Metadata (8 non-empty / 8 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 04/30/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-Q filed 04/30/2025
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: Wed Apr 30 22:14:09 CEST 2025
    ✅ mod_date: Wed Apr 30 22:14:12 CEST 2025

  ── Text (244,766 chars) ──
 
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Quarterly Period Ended March 31, 2025
   
OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON   91-1144442
(S

---
### 🟣 PDFBox (Java) — 0000950170 25 100235

  Pages:      158
  Words:      72,862
  Characters: 515,075
  File size:  3,024,506 bytes
  Time:       1726.3 ms

  Metadata (8 non-empty / 8 total):
    ✅ title: Form 10-K for Microsoft Corp filed 07/30/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-K filed 07/30/2025
    ✅ keywords: Microsoft Corp 10-K
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: Wed Jul 30 22:14:39 CEST 2025
    ✅ mod_date: Wed Jul 30 22:14:43 CEST 2025

  ── Text (515,075 chars) ──
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-K
 
 
☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 
 
  For the Fiscal Year Ended June 30, 2025
   
  OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845 
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON   91-1144442
(STATE OF 

---
### 🟣 PDFBox (Java) — 0001193125 25 256321

  Pages:      67
  Words:      29,166
  Characters: 215,357
  File size:  1,802,290 bytes
  Time:       772.4 ms

  Metadata (8 non-empty / 8 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 10/29/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-Q filed 10/29/2025
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: Wed Oct 29 21:16:21 CET 2025
    ✅ mod_date: Wed Oct 29 21:16:23 CET 2025

  ── Text (215,357 chars) ──
 
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Quarterly Period Ended September 30, 2025
   
OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON   91-1144442
(

---
### 🟣 PDFBox (Java) — 0001193125 26 027207

  Pages:      71
  Words:      31,837
  Characters: 238,841
  File size:  2,257,229 bytes
  Time:       1004.3 ms

  Metadata (8 non-empty / 8 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 01/28/2026
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: 10-Q filed 01/28/2026
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: Wed Jan 28 22:13:50 CET 2026
    ✅ mod_date: Wed Jan 28 22:13:55 CET 2026

  ── Text (238,841 chars) ──
 
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Quarterly Period Ended December 31, 2025
   
OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON   91-1144442
(

In [24]:
# ── Raw results: Tika (Docker) ────────────────────────────────────────────────

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    tk = results_tika.get(doc_id)

    display(Markdown(f"---\n### 🔴 Tika — {title}"))

    if not tk:
        print("  ❌ No Tika result (server not available or extraction failed)")
        continue

    # Metrics
    print(f"  Pages:      {tk['page_count'] or '?'}")
    print(f"  Words:      {tk['total_words']:,}")
    print(f"  Characters: {tk['total_chars']:,}")
    print(f"  File size:  {tk['file_size']:,} bytes")
    print(f"  Time:       {timings_tika.get(doc_id, 0)*1000:.1f} ms")

    # Metadata (normalized)
    meta = tk.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "✅" if v else "❌"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Full text
    text = tk.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ── Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ──")
    print(display_text + suffix)
    print()

---
### 🔴 Tika — Employee Handbook

  Pages:      11
  Words:      2,370
  Characters: 16,514
  File size:  142,977 bytes
  Time:       41.1 ms

  Metadata (8 non-empty / 11 total):
    ❌ title: (empty)
    ✅ author: python-docx
    ❌ subject: (empty)
    ❌ keywords: (empty)
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: 2023-03-06T21:57:20Z
    ✅ mod_date: 2023-03-06T21:57:20Z
    ✅ language: en-US
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (16,514 chars) ──


















































Contoso Electronics 

Employee Handbook 

 

 
 

 

 

 

  



This document contains information generated using a language model (Azure OpenAI). The 

information contained in this document is only for demonstration purposes and does not 

reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 

warranties of any kind, express or implied, about the completeness, accuracy, reliability, 


---
### 🔴 Tika — Benefit Options

  Pages:      4
  Words:      614
  Characters: 4,407
  File size:  544,811 bytes
  Time:       57.1 ms

  Metadata (8 non-empty / 11 total):
    ❌ title: (empty)
    ✅ author: Liam Cavanagh
    ❌ subject: (empty)
    ❌ keywords: (empty)
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: 2023-03-06T21:58:20Z
    ✅ mod_date: 2023-03-20T20:05:46Z
    ✅ language: en-US
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (4,407 chars) ──



















































Contoso Electronics 
Plan and Benefit Packages



This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availa

---
### 🔴 Tika — Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,994
  File size:  115,310 bytes
  Time:       35.2 ms

  Metadata (8 non-empty / 11 total):
    ❌ title: (empty)
    ✅ author: Liam Cavanagh
    ❌ subject: (empty)
    ❌ keywords: (empty)
    ✅ creator: Microsoft® Word for Microsoft 365
    ✅ producer: Microsoft® Word for Microsoft 365
    ✅ creation_date: 2023-03-07T18:33:37Z
    ✅ mod_date: 2023-03-07T18:33:37Z
    ✅ language: en-US
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (2,994 chars) ──




















































 

 

 

PerksPlus Health and Wellness 

Reimbursement Program for 

Contoso Electronics Employees 
 

 
 

 

 

  



This document contains information generated using a language model (Azure OpenAI). The information 

contained in this document is only for demonstration purposes and does not reflect the opinions or 

beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or impli

---
### 🔴 Tika — 0000950170 25 061046

  Pages:      72
  Words:      32,953
  Characters: 242,549
  File size:  2,179,871 bytes
  Time:       2809.1 ms

  Metadata (11 non-empty / 11 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 04/30/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: Microsoft Corp 10-Q
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: 2025-04-30T20:14:09Z
    ✅ mod_date: 2025-04-30T20:14:12Z
    ✅ language: en-us
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (242,549 chars) ──








































Form 10-Q for Microsoft Corp filed 04/30/2025


 

 

 
 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-Q
 

 

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Quarterly Period Ended March 31, 2025
   

OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCH

---
### 🔴 Tika — 0000950170 25 100235

  Pages:      158
  Words:      72,914
  Characters: 514,662
  File size:  3,024,506 bytes
  Time:       3975.2 ms

  Metadata (11 non-empty / 11 total):
    ✅ title: Form 10-K for Microsoft Corp filed 07/30/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: Microsoft Corp 10-K
    ✅ keywords: Microsoft Corp 10-K
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: 2025-07-30T20:14:39Z
    ✅ mod_date: 2025-07-30T20:14:43Z
    ✅ language: en-us
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (514,662 chars) ──








































Form 10-K for Microsoft Corp filed 07/30/2025


 

 

 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-K
 

 

☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 
 

  For the Fiscal Year Ended June 30, 2025
   

  OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT 

---
### 🔴 Tika — 0001193125 25 256321

  Pages:      67
  Words:      29,173
  Characters: 213,273
  File size:  1,802,290 bytes
  Time:       1423.5 ms

  Metadata (11 non-empty / 11 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 10/29/2025
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: Microsoft Corp 10-Q
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: 2025-10-29T20:16:21Z
    ✅ mod_date: 2025-10-29T20:16:23Z
    ✅ language: en-us
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (213,273 chars) ──








































Form 10-Q for Microsoft Corp filed 10/29/2025


 

 

 
 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-Q
 

 

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Quarterly Period Ended September 30, 2025
   

OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES 

---
### 🔴 Tika — 0001193125 26 027207

  Pages:      71
  Words:      31,844
  Characters: 236,646
  File size:  2,257,229 bytes
  Time:       3174.4 ms

  Metadata (11 non-empty / 11 total):
    ✅ title: Form 10-Q for Microsoft Corp filed 01/28/2026
    ✅ author: Kaleidoscope - kscope.io
    ✅ subject: Microsoft Corp 10-Q
    ✅ keywords: Microsoft Corp 10-Q
    ✅ creator: Chromium
    ✅ producer: KS - PDF Engine v1.2
    ✅ creation_date: 2026-01-28T21:13:50Z
    ✅ mod_date: 2026-01-28T21:13:55Z
    ✅ language: en-us
    ✅ content_type: application/pdf
    ✅ pdf_version: 1.7

  ── Text (236,646 chars) ──








































Form 10-Q for Microsoft Corp filed 01/28/2026


 

 

 
 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-Q
 

 

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Quarterly Period Ended December 31, 2025
   

OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES E

In [34]:
# ── Raw results: Document Intelligence (Azure) ───────────────────────────────

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    di = results_docint.get(doc_id)

    display(Markdown(f"---\n### 🟡 Document Intelligence — {title}"))

    if not di:
        print("  ❌ No Document Intelligence result (not configured or extraction failed)")
        continue

    # Metrics
    print(f"  Pages:       {di['page_count']}")
    print(f"  Words:       {di['total_words']:,}")
    print(f"  Characters:  {di['total_chars']:,}")
    print(f"  File size:   {di['file_size']:,} bytes")
    print(f"  Time:        {timings_docint.get(doc_id, 0)*1000:.1f} ms")
    print(f"  Tables:      {di.get('table_count', 0)}")
    print(f"  Paragraphs:  {di.get('paragraph_count', 0)}")

    # Per-page summary
    pages_info = di.get("pages", [])
    if pages_info:
        print(f"\n  Per-page breakdown:")
        print(f"    {'Page':>5}  {'Words':>7}  {'Chars':>7}  {'Lines':>6}  {'Sel.Marks':>10}")
        for p in pages_info:
            print(f"    {p['page_num']:>5}  {p['word_count']:>7,}  {p['char_count']:>7,}  "
                  f"{p.get('lines', 0):>6}  {p.get('selection_marks', 0):>10}")

    # Full text
    text = di.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ── Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ──")
    print(display_text + suffix)
    print()

---
### 🟡 Document Intelligence — Employee Handbook

  Pages:       11
  Words:       2,372
  Characters:  15,656
  File size:   142,977 bytes
  Time:        8155.6 ms
  Tables:      0
  Paragraphs:  142

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1        6       52       4           0
        2       66      405       6           0
        3      363    2,008      34           0
        4      322    1,724      30           0
        5      288    1,570      29           0
        6      324    1,875      31           0
        7      265    1,511      28           0
        8      200    1,165      26           0
        9      214    1,186      25           0
       10      244    1,336      31           0
       11       80      453      16           0

  ── Text (15,656 chars) ──
Contoso Electronics Employee Handbook
Contoso Electronics
This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes a

---
### 🟡 Document Intelligence — Benefit Options

  Pages:       4
  Words:       639
  Characters:  4,461
  File size:   544,811 bytes
  Time:        5606.7 ms
  Tables:      1
  Paragraphs:  34

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1        8       58       4           0
        2       66      405       6           0
        3      393    2,453      34           0
        4      172      907      25           0

  ── Text (4,461 chars) ──
Contoso Electronics
Plan and Benefit Packages
Contoso Electronics
This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.
All rights reserved to Microsoft
Welcome to Contoso Electro

---
### 🟡 Document Intelligence — Perks Plus Program

  Pages:       4
  Words:       434
  Characters:  2,829
  File size:   115,310 bytes
  Time:        5325.8 ms
  Tables:      0
  Paragraphs:  30

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1       12       94       5           0
        2       66      405       6           0
        3      352    1,878      37           0
        4        4       19       1           0

  ── Text (2,829 chars) ──
PerksPlus Health and Wellness Reimbursement Program for Contoso Electronics Employees
Contoso Electronics
This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.
All rights reserved

---
### 🟡 Document Intelligence — 0000950170 25 061046

  Pages:       72
  Words:       32,958
  Characters:  217,315
  File size:   2,179,871 bytes
  Time:        13575.2 ms
  Tables:      54
  Paragraphs:  3141

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1      396    2,145      58          14
        2      184      893      59           0
        3      187      982     131           0
        4       90      447      53           0
        5      239    1,521     126           0
        6      344    1,949     199           0
        7      177      992     109           0
        8      556    3,354      42           0
        9      695    3,807      42           0
       10      606    3,479      38           0
       11      474    2,683      81           0
       12      259    1,243     135           0
       13      262    1,100     193           0
       14      372    1,687     197           0
       15      418    2,203     181           0
       16      410    2,438      61           0
   

---
### 🟡 Document Intelligence — 0000950170 25 100235

  Pages:       158
  Words:       72,922
  Characters:  480,749
  File size:   3,024,506 bytes
  Time:        21682.6 ms
  Tables:      80
  Paragraphs:  4679

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1      631    3,380      72          21
        2      233    1,306      88           0
        3      561    3,443      40           0
        4      582    3,457      43           0
        5      590    3,607      41           0
        6      496    3,237      40           0
        7      622    3,960      42           0
        8      479    2,848      38           0
        9      683    4,002      46           0
       10      559    3,370      38           0
       11      637    4,037      45           0
       12      494    3,043      38           0
       13      453    2,693      31           0
       14      658    3,556      59           0
       15      346    2,026      25           0
       16      674    3,971      45           0
  

---
### 🟡 Document Intelligence — 0001193125 25 256321

  Pages:       67
  Words:       29,179
  Characters:  194,408
  File size:   1,802,290 bytes
  Time:        12251.0 ms
  Tables:      53
  Paragraphs:  2657

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1      396    2,151      58          14
        2      177      890      53           0
        3      134      728      80           0
        4       66      362      32           0
        5      235    1,505     123           0
        6      253    1,468     118           0
        7      131      765      66           0
        8      580    3,462      42           0
        9      682    3,784      43           0
       10      608    3,541      41           0
       11      315    1,656     110           0
       12      262    1,121     197           0
       13      377    1,739     201           0
       14      418    2,200     181           0
       15      410    2,448      61           0
       16      277    1,358     150           0
   

---
### 🟡 Document Intelligence — 0001193125 26 027207

  Pages:       71
  Words:       31,845
  Characters:  210,763
  File size:   2,257,229 bytes
  Time:        13247.0 ms
  Tables:      52
  Paragraphs:  3255

  Per-page breakdown:
     Page    Words    Chars   Lines   Sel.Marks
        1      396    2,150      58          14
        2      184      907      59           0
        3      188      993     125           0
        4       90      458      52           0
        5      235    1,505     123           0
        6      331    1,895     193           0
        7      177    1,001     109           0
        8      447    2,735      36           0
        9      798    4,366      45           0
       10      617    3,534      40           0
       11      378    2,089      75           0
       12      333    1,649     138           0
       13      262    1,112     193           0
       14      372    1,730     186           0
       15      418    2,194     181           0
       16      410    2,445      61           0
   

## 9. Comparison: PdfPig vs PDFBox vs Tika vs PyMuPDF vs Document Intelligence

Compare the extracted text and metadata across all libraries side by side (pdfplumber excluded — doesn't scale).

In [35]:
# Text extraction comparison (all libraries)
comparison_all = []

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    mu = results_pymupdf.get(doc_id)
    bx = results_pdfbox.get(doc_id)
    tk = results_tika.get(doc_id)
    di = results_docint.get(doc_id)

    # Get PdfPig result
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else None

    row = {"Document": title}

    # PdfPig (C# simulator)
    if pig:
        row["PdfPig Words"] = f"{pig.get('wordCount', 0):,}"
        row["PdfPig Chars"] = f"{pig.get('characterCount', 0):,}"
        row["PdfPig Pages"] = pig.get("pageCount", "?")
        row["PdfPig ms"] = f"{pig.get('extractionTimeMs', 0):.1f}"
    else:
        row["PdfPig Words"] = "—"
        row["PdfPig Chars"] = "—"
        row["PdfPig Pages"] = "—"
        row["PdfPig ms"] = "—"

    # PDFBox (Java)
    if bx:
        row["PDFBox Words"] = f"{bx['total_words']:,}"
        row["PDFBox Chars"] = f"{bx['total_chars']:,}"
        row["PDFBox ms"] = f"{timings_pdfbox.get(doc_id, 0)*1000:.1f}"
    else:
        row["PDFBox Words"] = "—"
        row["PDFBox Chars"] = "—"
        row["PDFBox ms"] = "—"

    # Tika (Docker)
    if tk:
        row["Tika Words"] = f"{tk['total_words']:,}"
        row["Tika Chars"] = f"{tk['total_chars']:,}"
        row["Tika ms"] = f"{timings_tika.get(doc_id, 0)*1000:.1f}"
    else:
        row["Tika Words"] = "—"
        row["Tika Chars"] = "—"
        row["Tika ms"] = "—"

    # PyMuPDF
    if mu:
        row["PyMuPDF Words"] = f"{mu['total_words']:,}"
        row["PyMuPDF Chars"] = f"{mu['total_chars']:,}"
        row["PyMuPDF ms"] = f"{timings_pymupdf.get(doc_id, 0)*1000:.1f}"
    else:
        row["PyMuPDF Words"] = "—"
        row["PyMuPDF Chars"] = "—"
        row["PyMuPDF ms"] = "—"

    # Document Intelligence (Azure)
    if di:
        row["DocInt Words"] = f"{di['total_words']:,}"
        row["DocInt Chars"] = f"{di['total_chars']:,}"
        row["DocInt ms"] = f"{timings_docint.get(doc_id, 0)*1000:.1f}"
    else:
        row["DocInt Words"] = "—"
        row["DocInt Chars"] = "—"
        row["DocInt ms"] = "—"

    comparison_all.append(row)

df_all = pd.DataFrame(comparison_all)
display(Markdown("### Text Extraction: Word & Character Count Comparison"))
display(df_all)

### Text Extraction: Word & Character Count Comparison

Unnamed: 0,Document,PdfPig Words,PdfPig Chars,PdfPig Pages,PdfPig ms,PDFBox Words,PDFBox Chars,PDFBox ms,Tika Words,Tika Chars,Tika ms,PyMuPDF Words,PyMuPDF Chars,PyMuPDF ms,DocInt Words,DocInt Chars,DocInt ms
0,Employee Handbook,2367,15777,11,2288.4,2370,16454,767.2,2370,16514,41.1,2370,16118,422.2,2372,15656,8155.6
1,Benefit Options,609,4289,4,570.8,614,4386,170.4,614,4407,57.1,507,3677,805.0,639,4461,5606.7
2,Perks Plus Program,432,2831,4,622.2,432,2940,428.7,432,2994,35.2,432,2907,267.3,434,2829,5325.8
3,0000950170 25 061046,31681,227261,72,2124.6,32946,244766,2021.1,32953,242549,2809.1,32946,242269,2512.9,32958,217315,13575.2
4,0000950170 25 100235,71031,491352,158,2349.3,72862,515075,1726.3,72914,514662,3975.2,72865,510412,2897.5,72922,480749,21682.6
5,0001193125 25 256321,28095,201591,67,2160.9,29166,215357,772.4,29173,213273,1423.5,29166,213507,1901.8,29179,194408,12251.0
6,0001193125 26 027207,30516,220770,71,2267.5,31837,238841,1004.3,31844,236646,3174.4,31837,236416,2660.3,31845,210763,13247.0


In [36]:
# Metadata comparison (all libraries)
meta_all = []

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    mu = results_pymupdf.get(doc_id, {}).get("metadata", {})
    bx = results_pdfbox.get(doc_id, {}).get("metadata", {})
    tk = results_tika.get(doc_id, {}).get("metadata", {})
    di = results_docint.get(doc_id, {}).get("metadata", {})

    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else {}
    pig_meta = pig.get("metadata", {})

    for field_name, pig_key, bx_key, tk_key, mu_key, di_key in [
        ("Title",         "title",       "title",         "title",         "title",         "title"),
        ("Author",        "author",      "author",        "author",        "author",        "author"),
        ("Creator",       "creator",     "creator",       "creator",       "creator",       "creator"),
        ("Producer",      "producer",    "producer",      "producer",      "producer",      "producer"),
        ("Creation Date", "createdDate", "creation_date", "creation_date", "creation_date", "creation_date"),
        ("Modified Date", "modifiedDate","mod_date",      "mod_date",      "mod_date",      "mod_date"),
        ("Subject",       "subject",     "subject",       "subject",       "subject",       "subject"),
        ("Keywords",      "keywords",    "keywords",      "keywords",      "keywords",      "keywords"),
    ]:
        pig_val = pig.get(pig_key, "") or pig_meta.get(pig_key, "") or ""
        bx_val = bx.get(bx_key, "") or ""
        tk_val = tk.get(tk_key, "") or ""
        mu_val = mu.get(mu_key, "") or ""
        di_val = di.get(di_key, "") or ""

        meta_all.append({
            "Document": title,
            "Field": field_name,
            "PdfPig (C#)": str(pig_val) if pig_val else "❌",
            "PDFBox (Java)": str(bx_val) if bx_val else "❌",
            "Tika (Docker)": str(tk_val) if tk_val else "❌",
            "PyMuPDF": str(mu_val) if mu_val else "❌",
            "DocIntelligence": str(di_val) if di_val else "❌",
        })

df_meta_all = pd.DataFrame(meta_all)
display(Markdown("### Metadata Comparison Across All Libraries"))
display(df_meta_all)

### Metadata Comparison Across All Libraries

Unnamed: 0,Document,Field,PdfPig (C#),PDFBox (Java),Tika (Docker),PyMuPDF,DocIntelligence
0,Employee Handbook,Title,❌,❌,❌,❌,❌
1,Employee Handbook,Author,python-docx,python-docx,python-docx,python-docx,❌
2,Employee Handbook,Creator,Microsoft® Word for Microsoft 365,Microsoft® Word for Microsoft 365,Microsoft® Word for Microsoft 365,Microsoft® Word for Microsoft 365,❌
3,Employee Handbook,Producer,Microsoft® Word for Microsoft 365,Microsoft® Word for Microsoft 365,Microsoft® Word for Microsoft 365,Microsoft® Word for Microsoft 365,❌
4,Employee Handbook,Creation Date,2023-03-06T13:57:20.0000000+00:00,Mon Mar 06 22:57:20 CET 2023,2023-03-06T21:57:20Z,D:20230306135720-08'00',❌
5,Employee Handbook,Modified Date,2023-03-06T13:57:20.0000000+00:00,Mon Mar 06 22:57:20 CET 2023,2023-03-06T21:57:20Z,D:20230306135720-08'00',❌
6,Employee Handbook,Subject,❌,❌,❌,❌,❌
7,Employee Handbook,Keywords,❌,❌,❌,❌,❌
8,Benefit Options,Title,❌,❌,❌,❌,❌
9,Benefit Options,Author,Liam Cavanagh,Liam Cavanagh,Liam Cavanagh,Liam Cavanagh,❌


In [37]:
# Content text comparison — show first 500 chars from each library
COMPARE_CHARS = 500

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    display(Markdown(f"---\n### 📄 {title} — First {COMPARE_CHARS} chars from each library"))

    # PdfPig
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig_text = pig_crackers[0].get("content", "") if pig_crackers else ""

    # PDFBox
    bx_text = results_pdfbox.get(doc_id, {}).get("full_text", "")

    # Tika
    tk_text = results_tika.get(doc_id, {}).get("full_text", "")

    # PyMuPDF
    mu_text = results_pymupdf.get(doc_id, {}).get("full_text", "")

    # Document Intelligence
    di_text = results_docint.get(doc_id, {}).get("full_text", "")

    display(Markdown("**PdfPig (C# / simulator):**"))
    print(pig_text[:COMPARE_CHARS] or "(no result)")
    print()

    display(Markdown("**PDFBox (Java):**"))
    print(bx_text[:COMPARE_CHARS] or "(no result)")
    print()

    display(Markdown("**Tika (Docker / Azure Search engine):**"))
    print(tk_text[:COMPARE_CHARS] or "(no result)")
    print()

    display(Markdown("**PyMuPDF (Python):**"))
    print(mu_text[:COMPARE_CHARS] or "(no result)")
    print()

    display(Markdown("**Document Intelligence (Azure):**"))
    print(di_text[:COMPARE_CHARS] or "(no result)")
    print()

---
### 📄 Employee Handbook — First 500 chars from each library

**PdfPig (C# / simulator):**

Contoso Electronics Employee Handbook         

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  All rights 



**PDFBox (Java):**

Contoso Electronics 
Employee Handbook 
 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this 



**Tika (Docker / Azure Search engine):**



















































Contoso Electronics 

Employee Handbook 

 

 
 

 

 

 

  



This document contains information generated using a language model (Azure OpenAI). The 

information contained in this document is only for demonstration purposes and does not 

reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 

warranties of any kind, express or implied, about the completeness, accuracy, reliability, 

suitability or availability



**PyMuPDF (Python):**

Contoso Electronics 
Employee Handbook 
 
 
 
 
 
 
 
 


This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this document. 



**Document Intelligence (Azure):**

Contoso Electronics Employee Handbook
Contoso Electronics
This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.
All 



---
### 📄 Benefit Options — First 500 chars from each library

**PdfPig (C# / simulator):**

Contoso Electronics Plan and Benefit Packages

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document. All rights re



**PDFBox (Java):**

Contoso Electronics 
Plan and Benefit Packages
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document. 
All



**Tika (Docker / Azure Search engine):**




















































Contoso Electronics 
Plan and Benefit Packages



This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to th



**PyMuPDF (Python):**

Contoso Electronics 
Plan and Benefit Packages


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document. 
All righ



**Document Intelligence (Azure):**

Contoso Electronics
Plan and Benefit Packages
Contoso Electronics
This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this docume



---
### 📄 Perks Plus Program — First 500 chars from each library

**PdfPig (C# / simulator):**

PerksPlus Health and Wellness Reimbursement Program for Contoso Electronics Employees        

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the infor



**PDFBox (Java):**

 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availabil



**Tika (Docker / Azure Search engine):**





















































 

 

 

PerksPlus Health and Wellness 

Reimbursement Program for 

Contoso Electronics Employees 
 

 
 

 

 

  



This document contains information generated using a language model (Azure OpenAI). The information 

contained in this document is only for demonstration purposes and does not reflect the opinions or 

beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 

about the completen



**PyMuPDF (Python):**

 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
 
 


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with res



**Document Intelligence (Azure):**

PerksPlus Health and Wellness Reimbursement Program for Contoso Electronics Employees
Contoso Electronics
This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to 



---
### 📄 0000950170 25 061046 — First 500 chars from each library

**PdfPig (C# / simulator):**

UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-Q  ☒QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Quarterly Period Ended March 31, 2025  OR  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845  MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT WAY,



**PDFBox (Java):**

 
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Quarterly Period Ended March 31, 2025
   
OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGT



**Tika (Docker / Azure Search engine):**









































Form 10-Q for Microsoft Corp filed 04/30/2025


 

 

 
 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-Q
 

 

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Quarterly Period Ended March 31, 2025
   

OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Transition Period From                  t



**PyMuPDF (Python):**

 
 
  
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒
QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Quarterly Period Ended March 31, 2025
 
 
OR
 
 
☐ 
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON
 
91-1144442
(STATE OF IN



**Document Intelligence (Azure):**

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-Q :selected: QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Quarterly Period Ended March 31, 2025 OR :unselected: TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to
Commission File Number 001-37845
MICROSOFT CORPORATION
WASHINGTON (STATE OF INCORPORATION) ONE MICROSOFT WAY, REDMOND, WASHINGTON 9805



---
### 📄 0000950170 25 100235 — First 500 chars from each library

**PdfPig (C# / simulator):**

UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-K  ☒ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Fiscal Year Ended June 30, 2025   OR  ☐TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845   MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID)ONE MICROSOFT WAY, REDMOND,



**PDFBox (Java):**

 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-K
 
 
☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 
 
  For the Fiscal Year Ended June 30, 2025
   
  OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845 
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON   91-11



**Tika (Docker / Azure Search engine):**









































Form 10-K for Microsoft Corp filed 07/30/2025


 

 

 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-K
 

 

☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 
 

  For the Fiscal Year Ended June 30, 2025
   

  OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Transition Period From                  to
Commissi



**PyMuPDF (Python):**

 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-K
 
 
☒
ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 
 
 
For the Fiscal Year Ended June 30, 2025
 
 
 
OR
 
 
☐
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Transition Period From                  to
Commission File Number 001-37845 
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON
 
91-1144442
(STATE OF INCORPORATI



**Document Intelligence (Azure):**

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549 FORM 10-K :selected: ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Fiscal Year Ended June 30, 2025 OR :unselected: TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to
Commission File Number 001-37845
MICROSOFT CORPORATION
WASHINGTON
91-1144442
(STATE OF INCORPORATION) ONE MICROSOFT WAY, REDMOND, WASHINGTON 98



---
### 📄 0001193125 25 256321 — First 500 chars from each library

**PdfPig (C# / simulator):**

UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-Q  ☒QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Quarterly Period Ended September 30, 2025  OR  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845  MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT 



**PDFBox (Java):**

 
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Quarterly Period Ended September 30, 2025
   
OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASH



**Tika (Docker / Azure Search engine):**









































Form 10-Q for Microsoft Corp filed 10/29/2025


 

 

 
 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-Q
 

 

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Quarterly Period Ended September 30, 2025
   

OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Transition Period From               



**PyMuPDF (Python):**

 
 
  
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒
QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Quarterly Period Ended September 30, 2025
 
 
OR
 
 
☐ 
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON
 
91-1144442
(STATE O



**Document Intelligence (Azure):**

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549
FORM 10-Q :selected: QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Quarterly Period Ended September 30, 2025 OR :unselected: TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to
Commission File Number 001-37845
MICROSOFT CORPORATION
WASHINGTON (STATE OF INCORPORATION) ONE MICROSOFT WAY, REDMOND, WASHINGTON 



---
### 📄 0001193125 26 027207 — First 500 chars from each library

**PdfPig (C# / simulator):**

UNITED STATESSECURITIES AND EXCHANGE COMMISSIONWashington, D.C. 20549  FORM 10-Q  ☒QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Quarterly Period Ended December 31, 2025  OR  ☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934   For the Transition Period From                  toCommission File Number 001-37845  MICROSOFT CORPORATION  WASHINGTON 91-1144442(STATE OF INCORPORATION) (I.R.S. ID) ONE MICROSOFT W



**PDFBox (Java):**

 
 
 
 
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Quarterly Period Ended December 31, 2025
   
OR
   
☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   
  For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHI



**Tika (Docker / Azure Search engine):**









































Form 10-Q for Microsoft Corp filed 01/28/2026


 

 

 
 

 

UNITED STATES
SECURITIES AND EXCHANGE COMMISSION

Washington, D.C. 20549
 

 

FORM 10-Q
 

 

☒ QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Quarterly Period Ended December 31, 2025
   

OR
   

☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
   

  For the Transition Period From                



**PyMuPDF (Python):**

 
 
  
 
UNITED STATES
SECURITIES AND EXCHANGE COMMISSION
Washington, D.C. 20549
 
 
FORM 10-Q
 
 
☒
QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Quarterly Period Ended December 31, 2025
 
 
OR
 
 
☐ 
TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934
 
 
 
For the Transition Period From                  to
Commission File Number 001-37845
 
 
MICROSOFT CORPORATION
 
 
WASHINGTON
 
91-1144442
(STATE OF



**Document Intelligence (Azure):**

UNITED STATES SECURITIES AND EXCHANGE COMMISSION Washington, D.C. 20549
FORM 10-Q :selected: QUARTERLY REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Quarterly Period Ended December 31, 2025 OR :unselected: TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934 For the Transition Period From to
Commission File Number 001-37845
MICROSOFT CORPORATION
WASHINGTON (STATE OF INCORPORATION) ONE MICROSOFT WAY, REDMOND, WASHINGTON 9



## 10. Findings — Five-Way Comparison (excluding pdfplumber)

> **pdfplumber** is excluded — it times out on large PDFs (pure-Python, O(n²) on complex layouts) and is not viable for production use.

We compare the five libraries that successfully processed all 7 documents:

| Library | Engine | Language | Integration | Cost |
|---|---|---|---|---|
| **PdfPig** | PdfPig 0.1.13 | C# | CLI (DocumentCrackingTool) | Free |
| **PDFBox** | Apache PDFBox 3.0.4 | Java | JPype (in-process JVM) | Free |
| **Tika** | Apache Tika 3.2.3 (wraps PDFBox) | Java | Docker REST API | Free |
| **PyMuPDF** | MuPDF | C (Python bindings) | Direct import | Free |
| **Document Intelligence** | Azure AI (Layout model) | Cloud API | Python SDK | $0.01/page |

In [39]:
# ── Five-way comparison analysis (PdfPig, PDFBox, Tika, PyMuPDF, Doc Intelligence) ──
# pdfplumber excluded — doesn't scale to large documents

import numpy as np

# Build a clean comparison DataFrame
rows = []
for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    file_size = downloaded[doc_id].stat().st_size

    # PdfPig
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else None

    bx = results_pdfbox.get(doc_id)
    tk = results_tika.get(doc_id)
    mu = results_pymupdf.get(doc_id)
    di = results_docint.get(doc_id)

    row = {
        "Document": title,
        "File Size (KB)": round(file_size / 1024, 1),
        "Pages": (pig.get("pageCount") if pig else None) or (bx["page_count"] if bx else None),
    }

    for label, res, timing_dict in [
        ("PdfPig", pig, timings_pdfpig),
        ("PDFBox", bx, timings_pdfbox),
        ("Tika", tk, timings_tika),
        ("PyMuPDF", mu, timings_pymupdf),
        ("DocInt", di, timings_docint),
    ]:
        if label == "PdfPig" and pig:
            words = pig.get("wordCount", 0)
            chars = pig.get("characterCount", 0)
            ms = pig.get("extractionTimeMs", 0)
        elif res and label != "PdfPig":
            words = res["total_words"]
            chars = res["total_chars"]
            ms = timing_dict.get(doc_id, 0) * 1000
        else:
            words = chars = ms = None

        row[f"{label} Words"] = words
        row[f"{label} Chars"] = chars
        row[f"{label} ms"] = round(ms, 1) if ms is not None else None

    rows.append(row)

df = pd.DataFrame(rows)

# ── Word count deviation from PDFBox (reference) ─────────────────────────────
display(Markdown("### 10a. Word Count Deviation from PDFBox (reference)"))
display(Markdown("PDFBox is the baseline since Tika wraps it and Azure AI Search uses Tika internally."))

dev_rows = []
for _, r in df.iterrows():
    ref = r["PDFBox Words"]
    if ref is None or ref == 0:
        continue
    row = {
        "Document": r["Document"],
        "PDFBox Words": f"{ref:,}",
    }
    for label in ["PdfPig", "Tika", "PyMuPDF", "DocInt"]:
        val = r.get(f"{label} Words")
        if val is not None:
            row[label] = f"{val:,}  ({(val - ref) / ref * 100:+.1f}%)"
        else:
            row[label] = "—"
    dev_rows.append(row)

display(pd.DataFrame(dev_rows))

# ── Character count deviation from PDFBox ─────────────────────────────────────
display(Markdown("### 10b. Character Count Deviation from PDFBox (reference)"))

cdev_rows = []
for _, r in df.iterrows():
    ref = r["PDFBox Chars"]
    if ref is None or ref == 0:
        continue
    row = {
        "Document": r["Document"],
        "PDFBox Chars": f"{ref:,}",
    }
    for label in ["PdfPig", "Tika", "PyMuPDF", "DocInt"]:
        val = r.get(f"{label} Chars")
        if val is not None:
            row[label] = f"{val:,}  ({(val - ref) / ref * 100:+.1f}%)"
        else:
            row[label] = "—"
    cdev_rows.append(row)

display(pd.DataFrame(cdev_rows))

# ── Speed comparison ──────────────────────────────────────────────────────────
display(Markdown("### 10c. Extraction Speed (ms)"))
display(Markdown("PdfPig timing = C# `Stopwatch` (pure extraction, no CLI overhead). Doc Intelligence = cloud round-trip. Others = Python wall-clock."))

speed_rows = []
for _, r in df.iterrows():
    speed_rows.append({
        "Document": r["Document"],
        "Pages": r["Pages"],
        "PdfPig ms": r["PdfPig ms"],
        "PDFBox ms": r["PDFBox ms"],
        "Tika ms": r["Tika ms"],
        "PyMuPDF ms": r["PyMuPDF ms"],
        "DocInt ms": r.get("DocInt ms"),
    })


df_speed = pd.DataFrame(speed_rows)

display(df_speed)

# Averages
display(Markdown("**Average extraction time (ms):**"))

for lib in ["PdfPig", "PDFBox", "Tika", "PyMuPDF", "DocInt"]:
    col = f"{lib} ms"
    vals = df_speed[col].dropna()
    if len(vals):
        print(f"  {lib:12s}  avg {vals.mean():,.0f} ms   min {vals.min():,.0f} ms   max {vals.max():,.0f} ms")

# Cost estimate for Document Intelligence
if results_docint:
    total_pages = sum(r.get("page_count", 0) for r in results_docint.values())
    display(Markdown(f"\n**Document Intelligence cost for this run:** {total_pages} pages × $0.01 = **${total_pages * 0.01:.2f}** (Layout) / **${total_pages * 0.0015:.2f}** (Read)"))

### 10a. Word Count Deviation from PDFBox (reference)

PDFBox is the baseline since Tika wraps it and Azure AI Search uses Tika internally.

Unnamed: 0,Document,PDFBox Words,PdfPig,Tika,PyMuPDF,DocInt
0,Employee Handbook,2370,"2,367 (-0.1%)","2,370 (+0.0%)","2,370 (+0.0%)","2,372 (+0.1%)"
1,Benefit Options,614,609 (-0.8%),614 (+0.0%),507 (-17.4%),639 (+4.1%)
2,Perks Plus Program,432,432 (+0.0%),432 (+0.0%),432 (+0.0%),434 (+0.5%)
3,0000950170 25 061046,32946,"31,681 (-3.8%)","32,953 (+0.0%)","32,946 (+0.0%)","32,958 (+0.0%)"
4,0000950170 25 100235,72862,"71,031 (-2.5%)","72,914 (+0.1%)","72,865 (+0.0%)","72,922 (+0.1%)"
5,0001193125 25 256321,29166,"28,095 (-3.7%)","29,173 (+0.0%)","29,166 (+0.0%)","29,179 (+0.0%)"
6,0001193125 26 027207,31837,"30,516 (-4.1%)","31,844 (+0.0%)","31,837 (+0.0%)","31,845 (+0.0%)"


### 10b. Character Count Deviation from PDFBox (reference)

Unnamed: 0,Document,PDFBox Chars,PdfPig,Tika,PyMuPDF,DocInt
0,Employee Handbook,16454,"15,777 (-4.1%)","16,514 (+0.4%)","16,118 (-2.0%)","15,656 (-4.8%)"
1,Benefit Options,4386,"4,289 (-2.2%)","4,407 (+0.5%)","3,677 (-16.2%)","4,461 (+1.7%)"
2,Perks Plus Program,2940,"2,831 (-3.7%)","2,994 (+1.8%)","2,907 (-1.1%)","2,829 (-3.8%)"
3,0000950170 25 061046,244766,"227,261 (-7.2%)","242,549 (-0.9%)","242,269 (-1.0%)","217,315 (-11.2%)"
4,0000950170 25 100235,515075,"491,352 (-4.6%)","514,662 (-0.1%)","510,412 (-0.9%)","480,749 (-6.7%)"
5,0001193125 25 256321,215357,"201,591 (-6.4%)","213,273 (-1.0%)","213,507 (-0.9%)","194,408 (-9.7%)"
6,0001193125 26 027207,238841,"220,770 (-7.6%)","236,646 (-0.9%)","236,416 (-1.0%)","210,763 (-11.8%)"


### 10c. Extraction Speed (ms)

PdfPig timing = C# `Stopwatch` (pure extraction, no CLI overhead). Doc Intelligence = cloud round-trip. Others = Python wall-clock.

Unnamed: 0,Document,Pages,PdfPig ms,PDFBox ms,Tika ms,PyMuPDF ms,DocInt ms
0,Employee Handbook,11,2288.4,767.2,41.1,422.2,8155.6
1,Benefit Options,4,570.8,170.4,57.1,805.0,5606.7
2,Perks Plus Program,4,622.2,428.7,35.2,267.3,5325.8
3,0000950170 25 061046,72,2124.6,2021.1,2809.1,2512.9,13575.2
4,0000950170 25 100235,158,2349.3,1726.3,3975.2,2897.5,21682.6
5,0001193125 25 256321,67,2160.9,772.4,1423.5,1901.8,12251.0
6,0001193125 26 027207,71,2267.5,1004.3,3174.4,2660.3,13247.0


**Average extraction time (ms):**

  PdfPig        avg 1,769 ms   min 571 ms   max 2,349 ms
  PDFBox        avg 984 ms   min 170 ms   max 2,021 ms
  Tika          avg 1,645 ms   min 35 ms   max 3,975 ms
  PyMuPDF       avg 1,638 ms   min 267 ms   max 2,898 ms
  DocInt        avg 11,406 ms   min 5,326 ms   max 21,683 ms



**Document Intelligence cost for this run:** 387 pages × $0.01 = **$3.87** (Layout) / **$0.58** (Read)

In [40]:
# ── 10d. Qualitative findings ─────────────────────────────────────────────────
display(Markdown("""### 10d. Qualitative Findings

#### Text Completeness

| Library | Observation |
|---|---|
| **PDFBox** | **Reference baseline.** Most complete text output. Tika wraps this engine internally. |
| **Tika** | Word counts match PDFBox almost exactly (+0.0% to +0.02%). Slightly more chars due to `\\u00A0` (non-breaking space) characters that inflate output. The `X-TIKA:content` text also includes leading whitespace/newlines from page structure. |
| **Document Intelligence** | Cloud AI service. Uses OCR pipeline — may produce different word counts than text-layer extraction. Excels at reading order and table detection. |
| **PyMuPDF** | Matches PDFBox on SEC filings. **Drops ~17% of content on Benefit_Options.pdf** (507 vs 614 words) — likely layout-dependent extraction failure. |
| **PdfPig** | Consistently **3–5% fewer words** than PDFBox on SEC filings (e.g. 71,031 vs 72,862 on 10-K). Text is the *cleanest* (no extra whitespace), but loses some content. |

#### Text Quality

| Library | Whitespace | Special chars | Layout handling |
|---|---|---|---|
| **PDFBox** | Moderate — preserves page breaks and newlines | Clean | Good — respects column/table layout |
| **Tika** | Verbose — adds many leading newlines, `\\u00A0` chars appear as `Â` in some renderings | `\\u00A0` noise on SEC filings | Same as PDFBox (wraps it), plus form-title extraction |
| **Document Intelligence** | Clean — structured by paragraphs and sections | Clean | **Best** — reading order, table cells, selection marks |
| **PyMuPDF** | Light — similar to PDFBox | Clean | Good except Benefit_Options |
| **PdfPig** | Minimal — strips most whitespace, joins lines aggressively | Clean | Aggressive joining loses structure |

#### Metadata

| Library | Richness | Notes |
|---|---|---|
| **Tika** | ⭐⭐⭐⭐⭐ | Richest — Dublin Core (`dc:title`, `dc:creator`), XMP, PDF-specific fields, language detection, PDF version, content type |
| **Document Intelligence** | ⭐⭐⭐⭐ | Paragraphs, tables, selection marks, bounding boxes. Does NOT read PDF info dict (no author/title metadata). |
| **PDFBox** | ⭐⭐⭐ | Standard PDF info dict (title, author, creator, producer, dates) |
| **PyMuPDF** | ⭐⭐⭐ | Same fields as PDFBox, plus format and encryption info |
| **PdfPig** | ⭐⭐ | Basic fields. Missing some metadata on SEC filings where others succeed. |

#### Speed

| Scenario | Fastest | Slowest | Notes |
|---|---|---|---|
| **Small files (< 1 MB)** | **Tika** (35–57 ms) | **Doc Intelligence** (2–5 sec) | Warm JVM + REST is fast. DI has cloud latency. |
| **Large files (> 1 MB)** | **PDFBox** (772–2,021 ms) | **Doc Intelligence** (5–20 sec) | In-process JVM wins. DI = cloud round-trip. |
| **PdfPig** | Consistent ~2,200 ms | — | CLI process spawn dominates. |

#### Cost

| Library | Cost | Deployment |
|---|---|---|
| PdfPig, PDFBox, Tika, PyMuPDF | **Free** | Self-hosted (local or Docker) |
| **Document Intelligence** | **$0.01/page** (Layout) or **$0.0015/page** (Read) | Azure cloud, requires subscription |
| DI commitment tier | $0.008–0.0095/page | 20K+ pages/month |
| DI free tier | $0 | 500 pages/month |

#### Recommendation for the Simulator

1. **Use Tika as the reference** — it's what Azure AI Search actually uses. The `/tika/text` endpoint with `Accept: application/json` returns both text and metadata in a single call.
2. **Document Intelligence** is the gold standard for complex layouts (tables, forms, reading order) but costs money and is 10–20× slower. Useful as a validation baseline, not for bulk simulation.
3. **PdfPig (current simulator engine)** produces slightly less text (3–5% fewer words) with more aggressive whitespace stripping. Consider post-processing to normalize whitespace if fidelity to Azure Search output is important.
4. **PyMuPDF** is a good fast alternative but has a reliability gap on certain layouts (Benefit_Options).
5. **PDFBox direct** (via JPype) gives the closest match to Tika without the Docker dependency.
"""))

### 10d. Qualitative Findings

#### Text Completeness

| Library | Observation |
|---|---|
| **PDFBox** | **Reference baseline.** Most complete text output. Tika wraps this engine internally. |
| **Tika** | Word counts match PDFBox almost exactly (+0.0% to +0.02%). Slightly more chars due to `\u00A0` (non-breaking space) characters that inflate output. The `X-TIKA:content` text also includes leading whitespace/newlines from page structure. |
| **Document Intelligence** | Cloud AI service. Uses OCR pipeline — may produce different word counts than text-layer extraction. Excels at reading order and table detection. |
| **PyMuPDF** | Matches PDFBox on SEC filings. **Drops ~17% of content on Benefit_Options.pdf** (507 vs 614 words) — likely layout-dependent extraction failure. |
| **PdfPig** | Consistently **3–5% fewer words** than PDFBox on SEC filings (e.g. 71,031 vs 72,862 on 10-K). Text is the *cleanest* (no extra whitespace), but loses some content. |

#### Text Quality

| Library | Whitespace | Special chars | Layout handling |
|---|---|---|---|
| **PDFBox** | Moderate — preserves page breaks and newlines | Clean | Good — respects column/table layout |
| **Tika** | Verbose — adds many leading newlines, `\u00A0` chars appear as `Â` in some renderings | `\u00A0` noise on SEC filings | Same as PDFBox (wraps it), plus form-title extraction |
| **Document Intelligence** | Clean — structured by paragraphs and sections | Clean | **Best** — reading order, table cells, selection marks |
| **PyMuPDF** | Light — similar to PDFBox | Clean | Good except Benefit_Options |
| **PdfPig** | Minimal — strips most whitespace, joins lines aggressively | Clean | Aggressive joining loses structure |

#### Metadata

| Library | Richness | Notes |
|---|---|---|
| **Tika** | ⭐⭐⭐⭐⭐ | Richest — Dublin Core (`dc:title`, `dc:creator`), XMP, PDF-specific fields, language detection, PDF version, content type |
| **Document Intelligence** | ⭐⭐⭐⭐ | Paragraphs, tables, selection marks, bounding boxes. Does NOT read PDF info dict (no author/title metadata). |
| **PDFBox** | ⭐⭐⭐ | Standard PDF info dict (title, author, creator, producer, dates) |
| **PyMuPDF** | ⭐⭐⭐ | Same fields as PDFBox, plus format and encryption info |
| **PdfPig** | ⭐⭐ | Basic fields. Missing some metadata on SEC filings where others succeed. |

#### Speed

| Scenario | Fastest | Slowest | Notes |
|---|---|---|---|
| **Small files (< 1 MB)** | **Tika** (35–57 ms) | **Doc Intelligence** (2–5 sec) | Warm JVM + REST is fast. DI has cloud latency. |
| **Large files (> 1 MB)** | **PDFBox** (772–2,021 ms) | **Doc Intelligence** (5–20 sec) | In-process JVM wins. DI = cloud round-trip. |
| **PdfPig** | Consistent ~2,200 ms | — | CLI process spawn dominates. |

#### Cost

| Library | Cost | Deployment |
|---|---|---|
| PdfPig, PDFBox, Tika, PyMuPDF | **Free** | Self-hosted (local or Docker) |
| **Document Intelligence** | **$0.01/page** (Layout) or **$0.0015/page** (Read) | Azure cloud, requires subscription |
| DI commitment tier | $0.008–0.0095/page | 20K+ pages/month |
| DI free tier | $0 | 500 pages/month |

#### Recommendation for the Simulator

1. **Use Tika as the reference** — it's what Azure AI Search actually uses. The `/tika/text` endpoint with `Accept: application/json` returns both text and metadata in a single call.
2. **Document Intelligence** is the gold standard for complex layouts (tables, forms, reading order) but costs money and is 10–20× slower. Useful as a validation baseline, not for bulk simulation.
3. **PdfPig (current simulator engine)** produces slightly less text (3–5% fewer words) with more aggressive whitespace stripping. Consider post-processing to normalize whitespace if fidelity to Azure Search output is important.
4. **PyMuPDF** is a good fast alternative but has a reliability gap on certain layouts (Benefit_Options).
5. **PDFBox direct** (via JPype) gives the closest match to Tika without the Docker dependency.


## 11. Overall Comparison — Executive Summary

A consolidated view of all five extraction solutions across **accuracy, speed, cost, features, and deployment complexity**.
The comparison is based on extraction of **7 PDF documents** (3 small Contoso docs + 4 large SEC filings) totaling **387 pages**.

In [41]:
# ══════════════════════════════════════════════════════════════════════════════
# 11. OVERALL COMPARISON — EXECUTIVE SUMMARY
# ══════════════════════════════════════════════════════════════════════════════

import numpy as np

# ── Helper: compute per-library aggregates from the existing df / df_speed ────
libs = ["PdfPig", "PDFBox", "Tika", "PyMuPDF", "DocInt"]
lib_labels = {
    "PdfPig": "PdfPig (C#)",
    "PDFBox": "PDFBox (Java/JPype)",
    "Tika": "Apache Tika (Docker)",
    "PyMuPDF": "PyMuPDF (Python)",
    "DocInt": "Document Intelligence (Azure)",
}

total_pages = int(df["Pages"].sum())

# ──────────────────────────────────────────────────────────────────────────────
# TABLE 1 — Scorecard
# ──────────────────────────────────────────────────────────────────────────────
display(Markdown("### 11a. Solution Scorecard"))
display(Markdown("Ratings: ⭐ = 1 (worst) … ⭐⭐⭐⭐⭐ = 5 (best). Higher is better."))

scorecard = [
    {
        "Solution": "PdfPig (C#)",
        "Word Accuracy": "⭐⭐⭐",
        "Char Accuracy": "⭐⭐⭐",
        "Speed": "⭐⭐⭐",
        "Cost": "⭐⭐⭐⭐⭐",
        "Metadata": "⭐⭐",
        "Layout / Tables": "⭐⭐",
        "Deployment": "⭐⭐⭐⭐⭐",
        "Notes": "3-5% fewer words on large PDFs; CLI overhead ~2s; .NET dependency",
    },
    {
        "Solution": "PDFBox (Java/JPype)",
        "Word Accuracy": "⭐⭐⭐⭐⭐",
        "Char Accuracy": "⭐⭐⭐⭐⭐",
        "Speed": "⭐⭐⭐⭐",
        "Cost": "⭐⭐⭐⭐⭐",
        "Metadata": "⭐⭐⭐",
        "Layout / Tables": "⭐⭐⭐",
        "Deployment": "⭐⭐⭐",
        "Notes": "Reference baseline; requires JVM; fast in-process",
    },
    {
        "Solution": "Apache Tika (Docker)",
        "Word Accuracy": "⭐⭐⭐⭐⭐",
        "Char Accuracy": "⭐⭐⭐⭐",
        "Speed": "⭐⭐⭐⭐⭐",
        "Cost": "⭐⭐⭐⭐⭐",
        "Metadata": "⭐⭐⭐⭐⭐",
        "Layout / Tables": "⭐⭐⭐",
        "Deployment": "⭐⭐⭐⭐",
        "Notes": "Same engine as Azure AI Search; fastest on small files; Docker required",
    },
    {
        "Solution": "PyMuPDF (Python)",
        "Word Accuracy": "⭐⭐⭐⭐",
        "Char Accuracy": "⭐⭐⭐⭐",
        "Speed": "⭐⭐⭐⭐",
        "Cost": "⭐⭐⭐⭐⭐",
        "Metadata": "⭐⭐⭐",
        "Layout / Tables": "⭐⭐⭐",
        "Deployment": "⭐⭐⭐⭐⭐",
        "Notes": "Fast, pip install; drops 17% on Benefit_Options layout",
    },
    {
        "Solution": "Doc Intelligence (Azure)",
        "Word Accuracy": "⭐⭐⭐⭐⭐",
        "Char Accuracy": "⭐⭐⭐",
        "Speed": "⭐",
        "Cost": "⭐⭐",
        "Metadata": "⭐⭐⭐⭐",
        "Layout / Tables": "⭐⭐⭐⭐⭐",
        "Deployment": "⭐⭐⭐",
        "Notes": "Best tables/layout; cloud latency; $0.01/page (Layout)",
    },
]
df_scorecard = pd.DataFrame(scorecard)
display(df_scorecard.style.set_caption("Solution Scorecard — All Dimensions").hide(axis="index"))

# ──────────────────────────────────────────────────────────────────────────────
# TABLE 2 — Speed Deep Dive  (emphasis)
# ──────────────────────────────────────────────────────────────────────────────
display(Markdown("---"))
display(Markdown("### 11b. ⏱️ Speed Comparison — Deep Dive"))
display(Markdown(
    "Extraction time in **milliseconds**. Green = fastest per document, Red = slowest. "
    "Doc Intelligence times include full cloud round-trip (upload → analyze → poll → download)."
))

# Styled speed table with colour gradient
speed_cols = [f"{lib} ms" for lib in libs]
df_speed_styled = df_speed.copy()

# Compute per-document rank for highlighting
def highlight_speed(row):
    styles = [""] * len(row)
    vals_only = {c: row[c] for c in speed_cols if pd.notna(row[c])}
    if not vals_only:
        return styles
    min_col = min(vals_only, key=vals_only.get)
    max_col = max(vals_only, key=vals_only.get)
    for i, c in enumerate(row.index):
        if c == min_col:
            styles[i] = "background-color: #c6efce; font-weight: bold"
        elif c == max_col:
            styles[i] = "background-color: #ffc7ce; font-weight: bold"
    return styles

display(
    df_speed_styled.style
    .apply(highlight_speed, axis=1)
    .format({c: "{:,.0f}" for c in speed_cols}, na_rep="—")
    .set_caption("Extraction Time (ms) per Document — 🟢 Fastest  🔴 Slowest")
    .hide(axis="index")
)

# Speed summary table
display(Markdown("#### Speed Summary"))

speed_summary = []
for lib in libs:
    col = f"{lib} ms"
    vals = df_speed[col].dropna()
    if len(vals) == 0:
        continue
    small_mask = df_speed["Pages"] <= 11  # small docs
    large_mask = df_speed["Pages"] > 11   # SEC filings
    small_vals = df_speed.loc[small_mask, col].dropna()
    large_vals = df_speed.loc[large_mask, col].dropna()

    speed_summary.append({
        "Solution": lib_labels[lib],
        "Avg (all docs)": f"{vals.mean():,.0f} ms",
        "Avg (small ≤11p)": f"{small_vals.mean():,.0f} ms" if len(small_vals) else "—",
        "Avg (large >11p)": f"{large_vals.mean():,.0f} ms" if len(large_vals) else "—",
        "Min": f"{vals.min():,.0f} ms",
        "Max": f"{vals.max():,.0f} ms",
        "ms/page (avg)": f"{vals.sum() / total_pages:.1f}",
    })

df_speed_summary = pd.DataFrame(speed_summary)
display(df_speed_summary.style.set_caption("Average Extraction Speed by Document Size").hide(axis="index"))

# Speed multiplier vs fastest
display(Markdown("#### Speed Multiplier vs Fastest (per document)"))

multiplier_rows = []
for _, r in df_speed.iterrows():
    ms_vals = {lib: r.get(f"{lib} ms") for lib in libs if pd.notna(r.get(f"{lib} ms"))}
    if not ms_vals:
        continue
    fastest = min(ms_vals.values())
    row = {"Document": r["Document"], "Pages": int(r["Pages"])}
    for lib in libs:
        v = r.get(f"{lib} ms")
        if pd.notna(v):
            row[lib_labels[lib]] = f"{v / fastest:.1f}×"
        else:
            row[lib_labels[lib]] = "—"
    multiplier_rows.append(row)

df_mult = pd.DataFrame(multiplier_rows)
display(df_mult.style.set_caption("Speed multiplier vs fastest solution per document (1.0× = fastest)").hide(axis="index"))


# ──────────────────────────────────────────────────────────────────────────────
# TABLE 3 — Cost Deep Dive  (emphasis)
# ──────────────────────────────────────────────────────────────────────────────
display(Markdown("---"))
display(Markdown("### 11c. 💰 Cost Comparison — Deep Dive"))
display(Markdown(
    "All local solutions (PdfPig, PDFBox, Tika, PyMuPDF) are **free** — zero per-page cost. "
    "Document Intelligence charges **per page** based on the model used."
))

# Per-document cost table
cost_rows = []
for _, r in df.iterrows():
    pages = r["Pages"]
    cost_rows.append({
        "Document": r["Document"],
        "Pages": int(pages),
        "PdfPig": "$0.00",
        "PDFBox": "$0.00",
        "Tika": "$0.00",
        "PyMuPDF": "$0.00",
        "DI Read ($0.0015/p)": f"${pages * 0.0015:.2f}",
        "DI Layout ($0.01/p)": f"${pages * 0.01:.2f}",
    })

df_cost = pd.DataFrame(cost_rows)
display(df_cost.style.set_caption("Per-Document Extraction Cost").hide(axis="index"))

# Cost at scale
display(Markdown("#### Cost Projection at Scale"))

volumes = [100, 1_000, 10_000, 100_000, 1_000_000]
avg_pages_per_doc = total_pages / len(downloaded)

scale_rows = []
for vol in volumes:
    total_p = int(vol * avg_pages_per_doc)
    scale_rows.append({
        "Documents": f"{vol:,}",
        "Est. Pages": f"{total_p:,}",
        "Local (any)": "$0",
        "DI Read": f"${total_p * 0.0015:,.0f}",
        "DI Layout": f"${total_p * 0.01:,.0f}",
        "DI Layout (commitment)": f"${total_p * 0.008:,.0f}",
    })

df_scale = pd.DataFrame(scale_rows)
display(df_scale.style.set_caption(
    f"Cost Projection (avg {avg_pages_per_doc:.0f} pages/doc based on test set)"
).hide(axis="index"))


# ──────────────────────────────────────────────────────────────────────────────
# TABLE 4 — Feature Matrix
# ──────────────────────────────────────────────────────────────────────────────
display(Markdown("---"))
display(Markdown("### 11d. Feature Matrix"))

features = [
    {
        "Feature": "Language / Runtime",
        "PdfPig": "C# / .NET 10",
        "PDFBox": "Java / JVM (JPype)",
        "Tika": "Java (Docker REST)",
        "PyMuPDF": "Python (C ext)",
        "Doc Intelligence": "Python SDK (Cloud)",
    },
    {
        "Feature": "Installation",
        "PdfPig": "dotnet build",
        "PDFBox": "pip + JRE + JAR",
        "Tika": "docker pull",
        "PyMuPDF": "pip install",
        "Doc Intelligence": "pip install + Azure keys",
    },
    {
        "Feature": "PDF text extraction",
        "PdfPig": "✅",
        "PDFBox": "✅",
        "Tika": "✅",
        "PyMuPDF": "✅",
        "Doc Intelligence": "✅",
    },
    {
        "Feature": "PDF metadata",
        "PdfPig": "⚠️ Basic",
        "PDFBox": "✅ Standard",
        "Tika": "✅ Rich (Dublin Core)",
        "PyMuPDF": "✅ Standard+",
        "Doc Intelligence": "❌ No PDF info dict",
    },
    {
        "Feature": "Table detection",
        "PdfPig": "❌",
        "PDFBox": "❌",
        "Tika": "❌",
        "PyMuPDF": "❌",
        "Doc Intelligence": "✅ (54-80 tables/doc)",
    },
    {
        "Feature": "Selection marks",
        "PdfPig": "❌",
        "PDFBox": "☑ as text",
        "Tika": "☑ as text",
        "PyMuPDF": "☑ as text",
        "Doc Intelligence": "✅ :selected:/:unselected:",
    },
    {
        "Feature": "Reading order",
        "PdfPig": "❌",
        "PDFBox": "⚠️ PDF order",
        "Tika": "⚠️ PDF order",
        "PyMuPDF": "⚠️ PDF order",
        "Doc Intelligence": "✅ AI-inferred",
    },
    {
        "Feature": "OCR fallback",
        "PdfPig": "❌",
        "PDFBox": "❌",
        "Tika": "⚠️ w/ Tesseract",
        "PyMuPDF": "❌",
        "Doc Intelligence": "✅ Built-in",
    },
    {
        "Feature": "Offline / air-gapped",
        "PdfPig": "✅",
        "PDFBox": "✅",
        "Tika": "✅",
        "PyMuPDF": "✅",
        "Doc Intelligence": "❌ Requires internet",
    },
    {
        "Feature": "Used by Azure AI Search",
        "PdfPig": "❌",
        "PDFBox": "Indirectly (via Tika)",
        "Tika": "✅ Default engine",
        "PyMuPDF": "❌",
        "Doc Intelligence": "✅ Optional skill",
    },
]
df_features = pd.DataFrame(features)
display(df_features.style.set_caption("Feature Comparison Matrix").hide(axis="index"))


# ──────────────────────────────────────────────────────────────────────────────
# TABLE 5 — Final Verdict
# ──────────────────────────────────────────────────────────────────────────────
display(Markdown("---"))
display(Markdown("### 11e. Final Verdict"))

verdict = [
    {
        "Use Case": "🔬 Simulator (match Azure Search output)",
        "Recommended": "**Tika (Docker)**",
        "Why": "Same engine Azure Search uses. Best fidelity. Free.",
    },
    {
        "Use Case": "⚡ Fastest local extraction",
        "Recommended": "**Tika** (small) / **PDFBox** (large)",
        "Why": "Tika 35-57ms on small files. PDFBox 772-2021ms on SEC filings.",
    },
    {
        "Use Case": "🐍 Python-only (no JVM/Docker)",
        "Recommended": "**PyMuPDF**",
        "Why": "pip install, fast, accurate — except rare layout issues.",
    },
    {
        "Use Case": "📊 Complex tables & forms",
        "Recommended": "**Document Intelligence (Layout)**",
        "Why": "Only solution with real table detection. AI reading order.",
    },
    {
        "Use Case": "💸 Zero cost at any scale",
        "Recommended": "**Any local** (Tika, PDFBox, PyMuPDF, PdfPig)",
        "Why": "All free. DI costs $0.01/page = $10K for 1M pages.",
    },
    {
        "Use Case": "🏢 Enterprise / air-gapped",
        "Recommended": "**PdfPig** or **PDFBox**",
        "Why": "No Docker, no cloud dependency. Pure local execution.",
    },
]
df_verdict = pd.DataFrame(verdict)
display(df_verdict.style.set_caption("Recommended Solution by Use Case").hide(axis="index"))

### 11a. Solution Scorecard

Ratings: ⭐ = 1 (worst) … ⭐⭐⭐⭐⭐ = 5 (best). Higher is better.

Solution,Word Accuracy,Char Accuracy,Speed,Cost,Metadata,Layout / Tables,Deployment,Notes
PdfPig (C#),⭐⭐⭐,⭐⭐⭐,⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐,⭐⭐,⭐⭐⭐⭐⭐,3-5% fewer words on large PDFs; CLI overhead ~2s; .NET dependency
PDFBox (Java/JPype),⭐⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐,⭐⭐⭐,⭐⭐⭐,Reference baseline; requires JVM; fast in-process
Apache Tika (Docker),⭐⭐⭐⭐⭐,⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐,⭐⭐⭐⭐,Same engine as Azure AI Search; fastest on small files; Docker required
PyMuPDF (Python),⭐⭐⭐⭐,⭐⭐⭐⭐,⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐,⭐⭐⭐,⭐⭐⭐⭐⭐,"Fast, pip install; drops 17% on Benefit_Options layout"
Doc Intelligence (Azure),⭐⭐⭐⭐⭐,⭐⭐⭐,⭐,⭐⭐,⭐⭐⭐⭐,⭐⭐⭐⭐⭐,⭐⭐⭐,Best tables/layout; cloud latency; $0.01/page (Layout)


---

### 11b. ⏱️ Speed Comparison — Deep Dive

Extraction time in **milliseconds**. Green = fastest per document, Red = slowest. Doc Intelligence times include full cloud round-trip (upload → analyze → poll → download).

Document,Pages,PdfPig ms,PDFBox ms,Tika ms,PyMuPDF ms,DocInt ms
Employee Handbook,11,2288,767,41,422,8156
Benefit Options,4,571,170,57,805,5607
Perks Plus Program,4,622,429,35,267,5326
0000950170 25 061046,72,2125,2021,2809,2513,13575
0000950170 25 100235,158,2349,1726,3975,2898,21683
0001193125 25 256321,67,2161,772,1424,1902,12251
0001193125 26 027207,71,2268,1004,3174,2660,13247


#### Speed Summary

Solution,Avg (all docs),Avg (small ≤11p),Avg (large >11p),Min,Max,ms/page (avg)
PdfPig (C#),"1,769 ms","1,160 ms","2,226 ms",571 ms,"2,349 ms",32.0
PDFBox (Java/JPype),984 ms,455 ms,"1,381 ms",170 ms,"2,021 ms",17.8
Apache Tika (Docker),"1,645 ms",44 ms,"2,846 ms",35 ms,"3,975 ms",29.8
PyMuPDF (Python),"1,638 ms",498 ms,"2,493 ms",267 ms,"2,898 ms",29.6
Document Intelligence (Azure),"11,406 ms","6,363 ms","15,189 ms","5,326 ms","21,683 ms",206.3


#### Speed Multiplier vs Fastest (per document)

Document,Pages,PdfPig (C#),PDFBox (Java/JPype),Apache Tika (Docker),PyMuPDF (Python),Document Intelligence (Azure)
Employee Handbook,11,55.7×,18.7×,1.0×,10.3×,198.4×
Benefit Options,4,10.0×,3.0×,1.0×,14.1×,98.2×
Perks Plus Program,4,17.7×,12.2×,1.0×,7.6×,151.3×
0000950170 25 061046,72,1.1×,1.0×,1.4×,1.2×,6.7×
0000950170 25 100235,158,1.4×,1.0×,2.3×,1.7×,12.6×
0001193125 25 256321,67,2.8×,1.0×,1.8×,2.5×,15.9×
0001193125 26 027207,71,2.3×,1.0×,3.2×,2.6×,13.2×


---

### 11c. 💰 Cost Comparison — Deep Dive

All local solutions (PdfPig, PDFBox, Tika, PyMuPDF) are **free** — zero per-page cost. Document Intelligence charges **per page** based on the model used.

Document,Pages,PdfPig,PDFBox,Tika,PyMuPDF,DI Read ($0.0015/p),DI Layout ($0.01/p)
Employee Handbook,11,$0.00,$0.00,$0.00,$0.00,$0.02,$0.11
Benefit Options,4,$0.00,$0.00,$0.00,$0.00,$0.01,$0.04
Perks Plus Program,4,$0.00,$0.00,$0.00,$0.00,$0.01,$0.04
0000950170 25 061046,72,$0.00,$0.00,$0.00,$0.00,$0.11,$0.72
0000950170 25 100235,158,$0.00,$0.00,$0.00,$0.00,$0.24,$1.58
0001193125 25 256321,67,$0.00,$0.00,$0.00,$0.00,$0.10,$0.67
0001193125 26 027207,71,$0.00,$0.00,$0.00,$0.00,$0.11,$0.71


#### Cost Projection at Scale

Documents,Est. Pages,Local (any),DI Read,DI Layout,DI Layout (commitment)
100,5528,$0,$8,$55,$44
1000,55285,$0,$83,$553,$442
10000,552857,$0,$829,"$5,529","$4,423"
100000,5528571,$0,"$8,293","$55,286","$44,229"
1000000,55285714,$0,"$82,929","$552,857","$442,286"


---

### 11d. Feature Matrix

Feature,PdfPig,PDFBox,Tika,PyMuPDF,Doc Intelligence
Language / Runtime,C# / .NET 10,Java / JVM (JPype),Java (Docker REST),Python (C ext),Python SDK (Cloud)
Installation,dotnet build,pip + JRE + JAR,docker pull,pip install,pip install + Azure keys
PDF text extraction,✅,✅,✅,✅,✅
PDF metadata,⚠️ Basic,✅ Standard,✅ Rich (Dublin Core),✅ Standard+,❌ No PDF info dict
Table detection,❌,❌,❌,❌,✅ (54-80 tables/doc)
Selection marks,❌,☑ as text,☑ as text,☑ as text,✅ :selected:/:unselected:
Reading order,❌,⚠️ PDF order,⚠️ PDF order,⚠️ PDF order,✅ AI-inferred
OCR fallback,❌,❌,⚠️ w/ Tesseract,❌,✅ Built-in
Offline / air-gapped,✅,✅,✅,✅,❌ Requires internet
Used by Azure AI Search,❌,Indirectly (via Tika),✅ Default engine,❌,✅ Optional skill


---

### 11e. Final Verdict

Use Case,Recommended,Why
🔬 Simulator (match Azure Search output),**Tika (Docker)**,Same engine Azure Search uses. Best fidelity. Free.
⚡ Fastest local extraction,**Tika** (small) / **PDFBox** (large),Tika 35-57ms on small files. PDFBox 772-2021ms on SEC filings.
🐍 Python-only (no JVM/Docker),**PyMuPDF**,"pip install, fast, accurate — except rare layout issues."
📊 Complex tables & forms,**Document Intelligence (Layout)**,Only solution with real table detection. AI reading order.
💸 Zero cost at any scale,"**Any local** (Tika, PDFBox, PyMuPDF, PdfPig)",All free. DI costs $0.01/page = $10K for 1M pages.
🏢 Enterprise / air-gapped,**PdfPig** or **PDFBox**,"No Docker, no cloud dependency. Pure local execution."
