# PDF Extraction Quality Inspector

This notebook downloads sample PDFs (from the Azure AI Search sample data repository) and runs extraction to inspect:
- **Extracted text** ‚Äî page by page
- **Metadata** ‚Äî title, author, creation date, producer, keywords, etc.
- **Quality metrics** ‚Äî word count, char count, extraction completeness

Use this to evaluate and improve the `PdfCracker` in the simulator.

## 1. Install and Import Dependencies

In [None]:
# Install required packages (run once)
%pip install pymupdf pdfplumber requests pandas tabulate jpype1

In [6]:
import os
import json
import time
import textwrap
from pathlib import Path
from collections import defaultdict

import fitz  # PyMuPDF
import pdfplumber
import requests
import pandas as pd
from IPython.display import display, HTML, Markdown

pd.set_option("display.max_colwidth", 120)
pd.set_option("display.max_rows", 100)

# Max chars of extracted text to display per document (0 = no limit, show all)

MAX_TEXT_DISPLAY = 0

print("‚úÖ All libraries imported successfully")


‚úÖ All libraries imported successfully


## 2. Download Sample PDF Files

We use the same PDFs from the [Azure-Samples/azure-search-sample-data](https://github.com/Azure-Samples/azure-search-sample-data) repository (health-plan folder). These are real-world documents with varying complexity.

In [7]:
# Sample PDF URLs from Azure cognitive-search-sample-data repository
SAMPLE_PDFS = {
    "employee_handbook": {
        "url": "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/health-plan/employee_handbook.pdf",
        "title": "Employee Handbook",
        "category": "HR",
    },
    "Benefit_Options": {
        "url": "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/health-plan/Benefit_Options.pdf",
        "title": "Benefit Options",
        "category": "Benefits",
    },
    "PerksPlus": {
        "url": "https://raw.githubusercontent.com/Azure-Samples/azure-search-sample-data/main/health-plan/PerksPlus.pdf",
        "title": "Perks Plus Program",
        "category": "Benefits",
    }
}

PDF_DIR = Path("../data/pdfs")
PDF_DIR.mkdir(parents=True, exist_ok=True)

downloaded = {}

# 1) Download remote sample PDFs (if not already present)
for doc_id, info in SAMPLE_PDFS.items():
    pdf_path = PDF_DIR / f"{doc_id}.pdf"
    if pdf_path.exists():
        print(f"üìÑ {info['title']} ‚Äî already exists ({pdf_path.stat().st_size:,} bytes)")
    else:
        print(f"üì• Downloading {info['title']}...")
        resp = requests.get(info["url"], timeout=30)
        resp.raise_for_status()
        pdf_path.write_bytes(resp.content)
        print(f"   ‚úÖ Saved ({len(resp.content):,} bytes)")
    downloaded[doc_id] = pdf_path

# 2) Discover any additional local PDF files in the same directory
local_count = 0
for pdf_path in sorted(PDF_DIR.glob("*.pdf")):
    doc_id = pdf_path.stem  # filename without extension
    if doc_id in downloaded:
        continue  # already registered from the remote list above
    # Skip the PDFBox JAR (has .jar extension, but just in case)
    if "pdfbox" in doc_id.lower():
        continue
    downloaded[doc_id] = pdf_path
    SAMPLE_PDFS[doc_id] = {
        "title": doc_id.replace("_", " ").replace("-", " ").title(),
        "category": "Local",
    }
    local_count += 1
    print(f"üìÇ Local PDF: {pdf_path.name} ({pdf_path.stat().st_size:,} bytes)")

remote_count = len(downloaded) - local_count
print(f"\n‚úÖ {len(downloaded)} PDF files ready ({remote_count} remote, {local_count} local) in {PDF_DIR.resolve()}")

üìÑ Employee Handbook ‚Äî already exists (142,977 bytes)
üìÑ Benefit Options ‚Äî already exists (544,811 bytes)
üìÑ Perks Plus Program ‚Äî already exists (115,310 bytes)
üìÇ Local PDF: 0000950170-25-061046.pdf (2,179,871 bytes)
üìÇ Local PDF: 0000950170-25-100235.pdf (3,024,506 bytes)
üìÇ Local PDF: 0001193125-25-256321.pdf (1,802,290 bytes)
üìÇ Local PDF: 0001193125-26-027207.pdf (2,257,229 bytes)

‚úÖ 7 PDF files ready (3 remote, 4 local) in C:\Projets\AzureAISimulator\samples\data\pdfs


## 3. Configure PDF Extraction Functions

We set up extraction functions for the Python libraries:
- **PyMuPDF** (`fitz`) ‚Äî fast, C-based, handles complex layouts
- **pdfplumber** ‚Äî pure-Python, good table extraction, detailed character info

> PDFBox (Java via JPype) extraction is configured in Section 5b below.

In [8]:
def extract_with_pymupdf(pdf_path: Path) -> dict:
    """Extract text and metadata using PyMuPDF (fitz)."""
    doc = fitz.open(str(pdf_path))
    
    pages = []
    full_text_parts = []
    for i, page in enumerate(doc):
        text = page.get_text("text")
        pages.append({
            "page_num": i + 1,
            "text": text,
            "char_count": len(text),
            "word_count": len(text.split()),
            "width": page.rect.width,
            "height": page.rect.height,
            "images": len(page.get_images(full=True)),
            "links": len(page.get_links()),
        })
        full_text_parts.append(text)
    
    metadata = doc.metadata or {}
    full_text = "\n\n".join(full_text_parts)
    
    result = {
        "library": "PyMuPDF",
        "file": pdf_path.name,
        "file_size": pdf_path.stat().st_size,
        "page_count": len(doc),
        "full_text": full_text,
        "total_chars": len(full_text),
        "total_words": len(full_text.split()),
        "pages": pages,
        "metadata": {
            "title": metadata.get("title", ""),
            "author": metadata.get("author", ""),
            "subject": metadata.get("subject", ""),
            "keywords": metadata.get("keywords", ""),
            "creator": metadata.get("creator", ""),
            "producer": metadata.get("producer", ""),
            "creation_date": metadata.get("creationDate", ""),
            "mod_date": metadata.get("modDate", ""),
            "format": metadata.get("format", ""),
            "encryption": metadata.get("encryption", None),
        },
    }
    doc.close()
    return result


def extract_with_pdfplumber(pdf_path: Path) -> dict:
    """Extract text and metadata using pdfplumber."""
    pdf = pdfplumber.open(str(pdf_path))
    
    pages = []
    full_text_parts = []
    for i, page in enumerate(pdf.pages):
        text = page.extract_text() or ""
        tables = page.extract_tables()
        pages.append({
            "page_num": i + 1,
            "text": text,
            "char_count": len(text),
            "word_count": len(text.split()),
            "width": page.width,
            "height": page.height,
            "tables_found": len(tables),
            "chars_count_raw": len(page.chars),
        })
        full_text_parts.append(text)
    
    metadata = pdf.metadata or {}
    full_text = "\n\n".join(full_text_parts)
    
    result = {
        "library": "pdfplumber",
        "file": pdf_path.name,
        "file_size": pdf_path.stat().st_size,
        "page_count": len(pdf.pages),
        "full_text": full_text,
        "total_chars": len(full_text),
        "total_words": len(full_text.split()),
        "pages": pages,
        "metadata": {
            "title": metadata.get("Title", metadata.get("title", "")),
            "author": metadata.get("Author", metadata.get("author", "")),
            "subject": metadata.get("Subject", metadata.get("subject", "")),
            "keywords": metadata.get("Keywords", metadata.get("keywords", "")),
            "creator": metadata.get("Creator", metadata.get("creator", "")),
            "producer": metadata.get("Producer", metadata.get("producer", "")),
            "creation_date": metadata.get("CreationDate", metadata.get("creationDate", "")),
            "mod_date": metadata.get("ModDate", metadata.get("modDate", "")),
        },
    }
    pdf.close()
    return result

print("‚úÖ Extraction functions defined")

‚úÖ Extraction functions defined


## 4. Run PDF Extraction on All Downloaded Files

Execute both extraction methods on each PDF and store the results.

In [None]:
# Run both extractors on each PDF
# Large/complex PDFs can make pdfplumber very slow ‚Äî we use a per-file timeout.
from concurrent.futures import ThreadPoolExecutor, TimeoutError as FuturesTimeoutError

EXTRACTION_TIMEOUT_SEC = 120  # max seconds per extraction per file

results_pymupdf = {}
results_pdfplumber = {}
timings_pymupdf = {}
timings_pdfplumber = {}

def _run_with_timeout(func, pdf_path, timeout=EXTRACTION_TIMEOUT_SEC):
    """Run an extraction function with a timeout. Returns (result, elapsed) or raises."""
    with ThreadPoolExecutor(max_workers=1) as executor:
        t0 = time.perf_counter()
        future = executor.submit(func, pdf_path)
        result = future.result(timeout=timeout)
        elapsed = time.perf_counter() - t0
        return result, elapsed

for doc_id, pdf_path in downloaded.items():
    title = SAMPLE_PDFS[doc_id]["title"]
    print(f"üîç Processing: {title} ({pdf_path.name})")

    try:
        result, elapsed = _run_with_timeout(extract_with_pymupdf, pdf_path)
        results_pymupdf[doc_id] = result
        timings_pymupdf[doc_id] = elapsed
        print(f"   PyMuPDF:    {result['page_count']} pages, "
              f"{result['total_words']:,} words, "
              f"{result['total_chars']:,} chars  "
              f"({elapsed*1000:.1f} ms)")
    except FuturesTimeoutError:
        print(f"   ‚è∞ PyMuPDF timed out after {EXTRACTION_TIMEOUT_SEC}s ‚Äî skipped")
    except Exception as e:
        print(f"   ‚ùå PyMuPDF failed: {e}")

    #try:
    #    result, elapsed = _run_with_timeout(extract_with_pdfplumber, pdf_path)
    #    results_pdfplumber[doc_id] = result
    #    timings_pdfplumber[doc_id] = elapsed
    #    print(f"   pdfplumber: {result['page_count']} pages, "
    #          f"{result['total_words']:,} words, "
    #          f"{result['total_chars']:,} chars  "
    #          f"({elapsed*1000:.1f} ms)")
    #except FuturesTimeoutError:
    #    print(f"   ‚è∞ pdfplumber timed out after {EXTRACTION_TIMEOUT_SEC}s ‚Äî skipped")
    #except Exception as e:
    #    print(f"   ‚ùå pdfplumber failed: {e}")

    print()

print(f"‚úÖ Extraction complete: PyMuPDF={len(results_pymupdf)}, pdfplumber={len(results_pdfplumber)} of {len(downloaded)} documents")

üîç Processing: Employee Handbook (employee_handbook.pdf)
   PyMuPDF:    11 pages, 2,370 words, 16,118 chars  (420.7 ms)
   pdfplumber: 11 pages, 2,370 words, 15,646 chars  (1148.5 ms)

üîç Processing: Benefit Options (Benefit_Options.pdf)
   PyMuPDF:    4 pages, 507 words, 3,677 chars  (957.8 ms)
   pdfplumber: 4 pages, 614 words, 4,286 chars  (557.9 ms)

üîç Processing: Perks Plus Program (PerksPlus.pdf)
   PyMuPDF:    4 pages, 432 words, 2,907 chars  (319.6 ms)
   pdfplumber: 4 pages, 432 words, 2,812 chars  (280.7 ms)

üîç Processing: 0000950170 25 061046 (0000950170-25-061046.pdf)
   PyMuPDF:    72 pages, 32,946 words, 242,269 chars  (3291.8 ms)


## 5. Run Simulator's PdfCracker (PdfPig / C#)

Call the simulator's C# document crackers via the `DocumentCrackingTool` CLI wrapper.

In [5]:
# Import the Python wrapper for the C# document cracking tool
import sys
sys.path.insert(0, str(Path("../../tools/DocumentCrackingTool").resolve()))
from document_cracking import DocumentCracker

# Initialize (auto-builds the .NET tool on first use)
cracker = DocumentCracker()

# List all available crackers
crackers_info = cracker.list_crackers()
print("Available simulator document crackers:")
for c in crackers_info:
    exts = ", ".join(c["supportedExtensions"])
    types = ", ".join(c["supportedContentTypes"])
    print(f"  üì¶ {c['name']:20s}  extensions: {exts:30s}  types: {types}")

üî® Building DocumentCrackingTool...
‚úÖ Build successful
Available simulator document crackers:
  üì¶ PdfCracker            extensions: .pdf                            types: application/pdf
  üì¶ PlainTextCracker      extensions: .txt, .md, .markdown, .text     types: text/plain, text/markdown, text/x-markdown
  üì¶ HtmlCracker           extensions: .html, .htm, .xhtml             types: text/html, application/xhtml+xml
  üì¶ JsonCracker           extensions: .json                           types: application/json, text/json
  üì¶ CsvCracker            extensions: .csv, .tsv                      types: text/csv, text/comma-separated-values, application/csv
  üì¶ ExcelCracker          extensions: .xlsx                           types: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet, application/vnd.ms-excel
  üì¶ WordDocCracker        extensions: .docx                           types: application/vnd.openxmlformats-officedocument.wordprocessingml.document, ap

### Run PdfCracker on all sample PDFs

In [32]:
# Run the simulator's PdfCracker on each PDF
results_pdfpig = {}
timings_pdfpig = {}

for doc_id, pdf_path in downloaded.items():
    title = SAMPLE_PDFS[doc_id]["title"]
    print(f"üîç Running simulator crackers on: {title}")
    
    t0 = time.perf_counter()
    result = cracker.crack(str(pdf_path), crackers=["PdfCracker"])
    timings_pdfpig[doc_id] = time.perf_counter() - t0
    results_pdfpig[doc_id] = result
    
    # Show the PdfCracker result
    for c in cracker.get_successful_crackers(result):
        print(f"   ‚úÖ {c['crackerName']}: {c.get('pageCount', '?')} pages, "
              f"{c.get('wordCount', '?'):,} words, "
              f"{c.get('characterCount', '?'):,} chars")
        if c.get("title"):
            print(f"      Title: {c['title']}")
        if c.get("author"):
            print(f"      Author: {c['author']}")
        if c.get("createdDate"):
            print(f"      Created: {c['createdDate']}")
        if c.get("metadata"):
            for k, v in c["metadata"].items():
                print(f"      {k}: {v}")
        extraction_ms = c.get('extractionTimeMs', 0)
        print(f"   ‚è±Ô∏è  C# extraction: {extraction_ms:.1f} ms  |  total with CLI overhead: {timings_pdfpig[doc_id]*1000:.0f} ms")
    print()

üîç Running simulator crackers on: Employee Handbook
   ‚úÖ PdfCracker: 11 pages, 2,367 words, 15,777 chars
      Author: python-docx
      Created: 2023-03-06T13:57:20.0000000+00:00
      creator: Microsoft¬Æ Word for Microsoft 365
      producer: Microsoft¬Æ Word for Microsoft 365
      pdfVersion: 1,7
   ‚è±Ô∏è  C# extraction: 709.2 ms  |  total with CLI overhead: 1859 ms

üîç Running simulator crackers on: Benefit Options
   ‚úÖ PdfCracker: 4 pages, 609 words, 4,289 chars
      Author: Liam Cavanagh
      Created: 2023-03-06T13:58:20.0000000+00:00
      creator: Microsoft¬Æ Word for Microsoft 365
      producer: Microsoft¬Æ Word for Microsoft 365
      pdfVersion: 1,7
   ‚è±Ô∏è  C# extraction: 724.2 ms  |  total with CLI overhead: 2052 ms

üîç Running simulator crackers on: Perks Plus Program
   ‚úÖ PdfCracker: 4 pages, 432 words, 2,831 chars
      Author: Liam Cavanagh
      Created: 2023-03-07T10:33:37.0000000+00:00
      creator: Microsoft¬Æ Word for Microsoft 365
      produ

## 5b. Run PDFBox (Java via JPype)

Apache PDFBox is the PDF extraction engine used by **Apache Tika**, which is what **Azure AI Search uses internally** for document cracking. This gives us the most authentic reference point.

We call PDFBox's Java API directly from Python using [JPype](https://jpype.readthedocs.io/).

### Prerequisites

| Requirement | Details |
|---|---|
| **Java Runtime (JRE) or JDK** | Version 11 or later (17+ recommended). JPype needs a JVM (`jvm.dll` on Windows, `libjvm.so` on Linux) to start. |
| **`jpype1` Python package** | Installed via `pip install jpype1` (already included in the pip cell above). |
| **PDFBox JAR** | Downloaded automatically by the cell below from Maven Central. |

### Installing a JRE / JDK

**Option A ‚Äî System-wide install (recommended, requires admin)**
- **Windows**: `winget install Microsoft.OpenJDK.21` or install [Eclipse Temurin](https://adoptium.net/). The installer sets `JAVA_HOME` automatically.
- **macOS**: `brew install openjdk@21`
- **Linux (Debian/Ubuntu)**: `sudo apt install openjdk-21-jre-headless`

**Option B ‚Äî Portable / no-admin install (Windows)**
1. Download an OpenJDK **zip** archive (e.g. [Adoptium Temurin JRE releases](https://github.com/adoptium/temurin21-binaries/releases)).
2. Extract it to a local folder, e.g. `%LOCALAPPDATA%\jdk-21-jre\`.
3. No `JAVA_HOME` needed ‚Äî the code cell below auto-discovers JVMs in common locations:
   - `%LOCALAPPDATA%\jdk-*-jre\bin\server\jvm.dll`
   - `%LOCALAPPDATA%\jdk-*\bin\server\jvm.dll`
   - `C:\Program Files\Java\*\bin\server\jvm.dll`
   - `C:\Program Files\Microsoft\jdk-*\bin\server\jvm.dll`
   - `C:\Program Files\Eclipse Adoptium\*\bin\server\jvm.dll`

### Verifying the installation

Run `java -version` in a terminal. You should see output like:
```
openjdk version "21.0.5" 2024-10-15 LTS
```

> **Note:** The JVM can only be started **once** per Python process. If you need to change the JVM path or classpath, restart the notebook kernel first.

In [36]:
# Download PDFBox standalone JAR and start JVM
import jpype
import jpype.imports
import glob

PDFBOX_VERSION = "3.0.4"
PDFBOX_JAR_URL = f"https://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox-app/{PDFBOX_VERSION}/pdfbox-app-{PDFBOX_VERSION}.jar"
PDFBOX_JAR = PDF_DIR / f"pdfbox-app-{PDFBOX_VERSION}.jar"

if not PDFBOX_JAR.exists():
    print(f"üì• Downloading PDFBox {PDFBOX_VERSION} JAR from Maven Central...")
    resp = requests.get(PDFBOX_JAR_URL, timeout=120)
    resp.raise_for_status()
    PDFBOX_JAR.write_bytes(resp.content)
    print(f"   ‚úÖ Saved ({len(resp.content):,} bytes)")
else:
    print(f"üìÑ PDFBox JAR already exists ({PDFBOX_JAR.stat().st_size:,} bytes)")

# Start JVM with PDFBox on classpath (can only be done once per process)
if not jpype.isJVMStarted():
    # Try default path first, fall back to searching common locations
    jvm_path = None
    try:
        jvm_path = jpype.getDefaultJVMPath()
    except jpype.JVMNotFoundException:
        # Search common portable JDK/JRE install locations on Windows
        for pattern in [
            os.path.expandvars(r"%LOCALAPPDATA%\jdk-*-jre\bin\server\jvm.dll"),
            os.path.expandvars(r"%LOCALAPPDATA%\jdk-*\bin\server\jvm.dll"),
            r"C:\Program Files\Java\*\bin\server\jvm.dll",
            r"C:\Program Files\Microsoft\jdk-*\bin\server\jvm.dll",
            r"C:\Program Files\Eclipse Adoptium\*\bin\server\jvm.dll",
        ]:
            matches = glob.glob(pattern)
            if matches:
                jvm_path = matches[0]
                break

    if jvm_path is None:
        raise RuntimeError(
            "No JVM found! Install a JRE/JDK and set JAVA_HOME, "
            "or place one in %LOCALAPPDATA%\\jdk-*"
        )

    print(f"   JVM: {jvm_path}")
    jpype.startJVM(jvm_path, classpath=[str(PDFBOX_JAR.resolve())])
    print("‚úÖ JVM started with PDFBox on classpath")
else:
    print("‚úÖ JVM already running")

# Import PDFBox Java classes
from java.io import File as JFile
from org.apache.pdfbox import Loader
from org.apache.pdfbox.text import PDFTextStripper

print(f"‚úÖ PDFBox {PDFBOX_VERSION} ready")

üìÑ PDFBox JAR already exists (13,454,142 bytes)
   JVM: C:\Users\laurelle\AppData\Local\jdk-21.0.5+11-jre\bin\server\jvm.dll
‚úÖ JVM started with PDFBox on classpath
‚úÖ PDFBox 3.0.4 ready


In [37]:
# Run PDFBox extraction on all PDFs
results_pdfbox = {}
timings_pdfbox = {}

stripper = PDFTextStripper()

for doc_id, pdf_path in downloaded.items():
    title = SAMPLE_PDFS[doc_id]["title"]
    print(f"üîç PDFBox extracting: {title}")

    t0 = time.perf_counter()
    try:
        jfile = JFile(str(pdf_path.resolve()))
        doc = Loader.loadPDF(jfile)

        # Extract text
        text = str(stripper.getText(doc))
        page_count = doc.getNumberOfPages()

        # Extract metadata
        info = doc.getDocumentInformation()
        metadata = {}
        for key in ["Title", "Author", "Subject", "Keywords", "Creator", "Producer"]:
            val = info.getCustomMetadataValue(key)
            if val:
                metadata[key.lower()] = str(val)

        # Try to get dates
        creation_date = info.getCreationDate()
        mod_date = info.getModificationDate()
        if creation_date:
            metadata["creation_date"] = str(creation_date.getTime())
        if mod_date:
            metadata["mod_date"] = str(mod_date.getTime())

        doc.close()
        elapsed = time.perf_counter() - t0
        timings_pdfbox[doc_id] = elapsed

        results_pdfbox[doc_id] = {
            "library": "PDFBox",
            "file": pdf_path.name,
            "file_size": pdf_path.stat().st_size,
            "page_count": page_count,
            "full_text": text,
            "total_chars": len(text),
            "total_words": len(text.split()),
            "metadata": metadata,
        }
        print(f"   ‚úÖ {page_count} pages, {len(text.split()):,} words, "
              f"{len(text):,} chars  ({elapsed*1000:.1f} ms)")

    except Exception as e:
        timings_pdfbox[doc_id] = time.perf_counter() - t0
        print(f"   ‚ùå PDFBox failed: {e}")

    print()

print(f"‚úÖ PDFBox extraction complete for {len(results_pdfbox)} documents")

üîç PDFBox extracting: Employee Handbook
   ‚úÖ 11 pages, 2,370 words, 16,454 chars  (801.7 ms)

üîç PDFBox extracting: Benefit Options
   ‚úÖ 4 pages, 614 words, 4,386 chars  (136.5 ms)

üîç PDFBox extracting: Perks Plus Program
   ‚úÖ 4 pages, 432 words, 2,940 chars  (6676.7 ms)

‚úÖ PDFBox extraction complete for 3 documents


## 7. Raw Extraction Results per Library

Dump the full raw output from each extraction library for every PDF, so you can inspect exactly what each one returns.

In [33]:
# ‚îÄ‚îÄ Raw results: PdfPig (C# / simulator) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else None

    display(Markdown(f"---\n### üü¢ PdfPig ‚Äî {title}"))

    if not pig:
        print("  ‚ùå No successful PdfCracker result")
        continue

    # Metrics
    print(f"  Pages:      {pig.get('pageCount', '?')}")
    print(f"  Words:      {pig.get('wordCount', '?'):,}")
    print(f"  Characters: {pig.get('characterCount', '?'):,}")
    print(f"  Time:       {pig.get('extractionTimeMs', 0):.1f} ms")
    print(f"  Title:      {pig.get('title') or '(none)'}")
    print(f"  Author:     {pig.get('author') or '(none)'}")
    print(f"  Created:    {pig.get('createdDate') or '(none)'}")
    print(f"  Modified:   {pig.get('modifiedDate') or '(none)'}")
    print(f"  Language:   {pig.get('language') or '(none)'}")

    # All metadata keys
    meta = pig.get("metadata", {})
    if meta:
        print(f"\n  Raw metadata ({len(meta)} keys):")
        for k, v in meta.items():
            print(f"    {k}: {v}")

    # Warnings
    if pig.get("warnings"):
        print(f"\n  ‚ö†Ô∏è  Warnings: {pig['warnings']}")

    # Full text
    text = pig.get("content", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ‚îÄ‚îÄ Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ‚îÄ‚îÄ")

    print(display_text + suffix)
    print()

---
### üü¢ PdfPig ‚Äî Employee Handbook

  Pages:      11
  Words:      2,367
  Characters: 15,777
  Time:       709.2 ms
  Title:      (none)
  Author:     python-docx
  Created:    2023-03-06T13:57:20.0000000+00:00
  Modified:   2023-03-06T13:57:20.0000000+00:00
  Language:   (none)

  Raw metadata (3 keys):
    creator: Microsoft¬Æ Word for Microsoft 365
    producer: Microsoft¬Æ Word for Microsoft 365
    pdfVersion: 1,7

  ‚îÄ‚îÄ Text (15,777 chars) ‚îÄ‚îÄ
Contoso Electronics Employee Handbook         

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  All rights reserved to Microsoft   

Contoso Electronics Employee Handbook Last Updated:

---
### üü¢ PdfPig ‚Äî Benefit Options

  Pages:      4
  Words:      609
  Characters: 4,289
  Time:       724.2 ms
  Title:      (none)
  Author:     Liam Cavanagh
  Created:    2023-03-06T13:58:20.0000000+00:00
  Modified:   2023-03-20T13:05:46.0000000+00:00
  Language:   (none)

  Raw metadata (3 keys):
    creator: Microsoft¬Æ Word for Microsoft 365
    producer: Microsoft¬Æ Word for Microsoft 365
    pdfVersion: 1,7

  ‚îÄ‚îÄ Text (4,289 chars) ‚îÄ‚îÄ
Contoso Electronics Plan and Benefit Packages

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document. All rights reserved to Microsoft

Welcome to Contoso Electronics! We are excited to offer our

---
### üü¢ PdfPig ‚Äî Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,831
  Time:       716.5 ms
  Title:      (none)
  Author:     Liam Cavanagh
  Created:    2023-03-07T10:33:37.0000000+00:00
  Modified:   2023-03-07T10:33:37.0000000+00:00
  Language:   (none)

  Raw metadata (3 keys):
    creator: Microsoft¬Æ Word for Microsoft 365
    producer: Microsoft¬Æ Word for Microsoft 365
    pdfVersion: 1,7

  ‚îÄ‚îÄ Text (2,831 chars) ‚îÄ‚îÄ
PerksPlus Health and Wellness Reimbursement Program for Contoso Electronics Employees        

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  All rights reserved to Microsoft   

Overvie

In [28]:
# ‚îÄ‚îÄ Raw results: PyMuPDF ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    mu = results_pymupdf.get(doc_id)

    display(Markdown(f"---\n### üîµ PyMuPDF ‚Äî {title}"))

    if not mu:
        print("  ‚ùå No PyMuPDF result")
        continue

    # Metrics
    print(f"  Pages:      {mu['page_count']}")
    print(f"  Words:      {mu['total_words']:,}")
    print(f"  Characters: {mu['total_chars']:,}")
    print(f"  File size:  {mu['file_size']:,} bytes")
    print(f"  Time:       {timings_pymupdf.get(doc_id, 0)*1000:.1f} ms")

    # Metadata
    meta = mu.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "‚úÖ" if v else "‚ùå"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Per-page summary
    print(f"\n  Per-page breakdown:")
    print(f"    {'Page':>5}  {'Words':>7}  {'Chars':>7}  {'Images':>7}  {'Links':>6}")
    for p in mu["pages"]:
        print(f"    {p['page_num']:>5}  {p['word_count']:>7,}  {p['char_count']:>7,}  "
              f"{p.get('images', 0):>7}  {p.get('links', 0):>6}")

    # Full text
    text = mu.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ‚îÄ‚îÄ Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ‚îÄ‚îÄ")
    print(display_text + suffix)
    print()

---
### üîµ PyMuPDF ‚Äî Employee Handbook

  Pages:      11
  Words:      2,370
  Characters: 16,118
  File size:  142,977 bytes
  Time:       24.3 ms

  Metadata (6 non-empty / 10 total):
    ‚ùå title: (empty)
    ‚úÖ author: python-docx
    ‚ùå subject: (empty)
    ‚ùå keywords: (empty)
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: D:20230306135720-08'00'
    ‚úÖ mod_date: D:20230306135720-08'00'
    ‚úÖ format: PDF 1.7
    ‚ùå encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1        4       56        1       0
        2       66      482        0       0
        3      363    2,411        0       0
        4      322    2,092        0       0
        5      288    1,912        0       0
        6      324    2,254        0       0
        7      265    1,834        0       0
        8      200    1,421        0       0
        9      214    1,456        0       0
       10      244    1,631      

---
### üîµ PyMuPDF ‚Äî Benefit Options

  Pages:      4
  Words:      507
  Characters: 3,677
  File size:  544,811 bytes
  Time:       28.4 ms

  Metadata (6 non-empty / 10 total):
    ‚ùå title: (empty)
    ‚úÖ author: Liam Cavanagh
    ‚ùå subject: (empty)
    ‚ùå keywords: (empty)
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: D:20230306135820-08'00'
    ‚úÖ mod_date: D:20230320130546-07'00'
    ‚úÖ format: PDF 1.7
    ‚ùå encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1        6       47        1       0
        2       66      476        0       0
        3      393    2,877        0       0
        4       42      271        1       0

  ‚îÄ‚îÄ Text (3,677 chars) ‚îÄ‚îÄ
Contoso Electronics 
Plan and Benefit Packages


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not

---
### üîµ PyMuPDF ‚Äî Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,907
  File size:  115,310 bytes
  Time:       23.3 ms

  Metadata (6 non-empty / 10 total):
    ‚ùå title: (empty)
    ‚úÖ author: Liam Cavanagh
    ‚ùå subject: (empty)
    ‚ùå keywords: (empty)
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: D:20230307103337-08'00'
    ‚úÖ mod_date: D:20230307103337-08'00'
    ‚úÖ format: PDF 1.7
    ‚ùå encryption: (empty)

  Per-page breakdown:
     Page    Words    Chars   Images   Links
        1       10      109        1       0
        2       66      482        0       0
        3      352    2,283        0       0
        4        4       27        0       0

  ‚îÄ‚îÄ Text (2,907 chars) ‚îÄ‚îÄ
 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
 
 


This document contains information generated using a language model (Azure OpenAI). The information 
contained in

In [29]:
# ‚îÄ‚îÄ Raw results: pdfplumber ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    pb = results_pdfplumber.get(doc_id)

    display(Markdown(f"---\n### üü† pdfplumber ‚Äî {title}"))

    if not pb:
        print("  ‚ùå No pdfplumber result")
        continue

    # Metrics
    print(f"  Pages:      {pb['page_count']}")
    print(f"  Words:      {pb['total_words']:,}")
    print(f"  Characters: {pb['total_chars']:,}")
    print(f"  File size:  {pb['file_size']:,} bytes")
    print(f"  Time:       {timings_pdfplumber.get(doc_id, 0)*1000:.1f} ms")

    # Metadata
    meta = pb.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "‚úÖ" if v else "‚ùå"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Per-page summary
    print(f"\n  Per-page breakdown:")
    print(f"    {'Page':>5}  {'Words':>7}  {'Chars':>7}  {'Tables':>7}  {'Raw chars':>10}")
    for p in pb["pages"]:
        print(f"    {p['page_num']:>5}  {p['word_count']:>7,}  {p['char_count']:>7,}  "
              f"{p.get('tables_found', 0):>7}  {p.get('chars_count_raw', 0):>10,}")

    # Full text
    text = pb.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ‚îÄ‚îÄ Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ‚îÄ‚îÄ")
    print(display_text + suffix)
    print()

---
### üü† pdfplumber ‚Äî Employee Handbook

  Pages:      11
  Words:      2,370
  Characters: 15,646
  File size:  142,977 bytes
  Time:       938.3 ms

  Metadata (5 non-empty / 8 total):
    ‚ùå title: (empty)
    ‚úÖ author: python-docx
    ‚ùå subject: (empty)
    ‚ùå keywords: (empty)
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: D:20230306135720-08'00'
    ‚úÖ mod_date: D:20230306135720-08'00'

  Per-page breakdown:
     Page    Words    Chars   Tables   Raw chars
        1        4       37        0          46
        2       66      470        0         474
        3      363    2,370        0       2,373
        4      322    2,045        0       2,054
        5      288    1,857        0       1,871
        6      324    2,198        0       2,211
        7      265    1,775        0       1,791
        8      200    1,364        0       1,380
        9      214    1,399        0       1,415
       10      244    1,579        0       1,

---
### üü† pdfplumber ‚Äî Benefit Options

  Pages:      4
  Words:      614
  Characters: 4,286
  File size:  544,811 bytes
  Time:       641.8 ms

  Metadata (5 non-empty / 8 total):
    ‚ùå title: (empty)
    ‚úÖ author: Liam Cavanagh
    ‚ùå subject: (empty)
    ‚ùå keywords: (empty)
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: D:20230306135820-08'00'
    ‚úÖ mod_date: D:20230320130546-07'00'

  Per-page breakdown:
     Page    Words    Chars   Tables   Raw chars
        1        6       45        0          45
        2       66      470        0         470
        3      393    2,845        0       2,843
        4      149      920        0         920

  ‚îÄ‚îÄ Text (4,286 chars) ‚îÄ‚îÄ
Contoso Electronics
Plan and Benefit Packages

This document contains information generated using a language model (Azure OpenAI). The information
contained in this document is only for demonstration purposes and does not reflect the opinions or
beliefs of

---
### üü† pdfplumber ‚Äî Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,812
  File size:  115,310 bytes
  Time:       261.4 ms

  Metadata (5 non-empty / 8 total):
    ‚ùå title: (empty)
    ‚úÖ author: Liam Cavanagh
    ‚ùå subject: (empty)
    ‚ùå keywords: (empty)
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: D:20230307103337-08'00'
    ‚úÖ mod_date: D:20230307103337-08'00'

  Per-page breakdown:
     Page    Words    Chars   Tables   Raw chars
        1       10       85        0          96
        2       66      470        0         464
        3      352    2,229        0       2,203
        4        4       22        0          24

  ‚îÄ‚îÄ Text (2,812 chars) ‚îÄ‚îÄ
PerksPlus Health and Wellness
Reimbursement Program for
Contoso Electronics Employees

This document contains information generated using a language model (Azure OpenAI). The information
contained in this document is only for demonstration purposes and doe

In [38]:
# ‚îÄ‚îÄ Raw results: PDFBox (Java via JPype) ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    bx = results_pdfbox.get(doc_id)

    display(Markdown(f"---\n### üü£ PDFBox (Java) ‚Äî {title}"))

    if not bx:
        print("  ‚ùå No PDFBox result")
        continue

    # Metrics
    print(f"  Pages:      {bx['page_count']}")
    print(f"  Words:      {bx['total_words']:,}")
    print(f"  Characters: {bx['total_chars']:,}")
    print(f"  File size:  {bx['file_size']:,} bytes")
    print(f"  Time:       {timings_pdfbox.get(doc_id, 0)*1000:.1f} ms")

    # Metadata
    meta = bx.get("metadata", {})
    print(f"\n  Metadata ({len([v for v in meta.values() if v])} non-empty / {len(meta)} total):")
    for k, v in meta.items():
        icon = "‚úÖ" if v else "‚ùå"
        print(f"    {icon} {k}: {v if v else '(empty)'}")

    # Full text
    text = bx.get("full_text", "")
    display_text = text if MAX_TEXT_DISPLAY <= 0 else text[:MAX_TEXT_DISPLAY]
    suffix = f"\n\n... [{len(text) - MAX_TEXT_DISPLAY:,} more chars] ..." if MAX_TEXT_DISPLAY > 0 and len(text) > MAX_TEXT_DISPLAY else ""
    print(f"\n  ‚îÄ‚îÄ Text ({len(text):,} chars{f', showing first {MAX_TEXT_DISPLAY:,}' if MAX_TEXT_DISPLAY > 0 else ''}) ‚îÄ‚îÄ")
    print(display_text + suffix)
    print()

---
### üü£ PDFBox (Java) ‚Äî Employee Handbook

  Pages:      11
  Words:      2,370
  Characters: 16,454
  File size:  142,977 bytes
  Time:       801.7 ms

  Metadata (5 non-empty / 5 total):
    ‚úÖ author: python-docx
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: Mon Mar 06 22:57:20 CET 2023
    ‚úÖ mod_date: Mon Mar 06 22:57:20 CET 2023

  ‚îÄ‚îÄ Text (16,454 chars) ‚îÄ‚îÄ
Contoso Electronics 
Employee Handbook 
 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this document.  
All rights reserved to Microsoft 
  
Contoso Electronics Employee Handbook 
Last Updat

---
### üü£ PDFBox (Java) ‚Äî Benefit Options

  Pages:      4
  Words:      614
  Characters: 4,386
  File size:  544,811 bytes
  Time:       136.5 ms

  Metadata (5 non-empty / 5 total):
    ‚úÖ author: Liam Cavanagh
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: Mon Mar 06 22:58:20 CET 2023
    ‚úÖ mod_date: Mon Mar 20 21:05:46 CET 2023

  ‚îÄ‚îÄ Text (4,386 chars) ‚îÄ‚îÄ
Contoso Electronics 
Plan and Benefit Packages
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document. 
All rights reserved to Microsoft
Welcome to Contoso Electronics! We are excited to offer our emplo

---
### üü£ PDFBox (Java) ‚Äî Perks Plus Program

  Pages:      4
  Words:      432
  Characters: 2,940
  File size:  115,310 bytes
  Time:       6676.7 ms

  Metadata (5 non-empty / 5 total):
    ‚úÖ author: Liam Cavanagh
    ‚úÖ creator: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ producer: Microsoft¬Æ Word for Microsoft 365
    ‚úÖ creation_date: Tue Mar 07 19:33:37 CET 2023
    ‚úÖ mod_date: Tue Mar 07 19:33:37 CET 2023

  ‚îÄ‚îÄ Text (2,940 chars) ‚îÄ‚îÄ
 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document.  
All rights reserved to Microsoft 
 

## 8. Four-Way Comparison: PdfPig vs PDFBox vs PyMuPDF vs pdfplumber

Compare the extracted text and metadata across all four libraries side by side.

In [39]:
# Four-way text extraction comparison
comparison_4way = []

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    mu = results_pymupdf.get(doc_id)
    pb = results_pdfplumber.get(doc_id)
    bx = results_pdfbox.get(doc_id)

    # Get PdfPig result
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else None

    row = {"Document": title}

    # PdfPig (C# simulator)
    if pig:
        row["PdfPig Words"] = f"{pig.get('wordCount', 0):,}"
        row["PdfPig Chars"] = f"{pig.get('characterCount', 0):,}"
        row["PdfPig Pages"] = pig.get("pageCount", "?")
        row["PdfPig ms"] = f"{pig.get('extractionTimeMs', 0):.1f}"
    else:
        row["PdfPig Words"] = "‚Äî"
        row["PdfPig Chars"] = "‚Äî"
        row["PdfPig Pages"] = "‚Äî"
        row["PdfPig ms"] = "‚Äî"

    # PDFBox (Java)
    if bx:
        row["PDFBox Words"] = f"{bx['total_words']:,}"
        row["PDFBox Chars"] = f"{bx['total_chars']:,}"
        row["PDFBox ms"] = f"{timings_pdfbox.get(doc_id, 0)*1000:.1f}"
    else:
        row["PDFBox Words"] = "‚Äî"
        row["PDFBox Chars"] = "‚Äî"
        row["PDFBox ms"] = "‚Äî"

    # PyMuPDF
    if mu:
        row["PyMuPDF Words"] = f"{mu['total_words']:,}"
        row["PyMuPDF Chars"] = f"{mu['total_chars']:,}"
        row["PyMuPDF ms"] = f"{timings_pymupdf.get(doc_id, 0)*1000:.1f}"

    # pdfplumber
    if pb:
        row["pdfplumber Words"] = f"{pb['total_words']:,}"
        row["pdfplumber Chars"] = f"{pb['total_chars']:,}"
        row["pdfplumber ms"] = f"{timings_pdfplumber.get(doc_id, 0)*1000:.1f}"

    comparison_4way.append(row)

df_4way = pd.DataFrame(comparison_4way)
display(Markdown("### Text Extraction: Word & Character Count Comparison"))
display(df_4way)

### Text Extraction: Word & Character Count Comparison

Unnamed: 0,Document,PdfPig Words,PdfPig Chars,PdfPig Pages,PdfPig ms,PDFBox Words,PDFBox Chars,PDFBox ms,PyMuPDF Words,PyMuPDF Chars,PyMuPDF ms,pdfplumber Words,pdfplumber Chars,pdfplumber ms
0,Employee Handbook,2367,15777,11,709.2,2370,16454,801.7,2370,16118,24.3,2370,15646,938.3
1,Benefit Options,609,4289,4,724.2,614,4386,136.5,507,3677,28.4,614,4286,641.8
2,Perks Plus Program,432,2831,4,716.5,432,2940,6676.7,432,2907,23.3,432,2812,261.4


In [40]:
# Four-way metadata comparison
meta_4way = []

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    mu = results_pymupdf.get(doc_id, {}).get("metadata", {})
    pb = results_pdfplumber.get(doc_id, {}).get("metadata", {})
    bx = results_pdfbox.get(doc_id, {}).get("metadata", {})

    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig = pig_crackers[0] if pig_crackers else {}
    pig_meta = pig.get("metadata", {})

    for field_name, pig_key, bx_key, mu_key, pb_key in [
        ("Title", "title", "title", "title", "title"),
        ("Author", "author", "author", "author", "author"),
        ("Creator", "creator", "creator", "creator", "creator"),
        ("Producer", "producer", "producer", "producer", "producer"),
        ("Creation Date", "createdDate", "creation_date", "creation_date", "creation_date"),
        ("Modified Date", "modifiedDate", "mod_date", "mod_date", "mod_date"),
        ("Subject", "subject", "subject", "subject", "subject"),
        ("Keywords", "keywords", "keywords", "keywords", "keywords"),
    ]:
        pig_val = pig.get(pig_key, "") or pig_meta.get(pig_key, "") or ""
        bx_val = bx.get(bx_key, "") or ""
        mu_val = mu.get(mu_key, "") or ""
        pb_val = pb.get(pb_key, "") or ""

        meta_4way.append({
            "Document": title,
            "Field": field_name,
            "PdfPig (C#)": str(pig_val) if pig_val else "‚ùå",
            "PDFBox (Java)": str(bx_val) if bx_val else "‚ùå",
            "PyMuPDF": str(mu_val) if mu_val else "‚ùå",
            "pdfplumber": str(pb_val) if pb_val else "‚ùå",
        })

df_meta_4way = pd.DataFrame(meta_4way)
display(Markdown("### Metadata Comparison Across All Four Libraries"))
display(df_meta_4way)

### Metadata Comparison Across All Four Libraries

Unnamed: 0,Document,Field,PdfPig (C#),PDFBox (Java),PyMuPDF,pdfplumber
0,Employee Handbook,Title,‚ùå,‚ùå,‚ùå,‚ùå
1,Employee Handbook,Author,python-docx,python-docx,python-docx,python-docx
2,Employee Handbook,Creator,Microsoft¬Æ Word for Microsoft 365,Microsoft¬Æ Word for Microsoft 365,Microsoft¬Æ Word for Microsoft 365,Microsoft¬Æ Word for Microsoft 365
3,Employee Handbook,Producer,Microsoft¬Æ Word for Microsoft 365,Microsoft¬Æ Word for Microsoft 365,Microsoft¬Æ Word for Microsoft 365,Microsoft¬Æ Word for Microsoft 365
4,Employee Handbook,Creation Date,2023-03-06T13:57:20.0000000+00:00,Mon Mar 06 22:57:20 CET 2023,D:20230306135720-08'00',D:20230306135720-08'00'
5,Employee Handbook,Modified Date,2023-03-06T13:57:20.0000000+00:00,Mon Mar 06 22:57:20 CET 2023,D:20230306135720-08'00',D:20230306135720-08'00'
6,Employee Handbook,Subject,‚ùå,‚ùå,‚ùå,‚ùå
7,Employee Handbook,Keywords,‚ùå,‚ùå,‚ùå,‚ùå
8,Benefit Options,Title,‚ùå,‚ùå,‚ùå,‚ùå
9,Benefit Options,Author,Liam Cavanagh,Liam Cavanagh,Liam Cavanagh,Liam Cavanagh


In [41]:
# Content text comparison ‚Äî show first 500 chars from each library
COMPARE_CHARS = 500

for doc_id in downloaded:
    title = SAMPLE_PDFS[doc_id]["title"]
    display(Markdown(f"---\n### üìÑ {title} ‚Äî First {COMPARE_CHARS} chars from each library"))

    # PdfPig
    pig_result = results_pdfpig.get(doc_id, {})
    pig_crackers = [c for c in pig_result.get("crackers", []) if c.get("success")]
    pig_text = pig_crackers[0].get("content", "") if pig_crackers else ""

    # PDFBox
    bx_text = results_pdfbox.get(doc_id, {}).get("full_text", "")

    # PyMuPDF
    mu_text = results_pymupdf.get(doc_id, {}).get("full_text", "")

    # pdfplumber
    pb_text = results_pdfplumber.get(doc_id, {}).get("full_text", "")

    display(Markdown("**PdfPig (C# / simulator):**"))
    print(pig_text[:COMPARE_CHARS])
    print()

    display(Markdown("**PDFBox (Java / Azure Search engine):**"))
    print(bx_text[:COMPARE_CHARS])
    print()

    display(Markdown("**PyMuPDF (Python):**"))
    print(mu_text[:COMPARE_CHARS])
    print()

    display(Markdown("**pdfplumber (Python):**"))
    print(pb_text[:COMPARE_CHARS])
    print()

---
### üìÑ Employee Handbook ‚Äî First 500 chars from each library

**PdfPig (C# / simulator):**

Contoso Electronics Employee Handbook         

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document.  All rights 



**PDFBox (Java / Azure Search engine):**

Contoso Electronics 
Employee Handbook 
 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this 



**PyMuPDF (Python):**

Contoso Electronics 
Employee Handbook 
 
 
 
 
 
 
 
 


This document contains information generated using a language model (Azure OpenAI). The 
information contained in this document is only for demonstration purposes and does not 
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or 
warranties of any kind, express or implied, about the completeness, accuracy, reliability, 
suitability or availability with respect to the information contained in this document. 



**pdfplumber (Python):**

Contoso Electronics
Employee Handbook

This document contains information generated using a language model (Azure OpenAI). The
information contained in this document is only for demonstration purposes and does not
reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or
warranties of any kind, express or implied, about the completeness, accuracy, reliability,
suitability or availability with respect to the information contained in this document.
All rights reserved to 



---
### üìÑ Benefit Options ‚Äî First 500 chars from each library

**PdfPig (C# / simulator):**

Contoso Electronics Plan and Benefit Packages

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the information contained in this document. All rights re



**PDFBox (Java / Azure Search engine):**

Contoso Electronics 
Plan and Benefit Packages
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document. 
All



**PyMuPDF (Python):**

Contoso Electronics 
Plan and Benefit Packages


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with respect to the information 
contained in this document. 
All righ



**pdfplumber (Python):**

Contoso Electronics
Plan and Benefit Packages

This document contains information generated using a language model (Azure OpenAI). The information
contained in this document is only for demonstration purposes and does not reflect the opinions or
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied,
about the completeness, accuracy, reliability, suitability or availability with respect to the information
contained in this document.
All rights rese



---
### üìÑ Perks Plus Program ‚Äî First 500 chars from each library

**PdfPig (C# / simulator):**

PerksPlus Health and Wellness Reimbursement Program for Contoso Electronics Employees        

This document contains information generated using a language model (Azure OpenAI). The information contained in this document is only for demonstration purposes and does not reflect the opinions or beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the infor



**PDFBox (Java / Azure Search engine):**

 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
  
This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availabil



**PyMuPDF (Python):**

 
 
 
PerksPlus Health and Wellness 
Reimbursement Program for 
Contoso Electronics Employees 
 
 
 
 
 
 
 


This document contains information generated using a language model (Azure OpenAI). The information 
contained in this document is only for demonstration purposes and does not reflect the opinions or 
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied, 
about the completeness, accuracy, reliability, suitability or availability with res



**pdfplumber (Python):**

PerksPlus Health and Wellness
Reimbursement Program for
Contoso Electronics Employees

This document contains information generated using a language model (Azure OpenAI). The information
contained in this document is only for demonstration purposes and does not reflect the opinions or
beliefs of Microsoft. Microsoft makes no representations or warranties of any kind, express or implied,
about the completeness, accuracy, reliability, suitability or availability with respect to the information
con

