# Test PDF to Markdown Conversion

This notebook tests converting a World Bank PAD PDF to Markdown using docling.

**Pipeline step:** PDF → Markdown (bronze → silver)

**Note:** For production use, the conversion logic is now available in `src/pdf_conversion/`. 
You can run conversions using:
```bash
uv run python -m src.pdf_conversion.cli           # Convert all PDFs
uv run python -m src.pdf_conversion.cli --pdf test.pdf  # Convert specific PDF
uv run python -m src.pdf_conversion.cli --overwrite     # Overwrite existing files
```

## 1. Import Required Libraries

In [None]:
from pathlib import Path

# Import our config and converter
import sys
sys.path.append(str(Path.cwd().parent))
from src.config import load_config
from src.pdf_conversion.converter import PDFConverter

## 2. Load Configuration and Set Paths

In [9]:
# Load config
config = load_config()

# Get project root and data paths
project_root = Path.cwd().parent
pdf_dir = project_root / config.paths.raw_pdfs
md_dir = project_root / config.paths.markdown

print(f"PDF directory: {pdf_dir}")
print(f"Markdown directory: {md_dir}")
print(f"PDF dir exists: {pdf_dir.exists()}")
print(f"Markdown dir exists: {md_dir.exists()}")

PDF directory: /Users/lauren/repos/PAD2Skills/data/bronze/pads_pdf
Markdown directory: /Users/lauren/repos/PAD2Skills/data/silver/pads_md
PDF dir exists: True
Markdown dir exists: True


## 3. Locate Test PDF File

Check for available PDFs in the bronze directory.

In [10]:
# List available PDFs
pdf_files = list(pdf_dir.glob("*.pdf"))

if pdf_files:
    print(f"Found {len(pdf_files)} PDF(s):")
    for pdf in pdf_files:
        print(f"  - {pdf.name} ({pdf.stat().st_size / 1024:.1f} KB)")
    
    # Use first PDF for testing
    test_pdf = pdf_files[0]
    print(f"\nUsing test PDF: {test_pdf.name}")
else:
    print("No PDFs found! Please add a test PDF to:", pdf_dir)
    test_pdf = None

Found 1 PDF(s):
  - test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf (2290.6 KB)

Using test PDF: test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf


## 4. Convert PDF to Markdown

Use the PDFConverter from src to convert the PDF.

In [11]:
if test_pdf:
    # Initialize converter with accurate table extraction
    converter = PDFConverter(accurate_tables=True)
    
    print(f"Converting {test_pdf.name}...")
    markdown_content = converter.convert_pdf(test_pdf)
    
    print(f"✓ Conversion complete!")
    print(f"Markdown length: {len(markdown_content)} characters")
else:
    print("⚠ Skipping conversion - no test PDF available")

2025-12-26 14:17:13,197 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-26 14:17:13,291 - INFO - Going to convert document batch...
2025-12-26 14:17:13,292 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2025-12-26 14:17:13,293 - INFO - Auto OCR model selected ocrmac.
2025-12-26 14:17:13,295 - INFO - Accelerator device: 'cpu'


2025-12-26 14:17:13,197 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-26 14:17:13,291 - INFO - Going to convert document batch...
2025-12-26 14:17:13,292 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2025-12-26 14:17:13,293 - INFO - Auto OCR model selected ocrmac.
2025-12-26 14:17:13,295 - INFO - Accelerator device: 'cpu'


Converting test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf...


2025-12-26 14:17:13,197 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-26 14:17:13,291 - INFO - Going to convert document batch...
2025-12-26 14:17:13,292 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2025-12-26 14:17:13,293 - INFO - Auto OCR model selected ocrmac.
2025-12-26 14:17:13,295 - INFO - Accelerator device: 'cpu'


Converting test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf...


2025-12-26 14:17:14,018 - INFO - Accelerator device: 'cpu'
2025-12-26 14:17:14,830 - INFO - Processing document test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf
2025-12-26 14:21:50,325 - INFO - Finished converting document test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf in 277.13 sec.


2025-12-26 14:17:13,197 - INFO - detected formats: [<InputFormat.PDF: 'pdf'>]
2025-12-26 14:17:13,291 - INFO - Going to convert document batch...
2025-12-26 14:17:13,292 - INFO - Initializing pipeline for StandardPdfPipeline with options hash e15bc6f248154cc62f8db15ef18a8ab7
2025-12-26 14:17:13,293 - INFO - Auto OCR model selected ocrmac.
2025-12-26 14:17:13,295 - INFO - Accelerator device: 'cpu'


Converting test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf...


2025-12-26 14:17:14,018 - INFO - Accelerator device: 'cpu'
2025-12-26 14:17:14,830 - INFO - Processing document test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf
2025-12-26 14:21:50,325 - INFO - Finished converting document test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf in 277.13 sec.


✓ Conversion complete!
Markdown length: 486089 characters


## 5. Display Conversion Results

Show a preview of the converted Markdown content.

In [12]:
if test_pdf and markdown_content:
    # Display first 2000 characters
    preview_length = 2000
    print("=" * 80)
    print("MARKDOWN PREVIEW (first 2000 chars):")
    print("=" * 80)
    print(markdown_content[:preview_length])
    if len(markdown_content) > preview_length:
        print(f"\n... ({len(markdown_content) - preview_length} more characters)")

MARKDOWN PREVIEW (first 2000 chars):
Public Disclosure Authorized

Public Disclosure Authorized

Public Disclosure Authorized

Public Disclosure Authorized

IBRD - IDA | WORLD BANK GROUP

<!-- image -->

OF WHICH EUR 92.9 MILLION (US$105.2 MILLION EQUIVALENT) FROI

## INTERNATIONAL DEVELOPMENT ASSOCIATION PROJECT APPRAISAL DOCUMENT

ON A

## PROPOSED CREDIT

IN THE AMOUNT OF US$132.3 MILLION OF WHICH EUR 92.9 MILLION (US$105.2 MILLION EQUIVALENT) FROM THE SCALE-UP WINDOW AND US$24.0 MILLION FROM THE SCALE-UP WINDOW-SHORTER MATURITY LOAN

TO THE

REPUBLIC OF GUINEA

Energy and Extractives Global Practice

Western and Central Africa Region

FOR A

GUINEA ELECTRICITY ACCESS SCALE UP PROJECT-PHASE 2

JUNE 6, 2025

ON A

Energy and Extractives Global Practice Western and Central Africa Region

This document has a restricted distribution and may be used by recipients only in the performance of their official duties. Its contents may not otherwise be disclosed without World Bank authorization

## 6. Validate Markdown Output

Check if the conversion captured expected elements.

In [13]:
if test_pdf and markdown_content:
    # Basic validation checks
    checks = {
        "Has content": len(markdown_content) > 100,
        "Contains headers": "#" in markdown_content,
        "Contains paragraphs": "\n\n" in markdown_content,
        "Contains tables": "|" in markdown_content or "Table" in markdown_content,
    }
    
    print("Validation Results:")
    print("-" * 40)
    for check, passed in checks.items():
        status = "✓" if passed else "✗"
        print(f"{status} {check}")
    
    # Count lines
    lines = markdown_content.split("\n")
    print(f"\nTotal lines: {len(lines)}")

Validation Results:
----------------------------------------
✓ Has content
✓ Contains headers
✓ Contains paragraphs
✓ Contains tables

Total lines: 2381


## 7. Save to Silver Directory (Optional)

Save the converted markdown to the silver directory.

In [14]:
if test_pdf and markdown_content:
    # Create output filename
    output_file = md_dir / f"{test_pdf.stem}.md"
    
    # Ensure directory exists
    md_dir.mkdir(parents=True, exist_ok=True)
    
    # Save markdown
    output_file.write_text(markdown_content, encoding="utf-8")
    
    print(f"✓ Saved markdown to: {output_file}")
    print(f"  File size: {output_file.stat().st_size / 1024:.1f} KB")

✓ Saved markdown to: /Users/lauren/repos/PAD2Skills/data/silver/pads_md/test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.md
  File size: 475.0 KB
