# PDF to Image Test

This code tests converting the PDF pages to images and extracting text with openai. However, this turned out to be too expensive.

**Pipeline step:** PDF → Images (bronze → bronze/pdf_images)

## Prerequisites

This notebook requires the `poppler` system library for PDF processing.

**Install system dependencies:**

```bash
# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases
```

**Install Python packages** (if not already installed):
```bash
uv add pdf2image pillow matplotlib
uv sync
```

## 1. Import Required Libraries

In [1]:
from pathlib import Path
from pdf2image import convert_from_path

# Import our config
import sys
sys.path.append(str(Path.cwd().parent))
from src.config import load_config

## 2. Load Configuration and Set Paths

In [2]:
# Load config
config = load_config()

# Get project root and data paths
project_root = Path.cwd().parent
pdf_dir = project_root / config.paths.raw_pdfs
images_dir = project_root / "data" / "bronze" / "images"

print(f"PDF directory: {pdf_dir}")
print(f"Images directory: {images_dir}")
print(f"PDF dir exists: {pdf_dir.exists()}")
print(f"Images dir exists: {images_dir.exists()}")

PDF directory: /Users/lauren/repos/PAD2Skills/data/bronze/pads_pdf
Images directory: /Users/lauren/repos/PAD2Skills/data/bronze/images
PDF dir exists: True
Images dir exists: True


## 3. Locate Test PDF File

Find the same test PDF used in the docling conversion.

In [3]:
# List available PDFs
pdf_files = list(pdf_dir.glob("*.pdf"))

if pdf_files:
    print(f"Found {len(pdf_files)} PDF(s):")
    for pdf in pdf_files:
        print(f"  - {pdf.name} ({pdf.stat().st_size / 1024:.1f} KB)")
    
    # Use first PDF for testing
    test_pdf = pdf_files[0]
    print(f"\nUsing test PDF: {test_pdf.name}")
else:
    print("No PDFs found! Please add a test PDF to:", pdf_dir)
    test_pdf = None

Found 1 PDF(s):
  - test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf (2290.6 KB)

Using test PDF: test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf


## 4. Create Output Directory

Create a subfolder named after the PDF to store its page images.

In [4]:
if test_pdf:
    # Create subfolder with PDF name (without extension)
    pdf_images_dir = images_dir / test_pdf.stem
    pdf_images_dir.mkdir(parents=True, exist_ok=True)
    
    print(f"Output directory: {pdf_images_dir}")
    print(f"Directory created: {pdf_images_dir.exists()}")
else:
    print("⚠ Skipping - no test PDF available")
    pdf_images_dir = None

Output directory: /Users/lauren/repos/PAD2Skills/data/bronze/images/test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0
Directory created: True


## 5. Convert PDF Pages to Images

Extract each page as a PNG image with 3-digit naming (001.png, 002.png, etc.).

In [5]:
if test_pdf and pdf_images_dir:
    print(f"Converting {test_pdf.name} to images...")
    
    # Convert PDF to list of PIL images
    # dpi=300 provides good quality for text recognition
    images = convert_from_path(str(test_pdf), dpi=300)
    
    print(f"Found {len(images)} page(s)")
    
    # Save each page as an image
    for page_num, image in enumerate(images, start=1):
        # Create filename with 3-digit zero-padding
        output_file = pdf_images_dir / f"{page_num:03d}.png"
        
        # Save as PNG
        image.save(output_file, "PNG")
        
        print(f"  ✓ Saved page {page_num:3d} -> {output_file.name}")
    
    print(f"\n✓ Conversion complete! Images saved to: {pdf_images_dir}")
else:
    print("⚠ Skipping conversion - no test PDF available")

Converting test-BOSIB-e7b1cb57-9078-490e-8988-30bc9fe608c0.pdf to images...
Found 75 page(s)
  ✓ Saved page   1 -> 001.png
  ✓ Saved page   2 -> 002.png
  ✓ Saved page   3 -> 003.png
  ✓ Saved page   4 -> 004.png
  ✓ Saved page   5 -> 005.png
  ✓ Saved page   6 -> 006.png
  ✓ Saved page   7 -> 007.png
  ✓ Saved page   8 -> 008.png
  ✓ Saved page   9 -> 009.png
  ✓ Saved page  10 -> 010.png
  ✓ Saved page  11 -> 011.png
  ✓ Saved page  12 -> 012.png
  ✓ Saved page  13 -> 013.png
  ✓ Saved page  14 -> 014.png
  ✓ Saved page  15 -> 015.png
  ✓ Saved page  16 -> 016.png
  ✓ Saved page  17 -> 017.png
  ✓ Saved page  18 -> 018.png
  ✓ Saved page  19 -> 019.png
  ✓ Saved page  20 -> 020.png
  ✓ Saved page  21 -> 021.png
  ✓ Saved page  22 -> 022.png
  ✓ Saved page  23 -> 023.png
  ✓ Saved page  24 -> 024.png
  ✓ Saved page  25 -> 025.png
  ✓ Saved page  26 -> 026.png
  ✓ Saved page  27 -> 027.png
  ✓ Saved page  28 -> 028.png
  ✓ Saved page  29 -> 029.png
  ✓ Saved page  30 -> 030.png
  ✓ Sav

## 6. Verify Output

Check the created images and display summary information.

In [None]:
if pdf_images_dir and pdf_images_dir.exists():
    # List all PNG files
    image_files = sorted(pdf_images_dir.glob("*.png"))
    
    print(f"Images in {pdf_images_dir.name}/:")
    print("-" * 60)
    
    total_size = 0
    for img_file in image_files:
        size_kb = img_file.stat().st_size / 1024
        total_size += size_kb
        print(f"  {img_file.name}: {size_kb:,.1f} KB")
    
    print("-" * 60)
    print(f"Total: {len(image_files)} images, {total_size:,.1f} KB ({total_size/1024:.2f} MB)")

## 7. Display First Page Preview (Optional)

Show a preview of the first page to verify image quality.

In [None]:
if pdf_images_dir and pdf_images_dir.exists():
    from PIL import Image
    import matplotlib.pyplot as plt
    
    # Load first image
    first_image = sorted(pdf_images_dir.glob("*.png"))[0]
    img = Image.open(first_image)
    
    # Display
    plt.figure(figsize=(10, 14))
    plt.imshow(img)
    plt.axis('off')
    plt.title(f"Preview: {first_image.name}\nSize: {img.size[0]}x{img.size[1]} pixels")
    plt.tight_layout()
    plt.show()
    
    print(f"Image dimensions: {img.size[0]} x {img.size[1]} pixels")
    print(f"Image mode: {img.mode}")

In [11]:
import json

raw = r'''
{
"pages": [
{
"file_name": "018.png",
"page_number": "7",
"markdown": "# A. Project Development Objective\n\n18. The proposed Project Development Objective (PDO) is to: (i) increase grid and non-grid access to electricity services in selected areas; and (ii) improve the operational performance of the power utility.\n\n# B. Theory of Change and PDO Indicators\n\nFigure 1: Theory of Change for the GNEAP-2 Project\n\nProject Development Objective:\n(i) increase electricity services through grid and off-grid access in selected areas; and (ii) improve the operational performance of the power utility\n\n| ACTIVITIES                                   | OUTPUTS                                                                                   | OUTCOMES                                                                                                                      | LONG-TERM IMPACT                                                                                   |\n|----------------------------------------------|------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------|\n| Infrastructure investments:                   | • New and improved residential connections in urban and rural areas                      | Increased electricity services:                                                                                          | • Expanded, reliable electricity access for households and businesses in both rural and urban areas  |\n|                                              | • New and improved connections for SMEs and business                                    | • 960,022 people provided with direct access to electricity through new connections                                          | • Improved quality and stability of electricity supply, with fewer outages and system losses        |\n| • Reinforcement and densification of the grid, construction of MV lines\n• Privately operated mini-grid electrification in rural areas | • New MV lines constructed  • Hybrid mini-grids deployed in rural localities          | • 570,073 people provided with access to electricity through improved connections                                             | • Improved investment climate for private sector participation in the energy sector                 |\n| Reforms & operational improvements:           | • Sector recovery plan implemented                                                      | • US$11.7 million of private capital mobilized for hybrid mini-grid deployment                                               | • Operational sustainability of EDG strengthened through improved revenue collection and reduced reliance on subsidies       |\n| • Digitalization and operational improvements of the power utility\n• Rollout of prepaid meters in public institutions\n• Assistance for financial modeling and utility management | • Performance contract between EDG and the Government signed and launched  • Increased financial transparency at EDG  • Prepaid meters installed in non-strategic public institutions  • Smart meters deployed for large customers  • EDG’s Integrated Management System (IMS) deployed |                                                                                                                             | • Reduction in illegal connections and revenue leakage, reinforcing long-term utility viability     |\n| Implementation support:                       | • Increased implementation and technical capacity                                       | Improved operational performance of the power utility:\n  • Operational cost recovery ratio of the power utility increased from 20 to 70 percent                                 | • Strengthened institutional capacity for integrated, long-term energy sector planning and governance                        |\n| • Capacity building on gender inclusion, safeguards, and implementation support\n• Technical studies and activities supervision | • Improved gender inclusion in the energy sector                                      |                                                                                                                             |                                                                                                    |\n\nCritical assumptions: (1) Government maintains commitment to tariff reform and financial sustainability. (2) EDG successfully implements digitalization and operational efficiency improvements. (3) Private sector remains engaged in rural sector electrification under viable business models. (4) Climate risks do not significantly disrupt infrastructure implementation.\n\n19. The following indicators have been selected to measure progress toward the PDO:\n\n(a) People provided with direct access to electricity through new connections (disaggregated by female, youth, urban areas, rural areas) (number; CRI).\n\n(b) People provided with access to electricity through improved connections (disaggregated by female and youth) (number).\n\n(c) Private capital mobilized (PCM) (US dollars).\n\n(d) Power utility operational cost recovery ratio (percentage).",
"footnotes": "30 The Government of Guinea has adopted a new methodology for tariffs adjustments which will result in tariff increases scheduled for end-2025.\n\n31 Operational cost recovery ratio is defined as total revenues collected (excluding government subsidies) over total operating expenses (total OPEX). It reflects the extent to which a utility (in this case, EDG) can cover its operating costs through revenues collected from electricity sales, without reliance on government subsidies or external financing. A ratio below 100 percent means EDG relies on subsidies or incurs losses."
}
]
}
'''.strip()

data = json.loads(raw)


In [12]:
# Example usage
page0 = data["pages"][0]
print(page0["markdown"])  # this prints with real newlines


# A. Project Development Objective

18. The proposed Project Development Objective (PDO) is to: (i) increase grid and non-grid access to electricity services in selected areas; and (ii) improve the operational performance of the power utility.

# B. Theory of Change and PDO Indicators

Figure 1: Theory of Change for the GNEAP-2 Project

Project Development Objective:
(i) increase electricity services through grid and off-grid access in selected areas; and (ii) improve the operational performance of the power utility

| ACTIVITIES                                   | OUTPUTS                                                                                   | OUTCOMES                                                                                                                      | LONG-TERM IMPACT                                                                                   |
|----------------------------------------------|-----------------------------------------------------------

In [14]:
print(page0["footnotes"])  # this prints with real newlines

30 The Government of Guinea has adopted a new methodology for tariffs adjustments which will result in tariff increases scheduled for end-2025.

31 Operational cost recovery ratio is defined as total revenues collected (excluding government subsidies) over total operating expenses (total OPEX). It reflects the extent to which a utility (in this case, EDG) can cover its operating costs through revenues collected from electricity sales, without reliance on government subsidies or external financing. A ratio below 100 percent means EDG relies on subsidies or incurs losses.


In [13]:
# Save page0 markdown to silver/markdown_testing
markdown_testing_dir = project_root / "data" / "silver" / "markdown_testing"
markdown_testing_dir.mkdir(parents=True, exist_ok=True)

output_file = markdown_testing_dir / "page01.md"
output_file.write_text(page0["markdown"], encoding="utf-8")

print(f"✓ Saved to: {output_file}")
print(f"  File size: {output_file.stat().st_size / 1024:.2f} KB")

✓ Saved to: /Users/lauren/repos/PAD2Skills/data/silver/markdown_testing/page01.md
  File size: 5.21 KB
