<a href="https://www.kaggle.com/code/hasib111/optimize-pdf?scriptVersionId=295667980" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!apt-get -qq update && apt-get -qq install -y ghostscript

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [2]:
from pathlib import Path
import subprocess, shlex

pdf_in = Path("/kaggle/input/document/SANTA ROSA AVE-CONSTRUCTION PLANS.pdf")
san = Path("/kaggle/working/sanitized.pdf")
opt = Path("/kaggle/working/optimized.pdf")



In [3]:
def run(cmd):
    print("Running:", " ".join(shlex.quote(c) for c in cmd))
    p = subprocess.run(cmd, capture_output=True, text=True)
    if p.returncode != 0:
        print(p.stdout); print(p.stderr)
        raise RuntimeError("Command failed")
    return p


In [4]:
# 1) Sanitize (parser-friendly)
run([
    "gs","-dNOPAUSE","-dBATCH","-dSAFER",
    "-sDEVICE=pdfwrite","-dCompatibilityLevel=1.4",
    "-dPDFSETTINGS=/prepress",
    "-o", str(san),
    str(pdf_in)
])

# 2) Optimize (shrink)
run([
    "gs","-dNOPAUSE","-dBATCH","-dSAFER",
    "-sDEVICE=pdfwrite","-dCompatibilityLevel=1.4",
    "-dDetectDuplicateImages=true",
    "-dCompressFonts=true","-dSubsetFonts=true",
    "-dPDFSETTINGS=/ebook",   # try /printer or /screen too
    "-o", str(opt),
    str(san)
])

print("\nSizes:")
print("original :", pdf_in.stat().st_size, "bytes")
print("sanitized:", san.stat().st_size, "bytes")
print("optimized:", opt.stat().st_size, "bytes")
print("\nOutput:", opt)

Running: gs -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress -o /kaggle/working/sanitized.pdf '/kaggle/input/document/SANTA ROSA AVE-CONSTRUCTION PLANS.pdf'
Running: gs -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDetectDuplicateImages=true -dCompressFonts=true -dSubsetFonts=true -dPDFSETTINGS=/ebook -o /kaggle/working/optimized.pdf /kaggle/working/sanitized.pdf

Sizes:
original : 142827557 bytes
sanitized: 102976435 bytes
optimized: 33372892 bytes

Output: /kaggle/working/optimized.pdf


**New version**

In [5]:
!apt-get -qq update && apt-get -qq install -y ghostscript

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [6]:
from pathlib import Path
import subprocess, shlex

pdf_in = Path("/kaggle/input/document/WCSS Sebastopol 1st Sub Set.pdf")
san = Path("/kaggle/working/sanitized.pdf")
opt = Path("/kaggle/working/optimized.pdf")

In [7]:
def run(cmd):
    print("Running:", " ".join(shlex.quote(str(c)) for c in cmd))
    p = subprocess.run(cmd, capture_output=True, text=True)
    if p.returncode != 0:
        print("STDOUT:\n", p.stdout)
        print("STDERR:\n", p.stderr)
        raise RuntimeError("Ghostscript failed")
    return p

In [8]:
def size_mb(p: Path) -> float:
    return p.stat().st_size / (1024 * 1024)

# 1) SANITIZE (rebuild structure so Claude can ingest)
run([
    "gs", "-dNOPAUSE", "-dBATCH", "-dSAFER",
    "-sDEVICE=pdfwrite", "-dCompatibilityLevel=1.4",
    "-dDetectDuplicateImages=true",
    "-o", str(san),
    str(pdf_in)
])

print(f"Sanitized: {size_mb(san):.2f} MB")

# 2) OPTIMIZE (try /ebook, then /screen)
presets = ["/ebook", "/screen"]
for preset in presets:
    tmp = Path("/kaggle/working/tmp_out.pdf")
    print(f"\nOptimizing with {preset} ...")

    run([
        "gs", "-dNOPAUSE", "-dBATCH", "-dSAFER",
        "-sDEVICE=pdfwrite", "-dCompatibilityLevel=1.4",
        "-dDetectDuplicateImages=true",
        "-dCompressFonts=true", "-dSubsetFonts=true",
        f"-dPDFSETTINGS={preset}",
        "-o", str(tmp),
        str(san)
    ])

    print(f"{preset} output: {size_mb(tmp):.2f} MB")
    tmp.replace(opt)

print("\nFinal sizes:")
print(f"Original : {size_mb(pdf_in):.2f} MB")
print(f"Sanitized: {size_mb(san):.2f} MB")
print(f"Optimized: {size_mb(opt):.2f} MB")
print("\nClaude-ready file path:", opt)

Running: gs -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDetectDuplicateImages=true -o /kaggle/working/sanitized.pdf '/kaggle/input/document/WCSS Sebastopol 1st Sub Set.pdf'
Sanitized: 241.95 MB

Optimizing with /ebook ...
Running: gs -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDetectDuplicateImages=true -dCompressFonts=true -dSubsetFonts=true -dPDFSETTINGS=/ebook -o /kaggle/working/tmp_out.pdf /kaggle/working/sanitized.pdf
/ebook output: 123.26 MB

Optimizing with /screen ...
Running: gs -dNOPAUSE -dBATCH -dSAFER -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dDetectDuplicateImages=true -dCompressFonts=true -dSubsetFonts=true -dPDFSETTINGS=/screen -o /kaggle/working/tmp_out.pdf /kaggle/working/sanitized.pdf
/screen output: 70.10 MB

Final sizes:
Original : 358.00 MB
Sanitized: 241.95 MB
Optimized: 70.10 MB

Claude-ready file path: /kaggle/working/optimized.pdf
