Convert PDF books to clean Markdown files for use in Claude Projects, NotebookLM, and other LLM tools.
Handles both text-based PDFs (using Marker) and scanned/image-based PDFs (using OCR via Tesseract).
brew install tesseract poppler ghostscriptgit clone https://github.com/AndySparks/BookConvert.git
cd BookConvert
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtpython convert.py input/MyBook.pdfpython convert.py input/python convert.py input/ScannedBook.pdf --ocrpython convert.py input/MyBook.pdf --output output/Philosophy/- Drop your PDF(s) into the
input/folder - Run
convert.py— it uses Marker by default for text-based PDFs, which produces high-quality markdown - For scanned books (where the pages are images), use the
--ocrflag — this uses Tesseract to extract text via OCR - Converted markdown appears in
output/
This project includes a CLAUDE.md file, so Claude Code understands the project and can help you convert and clean up books. Just open the project directory in Claude Code and ask it to help convert your PDFs.
- Start with the default mode (no
--ocrflag). It's faster and produces better formatting for text-based PDFs. - Use
--ocronly if the default mode produces empty or garbled output — this usually means the PDF is scanned/image-based. - OCR output may need cleanup. Claude Code can help you fix OCR artifacts, add proper headings, and improve formatting.
- Organize your output into subdirectories by topic (e.g.,
output/Coaching/,output/Writing/) to keep things tidy.
BookConvert/
input/ <- Drop your PDFs here
output/ <- Converted markdown files appear here
convert.py <- Main conversion script
requirements.txt
CLAUDE.md <- Instructions for Claude Code
MIT