A working PDF to Markdown converter with:
- direct text extraction using PyMuPDF
- OCR fallback (Tesseract via
pdf2image) for scanned pages - single-file and batch CLI conversion
- Install dependencies:
python3 -m pip install -r requirements.txt- Install Tesseract OCR:
- macOS:
brew install tesseract - Ubuntu/Debian:
sudo apt-get install tesseract-ocr poppler-utils - Windows: install from UB Mannheim builds
- For OCR conversion from PDF pages, make sure Poppler is installed (
pdftoppmmust be available).
python3 Scripts/convert_pdf.py path/to/input.pdf --output path/to/output.mdIf --output is omitted, output is written next to the input PDF using the same base name.
python3 Scripts/convert_pdf.py path/to/pdf-folder --batch --output path/to/output-folder- The converter prioritizes embedded PDF text.
- OCR is used only when extracted text for a page is too small (likely scanned/image-only).
- Output is page-separated using
## Page Nheaders.
A Python backend is available under backend/ with auto-discovered model routes.
Run:
python3 -m pip install -r backend/requirements.txt
uvicorn backend.app.main:app --reload --port 8000List models:
curl http://127.0.0.1:8000/modelsConvert with a specific model:
curl -X POST http://127.0.0.1:8000/convert/native \
-F "file=@/path/to/file.pdf"To add a new model, add a new Python file in backend/app/models/ exporting a model object (ModelDefinition).
The Next.js frontend now uses two API routes that proxy to the FastAPI model backend:
- Frontend
fetch('/api/models')-> Next routeGET /api/models-> FastAPIGET /models - Frontend
fetch('/api/convert-pdf')withfile+model-> Next routePOST /api/convert-pdf-> FastAPIPOST /convert/{model_id}
This ensures model selection in the frontend has a backend counterpart for conversion.
Run FastAPI before using conversion from the frontend:
uvicorn backend.app.main:app --reload --port 8000Set backend URL in environment:
FASTAPI_BASE_URL=http://127.0.0.1:8000