PDF to Markdown Converter

A working PDF to Markdown converter with:

direct text extraction using PyMuPDF
OCR fallback (Tesseract via pdf2image) for scanned pages
single-file and batch CLI conversion

Setup

Install dependencies:

python3 -m pip install -r requirements.txt

Install Tesseract OCR:

macOS: brew install tesseract
Ubuntu/Debian: sudo apt-get install tesseract-ocr poppler-utils
Windows: install from UB Mannheim builds

For OCR conversion from PDF pages, make sure Poppler is installed (pdftoppm must be available).

CLI Usage

Convert one PDF

python3 Scripts/convert_pdf.py path/to/input.pdf --output path/to/output.md

If --output is omitted, output is written next to the input PDF using the same base name.

Batch convert a folder

python3 Scripts/convert_pdf.py path/to/pdf-folder --batch --output path/to/output-folder

Notes

The converter prioritizes embedded PDF text.
OCR is used only when extracted text for a page is too small (likely scanned/image-only).
Output is page-separated using ## Page N headers.

FastAPI Model Backend (Plugin-Based)

A Python backend is available under backend/ with auto-discovered model routes.

Run:

python3 -m pip install -r backend/requirements.txt
uvicorn backend.app.main:app --reload --port 8000

List models:

curl http://127.0.0.1:8000/models

Convert with a specific model:

curl -X POST http://127.0.0.1:8000/convert/native \
  -F "file=@/path/to/file.pdf"

To add a new model, add a new Python file in backend/app/models/ exporting a model object (ModelDefinition).

Frontend/Backend Mapping

The Next.js frontend now uses two API routes that proxy to the FastAPI model backend:

Frontend fetch('/api/models') -> Next route GET /api/models -> FastAPI GET /models
Frontend fetch('/api/convert-pdf') with file + model -> Next route POST /api/convert-pdf -> FastAPI POST /convert/{model_id}

This ensures model selection in the frontend has a backend counterpart for conversion.

Required runtime

Run FastAPI before using conversion from the frontend:

uvicorn backend.app.main:app --reload --port 8000

Set backend URL in environment:

FASTAPI_BASE_URL=http://127.0.0.1:8000

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github		.github
Scripts		Scripts
backend		backend
pdfs		pdfs
tingyun sipping tool		tingyun sipping tool
.env.local		.env.local
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
README.md		README.md
extracted_text.md		extracted_text.md
extracted_text.txt		extracted_text.txt
packages.ps1		packages.ps1
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF to Markdown Converter

Setup

CLI Usage

Convert one PDF

Batch convert a folder

Notes

FastAPI Model Backend (Plugin-Based)

Frontend/Backend Mapping

Required runtime

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

PDF to Markdown Converter

Setup

CLI Usage

Convert one PDF

Batch convert a folder

Notes

FastAPI Model Backend (Plugin-Based)

Frontend/Backend Mapping

Required runtime

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages