A Simple Tool to convert PDF to Markdown

PDF format is very good in presenting the original visual layout of the document, but is not good for searching and analysing using methods of digital huamnities.

Using this command line tool you can convert any PDF (whether it is image or already has text-layer) into markdown format, with the option of deepseek api checking, and if it is combined with the great note-taking app like Obsidian, reading becomes more trackable and enjoyable.

PS: This is not an advanced tool with big DL models that can detect and remove the headline or page number, so it needs your manual correction if you desire a more accurate version.

Usage

python main.py "pdf_path" --out-dir "output_path" --lang(defaut=eng+deu) --start(default=1), --range 1- , --ocr-needed, --llm-needed, --tesseract-cmd' --llm-key, --max-workers, --v/verbose

Reference

Option	Type	Default	Description
`pdf_path`	Positional	N/A	The file path to the input PDF document.
`--output-dir`	Path	Same as input PDF	The directory path to save the output `.md` file.
`--lang`	string	`eng+deu`	Tesseract language codes (e.g., `eng` or `eng+fra`). Must be installed.
`--start`	int	`1`	The printed page number corresponding to the first page of the PDF. Useful for documents that start on page 5 or 10.
`--range`	string	All pages	Specify page ranges to process (e.g., `"1-3,5,7-9"`). Uses the PDF page number (1-based).
`--ocr-needed`	flag	Off	Required if the PDF is image-only or scanned. Activates the Tesseract OCR engine.
`--tesseract-cmd`	path	OS Default	Specify the explicit path to your Tesseract executable (e.g., `C:\Program Files\Tesseract-OCR\tesseract.exe`).
`--llm-needed`	flag	Off	Activates DeepSeek AI processing for error correction and formatting.
`--llm-key`	string	Env: `DEEPSEEK_API_KEY`	Your DeepSeek API key (or set the environment variable). Required if `--llm-needed` is set.
`--max-workers`	int	`4`	Maximum number of concurrent workers for parallel processing.
`-v, --verbose`	flag	Off	Enable verbose output, including detailed logging and debug messages.

Note

You should install tessearact ocr engine first and make sure it exists in system path.

Credit

-Thanks for the great libraries like Pytesseract, Fitz, Openai -Code writing is assisted by LLMs (gemini, grok, chatgpt, and deepseek)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
converter		converter
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
ocr-md.py		ocr-md.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Simple Tool to convert PDF to Markdown

Usage

Reference

Note

Credit

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

A Simple Tool to convert PDF to Markdown

Usage

Reference

Note

Credit

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages