PDF format is very good in presenting the original visual layout of the document, but is not good for searching and analysing using methods of digital huamnities.
Using this command line tool you can convert any PDF (whether it is image or already has text-layer) into markdown format, with the option of deepseek api checking, and if it is combined with the great note-taking app like Obsidian, reading becomes more trackable and enjoyable.
PS: This is not an advanced tool with big DL models that can detect and remove the headline or page number, so it needs your manual correction if you desire a more accurate version.
python main.py "pdf_path" --out-dir "output_path" --lang(defaut=eng+deu) --start(default=1), --range 1- , --ocr-needed, --llm-needed, --tesseract-cmd' --llm-key, --max-workers, --v/verbose
| Option | Type | Default | Description |
|---|---|---|---|
pdf_path |
Positional | N/A | The file path to the input PDF document. |
--output-dir |
Path | Same as input PDF | The directory path to save the output .md file. |
--lang |
string | eng+deu |
Tesseract language codes (e.g., eng or eng+fra). Must be installed. |
--start |
int | 1 |
The printed page number corresponding to the first page of the PDF. Useful for documents that start on page 5 or 10. |
--range |
string | All pages | Specify page ranges to process (e.g., "1-3,5,7-9"). Uses the PDF page number (1-based). |
--ocr-needed |
flag | Off | Required if the PDF is image-only or scanned. Activates the Tesseract OCR engine. |
--tesseract-cmd |
path | OS Default | Specify the explicit path to your Tesseract executable (e.g., C:\Program Files\Tesseract-OCR\tesseract.exe). |
--llm-needed |
flag | Off | Activates DeepSeek AI processing for error correction and formatting. |
--llm-key |
string | Env: DEEPSEEK_API_KEY |
Your DeepSeek API key (or set the environment variable). Required if --llm-needed is set. |
--max-workers |
int | 4 |
Maximum number of concurrent workers for parallel processing. |
-v, --verbose |
flag | Off | Enable verbose output, including detailed logging and debug messages. |
You should install tessearact ocr engine first and make sure it exists in system path.
-Thanks for the great libraries like Pytesseract, Fitz, Openai -Code writing is assisted by LLMs (gemini, grok, chatgpt, and deepseek)