Skip to content

Tiezjin/SourceSpot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A Simple Tool to convert PDF to Markdown

PDF format is very good in presenting the original visual layout of the document, but is not good for searching and analysing using methods of digital huamnities.

Using this command line tool you can convert any PDF (whether it is image or already has text-layer) into markdown format, with the option of deepseek api checking, and if it is combined with the great note-taking app like Obsidian, reading becomes more trackable and enjoyable.

PS: This is not an advanced tool with big DL models that can detect and remove the headline or page number, so it needs your manual correction if you desire a more accurate version.

Usage

python main.py "pdf_path" --out-dir "output_path" --lang(defaut=eng+deu) --start(default=1), --range 1- , --ocr-needed, --llm-needed, --tesseract-cmd' --llm-key, --max-workers, --v/verbose

Reference

Option Type Default Description
pdf_path Positional N/A The file path to the input PDF document.
--output-dir Path Same as input PDF The directory path to save the output .md file.
--lang string eng+deu Tesseract language codes (e.g., eng or eng+fra). Must be installed.
--start int 1 The printed page number corresponding to the first page of the PDF. Useful for documents that start on page 5 or 10.
--range string All pages Specify page ranges to process (e.g., "1-3,5,7-9"). Uses the PDF page number (1-based).
--ocr-needed flag Off Required if the PDF is image-only or scanned. Activates the Tesseract OCR engine.
--tesseract-cmd path OS Default Specify the explicit path to your Tesseract executable (e.g., C:\Program Files\Tesseract-OCR\tesseract.exe).
--llm-needed flag Off Activates DeepSeek AI processing for error correction and formatting.
--llm-key string Env: DEEPSEEK_API_KEY Your DeepSeek API key (or set the environment variable). Required if --llm-needed is set.
--max-workers int 4 Maximum number of concurrent workers for parallel processing.
-v, --verbose flag Off Enable verbose output, including detailed logging and debug messages.

Note

You should install tessearact ocr engine first and make sure it exists in system path.

Credit

-Thanks for the great libraries like Pytesseract, Fitz, Openai -Code writing is assisted by LLMs (gemini, grok, chatgpt, and deepseek)

About

A simple tool to convert pdf to markdown format

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages