A tiny CLI utility to stream large PDF files into plain text without loading the entire
file into memory. It wraps pdfminer.six with page-based iteration, configurable
LAParams, a friendly CLI spinner, and safe logging so you can batch-process enormous PDFs.
When conversion finishes, the CLI prints a summary showing file sizes and elapsed time.
- Create or activate your Python virtual environment (the repository already
contains
.venv/). - Install the requirements:
pip install -r requirements.txtpython pdf_to_text.py INPUT_PDF [-o OUTPUT_TXT] [OPTIONS]
Convert an entire PDF:
python pdf_to_text.py documents/manual.pdfExtract a subsection without overwriting an existing file:
python pdf_to_text.py big-output.pdf --page-range 50-150 \
--output extracted.txt --overwriteAppend additional content to an existing transcript if you are processing PDFs in chunks. The CLI will show a lightweight animation while it works:
python pdf_to_text.py another-chunk.pdf --append --output extracted.txt--page-range: specifystart-endto control the page window (e.g.,10-for everything after page 10).--encoding: control the output text encoding (defaultutf-8).--char-margin,--line-margin,--word-margin,--boxes-flow,--detect-vertical: customizepdfminer.sixlayout heuristics when dealing with complex columns or rotated text.--quiet/--log-level: mute or raise logging verbosity.--no-spinner: disable the CLI animation (it is automatically muted when--quietis used).
Run the CLI with --help to verify the script starts without errors:
python pdf_to_text.py --helpFor more thorough testing you can write automated tests that call
convert_pdf_to_text with a short PDF fixture (e.g., created by fpdf or
reportlab).