Skip to content

A CLI utility for streaming large PDFs into text using pdfminer.six, featuring page-range selection, LAParams tuning, spinner-based progress feedback, metadata summaries (sizes/duration), and pytest coverage—ideal for low-memory batch conversions.

Notifications You must be signed in to change notification settings

LiteObject/pdf-to-text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

PDF to Text

A tiny CLI utility to stream large PDF files into plain text without loading the entire file into memory. It wraps pdfminer.six with page-based iteration, configurable LAParams, a friendly CLI spinner, and safe logging so you can batch-process enormous PDFs. When conversion finishes, the CLI prints a summary showing file sizes and elapsed time.

Installation

  1. Create or activate your Python virtual environment (the repository already contains .venv/).
  2. Install the requirements:
pip install -r requirements.txt

Usage

python pdf_to_text.py INPUT_PDF [-o OUTPUT_TXT] [OPTIONS]

Examples

Convert an entire PDF:

python pdf_to_text.py documents/manual.pdf

Extract a subsection without overwriting an existing file:

python pdf_to_text.py big-output.pdf --page-range 50-150 \
    --output extracted.txt --overwrite

Append additional content to an existing transcript if you are processing PDFs in chunks. The CLI will show a lightweight animation while it works:

python pdf_to_text.py another-chunk.pdf --append --output extracted.txt

Helpful flags

  • --page-range: specify start-end to control the page window (e.g., 10- for everything after page 10).
  • --encoding: control the output text encoding (default utf-8).
  • --char-margin, --line-margin, --word-margin, --boxes-flow, --detect-vertical: customize pdfminer.six layout heuristics when dealing with complex columns or rotated text.
  • --quiet / --log-level: mute or raise logging verbosity.
  • --no-spinner: disable the CLI animation (it is automatically muted when --quiet is used).

Testing

Run the CLI with --help to verify the script starts without errors:

python pdf_to_text.py --help

For more thorough testing you can write automated tests that call convert_pdf_to_text with a short PDF fixture (e.g., created by fpdf or reportlab).

About

A CLI utility for streaming large PDFs into text using pdfminer.six, featuring page-range selection, LAParams tuning, spinner-based progress feedback, metadata summaries (sizes/duration), and pytest coverage—ideal for low-memory batch conversions.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages