A Python utility to extract text from PDF files, including scanned PDFs using OCR.
- Extract text from regular (text-based) PDFs
- OCR support for scanned PDFs using Tesseract
- Conservative table data extraction to CSV format
- Handles OCR errors gracefully - leaves blank cells for manual entry
- Works with any product type, not just Apple products
- Table structure analysis with completion statistics
- Preserves original raw text for verification
- Batch processing for multiple PDFs
- Command-line interface
- Save output to files or print to console
- Install Python dependencies:
pip install -r requirements.txt
- Install Tesseract OCR:
macOS:
brew install tesseract
Ubuntu/Debian:
sudo apt-get install tesseract-ocr
Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki
Extract text from a single PDF:
python pdf_text_extractor.py path/to/document.pdf
Save extracted text to a file:
python pdf_text_extractor.py path/to/document.pdf -o output.txt
Process all PDFs in a directory:
python pdf_text_extractor.py path/to/pdf_directory --batch -o output_directory
Disable OCR (text-based PDFs only):
python pdf_text_extractor.py path/to/document.pdf --no-ocr
Extract table data to CSV (auto-names as document.csv):
python pdf_text_extractor.py path/to/document.pdf --csv
Or specify custom output name:
python pdf_text_extractor.py path/to/document.pdf --csv -o custom_name.csv
Analyze table structure:
python pdf_text_extractor.py path/to/document.pdf --analyze
from pdf_text_extractor import PDFTextExtractor
# Initialize extractor
extractor = PDFTextExtractor()
# Extract text from a PDF
text = extractor.extract_text_from_pdf('document.pdf')
print(text)
# Save to file
extractor.extract_to_file('document.pdf', 'output.txt')
# Batch process
processed = extractor.batch_extract('pdf_folder', 'output_folder')
# Extract table data to CSV (auto-names as invoice.csv)
extractor.extract_tables_to_csv('invoice.pdf', 'invoice.csv')
# Analyze table structure
analysis = extractor.analyze_table_structure('invoice.pdf')
print(f"Found {analysis['total_rows']} table rows")
-o, --output
: Output file or directory path--csv
: Extract table data to CSV format--analyze
: Analyze table structure--no-ocr
: Disable OCR for scanned pages--batch
: Process all PDFs in directory--tesseract-cmd
: Path to tesseract executable-v, --verbose
: Enable verbose logging
- Python 3.6+
- PyMuPDF (fitz)
- pytesseract
- Pillow (PIL)
- pandas
- Tesseract OCR engine
Want to use this without installing Python? We can build a Windows executable:
-
Simple build:
python build_simple.py
-
Full featured build:
build_windows.bat
-
Find your executable in:
dist/pdf-text-extractor.exe
# Extract table to CSV (auto-names as invoice.csv)
pdf-text-extractor.exe "invoice.pdf" --csv
# Analyze table structure
pdf-text-extractor.exe "invoice.pdf" --analyze
See BUILD_WINDOWS_EXE.md for detailed instructions.