PDF Text Extractor

A Python utility to extract text from PDF files, including scanned PDFs using OCR.

Features

Extract text from regular (text-based) PDFs
OCR support for scanned PDFs using Tesseract
Conservative table data extraction to CSV format
Handles OCR errors gracefully - leaves blank cells for manual entry
Works with any product type, not just Apple products
Table structure analysis with completion statistics
Preserves original raw text for verification
Batch processing for multiple PDFs
Command-line interface
Save output to files or print to console

Installation

Install Python dependencies:

pip install -r requirements.txt

Install Tesseract OCR:

macOS:

brew install tesseract

Ubuntu/Debian:

sudo apt-get install tesseract-ocr

Windows: Download and install from: https://github.com/UB-Mannheim/tesseract/wiki

Usage

Command Line Interface

Extract text from a single PDF:

python pdf_text_extractor.py path/to/document.pdf

Save extracted text to a file:

python pdf_text_extractor.py path/to/document.pdf -o output.txt

Process all PDFs in a directory:

python pdf_text_extractor.py path/to/pdf_directory --batch -o output_directory

Disable OCR (text-based PDFs only):

python pdf_text_extractor.py path/to/document.pdf --no-ocr

Extract table data to CSV (auto-names as document.csv):

python pdf_text_extractor.py path/to/document.pdf --csv

Or specify custom output name:

python pdf_text_extractor.py path/to/document.pdf --csv -o custom_name.csv

Analyze table structure:

python pdf_text_extractor.py path/to/document.pdf --analyze

Python API

from pdf_text_extractor import PDFTextExtractor

# Initialize extractor
extractor = PDFTextExtractor()

# Extract text from a PDF
text = extractor.extract_text_from_pdf('document.pdf')
print(text)

# Save to file
extractor.extract_to_file('document.pdf', 'output.txt')

# Batch process
processed = extractor.batch_extract('pdf_folder', 'output_folder')

# Extract table data to CSV (auto-names as invoice.csv)
extractor.extract_tables_to_csv('invoice.pdf', 'invoice.csv')

# Analyze table structure
analysis = extractor.analyze_table_structure('invoice.pdf')
print(f"Found {analysis['total_rows']} table rows")

Options

-o, --output: Output file or directory path
--csv: Extract table data to CSV format
--analyze: Analyze table structure
--no-ocr: Disable OCR for scanned pages
--batch: Process all PDFs in directory
--tesseract-cmd: Path to tesseract executable
-v, --verbose: Enable verbose logging

Requirements

Python 3.6+
PyMuPDF (fitz)
pytesseract
Pillow (PIL)
pandas
Tesseract OCR engine

Windows Executable

Want to use this without installing Python? We can build a Windows executable:

Build Executable (Windows only)

Simple build:
```
python build_simple.py
```
Full featured build:
```
build_windows.bat
```
Find your executable in:
```
dist/pdf-text-extractor.exe
```

Use Executable

# Extract table to CSV (auto-names as invoice.csv)
pdf-text-extractor.exe "invoice.pdf" --csv

# Analyze table structure
pdf-text-extractor.exe "invoice.pdf" --analyze

See BUILD_WINDOWS_EXE.md for detailed instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
.DS_Store		.DS_Store
BUILD_WINDOWS_EXE.md		BUILD_WINDOWS_EXE.md
Makefile		Makefile
README.md		README.md
build_exe.py		build_exe.py
build_simple.py		build_simple.py
build_windows.bat		build_windows.bat
debug_parser.py		debug_parser.py
pdf_extractor.spec		pdf_extractor.spec
pdf_text_extractor.py		pdf_text_extractor.py
requirements-exe.txt		requirements-exe.txt
requirements.txt		requirements.txt
setup.bat		setup.bat
table_parser.py		table_parser.py
test_final.py		test_final.py
test_parser.py		test_parser.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PDF Text Extractor

Features

Installation

Usage

Command Line Interface

Python API

Options

Requirements

Windows Executable

Build Executable (Windows only)

Use Executable

About

Uh oh!

Releases

Packages

Languages

DigiLake/invoice_parser

Folders and files

Latest commit

History

Repository files navigation

PDF Text Extractor

Features

Installation

Usage

Command Line Interface

Python API

Options

Requirements

Windows Executable

Build Executable (Windows only)

Use Executable

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages