PDF Python Reader

A tool for extracting and analyzing text from PDF documents with natural language processing capabilities.

Features

PDF text extraction using PyMuPDF
Text tokenization and embedding generation
OCR for image-based PDFs using Tesseract
Document classification using pre-trained models
Pattern-based field extraction from insurance documents
Embedding storage in CSV format

Installation

# Clone the repository
git clone https://github.com/yourusername/PDF-Python-Reader.git
cd PDF-Python-Reader

# Install dependencies
pip install -r requirements.txt

Usage

Basic PDF Reading

python pdf_reader.py

This will:

Read the first page of the default PDF document
Generate token embeddings using a pre-trained model
Save the embeddings to CSV

Insurance Document Processing

python main.py path/to/sample.pdf

This will:

Convert the PDF to images and extract text via OCR
Classify the document type
Extract key fields like policy numbers and claim information
Output a JSON summary

Project Structure

main.py - Insurance document processor with OCR and classification
pdf_reader.py - Basic PDF text extraction and embedding generation
text_embeddings.py - Tokenization and embedding utilities
helpers/ - Additional utility scripts
pdf_documents/ - Sample PDF files for testing

Dependencies

PyMuPDF (fitz)
PyTesseract
PDF2Image
Transformers (Hugging Face)
Pandas
PyTorch

License

MIT License

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Python Reader

Features

Installation

Usage

Basic PDF Reading

Insurance Document Processing

Project Structure

Dependencies

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
helpers		helpers
pdf_documents		pdf_documents
.gitignore		.gitignore
FUTURE_PLANS.md		FUTURE_PLANS.md
README.md		README.md
main.py		main.py
pdf_reader.py		pdf_reader.py
requirements.txt		requirements.txt
text_embeddings.py		text_embeddings.py

Folders and files

Latest commit

History

Repository files navigation

PDF Python Reader

Features

Installation

Usage

Basic PDF Reading

Insurance Document Processing

Project Structure

Dependencies

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages