Skip to content

SebbyC/PDF-Python-Token-Embedder

Repository files navigation

PDF Python Reader

A tool for extracting and analyzing text from PDF documents with natural language processing capabilities.

Features

  • PDF text extraction using PyMuPDF
  • Text tokenization and embedding generation
  • OCR for image-based PDFs using Tesseract
  • Document classification using pre-trained models
  • Pattern-based field extraction from insurance documents
  • Embedding storage in CSV format

Installation

# Clone the repository
git clone https://github.com/yourusername/PDF-Python-Reader.git
cd PDF-Python-Reader

# Install dependencies
pip install -r requirements.txt

Usage

Basic PDF Reading

python pdf_reader.py

This will:

  1. Read the first page of the default PDF document
  2. Generate token embeddings using a pre-trained model
  3. Save the embeddings to CSV

Insurance Document Processing

python main.py path/to/sample.pdf

This will:

  1. Convert the PDF to images and extract text via OCR
  2. Classify the document type
  3. Extract key fields like policy numbers and claim information
  4. Output a JSON summary

Project Structure

  • main.py - Insurance document processor with OCR and classification
  • pdf_reader.py - Basic PDF text extraction and embedding generation
  • text_embeddings.py - Tokenization and embedding utilities
  • helpers/ - Additional utility scripts
  • pdf_documents/ - Sample PDF files for testing

Dependencies

  • PyMuPDF (fitz)
  • PyTesseract
  • PDF2Image
  • Transformers (Hugging Face)
  • Pandas
  • PyTorch

License

MIT License

About

A lightweight Python utility that streamlines the process of extracting text from PDF documents and converting that text into numerical embeddings for downstream ML or NLP tasks. It leverages: PyMuPDF (fitz) to load and parse PDF files.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages