A tool for extracting and analyzing text from PDF documents with natural language processing capabilities.
- PDF text extraction using PyMuPDF
- Text tokenization and embedding generation
- OCR for image-based PDFs using Tesseract
- Document classification using pre-trained models
- Pattern-based field extraction from insurance documents
- Embedding storage in CSV format
# Clone the repository
git clone https://github.com/yourusername/PDF-Python-Reader.git
cd PDF-Python-Reader
# Install dependencies
pip install -r requirements.txtpython pdf_reader.pyThis will:
- Read the first page of the default PDF document
- Generate token embeddings using a pre-trained model
- Save the embeddings to CSV
python main.py path/to/sample.pdfThis will:
- Convert the PDF to images and extract text via OCR
- Classify the document type
- Extract key fields like policy numbers and claim information
- Output a JSON summary
main.py- Insurance document processor with OCR and classificationpdf_reader.py- Basic PDF text extraction and embedding generationtext_embeddings.py- Tokenization and embedding utilitieshelpers/- Additional utility scriptspdf_documents/- Sample PDF files for testing
- PyMuPDF (fitz)
- PyTesseract
- PDF2Image
- Transformers (Hugging Face)
- Pandas
- PyTorch