This project is a console-based document scanner application that uses Optical Character Recognition (OCR) to extract text and tables from images and PDF files. It leverages the PaddleOCR and PPStructure libraries for accurate text recognition and structured data extraction, making it useful for automating data entry and document analysis tasks.
- Text Extraction: Extracts text from images and PDF documents using OCR.
- Table Recognition: Detects and extracts tables from scanned documents and images.
- Image Preprocessing: Enhances image quality for better OCR performance through various preprocessing techniques.
- Output Export: Saves extracted data into an Excel file for easy sharing and further analysis.
- Python 3.8
opencv-python
numpy
paddleocr
ppstructure
pdfplumber
pandas
Pillow
-
Clone the repository:
git clone https://github.com/Adeyemi0/Python-OCR.git cd document-scanner
-
Install the required packages:
pip install opencv-python numpy paddleocr ppstructure pdfplumber pandas Pillow
python python-ocr.py