UnstructData is a powerful Flask-based application designed to extract tables—including complex, borderless, or unstructured ones—from PDF files using a variety of advanced models and techniques. It is built for researchers, data scientists, and anyone who needs to accurately extract tables from challenging documents.
- Multiple Extraction Engines: Use state-of-the-art models including PDFPlumber, Camelot, Tabula, Unstructured VLM, Table Transformer, PyMuPDF, CascadeTabNet, and PDF2XML.
- Handles Complex Tables: Extracts tables even when they are borderless, merged, multi-line, or have irregular structures.
- Model Selection: Choose the extraction model that best fits your PDF structure and complexity.
- Dynamic Page Range: Extract tables from specific pages or across the whole document.
- Enhanced Pre- and Post-Processing: Automatic cleaning, validation, header detection, and formatting of extracted tables.
- Fallback Strategies: If a model fails to extract tables, the app tries clustering and other fallback strategies.
-
Clone the repository:
git clone https://github.com/mithgx/UnstructData.git cd UnstructData
-
Install dependencies:
- It is recommended to use a virtual environment (e.g.,
venv
orconda
). - Install Python requirements:
pip install -r requirements.txt
- Some models require additional tools:
- CascadeTabNet: Requires
mmdet
,mmcv-full
, andpytesseract
. - pdf2xml: Requires
pdftohtml
(part of thepoppler-utils
package). - tabula: Requires Java.
- CascadeTabNet: Requires
- It is recommended to use a virtual environment (e.g.,
-
Download model weights:
- For Table Transformer and CascadeTabNet, the required model weights will be automatically downloaded the first time they are used.
- For CascadeTabNet, you may need to manually place the config and checkpoint files in the
CascadeTabNet/Config
directory.
-
Start the application:
python app.py
- Open your browser and go to
your localhost URL where the app is
. - Upload a PDF file using the provided form.
- Select the desired table extraction model from the dropdown menu:
- pdfplumber: Fast, basic table extraction.
- camelot: Good for lattice-based tables.
- tabula: Java-based, works for many standard tables.
- table_transformer_advanced: Microsoft Table Transformer, excels at complex layouts.
- unstructured: Uses the Unstructured VLM for high-resolution and borderless tables.
- cascade_tabnet: Deep learning model for robust detection.
- pdf2xml: XML-based, good for preserving document structure.
- pymupdf: PyMuPDF-based extraction for challenging layouts.
- (Optional) Specify start and end pages for extraction.
- Click "Process" to extract tables. Results will be displayed on the page.
Model | Description | Strengths |
---|---|---|
pdfplumber | Lightweight, fast PDF text and table extraction | Simple, quick, standard tables |
camelot | Lattice and stream-based extraction using OpenCV | Bordered and simple stream tables |
tabula | Java-based parser, good for many table types | Cross-platform, standard tables |
table_transformer_advanced | Microsoft Table Transformer Detection + Structure Recognition | Complex, borderless, multi-line tables |
unstructured | Unstructured VLM, high-res, OCR, clustering fallback | Borderless, irregular, multi-format PDFs |
cascade_tabnet | Deep learning, MMDetection-based table detection | Robust detection, difficult layouts |
pdf2xml | XML-based, preserves visual layout | Complex, multi-column documents |
pymupdf | Cluster-based, for irregular and borderless tables | Challenging, unstructured tables |
- Image & OCR Enhancement: The app automatically sharpens images and applies adaptive histogram equalization for better OCR and detection.
- Header Detection: Multi-row and spanned headers are detected and merged.
- Financial Data Formatting: Specific rules for currency, dates, and numbers.
- CascadeTabNet errors: Make sure you have installed
mmdet
,mmcv-full
, andpytesseract
, and placed the correct config and checkpoint files. - pdf2xml errors: Install
poppler-utils
(Linux:sudo apt-get install poppler-utils
; macOS:brew install poppler
). - tabula errors: Ensure Java is installed and available in your PATH.
- Out-of-memory: For large files, reduce DPI or restrict page ranges.
Pull requests, feature suggestions, and bug reports are welcome! Please create an issue or submit a PR.
This project is licensed under the MIT License.
For issues or questions, please contact mithgx.