Automated Document Parser

A powerful and automated document parser built with LangChain for intelligent document processing. This library automatically detects file types and uses the appropriate loader to parse documents into LangChain-compatible formats.

Features

Automatic file type detection based on file extensions
Multiple PDF loading methods - 9 different PDF loaders for various use cases
Modular architecture - Clean separation with file_load/ and pdf_load/ modules
Support for multiple document formats: PDF, TXT, CSV, JSON, DOCX, HTML, Markdown
Built on LangChain for seamless integration with RAG applications
Type-safe implementation with comprehensive error handling
Batch processing support for multiple documents

Supported File Types

Text Files

.txt - Plain text files
.md - Markdown files

Structured Data

.csv - CSV files with encoding support
.json - JSON files with jq schema filtering

Documents

.docx - Microsoft Word documents
.html - HTML files

PDF Files (9 loading methods)

pypdf - Basic PDF text extraction (default, no extra dependencies)
unstructured - Advanced OCR and layout detection
amazon_textract - AWS Textract for high-accuracy OCR
mathpix - Specialized for mathematical formulas
pdfplumber - High accuracy text and table extraction
pypdfium2 - Google PDFium library
pymupdf - PyMuPDF (fitz) backend
pymupdf4llm - LLM-optimized extraction
opendataloader - Advanced multi-format parsing

Installation

Install from PyPI:

pip install automated-document-parser

Or using uv:

uv add automated-document-parser

Quick Start

Basic Usage - Automatic File Type Detection

The primary feature is automatic file type detection. Just point to any supported file and the parser handles the rest:

from automated_document_parser import DocumentParser

# Initialize the parser
parser = DocumentParser()

# Parse any single file - automatically detects type and uses the right loader
documents = parser.parse("document.pdf")        # Auto-detects PDF
documents = parser.parse("data.csv")            # Auto-detects CSV
documents = parser.parse("notes.txt")           # Auto-detects text file

# Parse multiple files of different types - all formats handled automatically
file_paths = ["report.pdf", "data.csv", "notes.txt", "info.docx"]
all_docs = parser.parse_multiple(file_paths)  # Each file auto-detected and loaded

# Access parsed content
for file_path, docs in all_docs.items():
    print(f"File: {file_path}")
    for doc in docs:
        print(f"  Content: {doc.page_content[:100]}...")
        print(f"  Metadata: {doc.metadata}")

Specify Loading Methods

Specify the PDF loading method and other parameters to apply to all files:

from automated_document_parser import DocumentParser

parser = DocumentParser()

# Step 1: Specify the method for PDFs
# Step 2: Parser automatically detects file types and loads them with specified method
file_paths = ["report1.pdf", "report2.pdf", "data.csv", "notes.txt"]
all_docs = parser.parse_multiple(
    file_paths,
    pdf_loader_method="pdfplumber",  # All PDFs will use pdfplumber
    encoding="utf-8"                  # All text files will use UTF-8 encoding
)

# Each file is automatically detected and loaded with the specified settings
for file_path, docs in all_docs.items():
    print(f"Loaded {file_path}: {len(docs)} documents")

Documentation

Full documentation is available at: https://pulkit12dhingra.github.io/automated-document-parser/

Architecture

The library uses a modular architecture:

automated_document_parser/
├── loaders/
│   ├── file_load/          # File loaders module
│   │   ├── base.py         # Base file loader class
│   │   ├── text_loader.py  # Text file loader
│   │   ├── csv_loader.py   # CSV loader
│   │   ├── json_loader.py  # JSON loader
│   │   ├── docx_loader.py  # DOCX loader
│   │   └── html_loader.py  # HTML loader
│   ├── pdf_load/           # PDF loaders module
│   │   ├── base.py         # Base PDF loader class
│   │   ├── pypdf_loader.py
│   │   ├── mathpix_loader.py
│   │   ├── pdfplumber_loader.py
│   │   └── ... (9 PDF loaders total)
│   └── file_loaders.py     # Main orchestrator
└── core.py                 # DocumentParser class

License

MIT License - see LICENSE file for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
docs		docs
notebook_examples		notebook_examples
src/automated_document_parser		src/automated_document_parser
test		test
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Automated Document Parser

Features

Supported File Types

Text Files

Structured Data

Documents

PDF Files (9 loading methods)

Installation

Quick Start

Basic Usage - Automatic File Type Detection

Specify Loading Methods

Documentation

Architecture

License

Contributing

About

Uh oh!

Releases 6

Packages

Languages

License

Pulkit12dhingra/automated-document-parser

Folders and files

Latest commit

History

Repository files navigation

Automated Document Parser

Features

Supported File Types

Text Files

Structured Data

Documents

PDF Files (9 loading methods)

Installation

Quick Start

Basic Usage - Automatic File Type Detection

Specify Loading Methods

Documentation

Architecture

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 6

Packages 0

Languages

Packages