DocFrame is a Python framework for turning messy enterprise documents into structured, AI-ready chunks.
It gives developers one API and one CLI for PDFs, Word files, CSVs, Excel workbooks, JPGs, and PNGs.
docframe process contract.pdf --format markdown
docframe process ./inbox --recursive --out normalized.json
python3 -m docframe process report.xlsxDocument workflows usually start with format chaos: text PDFs, scanned images, spreadsheets, Word files, CSV exports, and ad hoc attachments. DocFrame gives you a normalized result model so downstream systems can search, extract, summarize, validate, and route documents without rewriting parsers for every file type.
From PyPI:
python3 -m pip install docframe-aiLocal development:
python3 -m pip install -e .Then:
docframe formats
docframe process examples/sample.csv --format markdownThe PyPI distribution is docframe-ai; the Python import remains docframe.
See the repository's
PyPI publishing guide
for the GitHub Trusted Publishing setup.
import docframe as df
framework = df.DocFrame()
result = framework.process_sync("examples/sample.csv")
print(result.metadata.document_type)
print(result.chunks[0].rows)Async processing:
import docframe as df
framework = df.DocFrame()
results = await framework.process_many(["contract.pdf", "report.xlsx"])Safe corpus processing:
import docframe as df
framework = df.DocFrame()
results = await framework.process_many(
["good.pdf", "malformed.pdf"],
continue_on_error=True,
)
for result in results:
if result.errors:
print(result.metadata.filename, result.errors)LLM-ready token blocks:
import docframe as df
framework = df.DocFrame()
result = framework.process_sync("report.pdf")
tokens = df.to_llm_tokens(result)
prompt = df.to_llm_prompt(result)From the CLI:
docframe process report.pdf --format tokens
docframe process ./documents --recursive --format llm --out llm_payload.json
docframe process report.pdf --format prompt --out prompt.txttokens returns a JSON list of source-grounded text blocks. llm returns a
compact docframe.sager.tokens.v1 payload. prompt returns plain text ready to
paste or pass directly into an LLM call.
- PDF: text and page metadata via
pypdf - DOCX: paragraphs and tables via direct OOXML package parsing
- DOC: OOXML extraction when possible, metadata-only fallback for legacy binary Word files
- CSV: table chunks via Python's standard
csvparser - XLSX/XLSM: worksheet tables via
openpyxl - JPG/JPEG/PNG: image metadata via
Pillow
Images currently emit image chunks and metadata. OCR is intentionally a provider extension point so teams can choose local OCR, cloud OCR, or multimodal AI.
Many real corpora contain OOXML Word documents with a .doc extension. DocFrame
extracts those with the Word adapter and emits a warning. True legacy binary
.doc files emit metadata and a warning; convert them to .docx or register a
custom adapter when full text extraction is required.
DocFrame: framework object for processing documentsDocumentAdapter: parser for a file familyAdapterRegistry: maps file extensions to adaptersDocumentResult: normalized output for one documentDocumentChunk: text, table, image, or metadata unitPipeline: ordered post-processing stepsProcessingOptions: runtime limits and concurrency controls for large files
docframe process FILE_OR_DIRECTORY
docframe process FILE_OR_DIRECTORY --format markdown
docframe process FILE_OR_DIRECTORY --format tokens
docframe process FILE_OR_DIRECTORY --format llm
docframe process FILE_OR_DIRECTORY --format prompt
docframe process FILE_OR_DIRECTORY --recursive --out normalized.json
docframe formatsDocFrame is public alpha software. The core API, adapters, CLI, tests, MIT license, and landing site are in place. See PUBLIC_ALPHA.md for the production-readiness checklist.
python3 -m unittest discover -s tests
python3 -m compileall docframe testsThe static site lives in site/index.html.
Run it locally:
python3 -m http.server 8080 -d siteThen open:
http://127.0.0.1:8080/
The repository includes a Render Blueprint in render.yaml. It
publishes the static site from site/ as docframe-site.
After pushing the repository to GitHub, GitLab, or Bitbucket:
git push -u origin mainThen create the Blueprint from the Render Dashboard:
https://dashboard.render.com/blueprint/new
Connect the repository and Render will use render.yaml from the repo root.
Validate a private corpus before a release:
python3 scripts/validate_corpus.py test_corpus --out corpus-report.jsonThe validator exits nonzero if any supported file produces a structured error.
Use --allow-errors for exploratory runs where malformed files are expected.
Collect any supported corpus files by extension:
python3 scripts/collect_files.py "/path/to/archive" "/path/to/all_csv" --ext csv --dry-run --quiet
python3 scripts/collect_files.py "/path/to/archive" "/path/to/all_csv" --ext csv --quiet
python3 scripts/collect_files.py "/path/to/archive" "/path/to/all_images" --ext jpg --ext png --quietCollect PDFs from a deeply nested archive into one flat folder:
python3 scripts/collect_pdfs.py "/path/to/archive" "/path/to/all_pdfs" --dry-run --quiet
python3 scripts/collect_pdfs.py "/path/to/archive" "/path/to/all_pdfs" --quietThe collector copies by default, avoids overwriting existing files, and gives duplicate basenames a stable hash suffix.