Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 80+ languages.
-
Updated
Sep 19, 2025 - Python
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 80+ languages.
Get your documents ready for gen AI
Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.
Knowledge Agents and Management in the Cloud
ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.
Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.
Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines
Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers
A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.
Safe, Open, High-Performance — PDF for AI
Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)
A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.
A Unified Toolkit for Deep Learning-Based Table Extraction
Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Docling4j brings the functionalities of Docling in document understanding to Java® projects
Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"
Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.
Tool for converting First National Bank (FNB) bank statement PDFs into useful structured data
The metadata and text content extractor for almost every file type.
Docparser OCR Package for PHP Laravel
Add a description, image, and links to the document-parsing topic page so that developers can more easily learn about it.
To associate your repository with the document-parsing topic, visit your repo's landing page and select "manage topics."