document-parsing

Here are 45 public repositories matching this topic...

PaddlePaddle / PaddleOCR

Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 80+ languages.

ocr pdf-parser kie document-translation rag chineseocr ai4science pp-ocr document-parsing pp-structure pdf-extractor-rag pdf2markdown

Updated Sep 19, 2025
Python

docling-project / docling

Star

Get your documents ready for gen AI

html markdown pdf ai convert xlsx pdf-converter docx documents pptx pdf-to-text tables document-parser pdf-to-json document-parsing

Updated Sep 19, 2025
Python

Unstructured-IO / unstructured

Star

Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models. Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.

Updated Sep 17, 2025
HTML

run-llama / llama_cloud_services

Star

Knowledge Agents and Management in the Cloud

pdf parsing document pptx structured-data pdf-to-text pdf-to-excel tables docx-to-markdown document-parser pdf-document-processor pdf-to-json document-parsing ppt-to-json pdf-to-markdown ppt-to-markdown

Updated Sep 18, 2025
TypeScript

enoch3712 / ExtractThinker

Star

ExtractThinker is a Document Intelligence library for LLMs, offering ORM-style interaction for flexible and powerful document workflows.

python nlp pdf machine-learning ocr ai openai pdf-to-text document-processing document-image-analysis document-intelligence llm document-parsing langchain

Updated Aug 27, 2025
Python

NanoNets / docstrange

Star

Extract and convert data from any document, images, pdfs, word doc, ppt or URL into multiple formats (Markdown, JSON, CSV, HTML) with intelligent structured data extraction and advanced OCR.

markdown ocr ai structured-data tables pdf-parser document-parser structured-data-capture pdf-to-json llm document-parsing image-to-markdown pdf-to-markdown

Updated Sep 11, 2025
Python

edenai / edenai-apis

Star

Eden AI: simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines

python nlp api natural-language-processing text-to-speech ocr ai computer-vision aggregator machine-translation image-processing speech-recognition speech-to-text optical-character-recognition ai-as-a-service video-recognition pre-trained-model document-parsing

Updated Sep 19, 2025
Python

harishdeivanayagam / rowfill

Star

Open-source unstructured data (PDFs, Images, Audiofiles) processing platform built for knowledge workers

pdf ocr nextjs vision openai document llama pdfs vision-api unstructured unstructured-data document-extraction image-ocr ocr-javascript llm document-parsing ollama langgraph

Updated Mar 18, 2025
TypeScript

GiftMungmeeprued / document-parsers-list

Star

A comprehensive list of document parsers, covering PDF-to-text conversion and layout extraction. Each tested for support of tables, equations, handwriting, two-column layouts, and multi-column layouts.

pdf ocr preprocessing pdf-to-text document-image-processing data-pipeline document-parser document-parsing langchain

Updated Jul 14, 2025

opendataloader-project / opendataloader-pdf

Star

Safe, Open, High-Performance — PDF for AI

html markdown pdf json sdk recognition ai pdf-converter documents dataloader tables ocr-recognition document-parser pdf-to-html pdf-to-json document-parsing pdf-to-markdown

Updated Sep 19, 2025
Java

AdemBoukhris457 / Documents-Parsing-Lab

Star

Jupyter notebooks testing different OCR models for document parsing (Dolphin, MonkeyOCR, Marker, Nanonets, ...)

ocr ai parsing-data document-parsing genai

Updated Sep 18, 2025
Jupyter Notebook

papercast-dev / papercast

Star

A Python pipeline tool and plugin ecosystem for processing technical documents. Process papers from arXiv, SemanticScholar, PDF, with GROBID, LangChain, listen as podcast. Customize your own pipelines.

python nlp pipeline podcast pdf-converter tts arxiv pdf-to-text dag document-parser pdf-document-processor grobid semantic-scholar document-parsing

Updated Mar 17, 2025
Python

CycloneBoy / pdf_table

Star

A Unified Toolkit for Deep Learning-Based Table Extraction

pdf ocr ai table layout-analysis pdf-to-html table-recognition document-parsing

Updated Nov 21, 2024
Python

Unstructured-IO / community

Star

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.

open-source community machine-learning deep-learning nlp-parsing data-pipeline ocr-python document-ai preprocessing-data document-parsing

Updated Apr 7, 2023

docling-project / docling4j

Star

Docling4j brings the functionalities of Docling in document understanding to Java® projects

java pdf ai pdf-converter documents document-parser pdf-to-json document-understanding document-parsing docling

Updated Mar 31, 2025
Java

aimagelab / mugat

Star

Official implementation of our ECCVW paper "μgat: Improving Single-Page Document Parsing by Providing Multi-Page Context"

ocr transformer document-parsing

Updated Aug 30, 2024
Python

acenji / ats

Star

Applicant Tracking System (ATS): A powerful platform leveraging generative AI and soft-match algorithms to analyze resumes against job descriptions. Built with React and Node.js, it streamlines hiring insights. Future plans include expanding to investor pitches and other structured documents.