document-analysis

Here are 110 public repositories matching this topic...

opendatalab / MinerU

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具，将PDF转换成Markdown和JSON格式。

python pdf parser ocr pdf-converter extract-data document-analysis pdf-parser layout-analysis ai4science pdf-extractor-rag pdf-extractor-llm pdf-extractor-pretrain

Updated Apr 10, 2025
Python

UglyToad / PdfPig

Star

Read and extract text and other content from PDFs in C# (port of PDFBox)

pdf csharp pdfbox netstandard pdf-files pdf-document pdf-generation hocr document-analysis pdf-extractor alto-xml page-xml layout-analysis pdf-document-processor

Updated Apr 6, 2025
C#

AlibabaResearch / AdvancedLiterateMachinery

Star

A collection of original, innovative ideas and algorithms towards Advanced Literate Machinery. This project is maintained by the OCR Team in the Language Technology Lab, Tongyi Lab, Alibaba Group.

Updated Apr 9, 2025
C++

tstanislawek / awesome-document-understanding

Star

A curated list of resources for Document Understanding (DU) topic

Updated Jun 2, 2023

DocumindHQ / documind

Star

Open-source platform for extracting structured data from documents using AI.

open-source pdf parser ocr ai pdf-converter developer-tools extract-data document-analysis pdf-extractor document-extraction llms pdf-extractor-llm

Updated Feb 21, 2025
JavaScript

Yuliang-Liu / Curve-Text-Detector

Star

This repository provides train＆test code, dataset, det.&rec. annotation, evaluation script, annotation tool, and ranking.

deep-learning object-detection document-analysis scene-text

Updated Jul 20, 2020
Jupyter Notebook

wenwenyu / PICK-pytorch

Star

Code for the paper "PICK: Processing Key Information Extraction from Documents using Improved Graph Learning-Convolutional Networks" (ICPR 2020)

document-analysis graph-convolutional-network graph-learning graph-neural-networks document-understanding key-information-extraction

Updated Jul 25, 2024
Python

jpWang / LiLT

Star

Official PyTorch implementation of LiLT: A Simple yet Effective Language-Independent Layout Transformer for Structured Document Understanding (ACL 2022)

nlp information-extraction document-analysis document-understanding multilingual-models document-ai multimodal-pre-trained-model

Updated Oct 31, 2022
Python

CybercentreCanada / assemblyline

Star

AssemblyLine 4: File triage and malware analysis

framework incident-response malware python3 cybersecurity cert infosec malware-analyzer malware-analysis malware-research automation-framework cyber-security file-analysis document-analysis security-automation security-tools malware-detection assemblyline security-automation-framework

Updated Apr 10, 2025
Python

lazyFrogLOL / llmdocparser

Star

A package for parsing PDFs and analyzing their content using LLMs.

nlp ocr chunking document-analysis pdf-parser pdfparser rag llm text-chunking

Updated Aug 6, 2024
Python

pandora-analysis / pandora

Star

Pandora is an analysis framework to discover if a file is suspicious and conveniently show the results

infosec document-analysis malware-detection document-analyzing

Updated Apr 9, 2025
Python

Dedoc is a library (service) for automate documents parsing and bringing to a uniform format. It automatically extracts content, logical structure, tables, and meta information from textual electronic documents. (Parse document; Document content extraction; Logical structure extraction; PDF parser; Scanned document parser; DOCX parser; HTML parser

html pdf ocr table-of-contents excel html-parser docx documents doc scanned-documents txt document-analysis odt pdf-parser table-recognition docx-parser document-content-extraction logical-structure-extraction

Updated Apr 8, 2025
Python

masyagin1998 / robin

Star

RObust document image BINarization

python opencv ocr computer-vision deep-learning keras neural-networks document-analysis u-net document-binarization

Updated Aug 2, 2024
Python

chriswolfvision / local_adaptive_binarization

Star

Local adaptive image binarization

computer-vision document-analysis document-binarization

Updated Mar 5, 2023
C++

mirabdullahyaser / Retrieval-Augmented-Generation-Engine-with-LangChain-and-Streamlit

Star

Powerful web application that combines Streamlit, LangChain, and Pinecone to simplify document analysis. Powered by OpenAI's GPT-3, RAG enables dynamic, interactive document conversations, making it ideal for efficient document retrieval and summarization.