DocParser

A declarative, rule-based engine for extracting structured data from unstructured documents.

DocParser allows developers to transform unstructured text (from PDFs, Text files, Logs) into clean JSON objects without writing hardcoded parsing logic. Instead of writing complex if/else chains in your code, you simply define an Extraction Profile in JSON.

Project Structure

To use DocParser, you must organize your files within the mapped volume (default: examples/):

examples/
├── config/          <-- Place your profile.json here
│   └── profile.json
├── input/           <-- Place all PDFs or TXTs you want to process
│   ├── invoice_001.pdf
│   └── invoice_002.pdf
└── output/          <-- The engine will generate the JSON results here
    ├── invoice_001_profile.json
    └── invoice_002_profile.json

Configuration Guide

The extraction logic is defined in a JSON profile. You can choose between two methods:

Regex Strategy

Best for structured data like IDs, Dates, and Codes.

{
  "targetField": "ProjectCode",
  "method": "Regex",
  "regexPattern": "Project Number:\\s*(\\d+)",
}

Text Range Strategy

Best for extracting blocks of text or descriptions where Regex is too complex.

{
  "targetField": "DescriptionBlock",
  "method": "TextRegion",
  "startAnchor": "DESCRIPTION:",
  "endAnchor": "TECHNICAL DATA:",
  "trimWhitespace": true
}

Full Profile Example (`examples/config/profile.json`)

{
  "profileName": "Engineering_Spec_V1",
  "rules": [
    {
      "targetField": "ProjectCode",
      "method": "Regex",
      "regexPattern": "Project Number:\\s*(\\d+)",
    },
    {
      "targetField": "DescriptionBlock",
      "method": "TextRange",
      "startAnchor": "DESCRIPTION:",
      "endAnchor": "TECHNICAL DATA:",
      "trimWhitespace": true
    },
  ]
}

Architecture

DocParser.Core: The extraction engine logic (Standard .NET 8 Library).
DocParser.CLI: A command-line interface that implements the Batch Processing logic and File I/O.

Quick Start (Docker)

You don't need the .NET SDK installed. You can run the engine using Docker Compose.

Setup: Ensure your examples/config has a profile and examples/input has your documents.
Run the engine:
```
docker compose up
```

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
examples		examples
src		src
tests/DocParser.Tests		tests/DocParser.Tests
.gitignore		.gitignore
DocParser.sln		DocParser.sln
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DocParser

Project Structure

Configuration Guide

Full Profile Example (`examples/config/profile.json`)

Architecture

Quick Start (Docker)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DocParser

Project Structure

Configuration Guide

Full Profile Example (examples/config/profile.json)

Architecture

Quick Start (Docker)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Full Profile Example (`examples/config/profile.json`)

Packages