A declarative, rule-based engine for extracting structured data from unstructured documents.
DocParser allows developers to transform unstructured text (from PDFs, Text files, Logs) into clean JSON objects without writing hardcoded parsing logic. Instead of writing complex if/else chains in your code, you simply define an Extraction Profile in JSON.
To use DocParser, you must organize your files within the mapped volume (default: examples/):
examples/
├── config/ <-- Place your profile.json here
│ └── profile.json
├── input/ <-- Place all PDFs or TXTs you want to process
│ ├── invoice_001.pdf
│ └── invoice_002.pdf
└── output/ <-- The engine will generate the JSON results here
├── invoice_001_profile.json
└── invoice_002_profile.json
The extraction logic is defined in a JSON profile. You can choose between two methods:
- Regex Strategy
Best for structured data like IDs, Dates, and Codes.
{
"targetField": "ProjectCode",
"method": "Regex",
"regexPattern": "Project Number:\\s*(\\d+)",
}- Text Range Strategy
Best for extracting blocks of text or descriptions where Regex is too complex.
{
"targetField": "DescriptionBlock",
"method": "TextRegion",
"startAnchor": "DESCRIPTION:",
"endAnchor": "TECHNICAL DATA:",
"trimWhitespace": true
}{
"profileName": "Engineering_Spec_V1",
"rules": [
{
"targetField": "ProjectCode",
"method": "Regex",
"regexPattern": "Project Number:\\s*(\\d+)",
},
{
"targetField": "DescriptionBlock",
"method": "TextRange",
"startAnchor": "DESCRIPTION:",
"endAnchor": "TECHNICAL DATA:",
"trimWhitespace": true
},
]
}- DocParser.Core: The extraction engine logic (Standard .NET 8 Library).
- DocParser.CLI: A command-line interface that implements the Batch Processing logic and File I/O.
You don't need the .NET SDK installed. You can run the engine using Docker Compose.
-
Setup: Ensure your examples/config has a profile and examples/input has your documents.
-
Run the engine:
docker compose up