Deterministic execution substrate for structuring text into canonical artifacts.
This repository contains the DataDiddler runtime surface.
It is responsible for:
- executing stage pipelines
- enforcing schema boundaries
- producing canonical structured artifacts (GroundedDatasetBlock)
This is not the full system.
It is the execution layer only.
core/ → contract surface (schema + artifact definition)
lenses/ → execution primitives (stage behavior)
pipeline/ → orchestration (stage wiring + execution order)
integrity/ → trust enforcement (evidence + contradiction handling)
query/ → inspection layer (render + index)
tooling/ → build + toolchain support
publish/ → placeholder (non-operational)
lenses define behavior
pipeline defines sequence
core defines structure
DataDiddler does not accept arbitrary data.
Input must be:
- plain text or Markdown
- structurally recoverable (clear sections, paragraphs, or records)
- free of presentation-layer noise
Unsupported inputs:
- raw HTML
- layout-heavy documents without preprocessing
Upstream normalization is required for:
- scraped web content
- HTML documents
- domain-specific formats
DataDiddler does not clean data.
It assumes the data has already been shaped into something worth structuring.
Pipeline:
Rake → Separator → Tagger → Packager
Each stage:
- consumes explicit inputs
- produces explicit outputs
- runs deterministically
From the repository root:
./run_datadiddler.ps1This will execute the pipeline on the configured input directory and produce a structured artifact.
Primary artifact:
GroundedDatasetBlock.vN.json
Properties:
- schema-validated
- structurally complete
- content may be empty but shape is enforced
- deterministic execution
- fail-closed behavior (no partial success)
- explicit artifact production
- schema enforcement
This repository does not provide:
- semantic correctness
- domain-specific tagging
- ingestion or normalization
- governance or verification
- external trust systems
The publish/ directory is reserved for future functionality.
It does not currently participate in execution.
A minimal example is provided in:
examples/sample_input/
examples/sample_output/
This demonstrates the expected input shape and resulting output structure.
This is not:
- the full DataDiddler system
- the ingestion layer
- the governance layer
- a domain-specific processor
It is the execution substrate only.
Structure is enforced. Meaning must conform to it.