v0.5.0 — Offline-First Build Pipeline Release
·
99 commits
to master
since this release
v0.5 Draft (2026-04-26)
Status: RFC. Introduces the v0.5 offline-first build pipelines (ASR / text /
vectorization / moderation), a derived-asset provenance schema, and the
single-entrypoint pipeline CLI. No breaking changes to v0.4 manifests; the new
pipelines write everything under derived/<name>/ so existing records are
untouched until a pipeline is explicitly run against them.
Added
pipelines/directory with the v0.5 pipeline contract:pipelines/__init__.py—PipelineSpecregistry + dispatcher.pipelines/_descriptor.py— sharedDescriptorBuilderthat emits
<output>.descriptor.jsonvalidated against
schemas/derived-asset.schema.json.pipelines/asr/—dummy(deterministic, no model) andfaster-whisper
(lazy-imported, opt-in) backends.pipelines/text/— NFKC normalisation + conservative redaction
(priority order: URLs with embedded credentials, emails, CN ID
cards, CN mobile phones, IPv4 addresses, credit-card-like 13–19
digit runs, generic phone numbers). Replacements use stable
category placeholders (<EMAIL>,<PHONE_CN>,<ID_CN>,<IPV4>,
<CARD>,<PHONE>,<URL_WITH_CREDENTIALS>).redactions.json
sidecar carriesrule_name + start/end + replacementonly and is
auditable without re-leaking matched substrings.pipelines/vectorization/— paragraph-aware chunking with absolute char
offsets,hash(deterministic 64-D) andsentence-transformersbackends,
optional Qdrant push (backendandmodel_idstored as separate
payload keys so downstream filters work without ambiguity).pipelines/moderation/— deterministic regex/wordlist policy with
severity-based outcome aggregation (pass | flag | block). Built-in
v0.5 policy +--policy-filefor JSON/YAML overrides. Flags carry
rule + span only, never the matched substring.
tools/run_pipeline.py— single CLI entrypoint (python tools/run_pipeline.py <name> --record path/to/record …) shared by every pipeline.tools/validate_pipelines.py— static guard: enforces the
derived/<spec.name>/output-prefix invariant and refuses any module that
imports a hosted-API client (openai,anthropic,google.generativeai,
cohere,aliyun_sdk_bailian, …). This is what turns "offline-first" into
machine-checked policy.tools/test_pipelines.py— umbrella test driver. Runs the four
per-pipeline test scripts as subprocesses so an import failure in one
pipeline cannot mask test results in another.tools/test_asr_demo.py— end-to-end test forexamples/asr-demo.schemas/derived-asset.schema.json— provenance descriptor schema
(schema_version/derived_id/record_id/ top-levelpipeline+
pipeline_version/actor_role/inputs.{source_pointers,inputs_hash}
/output.{path,outputs_hash}/ optionalmodel.{id,version?,source?, online_api_used: false}(required when pipeline isasror
vectorization) / optionalmoderation_outcome).examples/asr-demo/— self-contained fixture record.run_demo.sh
regenerates a deterministic placeholder WAV (DLRS is pointer-first so
audio is never committed) and walks all four pipelines end-to-end with
no model download.docs/PIPELINE_GUIDE.md— companion to the example. Covers the contract,
the descriptor, every pipeline's CLI, authoring guide, and what v0.5
deliberately is not..github/workflows/validate.yml: dedicatedpipelinesjob parallel to
validate, matrix over Python 3.11 and 3.12.
Changed
tools/batch_validate.py: collapsed the four per-pipeline tests into a
singlepipelinesstep delegating totools/test_pipelines.py, then
addedasr_demofor the end-to-end fixture. Local report:
11/11 passed.docs/GAP_ANALYSIS.mdanddocs/IMPLEMENTATION_STATUS.mdrewritten to
reflect v0.5 (overall completion ~83%).ROADMAP.md: v0.5 marked as released, with theCloses #N-per-PR
governance rule appended to the v0.5 section so future major versions
inherit it.
Closes
#28 (epic), #29, #30, #31, #32, #33, #34, #35, #36, #37, #38.
Sub-issues closed by this epic
#28 (epic), #29, #30, #31, #32, #33, #34, #35, #36, #37, #38.
Validation
tools/batch_validate.py --report-dir reports -> 11/11 passed.