Context
Reviewed DataHackIL/avdp-synth-corpus PR #2 (the 8-clip M2a wet-test, currently the only live corpus content) ahead of first handoff to the She-Proves / Elephant consumer teams. Per CLAUDE.md, voice/audio quality items are tracked separately in deliveries/002-m2a-wettest/ and excluded from this issue. What remains is schema and metadata drift that would force consumer teams to special-case or guess. Four items:
1. PreprocessingApplied.normalized_dbfs is hardcoded -1.0
synthbanshee/cli.py:622 writes normalized_dbfs=-1.0 regardless of PreprocessingConfig.target_peak_dbfs. The default target became -2.0 dBFS in #78, so every clip generated after #78 has a JSON value that disagrees with what the preprocessor was configured to do. Semantically, this field has always represented the target of the normalization step (it sits in a block describing applied preprocessing); the fix is to read it from the live config, not invent a new meaning.
2. weak_label.has_violence derivation rule isn't pinned anywhere
The rule lives only in code (LabelGenerator.build_clip_metadata in synthbanshee/labels/generator.py):
has_violence = any(e.tier1_category != "NONE" for e in events)
The downstream corpus repo's README.md invented a different rule (typology in {SV, IT, NEG} and max_intensity >= 3) that disagrees with the code on every NEG row. The fix is to write the actual rule into docs/spec.md §5.1 with the source file/line reference, so external docs can mirror it.
3. docs/spec.md §2 / §5.1 examples use uppercase paths
§2.5 already mandates lowercase filenames, and the CLI lowercases scene_id and speaker_id for on-disk paths (cli.py:365–367). But the §2.1 directory diagram, §2.4 clip-id examples, and §5.1 example JSON all show uppercase. This is the doc-side source of the casing confusion that bites consumer teams who try to reconstruct paths from speakers[].speaker_id (which is uppercase) and end up with a directory that doesn't exist.
4. generation_metadata and voice_family absent from §5.1 example
Both fields are part of the current ClipMetadata schema (M11) but neither appears in the spec example. Consumer teams reading the spec will not know to expect them — or to handle them being null on pre-M11 clips. The example needs both fields populated and a short prose note on when each is present.
Resolution plan
- PR A —
fix(cli): make normalized_dbfs read from PreprocessingConfig.target_peak_dbfs, with a focused unit test in TestRunGeneratePipeline that verifies the wiring without magic numbers.
- PR B —
docs(spec): clean §2.1 diagram, tighten §2.3 paragraph on uppercase-id vs lowercase-directory, lowercase §2.4 examples, update §5.1 example JSON to reflect the current schema (with -2.0 target, generation_metadata, voice_family), and pin the has_violence derivation rule.
A follow-up avdp-synth-corpus data PR will then regenerate the 8 wet-test clips on top of PR A + PR B so the on-disk data matches the spec exactly.
Out of scope
- Speaker-id casing. The regex in
synthbanshee/config/speaker_config.py:15 enforces uppercase; ~30 tests and the speaker YAML filenames use it. Doc-side clarification only.
- Existing corpus clips. Cannot be edited in place — a regen fixes them. Tracked in the corpus repo PR, not here.
- Audio quality items (AGG RMS escalation, voice diversity, overlap) — already tracked in
deliveries/002-m2a-wettest/notes.md.
Context
Reviewed
DataHackIL/avdp-synth-corpusPR #2 (the 8-clip M2a wet-test, currently the only live corpus content) ahead of first handoff to the She-Proves / Elephant consumer teams. PerCLAUDE.md, voice/audio quality items are tracked separately indeliveries/002-m2a-wettest/and excluded from this issue. What remains is schema and metadata drift that would force consumer teams to special-case or guess. Four items:1.
PreprocessingApplied.normalized_dbfsis hardcoded-1.0synthbanshee/cli.py:622writesnormalized_dbfs=-1.0regardless ofPreprocessingConfig.target_peak_dbfs. The default target became-2.0 dBFSin #78, so every clip generated after #78 has a JSON value that disagrees with what the preprocessor was configured to do. Semantically, this field has always represented the target of the normalization step (it sits in a block describing applied preprocessing); the fix is to read it from the live config, not invent a new meaning.2.
weak_label.has_violencederivation rule isn't pinned anywhereThe rule lives only in code (
LabelGenerator.build_clip_metadatainsynthbanshee/labels/generator.py):The downstream corpus repo's
README.mdinvented a different rule (typology in {SV, IT, NEG} and max_intensity >= 3) that disagrees with the code on every NEG row. The fix is to write the actual rule intodocs/spec.md§5.1 with the source file/line reference, so external docs can mirror it.3.
docs/spec.md§2 / §5.1 examples use uppercase paths§2.5 already mandates lowercase filenames, and the CLI lowercases
scene_idandspeaker_idfor on-disk paths (cli.py:365–367). But the §2.1 directory diagram, §2.4 clip-id examples, and §5.1 example JSON all show uppercase. This is the doc-side source of the casing confusion that bites consumer teams who try to reconstruct paths fromspeakers[].speaker_id(which is uppercase) and end up with a directory that doesn't exist.4.
generation_metadataandvoice_familyabsent from §5.1 exampleBoth fields are part of the current
ClipMetadataschema (M11) but neither appears in the spec example. Consumer teams reading the spec will not know to expect them — or to handle them being null on pre-M11 clips. The example needs both fields populated and a short prose note on when each is present.Resolution plan
fix(cli):makenormalized_dbfsread fromPreprocessingConfig.target_peak_dbfs, with a focused unit test inTestRunGeneratePipelinethat verifies the wiring without magic numbers.docs(spec):clean §2.1 diagram, tighten §2.3 paragraph on uppercase-id vs lowercase-directory, lowercase §2.4 examples, update §5.1 example JSON to reflect the current schema (with-2.0target,generation_metadata,voice_family), and pin thehas_violencederivation rule.A follow-up
avdp-synth-corpusdata PR will then regenerate the 8 wet-test clips on top of PR A + PR B so the on-disk data matches the spec exactly.Out of scope
synthbanshee/config/speaker_config.py:15enforces uppercase; ~30 tests and the speaker YAML filenames use it. Doc-side clarification only.deliveries/002-m2a-wettest/notes.md.