Skip to content

schema drift: corpus pre-handoff readiness (normalized_dbfs, has_violence rule, lowercase example paths) #101

@shaypal5

Description

@shaypal5

Context

Reviewed DataHackIL/avdp-synth-corpus PR #2 (the 8-clip M2a wet-test, currently the only live corpus content) ahead of first handoff to the She-Proves / Elephant consumer teams. Per CLAUDE.md, voice/audio quality items are tracked separately in deliveries/002-m2a-wettest/ and excluded from this issue. What remains is schema and metadata drift that would force consumer teams to special-case or guess. Four items:

1. PreprocessingApplied.normalized_dbfs is hardcoded -1.0

synthbanshee/cli.py:622 writes normalized_dbfs=-1.0 regardless of PreprocessingConfig.target_peak_dbfs. The default target became -2.0 dBFS in #78, so every clip generated after #78 has a JSON value that disagrees with what the preprocessor was configured to do. Semantically, this field has always represented the target of the normalization step (it sits in a block describing applied preprocessing); the fix is to read it from the live config, not invent a new meaning.

2. weak_label.has_violence derivation rule isn't pinned anywhere

The rule lives only in code (LabelGenerator.build_clip_metadata in synthbanshee/labels/generator.py):

has_violence = any(e.tier1_category != "NONE" for e in events)

The downstream corpus repo's README.md invented a different rule (typology in {SV, IT, NEG} and max_intensity >= 3) that disagrees with the code on every NEG row. The fix is to write the actual rule into docs/spec.md §5.1 with the source file/line reference, so external docs can mirror it.

3. docs/spec.md §2 / §5.1 examples use uppercase paths

§2.5 already mandates lowercase filenames, and the CLI lowercases scene_id and speaker_id for on-disk paths (cli.py:365–367). But the §2.1 directory diagram, §2.4 clip-id examples, and §5.1 example JSON all show uppercase. This is the doc-side source of the casing confusion that bites consumer teams who try to reconstruct paths from speakers[].speaker_id (which is uppercase) and end up with a directory that doesn't exist.

4. generation_metadata and voice_family absent from §5.1 example

Both fields are part of the current ClipMetadata schema (M11) but neither appears in the spec example. Consumer teams reading the spec will not know to expect them — or to handle them being null on pre-M11 clips. The example needs both fields populated and a short prose note on when each is present.

Resolution plan

  • PR Afix(cli): make normalized_dbfs read from PreprocessingConfig.target_peak_dbfs, with a focused unit test in TestRunGeneratePipeline that verifies the wiring without magic numbers.
  • PR Bdocs(spec): clean §2.1 diagram, tighten §2.3 paragraph on uppercase-id vs lowercase-directory, lowercase §2.4 examples, update §5.1 example JSON to reflect the current schema (with -2.0 target, generation_metadata, voice_family), and pin the has_violence derivation rule.

A follow-up avdp-synth-corpus data PR will then regenerate the 8 wet-test clips on top of PR A + PR B so the on-disk data matches the spec exactly.

Out of scope

  • Speaker-id casing. The regex in synthbanshee/config/speaker_config.py:15 enforces uppercase; ~30 tests and the speaker YAML filenames use it. Doc-side clarification only.
  • Existing corpus clips. Cannot be edited in place — a regen fixes them. Tracked in the corpus repo PR, not here.
  • Audio quality items (AGG RMS escalation, voice diversity, overlap) — already tracked in deliveries/002-m2a-wettest/notes.md.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions