schema drift: corpus pre-handoff readiness (normalized_dbfs, has_violence rule, lowercase example paths)

## Context

Reviewed `DataHackIL/avdp-synth-corpus` PR #2 (the 8-clip M2a wet-test, currently the only live corpus content) ahead of first handoff to the She-Proves / Elephant consumer teams. Per `CLAUDE.md`, voice/audio quality items are tracked separately in `deliveries/002-m2a-wettest/` and excluded from this issue. What remains is **schema** and **metadata** drift that would force consumer teams to special-case or guess. Four items:

### 1. `PreprocessingApplied.normalized_dbfs` is hardcoded `-1.0`

`synthbanshee/cli.py:622` writes `normalized_dbfs=-1.0` regardless of `PreprocessingConfig.target_peak_dbfs`. The default target became `-2.0 dBFS` in #78, so every clip generated after #78 has a JSON value that disagrees with what the preprocessor was configured to do. Semantically, this field has always represented the **target** of the normalization step (it sits in a block describing applied preprocessing); the fix is to read it from the live config, not invent a new meaning.

### 2. `weak_label.has_violence` derivation rule isn't pinned anywhere

The rule lives only in code (`LabelGenerator.build_clip_metadata` in `synthbanshee/labels/generator.py`):

```python
has_violence = any(e.tier1_category != "NONE" for e in events)
```

The downstream corpus repo's `README.md` invented a different rule (`typology in {SV, IT, NEG} and max_intensity >= 3`) that disagrees with the code on every NEG row. The fix is to write the actual rule into `docs/spec.md` §5.1 with the source file/line reference, so external docs can mirror it.

### 3. `docs/spec.md` §2 / §5.1 examples use uppercase paths

§2.5 already mandates lowercase filenames, and the CLI lowercases `scene_id` and `speaker_id` for on-disk paths (`cli.py:365–367`). But the §2.1 directory diagram, §2.4 clip-id examples, and §5.1 example JSON all show uppercase. This is the doc-side source of the casing confusion that bites consumer teams who try to reconstruct paths from `speakers[].speaker_id` (which is uppercase) and end up with a directory that doesn't exist.

### 4. `generation_metadata` and `voice_family` absent from §5.1 example

Both fields are part of the current `ClipMetadata` schema (M11) but neither appears in the spec example. Consumer teams reading the spec will not know to expect them — or to handle them being null on pre-M11 clips. The example needs both fields populated and a short prose note on when each is present.

## Resolution plan

- **PR A** — `fix(cli):` make `normalized_dbfs` read from `PreprocessingConfig.target_peak_dbfs`, with a focused unit test in `TestRunGeneratePipeline` that verifies the wiring without magic numbers.
- **PR B** — `docs(spec):` clean §2.1 diagram, tighten §2.3 paragraph on uppercase-id vs lowercase-directory, lowercase §2.4 examples, update §5.1 example JSON to reflect the current schema (with `-2.0` target, `generation_metadata`, `voice_family`), and pin the `has_violence` derivation rule.

A follow-up `avdp-synth-corpus` data PR will then regenerate the 8 wet-test clips on top of PR A + PR B so the on-disk data matches the spec exactly.

## Out of scope

- **Speaker-id casing.** The regex in `synthbanshee/config/speaker_config.py:15` enforces uppercase; ~30 tests and the speaker YAML filenames use it. Doc-side clarification only.
- **Existing corpus clips.** Cannot be edited in place — a regen fixes them. Tracked in the corpus repo PR, not here.
- **Audio quality items** (AGG RMS escalation, voice diversity, overlap) — already tracked in `deliveries/002-m2a-wettest/notes.md`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

schema drift: corpus pre-handoff readiness (normalized_dbfs, has_violence rule, lowercase example paths) #101

Context

1. `PreprocessingApplied.normalized_dbfs` is hardcoded `-1.0`

2. `weak_label.has_violence` derivation rule isn't pinned anywhere

3. `docs/spec.md` §2 / §5.1 examples use uppercase paths

4. `generation_metadata` and `voice_family` absent from §5.1 example

Resolution plan

Out of scope

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

schema drift: corpus pre-handoff readiness (normalized_dbfs, has_violence rule, lowercase example paths) #101

Description

Context

1. PreprocessingApplied.normalized_dbfs is hardcoded -1.0

2. weak_label.has_violence derivation rule isn't pinned anywhere

3. docs/spec.md §2 / §5.1 examples use uppercase paths

4. generation_metadata and voice_family absent from §5.1 example

Resolution plan

Out of scope

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

1. `PreprocessingApplied.normalized_dbfs` is hardcoded `-1.0`

2. `weak_label.has_violence` derivation rule isn't pinned anywhere

3. `docs/spec.md` §2 / §5.1 examples use uppercase paths

4. `generation_metadata` and `voice_family` absent from §5.1 example