DataHackIL · shaypal5 · May 31, 2026 · May 31, 2026
diff --git a/README.md b/README.md
@@ -1,112 +1,149 @@
 # SynthBanshee
 
-**SynthBanshee** is a config-driven pipeline for generating large-scale synthetic Hebrew audio
-datasets. It is part of the [AVDP](https://datahack.org.il) (Audio Violence Dataset Project)
-initiative run by DataHack, which produces training data for two AI safety products:
+[![CI](https://github.com/DataHackIL/SynthBanshee/actions/workflows/ci.yml/badge.svg)](https://github.com/DataHackIL/SynthBanshee/actions/workflows/ci.yml)
+[![license: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
+![python: 3.11+](https://img.shields.io/badge/python-3.11%2B-blue.svg)
+![audio: 16 kHz mono PCM](https://img.shields.io/badge/audio-16kHz%20mono%20PCM-0f766e.svg)
 
-- **She-Proves** — a smartphone app that passively monitors for domestic violence incidents and
-  preserves audio evidence for legal use
-- **Elephant in the Room (הפיל שבחדר)** — a Raspberry Pi–class device in social-work offices that
-  alerts security when a social worker is being physically threatened
+Created by [Shay Palachy Affek](http://www.shaypalachy.com/).
 
-The goal of SynthBanshee is to supply AI teams with a wide, deliberate distribution of voices,
-acoustic conditions, and violence types while real-data (actor recording) pipelines are built in
-parallel. Synthetic-to-real gap is expected and documented.
+**SynthBanshee** is a config-driven pipeline for generating synthetic Hebrew audio datasets for
+DataHack's [AVDP](https://datahack.org.il) (Audio Violence Dataset Project). It turns structured
+scene YAML into Hebrew dialogue, rendered speech, acoustic variants, taxonomy labels, QA reports,
+and dataset packages that AI teams can use for early model development.
 
----
+The repository supports two sensitive AI-safety product contexts: **She-Proves**, focused on
+domestic-violence incident detection research, and **Elephant in the Room** (`הפיל שבחדר`), focused
+on threat detection in social-work offices. Synthetic-to-real gap is expected and documented; real
+actor and field-data pipelines remain separate.
 
-## How it works
+![SynthBanshee audio fixture waveform and spectrogram](docs/assets/synthbanshee-click-fixture-waveform.png)
 
-Every clip is produced by four sequential pipeline stages:
+## At a Glance
 
+| Field | Value |
+|---|---|
+| Role | Synthetic Hebrew audio dataset generation for AVDP |
+| Main inputs | `SceneConfig` YAML, speaker profiles, run configs, script templates |
+| Main outputs | `.wav`, `.txt`, `.json`, `.jsonl`, manifests, split files, QA reports |
+| Language | Hebrew (`he`, rendered with `he-IL` TTS voices) |
+| Audio contract | 16 kHz, mono, 16-bit PCM, `-1.0 dBFS` peak ceiling, silence padding |
+| Current status | Phase 0 and Phase 1 complete; Tier A baseline generation delivered |
+| Packaged examples | 3 example scene configs, 8 run configs, 6 packaged speakers, 4 tracked WAV fixtures |
+
+## Pipeline
+
+```mermaid
+flowchart LR
+    scene["SceneConfig YAML<br/>project, tier, speakers, intensity"] --> script["Script generator<br/>Jinja2 + LLM"]
+    script --> tts["TTS renderer<br/>Azure he-IL SSML"]
+    tts --> acoustic["Acoustic augmenter<br/>room, device, background events"]
+    acoustic --> labels["Label generator<br/>taxonomy + event timings"]
+    labels --> qa["QA and validation<br/>format, loudness, labels, splits"]
+    qa --> package["Dataset package<br/>audio + transcript + metadata"]
 ```
-SceneConfig (YAML)
-  → [1] Script Generator   LLM fills a Jinja2 template → dialogue JSON
-  → [2] TTS Renderer       Azure he-IL SSML → per-speaker WAV segments
-  → [3] Acoustic Augmenter Room IR + device profile + noise → scene WAV
-  → [4] Label Generator    Script structure + augmentation log → AVDP-schema JSONL
-```
-
-Output per clip: `{clip_id}.wav` + `{clip_id}.txt` (Hebrew transcript) + `{clip_id}.json` (metadata).
-
-Audio spec: **16 kHz · mono · 16-bit PCM · −1.0 dBFS peak · ≥ 0.5 s silence pad**.
-
----
-
-## Dataset tiers
 
-| Tier | Description | Target (per project) |
-|---|---|---|
-| A | Clean TTS, no acoustic augmentation | 1,000 clips |
-| B | Room simulation + device profile + background noise | 2,000 clips |
-| C | Hard negatives / confusors (arguments that de-escalate, ambient sounds) | 1,000 clips |
+Each generated clip is reproducible from config and seed. Tier A renders clean TTS, Tier B adds room
+and device simulation, and Tier C adds hard negatives and confusors for robustness testing.
 
-Two projects — **She-Proves** (3–6 min clips, apartment rooms) and **Elephant in the Room**
-(1–4 min clips, clinic/welfare offices) — each receive the full tier stack.
+## Output Surface
 
----
+| Artifact | Purpose |
+|---|---|
+| `{clip_id}.wav` | Rendered 16 kHz mono PCM scene audio |
+| `{clip_id}.txt` | Hebrew transcript for the rendered scene |
+| `{clip_id}.json` | Clip metadata, taxonomy-derived `has_violence`, speakers, duration, and provenance |
+| `{clip_id}.jsonl` | Time-aligned strong labels generated from script structure and augmentation logs |
+| Manifest and split files | Dataset inventory, train/validation/test assignment, and batch-level provenance |
+| QA reports | Format, loudness, duration, taxonomy, split hygiene, and generated-asset checks |
 
-## Label taxonomy
+The preview image above is generated from a tracked 16 kHz test fixture, not from a released dataset
+sample. It shows the kind of waveform and spectrogram view used for quick audio QA.
 
-Labels use a three-level hierarchy. `has_violence` (in clip metadata and manifests) is a derived convenience field computed from the taxonomy — not an independent label. The taxonomy is the ground truth; `has_violence` exists for fast filtering and baseline modelling.
+## Dataset Tiers
 
-1. **Violence typology** (scene): `SV` · `IT` · `NEG` · `NEU`
-2. **Tier 1 category** (event): `PHYS` · `VERB` · `DIST` · `ACOU` · `EMOT` · `NONE`
-3. **Tier 2 subtype** (event): e.g. `PHYS_HARD` · `VERB_THREAT` · `DIST_SCREAM`
+| Tier | Description | Target per project |
+|---|---|---|
+| A | Clean TTS with no acoustic augmentation | 1,000 clips |
+| B | Room simulation, device profile, and background noise | 2,000 clips |
+| C | Hard negatives and confusors, including de-escalating arguments and ambient sounds | 1,000 clips |
 
-Full taxonomy: `synthbanshee/data/taxonomy.yaml`.
+Both product contexts receive the full tier stack:
 
----
+- **She-Proves**: 3-6 minute apartment scenes with pre-incident, incident, and aftermath windows.
+- **Elephant in the Room**: 1-4 minute clinic or welfare-office scenes for local threat detection.
 
-## Current status
+## Label Taxonomy
 
-Phase 0 (pipeline foundation) and Phase 1 (full Tier A pipeline) are complete — AI teams can now use the delivered Tier A dataset to start baseline model development.
+Labels use a three-level hierarchy. `has_violence` in clip metadata and manifests is a derived
+convenience field computed from the taxonomy, not an independent label.
 
-| Phase | Deliverable | Status |
-|---|---|---|
-| 0 | Single spec-compliant clip end-to-end | **Done** |
-| 1 | Script templates + multi-speaker TTS + LLM generator + batch generation + QA suite | **Done** |
-| 2 | 1,000–1,500 Tier B clips/project | Planned |
-| 3 | 4,000 clips/project, all tiers | Planned |
+| Level | Examples |
+|---|---|
+| Violence typology | `SV`, `IT`, `NEG`, `NEU` |
+| Tier 1 event category | `PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE` |
+| Tier 2 event subtype | `PHYS_HARD`, `VERB_THREAT`, `DIST_SCREAM`, `ACOU_BREAK` |
 
----
+Full taxonomy: [`synthbanshee/data/taxonomy.yaml`](synthbanshee/data/taxonomy.yaml).
 
-## Quick start
+## Quick Start
 
 ```bash
-# Install (requires Python ≥ 3.11 and uv)
+# Install locally with Python 3.11+ and uv.
 uv pip install -e .
 
-# Generate a single clip from a scene config
-synthbanshee generate --config configs/scenes/test_scene_001.yaml
+# Generate a single clip from a scene config.
+synthbanshee generate --config configs/examples/scene_she_proves_IT_example.yaml
 
-# Generate a full Tier A batch from a run config
+# Generate a Tier A She-Proves batch from a run config.
 synthbanshee generate-batch \
   --run-config configs/run_configs/tier_a_500_she_proves.yaml \
   --output-dir data/he
 
-# Run the automated QA suite on a dataset directory
+# Run automated QA on a dataset directory.
 synthbanshee qa-report data/he
 
-# Validate an existing clip
+# Validate an existing clip.
 synthbanshee validate data/he/clip_001.wav
 ```
 
-API credentials required (set in environment or `.env`):
+Live generation requires provider credentials in your environment or local `.env` file:
 
-```
+```bash
 AZURE_TTS_KEY=...
 AZURE_TTS_REGION=...
-ANTHROPIC_API_KEY=...   # for script generation
+ANTHROPIC_API_KEY=...
 ```
 
----
+Do not commit provider keys, generated private datasets, or real-data artifacts.
+
+## Safety and Data Boundary
 
-## Docs
+SynthBanshee is a synthetic-data generator for research and model-development workflows. Generated
+clips are not evidence, user recordings, or a substitute for real validation data. They are designed
+to widen scenario coverage while actor-recording and real-data pipelines are handled separately.
+
+Sensitive context should remain explicit in configs and docs, but public-facing materials should not
+overstate model readiness, legal usefulness, or field performance.
+
+## Current Status
+
+| Phase | Deliverable | Status |
+|---|---|---|
+| 0 | Single spec-compliant clip end to end | Done |
+| 1 | Script templates, multi-speaker TTS, LLM generation, batch generation, QA suite | Done |
+| 2 | 1,000-1,500 Tier B clips per project | Planned |
+| 3 | 4,000 clips per project across all tiers | Planned |
+
+## Documentation
 
 | Document | Contents |
 |---|---|
-| `docs/spec.md` | Audio format, file naming, label schema, IAA protocol |
-| `docs/implementation_plan.md` | Phased milestones, module map, API cost estimates |
-| `docs/design_approaches.md` | Design decisions and rationale |
-| `CLAUDE.md` | Claude Code context guide (pipeline constraints, conventions) |
+| [`docs/spec.md`](docs/spec.md) | Audio format, file naming, label schema, and IAA protocol |
+| [`docs/implementation_plan.md`](docs/implementation_plan.md) | Phased milestones, module map, and API cost estimates |
+| [`docs/design_approaches.md`](docs/design_approaches.md) | Design decisions and rationale |
+| [`CLAUDE.md`](CLAUDE.md) | Agent context guide for pipeline constraints and conventions |
+
+## Credits
+
+Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)]
diff --git a/docs/assets/synthbanshee-click-fixture-waveform.png b/docs/assets/synthbanshee-click-fixture-waveform.png