-
Notifications
You must be signed in to change notification settings - Fork 0
docs(readme): improve project presentation #115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,112 +1,149 @@ | ||
| # SynthBanshee | ||
|
|
||
| **SynthBanshee** is a config-driven pipeline for generating large-scale synthetic Hebrew audio | ||
| datasets. It is part of the [AVDP](https://datahack.org.il) (Audio Violence Dataset Project) | ||
| initiative run by DataHack, which produces training data for two AI safety products: | ||
| [](https://github.com/DataHackIL/SynthBanshee/actions/workflows/ci.yml) | ||
| [](LICENSE) | ||
|  | ||
|  | ||
|
|
||
| - **She-Proves** — a smartphone app that passively monitors for domestic violence incidents and | ||
| preserves audio evidence for legal use | ||
| - **Elephant in the Room (הפיל שבחדר)** — a Raspberry Pi–class device in social-work offices that | ||
| alerts security when a social worker is being physically threatened | ||
| Created by [Shay Palachy Affek](http://www.shaypalachy.com/). | ||
|
|
||
| The goal of SynthBanshee is to supply AI teams with a wide, deliberate distribution of voices, | ||
| acoustic conditions, and violence types while real-data (actor recording) pipelines are built in | ||
| parallel. Synthetic-to-real gap is expected and documented. | ||
| **SynthBanshee** is a config-driven pipeline for generating synthetic Hebrew audio datasets for | ||
| DataHack's [AVDP](https://datahack.org.il) (Audio Violence Dataset Project). It turns structured | ||
| scene YAML into Hebrew dialogue, rendered speech, acoustic variants, taxonomy labels, QA reports, | ||
| and dataset packages that AI teams can use for early model development. | ||
|
|
||
| --- | ||
| The repository supports two sensitive AI-safety product contexts: **She-Proves**, focused on | ||
| domestic-violence incident detection research, and **Elephant in the Room** (`הפיל שבחדר`), focused | ||
| on threat detection in social-work offices. Synthetic-to-real gap is expected and documented; real | ||
| actor and field-data pipelines remain separate. | ||
|
|
||
| ## How it works | ||
|  | ||
|
|
||
| Every clip is produced by four sequential pipeline stages: | ||
| ## At a Glance | ||
|
|
||
| | Field | Value | | ||
| |---|---| | ||
| | Role | Synthetic Hebrew audio dataset generation for AVDP | | ||
| | Main inputs | `SceneConfig` YAML, speaker profiles, run configs, script templates | | ||
| | Main outputs | `.wav`, `.txt`, `.json`, `.jsonl`, manifests, split files, QA reports | | ||
| | Language | Hebrew (`he`, rendered with `he-IL` TTS voices) | | ||
| | Audio contract | 16 kHz, mono, 16-bit PCM, `-1.0 dBFS` peak ceiling, silence padding | | ||
| | Current status | Phase 0 and Phase 1 complete; Tier A baseline generation delivered | | ||
| | Packaged examples | 3 example scene configs, 8 run configs, 6 packaged speakers, 4 tracked WAV fixtures | | ||
|
|
||
| ## Pipeline | ||
|
|
||
| ```mermaid | ||
| flowchart LR | ||
| scene["SceneConfig YAML<br/>project, tier, speakers, intensity"] --> script["Script generator<br/>Jinja2 + LLM"] | ||
| script --> tts["TTS renderer<br/>Azure he-IL SSML"] | ||
| tts --> acoustic["Acoustic augmenter<br/>room, device, background events"] | ||
| acoustic --> labels["Label generator<br/>taxonomy + event timings"] | ||
| labels --> qa["QA and validation<br/>format, loudness, labels, splits"] | ||
| qa --> package["Dataset package<br/>audio + transcript + metadata"] | ||
| ``` | ||
| SceneConfig (YAML) | ||
| → [1] Script Generator LLM fills a Jinja2 template → dialogue JSON | ||
| → [2] TTS Renderer Azure he-IL SSML → per-speaker WAV segments | ||
| → [3] Acoustic Augmenter Room IR + device profile + noise → scene WAV | ||
| → [4] Label Generator Script structure + augmentation log → AVDP-schema JSONL | ||
| ``` | ||
|
|
||
| Output per clip: `{clip_id}.wav` + `{clip_id}.txt` (Hebrew transcript) + `{clip_id}.json` (metadata). | ||
|
|
||
| Audio spec: **16 kHz · mono · 16-bit PCM · −1.0 dBFS peak · ≥ 0.5 s silence pad**. | ||
|
|
||
| --- | ||
|
|
||
| ## Dataset tiers | ||
|
|
||
| | Tier | Description | Target (per project) | | ||
| |---|---|---| | ||
| | A | Clean TTS, no acoustic augmentation | 1,000 clips | | ||
| | B | Room simulation + device profile + background noise | 2,000 clips | | ||
| | C | Hard negatives / confusors (arguments that de-escalate, ambient sounds) | 1,000 clips | | ||
| Each generated clip is reproducible from config and seed. Tier A renders clean TTS, Tier B adds room | ||
| and device simulation, and Tier C adds hard negatives and confusors for robustness testing. | ||
|
|
||
| Two projects — **She-Proves** (3–6 min clips, apartment rooms) and **Elephant in the Room** | ||
| (1–4 min clips, clinic/welfare offices) — each receive the full tier stack. | ||
| ## Output Surface | ||
|
|
||
| --- | ||
| | Artifact | Purpose | | ||
| |---|---| | ||
| | `{clip_id}.wav` | Rendered 16 kHz mono PCM scene audio | | ||
| | `{clip_id}.txt` | Hebrew transcript for the rendered scene | | ||
| | `{clip_id}.json` | Clip metadata, taxonomy-derived `has_violence`, speakers, duration, and provenance | | ||
| | `{clip_id}.jsonl` | Time-aligned strong labels generated from script structure and augmentation logs | | ||
| | Manifest and split files | Dataset inventory, train/validation/test assignment, and batch-level provenance | | ||
| | QA reports | Format, loudness, duration, taxonomy, split hygiene, and generated-asset checks | | ||
|
|
||
| ## Label taxonomy | ||
| The preview image above is generated from a tracked 16 kHz test fixture, not from a released dataset | ||
| sample. It shows the kind of waveform and spectrogram view used for quick audio QA. | ||
|
|
||
| Labels use a three-level hierarchy. `has_violence` (in clip metadata and manifests) is a derived convenience field computed from the taxonomy — not an independent label. The taxonomy is the ground truth; `has_violence` exists for fast filtering and baseline modelling. | ||
| ## Dataset Tiers | ||
|
|
||
| 1. **Violence typology** (scene): `SV` · `IT` · `NEG` · `NEU` | ||
| 2. **Tier 1 category** (event): `PHYS` · `VERB` · `DIST` · `ACOU` · `EMOT` · `NONE` | ||
| 3. **Tier 2 subtype** (event): e.g. `PHYS_HARD` · `VERB_THREAT` · `DIST_SCREAM` | ||
| | Tier | Description | Target per project | | ||
| |---|---|---| | ||
| | A | Clean TTS with no acoustic augmentation | 1,000 clips | | ||
| | B | Room simulation, device profile, and background noise | 2,000 clips | | ||
| | C | Hard negatives and confusors, including de-escalating arguments and ambient sounds | 1,000 clips | | ||
|
|
||
| Full taxonomy: `synthbanshee/data/taxonomy.yaml`. | ||
| Both product contexts receive the full tier stack: | ||
|
|
||
| --- | ||
| - **She-Proves**: 3-6 minute apartment scenes with pre-incident, incident, and aftermath windows. | ||
| - **Elephant in the Room**: 1-4 minute clinic or welfare-office scenes for local threat detection. | ||
|
|
||
| ## Current status | ||
| ## Label Taxonomy | ||
|
|
||
| Phase 0 (pipeline foundation) and Phase 1 (full Tier A pipeline) are complete — AI teams can now use the delivered Tier A dataset to start baseline model development. | ||
| Labels use a three-level hierarchy. `has_violence` in clip metadata and manifests is a derived | ||
| convenience field computed from the taxonomy, not an independent label. | ||
|
|
||
| | Phase | Deliverable | Status | | ||
| |---|---|---| | ||
| | 0 | Single spec-compliant clip end-to-end | **Done** | | ||
| | 1 | Script templates + multi-speaker TTS + LLM generator + batch generation + QA suite | **Done** | | ||
| | 2 | 1,000–1,500 Tier B clips/project | Planned | | ||
| | 3 | 4,000 clips/project, all tiers | Planned | | ||
| | Level | Examples | | ||
| |---|---| | ||
| | Violence typology | `SV`, `IT`, `NEG`, `NEU` | | ||
| | Tier 1 event category | `PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE` | | ||
| | Tier 2 event subtype | `PHYS_HARD`, `VERB_THREAT`, `DIST_SCREAM`, `ACOU_BREAK` | | ||
|
|
||
| --- | ||
| Full taxonomy: [`synthbanshee/data/taxonomy.yaml`](synthbanshee/data/taxonomy.yaml). | ||
|
|
||
| ## Quick start | ||
| ## Quick Start | ||
|
|
||
| ```bash | ||
| # Install (requires Python ≥ 3.11 and uv) | ||
| # Install locally with Python 3.11+ and uv. | ||
| uv pip install -e . | ||
|
|
||
| # Generate a single clip from a scene config | ||
| synthbanshee generate --config configs/scenes/test_scene_001.yaml | ||
| # Generate a single clip from a scene config. | ||
| synthbanshee generate --config configs/examples/scene_she_proves_IT_example.yaml | ||
|
|
||
| # Generate a full Tier A batch from a run config | ||
| # Generate a Tier A She-Proves batch from a run config. | ||
| synthbanshee generate-batch \ | ||
| --run-config configs/run_configs/tier_a_500_she_proves.yaml \ | ||
| --output-dir data/he | ||
|
|
||
| # Run the automated QA suite on a dataset directory | ||
| # Run automated QA on a dataset directory. | ||
| synthbanshee qa-report data/he | ||
|
|
||
| # Validate an existing clip | ||
| # Validate an existing clip. | ||
| synthbanshee validate data/he/clip_001.wav | ||
| ``` | ||
|
|
||
| API credentials required (set in environment or `.env`): | ||
| Live generation requires provider credentials in your environment or local `.env` file: | ||
|
|
||
| ``` | ||
| ```bash | ||
| AZURE_TTS_KEY=... | ||
| AZURE_TTS_REGION=... | ||
| ANTHROPIC_API_KEY=... # for script generation | ||
| ANTHROPIC_API_KEY=... | ||
| ``` | ||
|
|
||
| --- | ||
| Do not commit provider keys, generated private datasets, or real-data artifacts. | ||
|
|
||
| ## Safety and Data Boundary | ||
|
|
||
| ## Docs | ||
| SynthBanshee is a synthetic-data generator for research and model-development workflows. Generated | ||
| clips are not evidence, user recordings, or a substitute for real validation data. They are designed | ||
| to widen scenario coverage while actor-recording and real-data pipelines are handled separately. | ||
|
|
||
| Sensitive context should remain explicit in configs and docs, but public-facing materials should not | ||
| overstate model readiness, legal usefulness, or field performance. | ||
|
|
||
| ## Current Status | ||
|
|
||
| | Phase | Deliverable | Status | | ||
| |---|---|---| | ||
| | 0 | Single spec-compliant clip end to end | Done | | ||
| | 1 | Script templates, multi-speaker TTS, LLM generation, batch generation, QA suite | Done | | ||
| | 2 | 1,000-1,500 Tier B clips per project | Planned | | ||
| | 3 | 4,000 clips per project across all tiers | Planned | | ||
|
|
||
| ## Documentation | ||
|
|
||
| | Document | Contents | | ||
| |---|---| | ||
| | `docs/spec.md` | Audio format, file naming, label schema, IAA protocol | | ||
| | `docs/implementation_plan.md` | Phased milestones, module map, API cost estimates | | ||
| | `docs/design_approaches.md` | Design decisions and rationale | | ||
| | `CLAUDE.md` | Claude Code context guide (pipeline constraints, conventions) | | ||
| | [`docs/spec.md`](docs/spec.md) | Audio format, file naming, label schema, and IAA protocol | | ||
| | [`docs/implementation_plan.md`](docs/implementation_plan.md) | Phased milestones, module map, and API cost estimates | | ||
| | [`docs/design_approaches.md`](docs/design_approaches.md) | Design decisions and rationale | | ||
| | [`CLAUDE.md`](CLAUDE.md) | Agent context guide for pipeline constraints and conventions | | ||
|
|
||
| ## Credits | ||
|
|
||
| Created by [Shay Palachy Affek ](http://www.shaypalachy.com/) [[GitHub](https://github.com/shaypal5)] | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.