Synthetic military radio transcript pair generator for ASR fine-tuning. Produces paired clean/noisy transcripts that simulate Wav2Vec2 CTC greedy decoder output from degraded HF/VHF radio audio.
- Overview
- Features
- Installation
- Quick Start
- Configuration
- Seed Taxonomy
- Radio Formats
- Generation Pipeline
- QA Pipeline
- Output Files
- Project Structure
- Development
Training an ASR system to transcribe military radio communications requires data that reflects the real acoustic and linguistic conditions of tactical radio — terse proword-driven speech, phonetic spelling, operator stress, degraded signal, and the characteristic error patterns of a CTC decoder operating without a language model. Clean transcripts of that data are rare and often sensitive.
CallSignForge generates synthetic transcript pairs at scale. Each pair consists of a clean transcript (ground truth) and a noisy transcript (simulated CTC decoder output), structured to train a sequence-to-sequence correction model: given noisy CTC output, predict the clean original.
- 11 tactical radio formats — SITREP, SPOTREP, MEDEVAC 9-line, Call for Fire (grid/polar/shift), Adjust Fire, Fire for Effect, Suppression, Immediate Suppression, Radio Check
- 16-axis seed taxonomy — independently sampled scenario, speaker, and linguistic axes produce a combinatorially large seed space
- Weighted format and tier sampling — configurable probability weights per format and per rarity tier (common / moderate / edge case)
- Separate clean and noisy LLM calls — clean generation and CTC noise simulation use independent models, prompts, and temperatures
- Batched generation — N records per LLM call with
===RECORD N===delimiters; configurable batch sizes reduce API cost - Concurrent sub-batch execution — clean sub-batches run concurrently via
ThreadPoolExecutor, then noisy sub-batches run concurrently after a barrier ensuring all clean transcripts are available - Exponential backoff retry — transient API errors (rate limit, connection, timeout) are retried up to 3 times with jitter
- Checkpoint and resume — per-batch JSONL checkpoint; crashed runs resume from the last completed batch
- In-stride diversity feedback — embedding diversity is measured per batch and generation temperature is adjusted automatically
- Three-stage QA pipeline — transcript similarity gating, LLM-as-judge coverage scoring, and embedding-based deduplication
- Four output artefacts — raw JSONL, clean JSONL, SFT chat-format JSONL, and a QA summary JSON per run
Requires Python 3.10+.
git clone <repo>
cd callSignForge
python -m venv .venv
source .venv/bin/activate
pip install -e .For development and testing:
pip install -e ".[dev]"Set your OpenRouter API key:
export OPENROUTER_API_KEY=<your-key>Preview a few records before committing to a full run:
python -m callsignforge preview --config pipeline.yaml --count 5Run full generation with QA:
python -m callsignforge generate --config pipeline.yamlResume a partial run from a checkpoint:
python -m callsignforge generate --config pipeline.yaml --resume output/checkpoint.jsonlRe-run QA on an existing raw output file (useful for threshold tuning without regenerating):
python -m callsignforge qa --config pipeline.yaml --input output/<run_id>_raw.jsonlAll pipeline behaviour is controlled by pipeline.yaml. The schema is validated by Pydantic on load; invalid configurations raise a ValidationError at startup.
llm:
provider: openrouter
api_key_env: OPENROUTER_API_KEY
generation:
model: meta-llama/llama-4-maverick:nitro
base_temperature: 0.8 # starting generation temperature
max_temperature: 1.3 # ceiling for diversity feedback adjustment
temperature_step: 0.1 # per-batch temperature increment/decrement
noise:
model: anthropic/claude-3.5-haiku
temperature: 1.1 # higher temp for varied CTC error patterns
judge:
model: meta-llama/llama-4-maverick:nitro
generation:
total_records: 1200
batch_size: 100 # records per outer batch (checkpointed together)
llm_batch_size: 10 # records per individual LLM call
max_concurrent_batches: 10 # concurrent sub-batch workers per phase
sampling:
weights:
common: 0.50
moderate: 0.30
edge_case: 0.20
format_weights: # must sum to 1.0
radio_check: 0.08
sitrep: 0.18
spotrep: 0.18
medevac_9line: 0.10
call_for_fire_grid: 0.10
call_for_fire_polar: 0.06
call_for_fire_shift: 0.06
adjust_fire: 0.08
fire_for_effect: 0.08
suppression: 0.06
immediate_suppression: 0.02
qa:
diversity_threshold: 0.45
coverage_threshold: 0.75
embedding_similarity_threshold: 0.92
run_similarity_check: true
similarity_range:
light: { min: 0.85, max: 0.99 }
moderate: { min: 0.60, max: 0.92 }
heavy: { min: 0.35, max: 0.78 }
similarity_action: flag # flag | drop
judge_batch_size: 10
output:
dir: output/
sft_format: chat| Parameter | Effect |
|---|---|
llm_batch_size |
Larger = fewer API calls, but harder for the model to maintain per-record consistency across a prompt |
max_concurrent_batches |
Higher = faster wall-clock time; limited by API rate limits |
base_temperature |
Higher = more diverse output; too high causes format violations |
diversity_threshold |
Lower = temperature feedback triggers less often |
embedding_similarity_threshold |
Lower = more aggressive deduplication |
similarity_action: drop |
Removes out-of-range pairs rather than flagging; reduces dataset size |
Each record is generated from a seed drawn by independently sampling 16 axes. No cross-axis correlation is enforced — the full combinatorial space is available.
| Axis | Values |
|---|---|
report_format |
11 formats (see below) |
domain |
infantry, armor, aviation, artillery, logistics, medevac, eod, sigint, special_operations, engineer |
echelon |
team, squad, platoon, company, battalion |
scenario_archetype |
deliberate_attack, hasty_attack, defense, retrograde, patrol, troops_in_contact, casevac_medevac, resupply, reconnaissance, breach, air_assault, relief_in_place |
environment |
urban_dense, urban_suburban, open_desert, mountainous, jungle, arctic, littoral, mixed_terrain |
operational_tempo |
steady_state, hasty, troops_in_contact, post_engagement |
time_of_day |
day, night, dawn_dusk |
comms_posture |
normal, degraded, relay_required, emcon, encrypted_burst |
speaker_experience |
novice_rto, experienced_nco, officer, foreign_liaison, stressed_novice |
stress_level |
calm, elevated, high, extreme |
| Axis | Values |
|---|---|
verbosity |
terse, standard, verbose |
formality |
strict_prowords, informal_prowords, mixed |
unit_dialect |
us_army_standard, usmc, nato_allied, joint_task_force |
error_injection |
none, self_correction, stepped_on, phonetic_expansion, partial_transmission |
acknowledgment_style |
roger, wilco, good_copy, say_again, break_break, standby, wait_out |
grammar_degradation |
none, light, moderate, heavy |
noise_level is sampled as an additional axis using tier weights: moderate maps to the common tier (50%), light to moderate (30%), and heavy to edge_case (20%). The resulting distribution reflects real-world ASR conditions where heavy signal degradation is less frequent than moderate.
Values designated as edge cases receive the edge_case probability budget split equally among them. The remaining probability mass is split proportionally between the most common value and the rest. Axes with no designated edge cases receive uniform weight across all values.
Edge case values by axis:
| Axis | Edge case values |
|---|---|
comms_posture |
emcon, encrypted_burst, relay_required |
speaker_experience |
foreign_liaison, stressed_novice |
stress_level |
extreme |
error_injection |
stepped_on, partial_transmission |
grammar_degradation |
heavy |
scenario_archetype |
breach, air_assault, relief_in_place |
| Format ID | Description |
|---|---|
radio_check |
Signal quality check between two stations |
sitrep |
Periodic unit status update (6-line) |
spotrep |
Enemy contact spot report |
medevac_9line |
MEDEVAC request (9-line) |
call_for_fire_grid |
Fire mission request using MGRS grid |
call_for_fire_polar |
Fire mission request using polar (direction + distance) |
call_for_fire_shift |
Fire mission shift from a known point |
adjust_fire |
Fire adjustment from previous rounds |
fire_for_effect |
Command to fire for effect |
suppression |
Suppression fire request |
immediate_suppression |
Emergency suppression with no adjustment |
Each format definition includes a purpose statement, line-by-line template, required fields list, prowords, notes on number formatting and MGRS grid precision, and at least one annotated clean example. These definitions are injected into both the generation prompt and the coverage judge prompt at runtime.
sample_seeds()
│
▼
[Concurrent clean sub-batches] ← ThreadPoolExecutor
_generate_clean_batch() × N ← one LLM call per sub-batch
│
▼ (barrier — noisy depends on clean)
[Concurrent noisy sub-batches] ← ThreadPoolExecutor
_generate_noisy_batch() × N ← one LLM call per sub-batch
│
▼
compute_batch_embedding_diversity()
TemperatureFeedback.update()
_write_checkpoint()
│
▼ (repeat until total_records reached)
_run_qa_pass()
│
├── run_transcript_similarity()
├── coverage_judge()
└── deduplicate_by_embedding()
│
▼
write_outputs()
Without explicit zone instruction, models exhibit a strong learned bias toward repeated grid zones like 38S MB (Afghanistan/Middle East), which appears frequently in publicly available military doctrine and training data. This produces near-identical grid prefixes across records, inflating embedding similarity and causing excessive deduplication loss.
Each batch prompt includes a randomly chosen MGRS zone, 100km square, digit precision (6/8/10), and a matching NATO phonetic spelling example rendered from the same zone. Both the phonetic example in the formatting rules and the batch-level zone instruction are derived from the same random draw — ensuring no hardcoded zone anchor remains in the prompt for the model to fall back on.
Noisy transcripts simulate Wav2Vec2 CTC greedy decoder output — a single flat line of uppercase space-separated tokens with no punctuation. Each record includes a noise_level and a quantitative TARGET specifying the required error mix:
| Noise level | Target |
|---|---|
light |
1–3 substitutions + optional 1 CTC artifact; all other tokens present |
moderate |
4–9 substitutions + delete 3–5 tokens + 1–2 CTC artifacts |
heavy |
10–18 substitutions + delete 6–10 tokens + 2–4 CTC artifacts |
CTC artifact types present across all noise levels:
- SUBSTITUTION — token replaced by a phonetically confusable output (acoustic confusion, not spelling similarity)
- DELETION — token dropped entirely
- MERGE — word boundary removed between two adjacent short tokens
- REPETITION — token emitted twice at a CTC frame boundary
- TRAILING LOSS — final proword clipped by PTT release
Three sequential checks are applied to all generated records. An in-stride diversity check also runs between generation batches.
Measures the cosine similarity between embeddings of the clean and noisy transcript in each pair. This validates that noise injection is calibrated to the declared noise_level — too-high similarity indicates insufficient noise; too-low similarity indicates over-corruption that destroys the signal.
How it works: All clean and noisy transcripts are encoded in a single batch call. Per-pair similarity is computed as the dot product of the L2-normalized embedding vectors. This reuses the embedding model already loaded for deduplication, adding no additional dependency or startup overhead.
Each noise level has a configured similarity range (qa.similarity_range). Records outside their expected range receive qa.similarity_flagged = true. When similarity_action: drop is set, out-of-range records are removed before downstream steps; the default flag retains all records for inspection.
The score is stored per record as qa.transcript_similarity (float, 0–1).
Expected ranges by noise level:
| Noise level | Min | Max | Rationale |
|---|---|---|---|
light |
0.85 | 0.99 | 1–3 substitutions should preserve most of the embedding |
moderate |
0.60 | 0.92 | Partial deletions and substitutions produce moderate drift |
heavy |
0.35 | 0.78 | Structural degradation should yield clearly lower similarity |
Records above the maximum for their noise level are insufficiently noisy. Records below the minimum are over-corrupted to the point where the clean/noisy relationship is lost.
An LLM judge scores each clean transcript on how completely it covers the required fields for its declared format. Up to judge_batch_size records are scored per API call to minimise cost.
How it works: Each record's prompt block includes the format ID, required field list, full template, and a clean example from the format definition. This gives the judge explicit format context per record rather than relying on the model's internal knowledge of format structure. Scores are returned as one decimal number per line, matched to records by position.
Scores range from 0.0 to 1.0:
| Score | Meaning |
|---|---|
| 1.0 | All required fields clearly present and complete |
| 0.75 | Most fields present, minor gaps |
| 0.5 | About half of required fields present or recognizable |
| 0.25 | A few fields present, major gaps |
| 0.0 | Required fields mostly missing or unrecognizable |
Records below qa.coverage_threshold (default 0.75) are flagged (qa.coverage_flagged = true). Coverage-flagged records are retained in the raw output but excluded from the clean set. On any judge API failure, affected records default to score 1.0 and are not flagged.
The score is stored per record as qa.coverage_score. When no API key is present, all records receive coverage_score: null and coverage_flagged: false.
Near-duplicate records are removed from the clean output set by comparing all-MiniLM-L6-v2 embeddings of the clean transcript across the full dataset. The first occurrence of each record is kept; any subsequent record whose cosine similarity to any already-kept record exceeds qa.embedding_similarity_threshold (default 0.92) is dropped.
How it works: All clean transcripts are encoded in a single batch call. A greedy first-seen pass compares each candidate against the matrix of already-kept embeddings using normalised dot products. The threshold of 0.92 corresponds to transcripts that differ only in minor surface variation while describing essentially the same scenario and content combination.
Deduplication runs on the full post-generation dataset rather than per-batch, so it catches cross-batch duplicates that batch-level diversity checks would miss.
After each outer batch, the mean pairwise cosine distance across all clean transcripts in that batch is computed using the same all-MiniLM-L6-v2 model. This diversity score drives automatic temperature adjustment:
- If diversity falls below
qa.diversity_threshold: generation temperature is increased bygeneration.temperature_step, up togeneration.max_temperature - If diversity meets or exceeds threshold for two or more consecutive batches: temperature is stepped back down toward
generation.base_temperature
This prevents the generator from converging on repetitive output during long runs without requiring manual temperature tuning between batches. Per-batch diversity scores and temperature values are recorded in the QA summary batch_log.
Each run produces four timestamped files in output/:
| File | Contents |
|---|---|
<run_id>_raw.jsonl |
All generated records including flagged ones |
<run_id>_clean.jsonl |
Deduplicated, post-QA records only |
<run_id>_sft.jsonl |
Chat-format SFT pairs (user=noisy, assistant=clean) |
<run_id>_qa_summary.json |
Counts, per-format/noise breakdowns, similarity stats, coverage stats, batch log |
A checkpoint.jsonl is also written to output/ during generation and can be passed to --resume to continue a partial run.
{
"id": "<uuid4>",
"generated_at": "<ISO 8601 UTC>",
"seed": {
"report_format": "sitrep",
"domain": "infantry",
"echelon": "platoon",
"noise_level": "moderate",
"scenario_archetype": "patrol",
"environment": "urban_dense",
"operational_tempo": "hasty",
"time_of_day": "night",
"comms_posture": "normal",
"speaker_experience": "experienced_nco",
"stress_level": "elevated",
"verbosity": "standard",
"formality": "strict_prowords",
"unit_dialect": "us_army_standard",
"error_injection": "none",
"acknowledgment_style": "roger",
"grammar_degradation": "none"
},
"clean_transcript": "APACHE TWO ONE, THIS IS BRAVO THREE...",
"noisy_transcript": "APACHE TOO ONE THIS BRAV FREE...",
"qa": {
"batch_embedding_diversity": 0.512,
"transcript_similarity": 0.741,
"similarity_flagged": false,
"coverage_score": 0.92,
"coverage_flagged": false
}
}{
"messages": [
{ "role": "user", "content": "<noisy transcript>" },
{ "role": "assistant", "content": "<clean transcript>" }
]
}{
"run_id": "callsignforge_20260309_143022",
"counts": {
"generated": 1200,
"deduped_removed": 14,
"similarity_flagged": 37,
"coverage_flagged": 8,
"final_clean": 1186
},
"by_format": { "sitrep": 143, "spotrep": 121, "...": 0 },
"by_noise_level": { "light": 358, "moderate": 600, "heavy": 228 },
"similarity_stats": {
"moderate": { "mean": 0.71, "min": 0.52, "max": 0.88, "flagged": 12 }
},
"coverage_stats": { "mean": 0.89, "min": 0.61, "max": 1.0, "flagged": 8 },
"batch_log": [
{
"batch_num": 1,
"n_records": 100,
"diversity": 0.51,
"temperature_used": 0.8,
"temperature_after": 0.8
}
]
}callsignforge/
├── taxonomy.py # All axis names and values; edge case designations
├── formats.py # 11 format definitions (template, fields, examples)
├── sampler.py # Weighted sampling from taxonomy axes
├── prompts.py # Prompt builders and shared response parser
├── designer.py # Generation loop, batching, checkpointing, concurrency
├── qa.py # Transcript similarity, coverage judge, deduplication
├── output.py # File writing and QA summary aggregation
├── config.py # Pydantic models for pipeline.yaml
└── cli.py # CLI entry point (preview / generate / qa subcommands)
tests/
├── test_config.py # Config loading, schema invariants, format weight validation
├── test_prompts.py # Prompt structure, parser correctness, rule content
└── test_sampler.py # Weight arithmetic, axis coverage, sampling structure
Run the test suite:
pytest tests/ -vLint:
ruff check callsignforge/All tests are smoke tests — no LLM calls are made. Tests cover config loading and validation, format weight invariants, QA config schema, prompt structure and response parser correctness, and sampler weight arithmetic.