CallSignForge

Synthetic military radio transcript pair generator for ASR fine-tuning. Produces paired clean/noisy transcripts that simulate Wav2Vec2 CTC greedy decoder output from degraded HF/VHF radio audio.

Overview

Training an ASR system to transcribe military radio communications requires data that reflects the real acoustic and linguistic conditions of tactical radio — terse proword-driven speech, phonetic spelling, operator stress, degraded signal, and the characteristic error patterns of a CTC decoder operating without a language model. Clean transcripts of that data are rare and often sensitive.

CallSignForge generates synthetic transcript pairs at scale. Each pair consists of a clean transcript (ground truth) and a noisy transcript (simulated CTC decoder output), structured to train a sequence-to-sequence correction model: given noisy CTC output, predict the clean original.

Features

11 tactical radio formats — SITREP, SPOTREP, MEDEVAC 9-line, Call for Fire (grid/polar/shift), Adjust Fire, Fire for Effect, Suppression, Immediate Suppression, Radio Check
16-axis seed taxonomy — independently sampled scenario, speaker, and linguistic axes produce a combinatorially large seed space
Weighted format and tier sampling — configurable probability weights per format and per rarity tier (common / moderate / edge case)
Separate clean and noisy LLM calls — clean generation and CTC noise simulation use independent models, prompts, and temperatures
Batched generation — N records per LLM call with ===RECORD N=== delimiters; configurable batch sizes reduce API cost
Concurrent sub-batch execution — clean sub-batches run concurrently via ThreadPoolExecutor, then noisy sub-batches run concurrently after a barrier ensuring all clean transcripts are available
Exponential backoff retry — transient API errors (rate limit, connection, timeout) are retried up to 3 times with jitter
Checkpoint and resume — per-batch JSONL checkpoint; crashed runs resume from the last completed batch
In-stride diversity feedback — embedding diversity is measured per batch and generation temperature is adjusted automatically
Three-stage QA pipeline — transcript similarity gating, LLM-as-judge coverage scoring, and embedding-based deduplication
Four output artefacts — raw JSONL, clean JSONL, SFT chat-format JSONL, and a QA summary JSON per run

Installation

Requires Python 3.10+.

git clone <repo>
cd callSignForge
python -m venv .venv
source .venv/bin/activate
pip install -e .

For development and testing:

pip install -e ".[dev]"

Set your OpenRouter API key:

export OPENROUTER_API_KEY=<your-key>

Quick Start

Preview a few records before committing to a full run:

python -m callsignforge preview --config pipeline.yaml --count 5

Run full generation with QA:

python -m callsignforge generate --config pipeline.yaml

Resume a partial run from a checkpoint:

python -m callsignforge generate --config pipeline.yaml --resume output/checkpoint.jsonl

Re-run QA on an existing raw output file (useful for threshold tuning without regenerating):

python -m callsignforge qa --config pipeline.yaml --input output/<run_id>_raw.jsonl

Configuration

All pipeline behaviour is controlled by pipeline.yaml. The schema is validated by Pydantic on load; invalid configurations raise a ValidationError at startup.

llm:
  provider: openrouter
  api_key_env: OPENROUTER_API_KEY
  generation:
    model: meta-llama/llama-4-maverick:nitro
    base_temperature: 0.8 # starting generation temperature
    max_temperature: 1.3 # ceiling for diversity feedback adjustment
    temperature_step: 0.1 # per-batch temperature increment/decrement
  noise:
    model: anthropic/claude-3.5-haiku
    temperature: 1.1 # higher temp for varied CTC error patterns
  judge:
    model: meta-llama/llama-4-maverick:nitro

generation:
  total_records: 1200
  batch_size: 100 # records per outer batch (checkpointed together)
  llm_batch_size: 10 # records per individual LLM call
  max_concurrent_batches: 10 # concurrent sub-batch workers per phase

sampling:
  weights:
    common: 0.50
    moderate: 0.30
    edge_case: 0.20
  format_weights: # must sum to 1.0
    radio_check: 0.08
    sitrep: 0.18
    spotrep: 0.18
    medevac_9line: 0.10
    call_for_fire_grid: 0.10
    call_for_fire_polar: 0.06
    call_for_fire_shift: 0.06
    adjust_fire: 0.08
    fire_for_effect: 0.08
    suppression: 0.06
    immediate_suppression: 0.02

qa:
  diversity_threshold: 0.45
  coverage_threshold: 0.75
  embedding_similarity_threshold: 0.92
  run_similarity_check: true
  similarity_range:
    light: { min: 0.85, max: 0.99 }
    moderate: { min: 0.60, max: 0.92 }
    heavy: { min: 0.35, max: 0.78 }
  similarity_action: flag # flag | drop
  judge_batch_size: 10

output:
  dir: output/
  sft_format: chat

Key configuration trade-offs

Parameter	Effect
`llm_batch_size`	Larger = fewer API calls, but harder for the model to maintain per-record consistency across a prompt
`max_concurrent_batches`	Higher = faster wall-clock time; limited by API rate limits
`base_temperature`	Higher = more diverse output; too high causes format violations
`diversity_threshold`	Lower = temperature feedback triggers less often
`embedding_similarity_threshold`	Lower = more aggressive deduplication
`similarity_action: drop`	Removes out-of-range pairs rather than flagging; reduces dataset size

Seed Taxonomy

Each record is generated from a seed drawn by independently sampling 16 axes. No cross-axis correlation is enforced — the full combinatorial space is available.

Scenario axes (10)

Axis	Values
`report_format`	11 formats (see below)
`domain`	infantry, armor, aviation, artillery, logistics, medevac, eod, sigint, special_operations, engineer
`echelon`	team, squad, platoon, company, battalion
`scenario_archetype`	deliberate_attack, hasty_attack, defense, retrograde, patrol, troops_in_contact, casevac_medevac, resupply, reconnaissance, breach, air_assault, relief_in_place
`environment`	urban_dense, urban_suburban, open_desert, mountainous, jungle, arctic, littoral, mixed_terrain
`operational_tempo`	steady_state, hasty, troops_in_contact, post_engagement
`time_of_day`	day, night, dawn_dusk
`comms_posture`	normal, degraded, relay_required, emcon, encrypted_burst
`speaker_experience`	novice_rto, experienced_nco, officer, foreign_liaison, stressed_novice
`stress_level`	calm, elevated, high, extreme

Linguistic axes (6)

Axis	Values
`verbosity`	terse, standard, verbose
`formality`	strict_prowords, informal_prowords, mixed
`unit_dialect`	us_army_standard, usmc, nato_allied, joint_task_force
`error_injection`	none, self_correction, stepped_on, phonetic_expansion, partial_transmission
`acknowledgment_style`	roger, wilco, good_copy, say_again, break_break, standby, wait_out
`grammar_degradation`	none, light, moderate, heavy

Noise level

noise_level is sampled as an additional axis using tier weights: moderate maps to the common tier (50%), light to moderate (30%), and heavy to edge_case (20%). The resulting distribution reflects real-world ASR conditions where heavy signal degradation is less frequent than moderate.

Tier weights and edge cases

Values designated as edge cases receive the edge_case probability budget split equally among them. The remaining probability mass is split proportionally between the most common value and the rest. Axes with no designated edge cases receive uniform weight across all values.

Edge case values by axis:

Axis	Edge case values
`comms_posture`	emcon, encrypted_burst, relay_required
`speaker_experience`	foreign_liaison, stressed_novice
`stress_level`	extreme
`error_injection`	stepped_on, partial_transmission
`grammar_degradation`	heavy
`scenario_archetype`	breach, air_assault, relief_in_place

Radio Formats

Format ID	Description
`radio_check`	Signal quality check between two stations
`sitrep`	Periodic unit status update (6-line)
`spotrep`	Enemy contact spot report
`medevac_9line`	MEDEVAC request (9-line)
`call_for_fire_grid`	Fire mission request using MGRS grid
`call_for_fire_polar`	Fire mission request using polar (direction + distance)
`call_for_fire_shift`	Fire mission shift from a known point
`adjust_fire`	Fire adjustment from previous rounds
`fire_for_effect`	Command to fire for effect
`suppression`	Suppression fire request
`immediate_suppression`	Emergency suppression with no adjustment

Each format definition includes a purpose statement, line-by-line template, required fields list, prowords, notes on number formatting and MGRS grid precision, and at least one annotated clean example. These definitions are injected into both the generation prompt and the coverage judge prompt at runtime.

Generation Pipeline

sample_seeds()
     │
     ▼
[Concurrent clean sub-batches]        ← ThreadPoolExecutor
  _generate_clean_batch() × N         ← one LLM call per sub-batch
     │
     ▼  (barrier — noisy depends on clean)
[Concurrent noisy sub-batches]        ← ThreadPoolExecutor
  _generate_noisy_batch() × N         ← one LLM call per sub-batch
     │
     ▼
compute_batch_embedding_diversity()
TemperatureFeedback.update()
_write_checkpoint()
     │
     ▼  (repeat until total_records reached)
_run_qa_pass()
     │
     ├── run_transcript_similarity()
     ├── coverage_judge()
     └── deduplicate_by_embedding()
          │
          ▼
     write_outputs()

MGRS grid zone injection

Without explicit zone instruction, models exhibit a strong learned bias toward repeated grid zones like 38S MB (Afghanistan/Middle East), which appears frequently in publicly available military doctrine and training data. This produces near-identical grid prefixes across records, inflating embedding similarity and causing excessive deduplication loss.

Each batch prompt includes a randomly chosen MGRS zone, 100km square, digit precision (6/8/10), and a matching NATO phonetic spelling example rendered from the same zone. Both the phonetic example in the formatting rules and the batch-level zone instruction are derived from the same random draw — ensuring no hardcoded zone anchor remains in the prompt for the model to fall back on.

CTC noise simulation

Noisy transcripts simulate Wav2Vec2 CTC greedy decoder output — a single flat line of uppercase space-separated tokens with no punctuation. Each record includes a noise_level and a quantitative TARGET specifying the required error mix:

Noise level	Target
`light`	1–3 substitutions + optional 1 CTC artifact; all other tokens present
`moderate`	4–9 substitutions + delete 3–5 tokens + 1–2 CTC artifacts
`heavy`	10–18 substitutions + delete 6–10 tokens + 2–4 CTC artifacts

CTC artifact types present across all noise levels:

SUBSTITUTION — token replaced by a phonetically confusable output (acoustic confusion, not spelling similarity)
DELETION — token dropped entirely
MERGE — word boundary removed between two adjacent short tokens
REPETITION — token emitted twice at a CTC frame boundary
TRAILING LOSS — final proword clipped by PTT release

QA Pipeline

Three sequential checks are applied to all generated records. An in-stride diversity check also runs between generation batches.

1. Transcript similarity check

Measures the cosine similarity between embeddings of the clean and noisy transcript in each pair. This validates that noise injection is calibrated to the declared noise_level — too-high similarity indicates insufficient noise; too-low similarity indicates over-corruption that destroys the signal.

How it works: All clean and noisy transcripts are encoded in a single batch call. Per-pair similarity is computed as the dot product of the L2-normalized embedding vectors. This reuses the embedding model already loaded for deduplication, adding no additional dependency or startup overhead.

Each noise level has a configured similarity range (qa.similarity_range). Records outside their expected range receive qa.similarity_flagged = true. When similarity_action: drop is set, out-of-range records are removed before downstream steps; the default flag retains all records for inspection.

The score is stored per record as qa.transcript_similarity (float, 0–1).

Expected ranges by noise level:

Noise level	Min	Max	Rationale
`light`	0.85	0.99	1–3 substitutions should preserve most of the embedding
`moderate`	0.60	0.92	Partial deletions and substitutions produce moderate drift
`heavy`	0.35	0.78	Structural degradation should yield clearly lower similarity

Records above the maximum for their noise level are insufficiently noisy. Records below the minimum are over-corrupted to the point where the clean/noisy relationship is lost.

2. Coverage judge

An LLM judge scores each clean transcript on how completely it covers the required fields for its declared format. Up to judge_batch_size records are scored per API call to minimise cost.

How it works: Each record's prompt block includes the format ID, required field list, full template, and a clean example from the format definition. This gives the judge explicit format context per record rather than relying on the model's internal knowledge of format structure. Scores are returned as one decimal number per line, matched to records by position.

Scores range from 0.0 to 1.0:

Score	Meaning
1.0	All required fields clearly present and complete
0.75	Most fields present, minor gaps
0.5	About half of required fields present or recognizable
0.25	A few fields present, major gaps
0.0	Required fields mostly missing or unrecognizable

Records below qa.coverage_threshold (default 0.75) are flagged (qa.coverage_flagged = true). Coverage-flagged records are retained in the raw output but excluded from the clean set. On any judge API failure, affected records default to score 1.0 and are not flagged.

The score is stored per record as qa.coverage_score. When no API key is present, all records receive coverage_score: null and coverage_flagged: false.

3. Embedding deduplication

Near-duplicate records are removed from the clean output set by comparing all-MiniLM-L6-v2 embeddings of the clean transcript across the full dataset. The first occurrence of each record is kept; any subsequent record whose cosine similarity to any already-kept record exceeds qa.embedding_similarity_threshold (default 0.92) is dropped.

How it works: All clean transcripts are encoded in a single batch call. A greedy first-seen pass compares each candidate against the matrix of already-kept embeddings using normalised dot products. The threshold of 0.92 corresponds to transcripts that differ only in minor surface variation while describing essentially the same scenario and content combination.

Deduplication runs on the full post-generation dataset rather than per-batch, so it catches cross-batch duplicates that batch-level diversity checks would miss.

In-stride diversity feedback (between batches)

After each outer batch, the mean pairwise cosine distance across all clean transcripts in that batch is computed using the same all-MiniLM-L6-v2 model. This diversity score drives automatic temperature adjustment:

If diversity falls below qa.diversity_threshold: generation temperature is increased by generation.temperature_step, up to generation.max_temperature
If diversity meets or exceeds threshold for two or more consecutive batches: temperature is stepped back down toward generation.base_temperature

This prevents the generator from converging on repetitive output during long runs without requiring manual temperature tuning between batches. Per-batch diversity scores and temperature values are recorded in the QA summary batch_log.

Output Files

Each run produces four timestamped files in output/:

File	Contents
`<run_id>_raw.jsonl`	All generated records including flagged ones
`<run_id>_clean.jsonl`	Deduplicated, post-QA records only
`<run_id>_sft.jsonl`	Chat-format SFT pairs (user=noisy, assistant=clean)
`<run_id>_qa_summary.json`	Counts, per-format/noise breakdowns, similarity stats, coverage stats, batch log

A checkpoint.jsonl is also written to output/ during generation and can be passed to --resume to continue a partial run.

Record schema (raw / clean JSONL)

{
  "id": "<uuid4>",
  "generated_at": "<ISO 8601 UTC>",
  "seed": {
    "report_format": "sitrep",
    "domain": "infantry",
    "echelon": "platoon",
    "noise_level": "moderate",
    "scenario_archetype": "patrol",
    "environment": "urban_dense",
    "operational_tempo": "hasty",
    "time_of_day": "night",
    "comms_posture": "normal",
    "speaker_experience": "experienced_nco",
    "stress_level": "elevated",
    "verbosity": "standard",
    "formality": "strict_prowords",
    "unit_dialect": "us_army_standard",
    "error_injection": "none",
    "acknowledgment_style": "roger",
    "grammar_degradation": "none"
  },
  "clean_transcript": "APACHE TWO ONE, THIS IS BRAVO THREE...",
  "noisy_transcript": "APACHE TOO ONE THIS BRAV FREE...",
  "qa": {
    "batch_embedding_diversity": 0.512,
    "transcript_similarity": 0.741,
    "similarity_flagged": false,
    "coverage_score": 0.92,
    "coverage_flagged": false
  }
}

SFT format (`_sft.jsonl`)

{
  "messages": [
    { "role": "user", "content": "<noisy transcript>" },
    { "role": "assistant", "content": "<clean transcript>" }
  ]
}

QA summary (`_qa_summary.json`)

{
  "run_id": "callsignforge_20260309_143022",
  "counts": {
    "generated": 1200,
    "deduped_removed": 14,
    "similarity_flagged": 37,
    "coverage_flagged": 8,
    "final_clean": 1186
  },
  "by_format": { "sitrep": 143, "spotrep": 121, "...": 0 },
  "by_noise_level": { "light": 358, "moderate": 600, "heavy": 228 },
  "similarity_stats": {
    "moderate": { "mean": 0.71, "min": 0.52, "max": 0.88, "flagged": 12 }
  },
  "coverage_stats": { "mean": 0.89, "min": 0.61, "max": 1.0, "flagged": 8 },
  "batch_log": [
    {
      "batch_num": 1,
      "n_records": 100,
      "diversity": 0.51,
      "temperature_used": 0.8,
      "temperature_after": 0.8
    }
  ]
}

Project Structure

callsignforge/
├── taxonomy.py      # All axis names and values; edge case designations
├── formats.py       # 11 format definitions (template, fields, examples)
├── sampler.py       # Weighted sampling from taxonomy axes
├── prompts.py       # Prompt builders and shared response parser
├── designer.py      # Generation loop, batching, checkpointing, concurrency
├── qa.py            # Transcript similarity, coverage judge, deduplication
├── output.py        # File writing and QA summary aggregation
├── config.py        # Pydantic models for pipeline.yaml
└── cli.py           # CLI entry point (preview / generate / qa subcommands)

tests/
├── test_config.py   # Config loading, schema invariants, format weight validation
├── test_prompts.py  # Prompt structure, parser correctness, rule content
└── test_sampler.py  # Weight arithmetic, axis coverage, sampling structure

Development

Run the test suite:

pytest tests/ -v

Lint:

ruff check callsignforge/

All tests are smoke tests — no LLM calls are made. Tests cover config loading and validation, format weight invariants, QA config schema, prompt structure and response parser correctness, and sampler weight arithmetic.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
callsignforge		callsignforge
docs		docs
output		output
tests		tests
.gitignore		.gitignore
README.md		README.md
pipeline.yaml		pipeline.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CallSignForge

Table of Contents

Overview

Features

Installation

Quick Start

Configuration

Key configuration trade-offs

Seed Taxonomy

Scenario axes (10)

Linguistic axes (6)

Noise level

Tier weights and edge cases

Radio Formats

Generation Pipeline

MGRS grid zone injection

CTC noise simulation

QA Pipeline

1. Transcript similarity check

2. Coverage judge

3. Embedding deduplication

In-stride diversity feedback (between batches)

Output Files

Record schema (raw / clean JSONL)

SFT format (`_sft.jsonl`)

QA summary (`_qa_summary.json`)

Project Structure

Development

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CallSignForge

Table of Contents

Overview

Features

Installation

Quick Start

Configuration

Key configuration trade-offs

Seed Taxonomy

Scenario axes (10)

Linguistic axes (6)

Noise level

Tier weights and edge cases

Radio Formats

Generation Pipeline

MGRS grid zone injection

CTC noise simulation

QA Pipeline

1. Transcript similarity check

2. Coverage judge

3. Embedding deduplication

In-stride diversity feedback (between batches)

Output Files

Record schema (raw / clean JSONL)

SFT format (_sft.jsonl)

QA summary (_qa_summary.json)

Project Structure

Development

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

SFT format (`_sft.jsonl`)

QA summary (`_qa_summary.json`)

Packages