A research codebase for decoding QRSS beacons — very slow Morse-code transmissions on the 30-metre amateur radio band — by treating spectrograms as images and reading the Morse characters with a convolutional recogniser. The project is structured as a reproducible experiment: synthetic data generation, model training, honest evaluation on held-out real recordings, and an incremental data-collection loop for real-world refinement.
QRSS ("Q-code: decrease your transmit speed") is a mode used by low-power beacons on the 30-metre band near 10.140 MHz. Transmissions are slow enough that individual Morse elements last several seconds: a single dot can be 3 to 12 seconds long, a full callsign 60 to 120 seconds. Because the elements are so long, the signals are inaudible by ear but become readable as horizontal lines on a long-duration spectrogram. Hobbyists capture audio from internet-accessible software-defined receivers (KiwiSDRs), render the spectrograms, and read the callsigns visually.
Automating that visual decoding is the problem this repository addresses. The natural framing is optical character recognition: the spectrogram is a picture, the callsign is the word, the Morse code is the font.
The codebase is a complete, test-covered pipeline from raw audio to decoded callsign. It contains three parts:
- A synthetic-data generator and sequence model. The generator renders arbitrary callsigns as spectrograms under a parameterisable noise and distortion model. A CNN + BiLSTM network is trained with CTC loss on the resulting synthetic dataset. On held-out synthetic callsigns, the model decodes with character error rate below one per cent and full-callsign accuracy above 97 per cent at a 15 dB signal-to-noise ratio.
- A honest evaluation of that model on real recordings. Twenty-five
manually labelled strips from seven captures of a European KiwiSDR
receiver, plus 287 strips auto-labelled by matched-filter template
correlation from 67 additional captures. On a capture-wise held-out split
the sequence model generalises at zero to three per cent full-accuracy
regardless of fine-tuning strategy. This negative result, recorded in
full in
RESULTS.md, is the central finding of the experiment: the synthesis-to-real distribution gap is not closable by preprocessing or augmentation alone when the real vocabulary is only four classes. - A pivot to a small closed-vocabulary classifier. Replacing the 2 M parameter sequence model with a 60 K parameter CNN classifying into four known callsigns plus an "unknown" bucket gives 85 per cent honest held-out accuracy on the same 287 strips. This small model is the current best real-world decoder in the project and the first one that measurably generalises across receivers and days. It serves as the seed for an active-learning loop that captures new audio, triages candidate signals, and queues anything unfamiliar for manual labelling.
The repository is therefore useful as a case study in three things: how to build a reasonable sim-to-real OCR pipeline for radio spectrograms, why that pipeline does not by itself produce real-world generalisation, and how to respond to that negative result by architecturally rightsizing the model and collecting data incrementally.
qrss_ocr_v2/
├── config.yaml # all hyperparameters (audio SR, STFT, augmentation, training)
├── requirements.txt # numpy, scipy, torch, Pillow, matplotlib, pyyaml, pytest
├── RESULTS.md # full experimental log with numbers and lessons
│
├── generator/ # synthetic data generation
│ ├── morse_table.py # Morse table and ITU callsign format
│ ├── tone_render.py # audio for one element / character / callsign
│ ├── spectrogram.py # STFT, log-magnitude, resize
│ ├── augmentations.py # noise, drift, fading, jitter, impulse, QRM
│ ├── char_dataset.py # dataset of single symbols for the character classifier
│ ├── line_dataset.py # dataset of full callsigns with disjoint train/val pools
│ └── extract_real_strips.py # extract labelled strips from matched-filter outputs
│
├── model/ # networks and training code
│ ├── char_classifier.py # small CNN for single-character classification
│ ├── sequence_model.py # CNN + BiLSTM + Linear head, trained with CTC
│ ├── closed_classifier.py # 60K-parameter CNN for closed-vocabulary inference
│ ├── callsign_grammar.py # CTC prefix beam search constrained by ITU format
│ ├── input_norm.py # per-image z-score (matches distributions at inference)
│ ├── losses.py # focal loss for optional class-imbalance handling
│ ├── train_char.py # trains char_classifier
│ ├── train_seq.py # trains sequence_model with grammar beam evaluation
│ ├── train_closed.py # trains closed_classifier with capture-wise split
│ ├── finetune_real.py # adapts a pretrained sequence_model on labelled real strips
│ └── finetune_real_split.py # same, with honest capture-wise split
│
├── segmentor/ # signal extraction from wide-band spectrograms
│ ├── energy.py # per-band energy estimation
│ ├── tracer.py # frequency trajectory tracking with drift
│ ├── normalizer.py # straightening, amplitude and speed normalisation
│ └── signal_type.py # heuristic mode classification (CW, FSKCW, WSPR, ...)
│
├── pipeline/ # end-to-end processing
│ ├── cycle_stacker.py # autocorrelation-based cycle detection and averaging
│ ├── full_pipeline.py # QRSSDecoder class assembling segmentor + model + grammar
│ ├── annotate_capture.py # render full-band spectrogram with an overlaid highlight
│ └── triage.py # npz → bands → signal type → classifier → routed outputs
│
├── eval/ # measurement utilities
│ ├── metrics.py # Levenshtein CER, full-callsign accuracy, detection P/R
│ ├── confusion.py # per-character confusion matrix
│ ├── visualize.py # annotated prediction overlays
│ ├── benchmark.py # batch evaluation on a labelled directory
│ ├── honest_eval.py # per-SNR evaluation on synthetic hold-out
│ ├── real_benchmark.py # evaluation on the 25-strip manually labelled benchmark
│ └── decode_raw_npz.py # decode candidate strips extracted directly from raw captures
│
└── scripts/ # operational helpers
├── generate_on_vps.sh # bulk synthetic data generation on a remote GPU box
├── train_on_vps.sh # training on a remote GPU box
├── sync_data.sh # rsync checkpoints and datasets between local and remote
├── capture_loop.py # persistent capture+triage loop (deprecated in favour of cron)
├── capture_round.py # one cron round: pick hosts by day/night, capture, triage
└── review_queue.py # CLI to list, manifest, and apply manual labels
Unit tests live next to each module as test_*.py. As of 2026-04-15 the
suite is 58 tests passing.
python3 -m venv .venv
.venv/bin/pip install -r requirements.txtThe code targets Python 3.12, PyTorch 2, scipy, PIL, and PyYAML. GPU is optional but strongly recommended for training.
.venv/bin/python -m pytest -q.venv/bin/python -m generator.char_dataset --config config.yaml \
--output data/char_train --num_samples 10000 --seed 1
.venv/bin/python -m generator.line_dataset --config config.yaml \
--train_out data/line_train --val_out data/line_val \
--train_samples 5000 --val_samples 1000 \
--unique_train 500 --unique_val 100 \
--pools_json data/callsign_pools.json --seed 0Full-scale training sizes are in config.yaml (100k character samples,
50k line samples, 5000 unique callsigns). The helper scripts
scripts/generate_on_vps.sh and scripts/train_on_vps.sh are meant for a
remote GPU box; see scripts/sync_data.sh for the rsync wrapper.
.venv/bin/python -m model.train_seq \
--config config.yaml \
--train_dir data/line_train \
--val_dir data/line_val \
--output model/checkpoints/seq.pt \
--device cudaThis requires a directory of labelled strips named
{capture_stem}__{callsign}_{freq_hz}_o{offset_s}.png, typically produced
by generator/extract_real_strips.py from matched-filter template outputs.
.venv/bin/python -m model.train_closed \
--data_dir data/real_strips_extracted \
--output model/checkpoints/closed.pt \
--device cuda --epochs 60 --dropout 0.3 --augmentSynthetic hold-out per signal-to-noise bucket:
.venv/bin/python -m eval.honest_eval \
--model model/checkpoints/seq.pt \
--output_dir eval/honest \
--samples_per_snr 400 \
--snrs 25 20 15 10 5Benchmark on labelled real strips:
.venv/bin/python -m eval.real_benchmark \
--model model/checkpoints/seq.pt \
--data_glob "data/real_benchmark/*.png"The closed-vocabulary classifier is deployed inside a capture-and-review loop that accumulates labelled real data over time. Four components work together.
scripts/capture_round.py is a one-shot capture script intended to run from
cron. At each invocation it reads a directory of public KiwiSDR receivers
from ~/more_fun_to_receive.json, filters to those that cover 10.14 MHz,
are online, and have a free slot, and selects a small balanced set of day-
side and night-side hosts with distinct Maidenhead grid squares. A short
rotation file (/tmp/qrss_recent_hosts.txt) keeps successive runs from
hammering the same receivers. For each selected host it calls
tools/kiwi_capture.py from the companion project more_fun_to_receive to
download a few minutes of audio around 10.140 MHz.
Example crontab entry:
3,27,49 * * * * /path/to/python qrss_ocr_v2/scripts/capture_round.py \
--minutes 6 --n_day 2 --n_night 2 \
>> qrss_ocr_v2/logs/cron_capture.log 2>&1
The non-uniform minute offsets are deliberate: successive firings drift through different minutes of the hour, giving varied propagation sampling over a day.
pipeline/triage.py consumes a raw capture file, computes the full-band
spectrogram, finds the strongest narrow-band energy peaks inside the
beacon subband (1000–1700 Hz audio), extracts a short strip around each,
and routes it into one of three buckets:
decoded/— thesegmentor/signal_type.pyheuristic identified the strip as a QRSS mode (CW, FSKCW, DFCW) and the closed classifier predicted one of its known callsigns with confidence above the threshold.review_qrss/— QRSS-like signal but the classifier was uncertain or predicted "unknown". Candidate for a new known callsign.review_other/— the signal was classified as a different mode (WSPR, FT8, carrier, noise) and is queued separately.
Each routed strip carries a sidecar JSON with the full metadata and a companion PNG showing the original capture's full 1000–1700 Hz spectrogram with a red rectangle highlighting where in the band the candidate signal appears. The strip filename encodes the most relevant fields, for example:
conf0.47_CW_1490Hz_predG0PKT_0.6s_kiwi30m_20260415_115202.png
scripts/review_queue.py is a four-command command-line tool for manual
labelling:
python scripts/review_queue.py list # summary by route / signal type / prediction
python scripts/review_queue.py manifest # write data/triage_out/labels_todo.csv
# — edit labels_todo.csv, fill in callsign_manual column —
python scripts/review_queue.py apply # copy labelled strips into data/real_strips_labeled/
python scripts/review_queue.py stats # per-callsign countsThe reviewer may enter a valid ITU callsign, the literal "UNKNOWN" (kept as a negative example), or "SKIP" to leave an item in the queue.
When data/real_strips_labeled/ has grown by a meaningful amount — on the
order of twenty or thirty strips per new callsign, or several dozen
"UNKNOWN" negatives — retraining closed_classifier with the expanded
vocabulary is a five-minute operation. The vocabulary is read directly
from training-directory labels, so adding callsigns is a file-system
operation with no code changes.
The RESULTS.md file contains the full experimental log with all numbers
measured during the development of this codebase. Two points are worth
highlighting here because they shaped the architecture.
Synthetic metrics are not a proxy for real-world performance. The sequence model trained on synthetic spectrograms reaches 0.12 per cent character error rate and 100 per cent full-callsign accuracy on a held-out synthetic set. The same model on twenty-five manually labelled real strips decodes nothing (0 out of 25). Increasing training data, adding vertical blur, matching frequency-band width, re-normalising inputs to zero mean and unit variance, and estimating dit duration per capture did not move the real-world accuracy above three per cent. The underlying reason appears to be that each real capture has its own distribution of noise, receiver passband, modulation depth, and dit speed; without a sufficiently diverse labelled real dataset the model memorises per-capture artefacts rather than per-character structure.
Architecture must match the real task. When the real vocabulary is only four distinct callsigns — which is typical for narrow-band amateur beacon activity on a single capture day — a 2 M parameter sequence model is over-specified. Replacing it with a 60 K parameter classifier that outputs a softmax over five classes gives 85 per cent honest held-out accuracy on the same training data. The lesson is not that sequence models are wrong in general but that they do not help when the real label cardinality is small and the data volume per class is modest.
These two facts motivate the active-learning loop: the small classifier is good enough to be useful immediately for the most common beacons, and anything it cannot classify is queued for human inspection, which grows the training vocabulary so that at some future point a sequence model can be trained again on an expanded real dataset with genuine class diversity.
QRSS captures on 30 metres contain several distinct signal types, which
the triage pipeline classifies with simple heuristics in
segmentor/signal_type.py. The currently-recognised modes are:
| Mode | Visual signature |
|---|---|
| CW | single horizontal line, amplitude-modulated by dots and dashes |
| FSKCW | two close horizontal lines four to six hertz apart, anti-correlated |
| DFCW | two lines further apart; dot lives on one line, dash on the other |
| CHIRP | single carrier whose frequency ramps up, down, or stays flat within each symbol; no inter-element pause |
| WSPR | four parallel lines at about 1.5 Hz spacing, 4-FSK digital |
| FT8 | eight parallel lines at about 6 Hz spacing, 8-FSK digital |
| CARRIER | unmodulated steady tone |
| NOISE | no consistent structure |
CHIRP detection is currently a known work-in-progress: the heuristic for it is drafted but does not yet fire reliably on live captures, and the synthetic generator does not yet produce CHIRP examples.
Not yet assigned. The code is provided as a research artefact.