qrss_ocr_v2

A research codebase for decoding QRSS beacons — very slow Morse-code transmissions on the 30-metre amateur radio band — by treating spectrograms as images and reading the Morse characters with a convolutional recogniser. The project is structured as a reproducible experiment: synthetic data generation, model training, honest evaluation on held-out real recordings, and an incremental data-collection loop for real-world refinement.

Background

QRSS ("Q-code: decrease your transmit speed") is a mode used by low-power beacons on the 30-metre band near 10.140 MHz. Transmissions are slow enough that individual Morse elements last several seconds: a single dot can be 3 to 12 seconds long, a full callsign 60 to 120 seconds. Because the elements are so long, the signals are inaudible by ear but become readable as horizontal lines on a long-duration spectrogram. Hobbyists capture audio from internet-accessible software-defined receivers (KiwiSDRs), render the spectrograms, and read the callsigns visually.

Automating that visual decoding is the problem this repository addresses. The natural framing is optical character recognition: the spectrogram is a picture, the callsign is the word, the Morse code is the font.

Scope and state

The codebase is a complete, test-covered pipeline from raw audio to decoded callsign. It contains three parts:

A synthetic-data generator and sequence model. The generator renders arbitrary callsigns as spectrograms under a parameterisable noise and distortion model. A CNN + BiLSTM network is trained with CTC loss on the resulting synthetic dataset. On held-out synthetic callsigns, the model decodes with character error rate below one per cent and full-callsign accuracy above 97 per cent at a 15 dB signal-to-noise ratio.
A honest evaluation of that model on real recordings. Twenty-five manually labelled strips from seven captures of a European KiwiSDR receiver, plus 287 strips auto-labelled by matched-filter template correlation from 67 additional captures. On a capture-wise held-out split the sequence model generalises at zero to three per cent full-accuracy regardless of fine-tuning strategy. This negative result, recorded in full in RESULTS.md, is the central finding of the experiment: the synthesis-to-real distribution gap is not closable by preprocessing or augmentation alone when the real vocabulary is only four classes.
A pivot to a small closed-vocabulary classifier. Replacing the 2 M parameter sequence model with a 60 K parameter CNN classifying into four known callsigns plus an "unknown" bucket gives 85 per cent honest held-out accuracy on the same 287 strips. This small model is the current best real-world decoder in the project and the first one that measurably generalises across receivers and days. It serves as the seed for an active-learning loop that captures new audio, triages candidate signals, and queues anything unfamiliar for manual labelling.

The repository is therefore useful as a case study in three things: how to build a reasonable sim-to-real OCR pipeline for radio spectrograms, why that pipeline does not by itself produce real-world generalisation, and how to respond to that negative result by architecturally rightsizing the model and collecting data incrementally.

Repository layout

qrss_ocr_v2/
├── config.yaml            # all hyperparameters (audio SR, STFT, augmentation, training)
├── requirements.txt       # numpy, scipy, torch, Pillow, matplotlib, pyyaml, pytest
├── RESULTS.md             # full experimental log with numbers and lessons
│
├── generator/             # synthetic data generation
│   ├── morse_table.py         # Morse table and ITU callsign format
│   ├── tone_render.py         # audio for one element / character / callsign
│   ├── spectrogram.py         # STFT, log-magnitude, resize
│   ├── augmentations.py       # noise, drift, fading, jitter, impulse, QRM
│   ├── char_dataset.py        # dataset of single symbols for the character classifier
│   ├── line_dataset.py        # dataset of full callsigns with disjoint train/val pools
│   └── extract_real_strips.py # extract labelled strips from matched-filter outputs
│
├── model/                 # networks and training code
│   ├── char_classifier.py     # small CNN for single-character classification
│   ├── sequence_model.py      # CNN + BiLSTM + Linear head, trained with CTC
│   ├── closed_classifier.py   # 60K-parameter CNN for closed-vocabulary inference
│   ├── callsign_grammar.py    # CTC prefix beam search constrained by ITU format
│   ├── input_norm.py          # per-image z-score (matches distributions at inference)
│   ├── losses.py              # focal loss for optional class-imbalance handling
│   ├── train_char.py          # trains char_classifier
│   ├── train_seq.py           # trains sequence_model with grammar beam evaluation
│   ├── train_closed.py        # trains closed_classifier with capture-wise split
│   ├── finetune_real.py       # adapts a pretrained sequence_model on labelled real strips
│   └── finetune_real_split.py # same, with honest capture-wise split
│
├── segmentor/             # signal extraction from wide-band spectrograms
│   ├── energy.py              # per-band energy estimation
│   ├── tracer.py              # frequency trajectory tracking with drift
│   ├── normalizer.py          # straightening, amplitude and speed normalisation
│   └── signal_type.py         # heuristic mode classification (CW, FSKCW, WSPR, ...)
│
├── pipeline/              # end-to-end processing
│   ├── cycle_stacker.py       # autocorrelation-based cycle detection and averaging
│   ├── full_pipeline.py       # QRSSDecoder class assembling segmentor + model + grammar
│   ├── annotate_capture.py    # render full-band spectrogram with an overlaid highlight
│   └── triage.py              # npz → bands → signal type → classifier → routed outputs
│
├── eval/                  # measurement utilities
│   ├── metrics.py             # Levenshtein CER, full-callsign accuracy, detection P/R
│   ├── confusion.py           # per-character confusion matrix
│   ├── visualize.py           # annotated prediction overlays
│   ├── benchmark.py           # batch evaluation on a labelled directory
│   ├── honest_eval.py         # per-SNR evaluation on synthetic hold-out
│   ├── real_benchmark.py      # evaluation on the 25-strip manually labelled benchmark
│   └── decode_raw_npz.py      # decode candidate strips extracted directly from raw captures
│
└── scripts/               # operational helpers
    ├── generate_on_vps.sh     # bulk synthetic data generation on a remote GPU box
    ├── train_on_vps.sh        # training on a remote GPU box
    ├── sync_data.sh           # rsync checkpoints and datasets between local and remote
    ├── capture_loop.py        # persistent capture+triage loop (deprecated in favour of cron)
    ├── capture_round.py       # one cron round: pick hosts by day/night, capture, triage
    └── review_queue.py        # CLI to list, manifest, and apply manual labels

Unit tests live next to each module as test_*.py. As of 2026-04-15 the suite is 58 tests passing.

Quick start

Environment

python3 -m venv .venv
.venv/bin/pip install -r requirements.txt

The code targets Python 3.12, PyTorch 2, scipy, PIL, and PyYAML. GPU is optional but strongly recommended for training.

Run the tests

.venv/bin/python -m pytest -q

Generate a small synthetic dataset

.venv/bin/python -m generator.char_dataset --config config.yaml \
    --output data/char_train --num_samples 10000 --seed 1

.venv/bin/python -m generator.line_dataset --config config.yaml \
    --train_out data/line_train --val_out data/line_val \
    --train_samples 5000 --val_samples 1000 \
    --unique_train 500 --unique_val 100 \
    --pools_json data/callsign_pools.json --seed 0

Full-scale training sizes are in config.yaml (100k character samples, 50k line samples, 5000 unique callsigns). The helper scripts scripts/generate_on_vps.sh and scripts/train_on_vps.sh are meant for a remote GPU box; see scripts/sync_data.sh for the rsync wrapper.

Train the sequence model

.venv/bin/python -m model.train_seq \
    --config config.yaml \
    --train_dir data/line_train \
    --val_dir data/line_val \
    --output model/checkpoints/seq.pt \
    --device cuda

Train the closed-vocabulary classifier on real strips

This requires a directory of labelled strips named {capture_stem}__{callsign}_{freq_hz}_o{offset_s}.png, typically produced by generator/extract_real_strips.py from matched-filter template outputs.

.venv/bin/python -m model.train_closed \
    --data_dir data/real_strips_extracted \
    --output model/checkpoints/closed.pt \
    --device cuda --epochs 60 --dropout 0.3 --augment

Evaluate

Synthetic hold-out per signal-to-noise bucket:

.venv/bin/python -m eval.honest_eval \
    --model model/checkpoints/seq.pt \
    --output_dir eval/honest \
    --samples_per_snr 400 \
    --snrs 25 20 15 10 5

Benchmark on labelled real strips:

.venv/bin/python -m eval.real_benchmark \
    --model model/checkpoints/seq.pt \
    --data_glob "data/real_benchmark/*.png"

The active-learning loop

The closed-vocabulary classifier is deployed inside a capture-and-review loop that accumulates labelled real data over time. Four components work together.

Capture

scripts/capture_round.py is a one-shot capture script intended to run from cron. At each invocation it reads a directory of public KiwiSDR receivers from ~/more_fun_to_receive.json, filters to those that cover 10.14 MHz, are online, and have a free slot, and selects a small balanced set of day- side and night-side hosts with distinct Maidenhead grid squares. A short rotation file (/tmp/qrss_recent_hosts.txt) keeps successive runs from hammering the same receivers. For each selected host it calls tools/kiwi_capture.py from the companion project more_fun_to_receive to download a few minutes of audio around 10.140 MHz.

Example crontab entry:

3,27,49 * * * * /path/to/python qrss_ocr_v2/scripts/capture_round.py \
    --minutes 6 --n_day 2 --n_night 2 \
    >> qrss_ocr_v2/logs/cron_capture.log 2>&1

The non-uniform minute offsets are deliberate: successive firings drift through different minutes of the hour, giving varied propagation sampling over a day.

Triage

pipeline/triage.py consumes a raw capture file, computes the full-band spectrogram, finds the strongest narrow-band energy peaks inside the beacon subband (1000–1700 Hz audio), extracts a short strip around each, and routes it into one of three buckets:

decoded/ — the segmentor/signal_type.py heuristic identified the strip as a QRSS mode (CW, FSKCW, DFCW) and the closed classifier predicted one of its known callsigns with confidence above the threshold.
review_qrss/ — QRSS-like signal but the classifier was uncertain or predicted "unknown". Candidate for a new known callsign.
review_other/ — the signal was classified as a different mode (WSPR, FT8, carrier, noise) and is queued separately.

Each routed strip carries a sidecar JSON with the full metadata and a companion PNG showing the original capture's full 1000–1700 Hz spectrogram with a red rectangle highlighting where in the band the candidate signal appears. The strip filename encodes the most relevant fields, for example:

conf0.47_CW_1490Hz_predG0PKT_0.6s_kiwi30m_20260415_115202.png

Review

scripts/review_queue.py is a four-command command-line tool for manual labelling:

python scripts/review_queue.py list         # summary by route / signal type / prediction
python scripts/review_queue.py manifest     # write data/triage_out/labels_todo.csv
# — edit labels_todo.csv, fill in callsign_manual column —
python scripts/review_queue.py apply        # copy labelled strips into data/real_strips_labeled/
python scripts/review_queue.py stats        # per-callsign counts

The reviewer may enter a valid ITU callsign, the literal "UNKNOWN" (kept as a negative example), or "SKIP" to leave an item in the queue.

Retrain

When data/real_strips_labeled/ has grown by a meaningful amount — on the order of twenty or thirty strips per new callsign, or several dozen "UNKNOWN" negatives — retraining closed_classifier with the expanded vocabulary is a five-minute operation. The vocabulary is read directly from training-directory labels, so adding callsigns is a file-system operation with no code changes.

Notes on results and honesty

The RESULTS.md file contains the full experimental log with all numbers measured during the development of this codebase. Two points are worth highlighting here because they shaped the architecture.

Synthetic metrics are not a proxy for real-world performance. The sequence model trained on synthetic spectrograms reaches 0.12 per cent character error rate and 100 per cent full-callsign accuracy on a held-out synthetic set. The same model on twenty-five manually labelled real strips decodes nothing (0 out of 25). Increasing training data, adding vertical blur, matching frequency-band width, re-normalising inputs to zero mean and unit variance, and estimating dit duration per capture did not move the real-world accuracy above three per cent. The underlying reason appears to be that each real capture has its own distribution of noise, receiver passband, modulation depth, and dit speed; without a sufficiently diverse labelled real dataset the model memorises per-capture artefacts rather than per-character structure.

Architecture must match the real task. When the real vocabulary is only four distinct callsigns — which is typical for narrow-band amateur beacon activity on a single capture day — a 2 M parameter sequence model is over-specified. Replacing it with a 60 K parameter classifier that outputs a softmax over five classes gives 85 per cent honest held-out accuracy on the same training data. The lesson is not that sequence models are wrong in general but that they do not help when the real label cardinality is small and the data volume per class is modest.

These two facts motivate the active-learning loop: the small classifier is good enough to be useful immediately for the most common beacons, and anything it cannot classify is queued for human inspection, which grows the training vocabulary so that at some future point a sequence model can be trained again on an expanded real dataset with genuine class diversity.

Signal modes encountered

QRSS captures on 30 metres contain several distinct signal types, which the triage pipeline classifies with simple heuristics in segmentor/signal_type.py. The currently-recognised modes are:

Mode	Visual signature
CW	single horizontal line, amplitude-modulated by dots and dashes
FSKCW	two close horizontal lines four to six hertz apart, anti-correlated
DFCW	two lines further apart; dot lives on one line, dash on the other
CHIRP	single carrier whose frequency ramps up, down, or stays flat within each symbol; no inter-element pause
WSPR	four parallel lines at about 1.5 Hz spacing, 4-FSK digital
FT8	eight parallel lines at about 6 Hz spacing, 8-FSK digital
CARRIER	unmodulated steady tone
NOISE	no consistent structure

CHIRP detection is currently a known work-in-progress: the heuristic for it is drafted but does not yet fire reliably on live captures, and the synthetic generator does not yet produce CHIRP examples.

Licence

Not yet assigned. The code is provided as a research artefact.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

qrss_ocr_v2

Background

Scope and state

Repository layout

Quick start

Environment

Run the tests

Generate a small synthetic dataset

Train the sequence model

Train the closed-vocabulary classifier on real strips

Evaluate

The active-learning loop

Capture

Triage

Review

Retrain

Notes on results and honesty

Signal modes encountered

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
eval		eval
generator		generator
model		model
pipeline		pipeline
scripts		scripts
segmentor		segmentor
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

qrss_ocr_v2

Background

Scope and state

Repository layout

Quick start

Environment

Run the tests

Generate a small synthetic dataset

Train the sequence model

Train the closed-vocabulary classifier on real strips

Evaluate

The active-learning loop

Capture

Triage

Review

Retrain

Notes on results and honesty

Signal modes encountered

Licence

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages