Skip to content

Audio playback hitches every ~500ms during voice reception #118

@Abraxas3d

Description

@Abraxas3d

Audio playback hitches every ~500ms during voice reception

Summary

When receiving live voice from a remote OPV station, the playback audio exhibits a perceptible hitch in its getalong approximately twice per second. The hitches do not appear in recorded playback at all. Playback in the UI bubble sounds great. Only live audio is affected. This issue is reproducible on every received call from KB5MU. Voice intelligibility is preserved but the audio is unpleasant to listen to for long periods. Definitely no bueno.

Root Cause Was Unexpected

The current _audio_playback_loop (in enhanced_receiver.py) feeds ALSA from a FIFO queue without any timing awareness. The drain logic is:

chunks = []
try:
    while True:
        audio_packet = self.playback_queue.get_nowait()
        chunks.append(audio_packet['pcm_data'])
        ...
except Empty:
    pass

if chunks:
    pcm_data = b''.join(chunks)
else:
    pcm_data = silence_packet      # <-- silence inserted here
    consecutive_silence += 1

This was put in because of underruns making popping sound that was pretty terrible. This logic, though, is kind of sketch. It conflates at least three distinct conditions into one "queue is empty" response. It doesn't really check which problem we are having. I didn't really think that was super important. However, I now am educated more about live audio playing. Here's the things that all result in the same outcome:

  1. Real voice continuity — packets arriving roughly every 40 ms, and the queue oscillates between 0 and 1
  2. Dummy frames from TX — the remote transmitter sends an all-zeros frame in slots where no voice is available, and the receiver discards them at _handle_audio_packet line 299, and the playback thread sees the resulting gap as "queue empty"
  3. Late or lost packets — true network jitter or packet loss

I wrote a buffer that handles everything the same, but each case should be handled differently. The current code treats them all as "feed a 40 ms silence buffer to ALSA" which causes audible discontinuities. Empirically, the dummy-frame case dominates. We can clearly see what's happening in the --verbose setting in Interlocutor. The log analysis shows dummy frames discarded at ~12-packet (480 ms) intervals during active voice, matching the perceived hitch rate.

Why the existing RTP header is the answer, even though @MustBeArt is Skeptical

OPV voice packets carry a standard 12-byte RTP header before the Opus payload. The header contains:

  • 16-bit sequence number which detects loss and reordering
  • 32-bit timestamp which gives sample-domain time of the first sample in this packet
  • 32-bit SSRC which is the synchronization source identifier

The receiver currently strips these bytes and discards them (line 470: rtp_payload = udp_payload[12:] # Skip RTP header). The values are present in every packet but currently unused.

With RTP timestamps available to the playback thread, the three cases above can be distinguished cleanly:

  • Consecutive voice: timestamps differ by exactly the frame size (1920 samples at 48 kHz)
  • Dummy-induced gap: timestamps differ by 2× or more frame sizes — exact silence duration is known and can be inserted as a single clean buffer
  • Late packet: expected timestamp has not arrived; playback can wait briefly before declaring loss

This is the standard RTP receiver model, according to Top Men (RFC people).

Is TX-side RTP populated correctly in the first place?

Yes! Confirmed by reading radio_protocol.py:RTPHeader (lines 689-783):

  • Sequence numbers: randomly initialized, increments by 1 per frame

  • Timestamps: initialized from wall clock, increments by 1920 samples per frame (matching OPULENT_VOICE_SAMPLES_PER_FRAME)

  • SSRC: derived from hash of the station ID and zero guarded

  • Marker bit: set on first packet (talkspurt start)

  • Payload type: 96 (dynamic, identifies Opus)

    Confirmed by reading the TX path: RTPAudioFrameBuilder.create_rtp_audio_frame() calls self.rtp_header.create_header() for every voice frame. The resulting 12-byte header is prepended to the Opus payload before transmission. Similar pattern for all the headers of all the protocols we use.

    The RX-side parser already exists as RTPHeader.parse_header() (lines 750-775). It is currently unused. enhanced_receiver.py strips the 12 header bytes at line 470 and discards them without parsing. This refactor will call this existing parser instead of discarding the header.

OPV's SSRC convention diverges from the standard

Per radio_protocol.py:793-795, OPV's SSRC is derived deterministically from the callsign.

pythonssrc = hash(str(station_identifier)) % (2**32)
if ssrc == 0:
    ssrc = 1

This is a deliberate divergence from standard RTP, where SSRC is randomly generated per session. OPV's choice makes SSRC a stable per-station identifier, useful for receiver-side identification (sort of), cross-session continuity (definitely), and per-station statistics aggregation (definitely). This is all good stuff. The probability of callsign-hash collisions in 2^32 space is negligible for the amateur radio population. Seriously not a problem.

Implication for the refactor: SSRC change in the RTP stream means a different operator is talking, not just "same operator restarted." We can use SSRC change as a strong signal to reset the playout anchor and per-source state. Jitter estimators should be keyed on SSRC, allowing per-station statistics to accumulate stably across transmissions.

Proposed phased solution

The refactor is staged so each phase is independently testable, and any phase can be deployed without depending on the next. Learned this from how nice things went over on Arcanus.

Phase 1: Parse and propagate RTP header fields (no behavioral change)

Modify _handle_audio_packet to parse the 12-byte RTP header, extracting:

  • Version, padding, extension, CSRC count (validation)
  • Marker bit
  • Payload type
  • Sequence number
  • Timestamp
  • SSRC

Attach these fields to the packet dict passed to queue_audio_for_playback. The playback thread does not yet act on them. This phase only ensures the values flow through the system and are available for inspection. Then we test!

Gating test:

  • All existing log lines still emitted and no behavior change
  • Add a debug log line per packet showing parsed (seq, ts) values
  • Verify across a 30-second voice call that seq increments by 1 between consecutive packets (or 2+ when a dummy is interposed), ts increments by 1920 between consecutive packets (or by 2×1920 for dummy gaps), and no regression in audio playback behavior. Same hitches as before, no new ones. The M1 Mac is the transmitter that is doing this. So use that one.

Phase 2: Timestamp-aware playout scheduling

Now, change the playback thread's drain logic from "play whatever is queued" to "play the packet whose RTP timestamp says now is its time." The thread maintains an anchor point. It is the local wall-clock time corresponding to the first packet's RTP timestamp. All subsequent packets are scheduled relative to this anchor:

playout_time(packet) = anchor_local_time + (packet.rtp_ts - anchor_rtp_ts) / sample_rate + target_delay

Replace the silence-on-empty-queue logic with three cases:

  • Next packet's playout time is in the future? Yes? Wait (with a queue.get timeout, not a sleep)
  • Next packet's playout time is now? Then play it
  • Expected packet has not arrived by its playout time + tolerance? Well, go ahead and declare loss, insert one frame of PLC (initially: zeros; later: Opus PLC)

Initial target_delay value: 80 ms (two frame durations). Sufficient cushion for typical local-network jitter without adding excessive latency. Can be tuned later. Less than the current 120 ms buffer thing.

Gating tests (all must pass):

  1. Continuous voice test: 30-second call with no dummy frames (voice activity throughout). Must hear no hitches. Audio quality subjectively equivalent to a recorded playback of the same content.
  2. Dummy interspersed test: 30-second call with KB5MU's typical dummy pattern (~12 voice frames between dummies). Must hear no hitches. Brief silences during dummy intervals must be smooth and inaudible as discontinuities. Use the M1 computer for this.
  3. Latency measurement: End-to-end latency from microphone-input on TX side to speaker-output on RX side must increase by no more than target_delay (80 ms) compared to current implementation.
  4. Stream end recovery: After a transmission ends and a new one begins, the second transmission must play correctly without the playback thread getting stuck waiting for old timestamps.
  5. No regressions in existing log output: all existing debug lines still fire correctly. Web UI notifications still work.

Phase 3: Adaptive playout delay (RFC 3550 §6.4.1)

Implement the standard RTP jitter estimator:

J(i) = J(i-1) + (|D(i-1, i)| - J(i-1)) / 16

Now we are using RTP in anger! Where D is the difference between expected and actual inter-arrival time, computed from RTP timestamps and local clock. This produces an exponentially-weighted moving average of observed jitter. EMA! Just like in the modem.

Use the jitter estimate to adjust target_delay adaptively:

  • Stable link: target_delay shrinks toward minimum (e.g., 40 ms = one frame)
  • Jittery link: target_delay grows up to a configured maximum (e.g., 200 ms = five frames)

Update target_delay no more than once per second to avoid oscillation.

Gating tests:

  1. Stable link test: With a low-jitter local link, target_delay should converge to a low value (≤2 frames) within 5 seconds of sustained voice. Latency in the steady state should be lower than Phase 2's fixed 80 ms.
  2. Jittery link test: Inject artificial jitter (e.g., delay packets randomly between 0-100 ms before processing). target_delay must grow to absorb the jitter. No additional hitches compared to Phase 2.
  3. Step-change test: Start with stable link, then introduce sustained 100 ms jitter. target_delay must adapt within 5 seconds. Hitches during the adaptation window are acceptable; hitches after adaptation has stabilized are not. Like the cases in our lunar lander CTF.
  4. No regressions in Phase 2 gating tests All four still pass! You get to pass, and you get to pass, and you get to pass.

What this issue does not fix

  • The TX-side dummy frame rate. transmitters sending dummy frames every ~480 ms even during active voice can't be fixed by RTP in the receiver. This is a separate question. Whatever the answer, the receiver-side refactor described here will handle dummy frames cleanly regardless of their rate.
  • Codec-internal PLC (Opus FEC). The Opus codec has built-in packet loss concealment via decode_fec=1. Integration with this is a useful future enhancement but not required for the basic refactor.
  • Time-scale modification (audio acceleration / stretching). WebRTC's NetEq dynamically adjusts playback rate to maintain target latency. This is genuinely advanced stuff and not needed for the current problem.
  • Multi-source mixing. Currently Interlocutor handles one remote talker at a time. Multi-source playout (multiple SSRCs, mixing) might be something we want to look at but I can't think of a compelling use case for that.

References

Standards (we've got them!)

RFC 3550 RTP: A Transport Protocol for Real-Time Applications. H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson. July 2003.

  • Section 5.1: RTP Fixed Header Fields
  • Section 6.4.1: SR: Sender Report RTCP Packet (defines the inter-arrival jitter computation we use in Phase 3)
  • Appendix A.8: Estimating the Interarrival Jitter (canonical algorithm reference)
  • https://datatracker.ietf.org/doc/html/rfc3550

Book

RTP: Audio and Video for the Internet Colin Perkins

Reference implementation

Speex JitterBuffer — Jean-Marc Valin (also the author of Opus).

Modern stuff just like this

WebRTC NetEq Google's adaptive jitter buffer for WebRTC voice/video.

  • Implements time-scale modification (PLC + acceleration) to maintain target latency.
  • Out of scope for this issue but if we ever want to implement PLC and other stuff then this is the thing to look at

Acceptance criteria for closing this issue

This issue is closed when everything gates ok. No hitches in a long conversation even on hardware that throws dummy frames. End to end latency documented so no surprises. Someone reviews the implementation to make sure we didn't goof it up. Documentation up to date.

The phasing exists so we can ship working improvements incrementally rather than gating everything on the full adaptive design, and it isn't left in a broken half-assed messed up state.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdocumentationImprovements or additions to documentationenhancementNew feature or request

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions