Skip to content

WASAPI: AUTOCONVERTPCM (#1097) produces silent input streams on Windows 11 24H2 Communications-class endpoints #1200

@louis030195

Description

@louis030195

Summary

Since cpal v0.17.2, default_input_config() for "Communications-class" USB microphones on Windows 11 24H2 returns 16 kHz mono F32 (the system Communications mix format), and the resulting WASAPI capture stream delivers genuine zero/near-zero samples — i.e. silence at the noise floor — while the same physical microphone records normal speech levels via DirectShow on the same machine at the same moment.

The regression bisects to PR #1097"wasapi: Enable resampling and rate adjustment" (merged 2026-01-29, released in v0.17.2 on 2026-02-08). My downstream users started reporting "audio looks like it's capturing, but the files are basically silent with a bit of white noise" exactly when their auto-updater pulled them through the cpal-bump release.

A precise measurement, same mic + same speaker + same minute:

Path Mean Peak
ffmpeg -f dshow -i audio="<mic>" -28.6 dB -3.3 dB (normal speech)
cpal WASAPI (default_input_config + build_input_stream) -85.5 dB -42.9 dB (noise floor)

That's an ~82 dB delta = ~12,600× attenuation. It's not an offset; the samples are genuinely zero, not misinterpreted bytes from a format mismatch (see hex dump below).

Environment

  • OS: Windows 11 Pro 24H2 (build 10.0.26200, Insider)
  • cpal: v0.18.0 (downstream fork pinned to a commit based on upstream main; behavior is the same as v0.17.2+)
  • Hardware (reproduced on both): USB headset (Jabra Evolve 75) and USB webcam (Logi C270 HD WebCam)
  • Working baseline: Same machine, same mics, ffmpeg via DirectShow → normal speech levels
  • Not affected on the same machine: Built-in Microphone Array (Intel Smart Sound Technology) — exposes 48 kHz stereo via WASAPI and records normally. Only the USB Communications-class endpoints are silent.

Reproduction

  1. Use a USB headset or USB webcam mic that Windows registers as a Communications-class endpoint on Win11 24H2 (verifiable: mmsys.cpl → Recording → properties → the device is set as both Default Device AND Default Communications Device).
  2. Enumerate via cpal:
use cpal::traits::{DeviceTrait, HostTrait};
fn main() {
    let host = cpal::default_host();
    for d in host.input_devices().unwrap() {
        let name = d.name().unwrap_or("?".into());
        println!("=== {} ===", name);
        if let Ok(c) = d.default_input_config() {
            println!("  default: {:?} {} ch @ {} Hz",
                c.sample_format(), c.channels(), c.sample_rate().0);
        }
        if let Ok(configs) = d.supported_input_configs() {
            for c in configs {
                println!("  supported: {:?} {} ch @ {}-{} Hz",
                    c.sample_format(), c.channels(),
                    c.min_sample_rate().0, c.max_sample_rate().0);
            }
        }
    }
}

Output on the affected machine:

=== Microphone (Logi C270 HD WebCam) ===
  default: F32 1 ch @ 16000 Hz
  supported: F32 1 ch @ 16000-16000 Hz
  supported: I32 1 ch @ 16000-16000 Hz
  supported: I16 1 ch @ 16000-16000 Hz
  supported: U8  1 ch @ 16000-16000 Hz

=== Headset (Jabra Evolve 75) ===
  default: F32 1 ch @ 16000 Hz
  supported: F32 1 ch @ 16000-16000 Hz
  ... same as above

Note: cpal exposes only 16 kHz for these devices — which is not a native hardware rate. ffmpeg -f dshow -list_options true for the same devices lists 8000 / 11025 / 22050 / 32000 / 44100 / 48000 / 96000 Hz × 1/2 ch × 8/16-bit. 16 kHz is the Windows Communications-class mix format, and AUTOCONVERTPCM is what makes WASAPI accept that rate via server-side resampling.

  1. Build an input stream with default_input_config() and dump samples. Result: stream callbacks fire at the expected rate, but every sample value is 0, ±1, or extremely-near-zero noise. Decoded as s16le, the first 256 bytes of one capture look like:
00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 FF FF 00 00 00 00 01 00
01 00 00 00 00 00 FF FF 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 00
00 00 00 00 00 00 FF FF 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[…continues with the same near-zero pattern]

For comparison, the same mic captured via DirectShow in the same second (decoded to s16le):

C2 FE 53 FE 87 FE A0 FE EE FE 4F FF 7E FF D9 FF 20 00 23 00 83 00 1A 01
4E 01 5F 01 8D 01 BF 01 F3 01 1F 02 3C 02 5D 02 8E 02 CF 02 1C 03 73 03
[…normal speech signal continues]

The cpal samples are not misinterpreted bytes from a format mismatch — they are genuinely zero. The format negotiation succeeds; the stream just doesn't carry any signal.

Why I believe PR #1097 is the cause

  • The change in wasapi: Enable resampling and rate adjustment #1097 enables AUDCLNT_STREAMFLAGS_AUTOCONVERTPCM in WASAPI Initialize so non-native rates can be requested through the server-side resampler.
  • The PR thread (wasapi: Enable resampling and rate adjustment #1097) acknowledges this flag was non-standard prior to Windows 10, with no testing reported on Win11 24H2 Communications-class endpoints.
  • On Win11 24H2 specifically, the WASAPI audio engine appears to apply a privacy/Communications policy when a non-Communications consumer opens a Communications-class endpoint at the Communications mix format (16 kHz F32 mono): Initialize succeeds, the stream "plays," callbacks fire — but the samples delivered are zero.
  • Reverting to v0.15.3 (the last release before AUTOCONVERTPCM) restores normal capture on the exact same hardware. (We confirmed the timeline: our downstream stopped working when users were rolled past the cpal v0.17.2+ release; no other audio-code changes correlate.)
  • The Intel Smart Sound mic array on the same machine is NOT a Communications-class endpoint, exposes 48 kHz stereo via WASAPI (without AUTOCONVERTPCM in play), and records normally.

Suggested fix directions

  1. Gate AUTOCONVERTPCM behind an opt-in flag rather than always-on. The PR's stated goal (issue build_output_stream fails on Windows 10 if the specified sample rate does not match the output device's default sample rate #593) was solving a build-time failure when users request non-native rates; AUTOCONVERTPCM is one valid solution, but for callers who request the device's native rate (or who use default_input_config() expecting a usable stream), the flag introduces silent-failure risk on Win11 24H2.
  2. Or: probe for silent streams during stream setup. A 100–500 ms post-Start check — if RMS is exactly zero over the first N buffers, retry with AUTOCONVERTPCM off and use the device's actual hardware mix format from GetMixFormat on the endpoint's eMultimedia role (instead of eCommunications).
  3. Or: pick the endpoint role explicitly. IMMDeviceEnumerator::GetDefaultAudioEndpoint(eCapture, eMultimedia) returns a different audio session policy than eCommunications, even for the same physical device. cpal currently doesn't expose role selection; exposing it (or defaulting to eMultimedia for non-RT use cases) sidesteps the policy gate entirely.

Happy to test patches against the affected hardware here. cc @yeah-its-gloria @roderickvd.

Downstream context

We're screenpipe — Rust + Tauri app that records audio + accessibility text continuously. We started seeing user reports immediately after our auto-updater rolled cpal v0.17.2+ to Windows users. Diagnosis credit to one of our users (William Lucas) who built the DirectShow baseline + WASAPI hex dump to isolate the regression to the cpal capture layer.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions