Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Glitching in WASAPI in shared mode #303

Closed
sveinse opened this issue Sep 26, 2020 · 92 comments · Fixed by #307
Closed

Glitching in WASAPI in shared mode #303

sveinse opened this issue Sep 26, 2020 · 92 comments · Fixed by #307
Assignees
Labels
P2 Priority: High src-wasapi MS WASAPI Host API /src/hostapi/wasapi windows Affects MS Windows

Comments

@sveinse
Copy link
Contributor

sveinse commented Sep 26, 2020

I'm experiencing significant glitching when using WASAPI from master in shared mode. If using exclusive mode or a full-duplex audio stream, the audio doesn't glitch.

I'm using PA from Python using the interface library pyaudio. I'm not excluding performance issues from the python environment, but all the other hostAPIs works smoothly without glitching.

The glitching is typically manifested as bursts of 4096 samples played back successfully and then 220-230 samples of pause in between. The pause interval varies. PA is called with framesPerBuffer of 1024. It improves the occurrence of glitching by increasing this number, but it doesn't remove the glitching altogether, even with very large values such as 65536.

For reference my environment is:

  • Windows 10, 2004 version

  • Compiled with MS VS 2019 Community Edition

  • Applied patch for building (for MSVC 2019 and to removed ASIO): https://gist.github.com/sveinse/7a3442d6f8444b95c4084a7172ec5fdb

  • Compile command (from git bash):

    '/c/Program Files (x86)/Microsoft Visual Studio/2019/Community/MSBuild/Current/Bin/MSBuild.exe' \ 
        build\\msvc\\portaudio.sln \
        //property:GenerateFullPaths=true \
        //t:build \
        //consoleloggerparameters:NoSummary \
        //p:Configuration=Debug \
        //p:Platform=win32
    
@sveinse
Copy link
Contributor Author

sveinse commented Sep 26, 2020

Some observations:

  • PortAudio from master glitches on WASAPI playback
  • PortAudio from pa_stable_v190600_20161030 does not glitch
  • WASAPI used in exclusive mode does not glitch
  • WASAPI used in full duplex mode does not glitch
  • WASAPI used in recording (in streams) does not glitch

I made a git bisect to determine which commit caused the gliching behavior in WASAPI. The breaking commit commited on Sep 6 2018 is:

c3fee03

edit Added recording observation

@sveinse
Copy link
Contributor Author

sveinse commented Sep 26, 2020

I think I have found evidence that this is related to sleep timing setup in ProcThreadPoll() in pa_win_wasapi.c and that the timing amounts have doubled in master causing these glitches.

pa_win_wasapi.c implements GetFramesSleepTime() to calculate thread sleep time. This function is used in the main worker thread to calculate the sleep time between each iteration in poll mode.

https://github.com/PortAudio/portaudio/blob/master/src/hostapi/wasapi/pa_win_wasapi.c#L5764-L5802

The change that cause WASAPI to start glitching is that the sleep times returned from GetFramesSleepTime() has been doubled. This works without glitching:

return (UINT32)(nDuration/REFTIMES_PER_MILLISEC/2);

While this does not work and produces glitching:

return (UINT32)(nDuration / REFTIMES_PER_MILLISEC);

My test setup runs at 48kHz, with framesPerBuffer at 1024 (21.3ms). (WASAPI prints framesPerUser[ 1024 ] framesPerHost[ 2048 ] latency[ 42.67ms ]). This is a setup of the observed sleep values (sleep_ms) used in ProcThreadPoll() under different usage conditions:

What Master Stable
Playback 21ms (glitch) 10ms (OK)
Recording 2ms (OK) 2ms (OK)
Full-duplex 2ms (OK) 2ms (OK)
Exclusive mode playback 21ms (OK) 10ms (OK)

Observations:

  • Recording is clipped to 2 ms sleep by intentional design which is the reason why no glitching is obseved on record
  • In full-duplex the sleep is selects the lowest sleep value, hence because of the previous item, no glitching is observed in full-duplex mode
  • Exclusive mode seems more resilient against glitching, as the sleep time is the same as shared mode that glitches.

@sveinse
Copy link
Contributor Author

sveinse commented Sep 26, 2020

I have implemented this workaround in my own portaudio build which I use from my Python application and it works fine:

sveinse@4a88ee7

In essence it reverts the thread sleep delay to the same values as in the stable version of PA.

@dmitrykos
Copy link
Collaborator

dmitrykos commented Sep 26, 2020

@sveinse , thank you for raising this issue. In my setup I do not experience any glitching but may be it happens when latency is the minimal for the Shared Mode, e.g. as in your case - 21ms (glitch), or there could be a specific behavior of the underlying audio drivers.

Your workaround for ProcThreadPoll looks fine to me as similar approach is already used for the blocking API (ReadStream, WriteStream). Could you submit the pull request to incorporate this change?

@dmitrykos dmitrykos self-assigned this Sep 26, 2020
@dmitrykos dmitrykos added windows P2 Priority: High labels Sep 26, 2020
sveinse added a commit to sveinse/portaudio that referenced this issue Sep 27, 2020
sveinse added a commit to sveinse/portaudio that referenced this issue Sep 27, 2020
@sveinse
Copy link
Contributor Author

sveinse commented Sep 27, 2020

@dmitrykos I created a PR #305 with this fix.

I'm not sure what makes one system (like yours) not glitch and others is glitching a lot. With the old solution it does not matter what the framesPerBuffer value is, it always glitches, not only on low-values.

Yet, I'm not completely convinced that this fix is complete. I ran a series of tests after the PR and I still experience gliching in some ranges of framesPerBuffer:

framesPerUser framesPerHost latency glitches?
4096 8192 171ms No
2048 4096 85ms No
1024 2048 43ms No
512 1056 22ms Yes
256 1056 22ms Yes
128 1056 22ms No (usec sleep)
64 1056 22ms No (usec sleep)
6 1056 22ms No (usec sleep)

In my system any framesPerBuffer value between 129 - 799 glitches, regardless of which audio device used. It starts working seamless again below 128 and notice that it keeps track with seamless playback the whole way down to 6 samples per iteration (and this includes jumps into Python runtime code).

@dmitrykos
Copy link
Collaborator

@sveinse I merged the fix as a quick solution for now. Although it fixes the problem you observed, we still need to find the reason of the underruns/glitches.

The line:

sleep_ms_out = GetFramesSleepTime(stream->out.framesPerBuffer, stream->out.wavex.Format.nSamplesPerSec);

will cause sleeps for a duration of the user buffer, e.g. we will always sleep less than the duration of the host (WASAPI's side) buffer which size is reflected in framesPerHostCallback variable. According to the official docs the code there waits half of the host buffer duration (https://docs.microsoft.com/en-us/windows/win32/coreaudio/rendering-a-stream):

// Sleep for half the buffer duration.
Sleep((DWORD)(hnsActualDuration/REFTIMES_PER_MILLISEC/2));

so our implementation (before this fix) had to be stable in terms of absence of the underruns.

In your failed tests with this fix we will sleep for:

  • 512: 5 ms = 512 / 48000 / 2
  • 256: 2 ms = 256 / 48000 / 2

and it does not provide any clue why glitches still exist because the host's buffer duration is 22 ms, so we have plenty of time. Could it be due to a jitter of the system timer/Sleep?

@dmitrykos
Copy link
Collaborator

@sveinse, would you please make an experiment and replace line 3859 to such:

bufferMode = paUtilBoundedHostBufferSize;

and test with framesPerUser with 512 and 256.

Let's check if we do not have any problem inside the rendering loop, specifically this area starting at line 5934:

if (stream->bufferMode == paUtilFixedHostBufferSize)
{
    while (frames >= stream->out.framesPerBuffer)
    {
        if ((hr = ProcessOutputBuffer(stream, processor, stream->out.framesPerBuffer)) != S_OK)
        {
            LogHostError(hr);
            goto thread_error;
        }

        frames -= stream->out.framesPerBuffer;
    }
}

Here I got worried that if this condition while (frames >= stream->out.framesPerBuffer) is false right after the _PollGetOutputFramesAvailable then we can run into the underrun due to sleeping again.

@sveinse
Copy link
Contributor Author

sveinse commented Sep 27, 2020

@dmitrykos setting bufferMode improves the stability a lot, but it still glitches. It goes from a constant stream of glitches, to a more randomly occurring instances, up to 10 seconds apart. It also seems stability (glitching) occurs at the lower framesPerBuffer values with this new.

BTW I'm monitoring and printing if any over- or underrun flags are reported to the user callback function, and they are not flagged in any of the glitches. So either WASAPI is oblivious to the underruns or its missing an implementation of reporting this to the user callback function. I haven't checked what it does.

@dmitrykos
Copy link
Collaborator

@sveinse, thank you for a test!

Even if stability is improved but there are glitches left then paUtilFixedHostBufferSize rendering loop seems to be ok but it is causing more instability due to requesting host buffer in a fixed number of chunks (stream->out.framesPerBuffer).

What if you make another experiment and hardcode sleep_ms_out to be equal 11, e.g. it will be exactly 50% of the host buffer that is: 1056 / 48000 / 2 = 11. Earlier you reported that your test with framesPerUser == 1024 was without glitching and the sleep_ms_out in that case was 10. So actually could you check 10 an 11 ms and with bufferMode reverted to paUtilFixedHostBufferSize.

@sveinse
Copy link
Contributor Author

sveinse commented Sep 27, 2020

@dmitrykos

Given the apparent host native size of 1056, I tried to hardcode sleep_ms as suggested. If I use framesPerBuffer of 528, I find no values of sleep_ms that results in stable playback (tried 11, 10, 8, 6, 5, 4). Same thing with 264: no stable playback. At 132 it is glitch free at value 1, but not at 2. At 66 something strange happens: Then I get stable playback with any sleep values up to approx 10!

In my post above, the table seems to hint that usec sleep is what resolves the glitching when framesPerBuffer gets low. So I tested that and usec sleep doesn't fix it. It does glitch even with some values when usec sleep is selected. So that can be crossed off the list.

edit

FYI: It seem the host frame size of 1056 is a constant on my system. All devices, from HDMI to built-in sound cards to external USB headsets, all report the same host frame size. It doesn't change when the sample rate config is changed in the control panel either.

@sveinse
Copy link
Contributor Author

sveinse commented Sep 27, 2020

@dmitrykos

As for the while (frames >= stream->out.framesPerBuffer) I did a crude test:

if (stream->bufferMode == paUtilFixedHostBufferSize)
{
	UINT32 count = 0;
	while (frames >= stream->out.framesPerBuffer)
	{
		count++;
		if ((hr = ProcessOutputBuffer(stream, processor, stream->out.framesPerBuffer)) != S_OK)
		{
			LogHostError(hr);
			goto thread_error;
		}

		frames -= stream->out.framesPerBuffer;
	}
	printf("%d,", count);
}

(Don't flog me for the printf() within real-time-ish code, but it does work sufficiently to highlight the results :D)

WASAPI::OpenStream(output): framesPerUser[ 528 ] framesPerHost[ 1056 ] latency[ 22.00ms ] exclusive[ NO ] wow64_fix[ NO ] mode[ POLL ]
2,0,2,0,2,0,2,0,2,1,1,2,0,2,0,2,0,2,1,1,2,0,2,0,2,0,2,1,1,2,0,2,0,2,1,1,2,0,2,0,2,0,2,1,1,2,0,2,1,1,2,0,2,1,1,2,0,2,0,2,1,1,2,0,2,1,1,2,0,2,0,2,1,1,2,0,2,0,2,

So it definitely hits the while loop without enough data available, resulting in no callbacks to ProcessOutputBuffer(). In another run I looked at frames from _PollGetOutputFramesAvailable and I get highly variable numbers:

1056 960 912 1056 480 1056 480 1056 960 912 1056 480 1056 480 1056 960 912 1056 480 1056 480 1056 960 912 1056 480 1056 960 912 1056 480 1056 960 912 1056 480 1056 480 1056 480 1056 960 912 1056 480 1056 960 912 1056 480 1056 480 1056 960 912 1056 480 1056 960 912 1056 480 1056 960 912 1056 480 1056 960 912 1056 480 1056 480 1056 960 912

edit

I made an additional test to see if getting no iterations in the while-loop was the cause for the glitches. And it's not the sole cause. The glitching occurs even if all iterations pass through the while loop. E.g. setting framesPerBuffer to 132 (1056/8) cause the while loop to be entered every iteration, but it still glitches.

@kmatheussen
Copy link

BTW I'm monitoring and printing if any over- or underrun flags are reported to the user callback function, and they are not flagged in any of the glitches. So either WASAPI is oblivious to the underruns or its missing an implementation of reporting this to the user callback function. I haven't checked what it does.

Thank you both for your work. Non-reported glitches is my biggest concern about using the current WASAPI driver in portaudio.

@kmatheussen
Copy link

I've looked a little bit at the code now, and I'm probably wrong, or maybe not, but as far as I can see, the audio processing itself (called in the function "WaspiHostProcessingLoop") happens in the same thread as the one that sleeps (the function "ProcThreadPoll"). Doesn't this mean that when you spend half the time sleeping you also can't use more than 50% CPU to process audio?

@dmitrykos
Copy link
Collaborator

dmitrykos commented Sep 28, 2020

@sveinse thank you for your experiment. To my view we can have underruns happening here:

2,0,2,0

e.g. after having no frames to process on the next cycle we get 2 buffers to fill (1056 / 528 = 2) which are equal to a full host buffer and it is a problem, I would rather expect sequence 1,0,1,0 or 1,0,1,1 that would mean that we usually get 1/2 of the host buffer. Here I assumed that in your test stream->out.framesPerBuffer equals 528.

I did some more experiments and tried to make busy waiting once we get 0 like this (starts at line 5981):

UINT32 next_sleep = sleep_ms;
UINT32 repeat = 0;

// Processing Loop
while (WaitForSingleObject(stream->hCloseRequest, next_sleep) == WAIT_TIMEOUT)
{
    for (i = 0; i < S_COUNT; ++i)
    {
        // Process S_INPUT/S_OUTPUT
        switch (i)
        {
        // Input stream
        case S_INPUT: {

            if (stream->captureClient == NULL)
                break;

            if ((hr = ProcessInputBuffer(stream, processor)) != S_OK)
            {
                LogHostError(hr);
                goto thread_error;
            }

            break; }

        // Output stream
        case S_OUTPUT: {

            UINT32 frames;
            if (stream->renderClient == NULL)
                break;

            // Get available frames
            if ((hr = _PollGetOutputFramesAvailable(stream, &frames)) != S_OK)
            {
                LogHostError(hr);
                goto thread_error;
            }

            // Output data to the user callback
            if (stream->bufferMode == paUtilFixedHostBufferSize)
            {
                UINT32 count = 0;
                while (frames >= stream->out.framesPerBuffer)
                {
                    count++;
                    if ((hr = ProcessOutputBuffer(stream, processor, stream->out.framesPerBuffer)) != S_OK)
                    {
                        LogHostError(hr);
                        goto thread_error;
                    }

                    frames -= stream->out.framesPerBuffer;
                }

                if (count == 0)
                {
                    repeat++;
                    next_sleep = 0;
                    continue;
                }
                else
                {
                    //printf("%d,", repeat);
                    repeat = 0;
                }
            }
            else
            if (frames != 0)
            {
                if ((hr = ProcessOutputBuffer(stream, processor, frames)) != S_OK)
                {
                    LogHostError(hr);
                    goto thread_error;
                }
            }

            break; }
        }
    }

    // Get next sleep time
    if (sleep_ms == 0)
        next_sleep = ThreadIdleScheduler_NextSleep(&scheduler);
    else
        next_sleep = sleep_ms;
}

Explanation:

Once we get 0 we should not sleep again (may be it was a corner case and buffer is already ready) as it can result in the underrun and would rather do a short busy waiting by clearing next_sleep to 0 temporarily. I recorded how many repeated cycles the busy loop is doing and it is 1-2 cycles that is pretty ok for a such configuration:

framesPerUser[ 300 ] framesPerHost[ 1056 ] latency[ 22.00ms ] exclusive[ NO ] wow64_fix[ NO ] mode[ POLL ]

// 1,2,2,2,2,2,2,2,1,2,1,2,1,1,2

Could you please replace your implementation with this piece of code and check if it improves anything on your machine.

@dmitrykos
Copy link
Collaborator

dmitrykos commented Sep 28, 2020

@kmatheussen, your assumption is correct but we need to get some space in WASAPI's host buffer to write into it and thus code must sleep/wait for some piece of the host buffer to become available for writing into it.

To solve the problem you are mentioning there should be a 2-nd audio rendering thread that is doing audio processing on the app's side, it would write to the own temporary buffer/-s and once WASAPI's host buffer is available the PA's callback function would simply copy ready audio samples prepared by the 2-nd audio processing thread.

With such approach you can utilize CPU effectively but it is outside the PA's scope and rather a design of the application which is using PA as audio API.

@sveinse
Copy link
Contributor Author

sveinse commented Sep 28, 2020

@kmatheussen, @dmitrykos , with respect to amount of wall-time for user processing/callback of audio: How is this implemented for other host API? Do they also sleep equal amounts? I guess, from a principle point of view, this is the price of polling. One have to wait, check, wait and so on. You will lose granularity on this. The other alternative is to wake up the thread when there is data available -- which is what the event-based variant is all about, right?

Could the sleep be dynamic? In principle the outer loop is aware of when data should be available (being more or less monotonic with respect to wall-time). Could the sleep after user callback be dynamic in the sense that it sleeps the remainder of the timeslot rather than a fixed pre-calculated amount?

@dmitrykos
Copy link
Collaborator

Could the sleep be dynamic?

I think so. We could calculate how much time does ProcessOutputBuffer take and substract it from the next_sleep.

@dmitrykos
Copy link
Collaborator

dmitrykos commented Sep 28, 2020

@sveinse I quickly made the change you proposed that subtracts the time spent in the processing from the next_sleep, could you please check if anything changed in your setup with this piece of code?:

INT32 next_sleep = sleep_ms;
UINT32 repeat = 0;

// Processing Loop
while (WaitForSingleObject(stream->hCloseRequest, next_sleep) == WAIT_TIMEOUT)
{
    double start_time = PaUtil_GetTime();

    for (i = 0; i < S_COUNT; ++i)
    {
        // Process S_INPUT/S_OUTPUT
        switch (i)
        {
        // Input stream
        case S_INPUT: {

            if (stream->captureClient == NULL)
                break;

            if ((hr = ProcessInputBuffer(stream, processor)) != S_OK)
            {
                LogHostError(hr);
                goto thread_error;
            }

            break; }

        // Output stream
        case S_OUTPUT: {

            UINT32 frames;
            if (stream->renderClient == NULL)
                break;

            // Get available frames
            if ((hr = _PollGetOutputFramesAvailable(stream, &frames)) != S_OK)
            {
                LogHostError(hr);
                goto thread_error;
            }

            // Output data to the user callback
            if (stream->bufferMode == paUtilFixedHostBufferSize)
            {
                UINT32 count = 0;
                while (frames >= stream->out.framesPerBuffer)
                {
                    count++;
                    if ((hr = ProcessOutputBuffer(stream, processor, stream->out.framesPerBuffer)) != S_OK)
                    {
                        LogHostError(hr);
                        goto thread_error;
                    }

                    frames -= stream->out.framesPerBuffer;
                }

                if (count == 0)
                {
                    repeat++;
                    next_sleep = 0;
                    continue;
                }
                else
                {
                    //printf("%d,", repeat);
                    repeat = 0;
                }
            }
            else
            if (frames != 0)
            {
                if ((hr = ProcessOutputBuffer(stream, processor, frames)) != S_OK)
                {
                    LogHostError(hr);
                    goto thread_error;
                }
            }

            break; }
        }
    }

    // Get next sleep time
    if (sleep_ms == 0)
        next_sleep = ThreadIdleScheduler_NextSleep(&scheduler);
    else
        next_sleep = sleep_ms;

    // Update next sleep time dynamically depending on how much time we spent in processing
    next_sleep -= (INT32)((PaUtil_GetTime() - start_time) * 1000);
    if (next_sleep < 0)
        next_sleep = 0;
}

@sveinse
Copy link
Contributor Author

sveinse commented Sep 28, 2020

@dmitrykos I quickly tested them both (with and without dynamic sleep), and none of them resolves the glitching issues, unfortunately.

I'm curious if its my HW environment that make this behave differently from your, or if its related to the type of callback system I use (i.e. python). It would be really great if you could download the standalone executable https://github.com/sveinse/elns-release/releases/download/v1.1-b3/elns-1.1-b3-standalone.exe and test if WASAPI glitches in your system. Select "WASAPI" in audio system, find a suitable wav or flac file as input and select an appropriate output audio device, select "Passthrough" function and then press play. (I hope you're not set off by my executable, its an application I author and maintain. Is a py application including bundled portaudio binaries.)

@dmitrykos
Copy link
Collaborator

dmitrykos commented Sep 28, 2020

@sveinse I tested your app and can hear glitches on all latencies except 128. I used 1 kHz sine in WAV as it is very easy to hear glitches.

Nevertheless the work that was done above is anyway not useless and I tend to include it into the actual implementation:

  1. Do short busy waiting if no frames available yet while we expected them;
  2. Sleep dynamically depending on spent time for processing.

In my tests I noticed that with the fix you proposed sleep_ms /= 2 the busy waiting (1.) is more often (expected) because we sleep 2 times shorter time than stream->out.framesPerBuffer and to my view it needs to be removed in favor of 1. and 2. changes.

@sveinse
Copy link
Contributor Author

sveinse commented Sep 28, 2020

@dmitrykos I made a brute force test by setting while (WaitForSingleObject(stream->hCloseRequest, 0) == WAIT_TIMEOUT), which essentially makes the whole thing a large busy loop. And no surprise really, it works with all framesPerBuffer sizes. Interestingly it is capable of having framesPerBuffer as low as 2 samples (@48khz), which each will jump into python runtime execution, do some numpy operations and return. It does spend considerable CPU, but it reveals the true computing power of recent CPUs.

This tells me this seems to be purely timing related. WaitForSingleObject is possibly either waiting too long or jitters too much. Is the granularity of ms resolution too coarse? E.g. when the frame size dictates e.g. 5.5ms, not 5 or 6 ms. - But no, it still fails when being switched over to the usec granularity.

Anyways, I hoped this could be useful input.

@kmatheussen
Copy link

(continuing disussion from #305 here)
@dmitrykos Of course increased CPU consumption can be a problem. My question is if sleeping 2ms makes any difference to care about? No one cares if the portaudio driver uses 0.0001% more CPU, for instance. I'm suggestion that we measure this, and use 2ms or even 1ms if it doesn't make any notable difference.

@kmatheussen
Copy link

@dmitrykos You wrote: "you could do audio rendering not inside the PA's thread but in some other thread which would deliver ready buffers/data into the PA's callback thread"

Are you serious? Of course you have to do processing in a separate thread if your own code doesn't behave in a proper RT manner, but you should definitely not have to do it because PA doesn't behave in a proper RT manner.

@kmatheussen
Copy link

kmatheussen commented Oct 1, 2020 via email

@dmitrykos
Copy link
Collaborator

dmitrykos commented Oct 1, 2020

@kmatheussen

It looks like you are sleeping the entire buffer.

We are sleeping for the duration of the user buffer which is always at least 2x times smaller than host (WASAPI) buffer, if user specifies the desired latency the host buffer will be of the size of the requested latency. There is no mistake in the implementation.

How can you avoid risking drifting when you don't even keep account of how long you have slept totally

On every cycle we get number of available frames with _PollGetOutputFramesAvailable, so drifting is not possible as we always fill available frames until they are less than the size of the user buffer. You could run some of the examples, put breakpoints and see the logic in action.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 1, 2020

@dmitrykos Sorry for the delay, I got put off by the rebase in winrt and had to remerge my changes (I have to patch msvc build files to be able to build it).

Running the PR #307 as is, with PA_WASAPI_LOG_TIME_SLOTS enabled, seem to create decent playback on my system in all buffer sizes. There is still some glitching in the very beginning of frame sizes around 264.

WASAPI::OpenStream(output): framesPerUser[ 264 ] framesPerHost[ 1056 ] latency[ 22.00ms ] exclusive[ NO ] wow64_fix[ NO ] mode[ POLL ]
[0|1],{0},[5|4],{1},[4|4],{1},[4|3],{1},[4|3],{1},[4|4],{1},[4|4],{1},[4|4],{0},[5|5],{0},[5|5],{0},[5|5],{0},[5|5],{0},[5|6],{0},

You mentoned you'd like to test my app on your HW. I created this special build of ELNS using the portaudio version from this PR. It can be downloaded from https://github.com/sveinse/elns-portaudio-test/releases/tag/1. It has console output enabled, so running it from a command-prompt will output the WASAPI logging. This test version loads the input file into memory prior to playback to rule out any timing issues internally.

I test by pressing play and stop rather quickly. What is interesting to note is that it never glitches on rapid pause play operations. Perhaps that suggests that this is related to the initial startup of a stream?

@sveinse
Copy link
Contributor Author

sveinse commented Oct 1, 2020

@dmitrykos btw, to test your statement in PR #307 "...time hungry user-side processing", I added the following code:

		LARGE_INTEGER a, b;
		QueryPerformanceCounter(&b);
		processor[S_OUTPUT].processor(NULL, 0, data, frames, processor[S_OUTPUT].userData);
		QueryPerformanceCounter(&a);
		printf("_%lld_,", a.QuadPart-b.QuadPart);
WASAPI::OpenStream(output): framesPerUser[ 264 ] framesPerHost[ 1056 ] latency[ 22.00ms ] exclusive[ NO ] wow64_fix[ NO ] mode[ POLL ]
_932_,[0|1],_2312_,_575_,_1512_,_490_,{1},[4|4],{0},[5|6],_979_,{0},[5|5],{0},[5|5],_1013_,_644_,{1},[4|3],{0},[5|5],_1004_,_646_,{1},[4|3],{0},[5|6],_1012_,_624_,{1},[4|4],{0},[5|6],_1204_,_696_,{0},[5|6],{0},[5|5],_1023_,{1},[4|4],{0},[5|5],_1113_,_732_,{1},[4|3],{0},[5|4],_1310_,_1326_,{1},[4|4],{0},[5|5],_1276_,_936_,{1},[4|3],{0},[5|6],_1402_,_884_,

This one is a glitch on startup. It can seem like the user callback spends much longer time in the first invocation, but the number are very variable. They are equally variable and high on playbacks that doesn't glitch and even with lower buffer sizes. I plan to collect a better histogram of these, but just wanted to mention that I'm looking into if the root cause to the startup glitching is excessive user callback times.

@kmatheussen
Copy link

kmatheussen commented Oct 2, 2020 via email

@dmitrykos
Copy link
Collaborator

@sveinse thank you for a test and I am glad that we managed to get to a stable (almost) version.

with PA_WASAPI_LOG_TIME_SLOTS enabled

I probably misunderstood but PA_WASAPI_LOG_TIME_SLOTS needs to be defined only to enable logging (printfs). Without defining it I believe you have the same stable playback?

I could not test your application unfortunately as it throws Invalid device error, tried on 2 machines but with the same result. Audio device is set to 24/48000 format. May be I am doing something wrong. Input is set to WAV/FLAC file, Output to WASAPI output device.

Starting
Using input device index number: -1
Using output device index number: 4
WASAPI: IAudioClient2 set properties: IsOffload = 0, Category = 0, Options = 0
WASAPI ERROR HRESULT: 0x88890008 : AUDCLNT_E_UNSUPPORTED_FORMAT
 [FUNCTION: CreateAudioClient FILE: \elns-wasapi\portaudio-pr307\src\hostapi\wasapi\pa_win_wasapi.c {LINE: 3132}]
WASAPI ERROR PAERROR: -9997 : Invalid sample rate
 [FUNCTION: ActivateAudioClientOutput FILE: \elns-wasapi\portaudio-pr307\src\hostapi\wasapi\pa_win_wasapi.c {LINE: 3449}]
WASAPI ERROR PAERROR: -9996 : Invalid device
 [FUNCTION: OpenStream FILE: \elns-wasapi\portaudio-pr307\src\hostapi\wasapi\pa_win_wasapi.c {LINE: 2143}]
An error occured while using the portaudio stream
Error number: -9996
Error message: Invalid device
Traceback:
  File "elns\core.py", line 797, in tick_step
  File "elns\audio.py", line 556, in open
  File "pyaudio.py", line 754, in open
  File "pyaudio.py", line 445, in __init__
OSError: [Errno -9996] Invalid device
ERROR: Opening audio device failed. OSError: [Errno -9996] Invalid device

@sveinse
Copy link
Contributor Author

sveinse commented Oct 2, 2020

@dmitrykos This is actually another bug (or feature) in WASAPI. The root cause is that you're most probably trying to play an audio file with a different sample rate than the Windows audio device is configured for (48k). In this version I do not use the paWinWasapiAutoConvert flag, so the sample rate must match. WASAPI first returns error -9997 Invalid sample rate which is an useful error. But it later mutates into a -9996 Invalid device which is the message that is reported to the user application. Unfortunately this second message is more cryptic and confusing at best.

@dmitrykos
Copy link
Collaborator

dmitrykos commented Oct 2, 2020

@sveinse yes, you were right, matching sample rate made it work (I tried to match them yesterday but may be something went wrong) and I went through all the latencies. I could hear some very occasional glitching when using 128 ms but you enabled logging with PA_WASAPI_LOG_TIME_SLOTS and then may be console I/O due to a lot of printf could cause it. All other latencies were stable when listening to 440 Hz sine tone.

This is actually another bug (or feature) in WASAPI

It is not a bug but how WASAPI works in Shared mode, Microsoft decided to pass the resampling task to the application level.

Can we conclude that we managed to reach stability and current implementation shall be merged, or you wish to make some other experiments first?

@sveinse
Copy link
Contributor Author

sveinse commented Oct 2, 2020

@dmitrykos Please stand by, I'm going to run some test now without printf.

This is actually another bug (or feature) in WASAPI

It is not a bug but how WASAPI works in Shared mode, Microsoft decided to pass the resampling task to the application level.

The sample rate matching error is correct. That's not what i meant. I meant that WASAPI does identify this being a sample rate error (the -9997 error), but it later becomes a -9996 invalid device error, where the latter is the error reported back to the user layer. It would be better if the error was -9997 Invalid sample rate. -- That said, I plan to query for compatible format before starting the stream in the next version, so I can catch this and inform the user in a more useful manner. Even give the user the option to use autoconvert.

@dmitrykos
Copy link
Collaborator

dmitrykos commented Oct 2, 2020

@sveinse I see, sorry for confusion. I checked the related code and it is actually a bug in the implementation at line 3842 if (FAILED(hr = ActivateAudioClientOutput(stream))). The same for input. I propose to fix it with another pull request once we are done with the current one.

@philburk
Copy link
Collaborator

philburk commented Oct 2, 2020

This is a great technical discussion. But the thread is very long. So when this issue is closed, could someone please add a very short summary paragraph describing the issue and the final solution. That will help others in the future. Thanks.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 2, 2020

@dmitrykos I rebuild portaudio to release with no printfs. Unfortunately the glitching does not go away

The test results are as follows:

Devices:

   1: Windows WASAPI  [out 2]  LG HDR 4K (NVIDIA High Definition Audio)
   2: Windows WASAPI  [out 2]  Speaker/HP (Realtek High Definition Audio)
   3: Windows WASAPI  [out 2]  Speakers (2- Logitech USB Headset)

Solo = started without any audio running on that device. shared = another audio playback running on the same device

buffer size Device 1 solo Device 1 shared Device 2 solo Device 2 shared Device 3 solo
2112 3 OK 3 OK 3 OK 5 OK 3 OK
1056 3 OK 2 OK, 1 glitch 3 OK 5 OK 3 OK, 2 glitch
528 1 OK, 5 glitch 5 glitch 3 OK , 3 glitch 5 glitch 4 glitch
264 4 glitch 3 OK, 2 glitch 1 OK, 4 glitch 2 OK, 3 glitch 3 glitch
132 5 glitch 3 OK, 2 glitch 2 OK, 3 glitch 5 OK 4 OK
66 5 glitch 4 OK 1 OK, 4 glitch 5 OK 3 OK, 1 glitch

edit PS! None of the other hostapis (DS, MME and WDM-KS) glitches with the same setup

@dmitrykos
Copy link
Collaborator

dmitrykos commented Oct 2, 2020

@sveinse thank you for a very detailed test! Did it become worse when you do not define PA_WASAPI_LOG_TIME_SLOTS?

According to the time slots logging results, on application/PA side all seems to be fine and we never exceed 1/2 of the host buffer latency, so I am puzzled.

edit:
Comparing other APIs does not makes sense as in one case core (DS, MME) may be doing some internal buffering which works fine, WDMKS instead is a direct I/O without any hidden wrapper. We may be facing some corrupted logic of WASAPI's shared mode. For experiment, to check if PA WASAPI logic is correct you could try forcing Exclusive mode which bypasses WASAPI's mixer/wrapper, if playback is stable then most probably there is something wrong with shared mode of WASAPI then.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 2, 2020

@dmitrykos

@sveinse thank you for a very detailed test! Did it become worse when you do not define PA_WASAPI_LOG_TIME_SLOTS?

I changed multiple variables in the test above (no printf and Release build), so I had to rerun the tests. Setting PA_WASAPI_LOG_TIME_SLOTS doesn't improve glitching. What is worth noting is that the glitch sound coming at a specific buffer value are more or less deterministic. E.g. at 528 there is a glitch at the startup when it glitches, nothing more. At 264 when it isn't perfect playback, I get a double glitch at the beginning with subjectively equal distance. So I would be tempted to think that the system is behaving close to deterministic when it glitches.

According to the time slots logging results, on application/PA side all seems to be fine and we never exceed 1/2 of the host buffer latency, so I am puzzled.

Is it Windows that require a larger buffer fill in the first invocations to be able to handle internal warm-up?

I am considering logging/tracing the absolute time when e.g. WasapiHostProcessingLoop() is called or somesuch to see if there is correlation in the timing that might explain the glitching.

The other worry I have is that these glitches are unobserved. There is no flag being raised and sent back to the user indicating the glitch/underrun. That bothers me.

edit:
Comparing other APIs does not makes sense as in one case core (DS, MME) may be doing some internal buffering which works fine, WDMKS instead is a direct I/O without any hidden wrapper. We may be facing some corrupted logic of WASAPI's shared mode. For experiment, to check if PA WASAPI logic is correct you could try forcing Exclusive mode which bypasses WASAPI's mixer/wrapper, if playback is stable then most probably there is something wrong with shared mode of WASAPI then.

I understand your point, but at the same time I think its not good if the WASAPI host API underperforms compared to the other host APIs. So my opinion and objective is that we need to find a config and method that works reliably for the users. If that mode is Event mode, that's fine, except the current Event mode is unavailable except for Exclusive mode.

Exclusive mode, which use Event, is rock solid on all frame size I can throw at it, even with very low values such as 12 samples. One question: In shared mode I see that the host buffer sizes orient around 1056 samples in my system. I cannot find any such "converging" values in exclusive mode. It seems to want to use a host buffer size a couple of hundred samples bigger than the user buffer size. Why is that?

@dmitrykos
Copy link
Collaborator

dmitrykos commented Oct 3, 2020

@sveinse

Is it Windows that require a larger buffer fill in the first invocations to be able to handle internal warm-up?

There are no docs related to how many frames should be preloaded to guarantee stable startup but there is headroom for improvement by filling whole host buffer with user data before stream is started:

you could modify block starting at line line 5992 to such:

while (frames >= stream->out.framesPerBuffer)
{
    if ((hr = ProcessOutputBuffer(stream, processor, stream->out.framesPerBuffer)) != S_OK)
    {
        LogHostError(hr); // not fatal, just log
        break;
    }
    frames -= stream->out.framesPerBuffer;
}

The other worry I have is that these glitches are unobserved. There is no flag being raised and sent back to the user indicating the glitch/underrun. That bothers me.

Yes, it may be the case that problem is somewhere inside the WASAPI's mixer/buffer handling because the time slots logging showed that no underruns happening on our side and polling time is always smaller than 1/2 of the host buffer. This problem is happening when the lowest (22 ms) possible host buffer is used in Shared mode. In your tests you got stable playback when the host buffer has 44 ms latency (2112).

my opinion and objective is that we need to find a config and method that works reliably for the users

I agree but here to my view we can not compare performance 1 to 1, for example using your application and selecting DS as host API I can see in logs that DS latency is much higher than what we are trying to achieve with WASAPI:

DirectSound host buffer size frames: 5548, polling period seconds: 0.007483, @ sr: 44100.000000

e.g. it is 125 ms when I selected latency 256. MME's latency will be high too and WDMKS is equivalent of WASAPI's Exclusive mode. On contrary in case of WASAPI host API we are trying to play with 22 ms latency in Shared mode. May be it is too low for Shared mode while 44 ms becomes stable which is almost 3 times lower than DS's latency anyway.

If you could experiment - force Exclusive mode for WASAPI and test the range of latencies then we could understand whether Shared mode itself is buggy/unstable when latency is 22 ms or the problem in PA WASAPI implementation (I can see no corrupted logic in our implementation). If playback is stable then we probably can not achieve an absolute success with WASAPI Shared mode with latency below 44 ms.

One question: In shared mode I see that the host buffer sizes orient around 1056 samples in my system.

It is the value provided from WASAPI's API by IAudioClient_GetBufferSize when we request the minimal possible latency, so rationale behind it is in WASAPI's implementation code.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 3, 2020

@dmitrykos I did not do a very thorough test, but I get no glitching with any of the buffer size values I've tested before. Looking good. This test was in debug and with printf, so let me do a more detailed test.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 3, 2020

@dmitrykos I have now tested the latest proposal extensively:

Shared mode: Works perfectly without any glitching. Tested with 3 different audio devices (built-in, HDMI and USB), in solo and mixed with other playback devices. And tested every permutations with buffer sizes of 2112, 1056, 528, 264, 132 and 66. All OK.

Exclusive mode: Works great, but I get some glitching on the lower buffer sizes (132 and below). This is probably due to excessive user processing time and will at some point be expected.

Forced event mode: In this mode I force the playback to use the Event mode. It is not stable at all except for 528. In all the other values it creates a repetitive, monotonic stream of glitching audio (as opposed to random, intermittent glitching).

Auto convert mode: Didn't do as detailed test as above, but using this option doesn't seem to affect stability. If I revert the fix above, I get the same type of glitching regardless if the auto convert feature is used or not.

I agree but here to my view we can not compare performance 1 to 1, for example using your application and selecting DS as host API I can see in logs that DS latency is much higher than what we are trying to achieve with WASAPI:

DirectSound host buffer size frames: 5548, polling period seconds: 0.007483, @ sr: 44100.000000

e.g. it is 125 ms when I selected latency 256. MME's latency will be high too and WDMKS is equivalent of WASAPI's Exclusive mode. On contrary in case of WASAPI host API we are trying to play with 22 ms latency in Shared mode. May be it is too low for Shared mode while 44 ms becomes stable which is almost 3 times lower than DS's latency anyway.

I agree, DS and MME is not a fair comparison due to their very long buffering schemes and that they can't be compared 1 to 1. However my overall point is seen from the user's perspective: The user most often don't care which host API is used, just something that works. DS and MME are old legacy APIs (which is using WASAPI behind the scene). I would like to be able to recommend my users to use WASAPI as this is preferred, modern, Windows audio system. That's all.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 3, 2020

PS! I hit the bug/feature in https://github.com/PortAudio/portaudio/blob/winrt/src/hostapi/wasapi/pa_win_wasapi.c#L2079-L2085 . Long story short: I have an Focusrite Scarlett 6i6 USB Pro sound card, and it wanted a firmware update. For some reason that firmware update completely broke audio on my computer. If I selected that audio card as output, other audio apps, such as Spotify crashed. I had plans to use this sound card for the tests above, but had to give it up. When I ran my application, PortAudio crashed due to this:

Cannot create DirectSound for Speakers (Focusrite Usb Audio). Result = 0x88890008
Speakers (Focusrite Usb Audio)'s GUID: {0xd802d747,0x1f0a,0x47f8,0x98,0xf9,0x47,0xfe,0x1b,0xa3,0x2f, 0x35}
....
WASAPI: device idx: 03
WASAPI: ---------------
WASAPI:3| name[Speakers (Focusrite Usb Audio)]
WASAPI:3| form-factor[1]
WASAPI ERROR HRESULT: 0x88890008 : AUDCLNT_E_UNSUPPORTED_FORMAT
 [FUNCTION: CreateDeviceList FILE: C:\Svein\Prosjekt\elns-wasapi\portaudio-pr307\src\hostapi\wasapi\pa_win_wasapi.c {LINE: 2082}]
WASAPI: failed CreateDeviceList error[-9986|Internal PortAudio error]
WASAPI: failed PaWasapi_Initialize error[-9986|Internal PortAudio error]

This is OT for this issue, but I wanted to mention it. If you like I could create a new issue for it.

@dmitrykos
Copy link
Collaborator

@sveinse I am glad to hear we got some light in the end of tunnel. Do I understand it correctly that proposed modification starting from line 5992 improved situation with glitches?

This is OT for this issue, but I wanted to mention it. If you like I could create a new issue for it.

Yes, if you do not mind to create 2 issues, one for this issue and another for The sample rate matching error is correct. That's not what i meant. I meant that WASAPI does identify this being a sample rate error (the -9997 error), but it later becomes a -9996 invalid device error then we could link them to the pull requests with fixes.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 3, 2020

@sveinse I am glad to hear we got some light in the end of tunnel. Do I understand it correctly that proposed modification starting from line 5992 improved situation with glitches?

Affirmative. That fix resolves any startup glitches.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 3, 2020

@dmitrykos Since the initial buffer fill fix was found last, I wanted to verify if the /2 fix was necessary. I did this by commenting out

UINT32 chunks = stream->out.framesPerHostCallback / userFramesOut;
if (chunks <= 2)
{
sleepTimeOut /= 2;
PRINT(("WASAPI: underrun workaround, sleep [%d] ms - 1/2 of the user buffer[%d] | host buffer[%d]\n", sleepTimeOut, userFramesOut, stream->out.framesPerHostCallback));
}
. The results were negative. This code is needed to fix glitching. Specifically buffer sizes of 1056 and 528 were glitching continuously. All the other sizes had perfect playback.

@sveinse
Copy link
Contributor Author

sveinse commented Oct 3, 2020

I've added a special build of my end-user app for windows with the propsals in PR #307 if anyone else wants to test. It can be downloaded from here: https://github.com/sveinse/elns-portaudio-test/releases/tag/2

@dmitrykos
Copy link
Collaborator

dmitrykos commented Oct 3, 2020

@sveinse thank you very much for testing and all your proposals, I am really glad that finally we nailed the bug! :) This hard work resulted in a large improvement to my view.

To summarize (as per @philburk's advise) what we have done with #307 to get rid of audible glitches in Polling Shared mode:

  1. WASAPI implementation was not setting system timer function's granularity and therefore WaitForSingleObject was constantly expiring because standard granularity is 15 ms, now once WASAPI thread is started it will force system timer granularity to 1 ms in Poll and Even modes with timeBeginPeriod. Granularity is restored back to its default value when thread exits (stream is stopped).

  2. If user buffer is 1/2 of the host buffer we will wait 2 times less than the duration of the user buffer as it helped to get rid of glitching in our tests.

  3. The next cycle sleep time will now be updated dynamically depending on how much time was spent in processing (user code) that helps to stabilize polling time slots and make them practically equal.

  4. To avoid underrun due to sleeping again if polling cycle resulted in less frames than we expected while sleeping, we will re-try polling without sleeping (0 ms). Such short busy waiting results in 1-2 additional iterations that is ok and will not harm the performance and power saving.

  5. To avoid occasional glitches when stream is started and requested latency is low (22 ms) we have to preload whole host buffer, not just with chunk of the user buffer size.

Will proceed with merging into the master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Priority: High src-wasapi MS WASAPI Host API /src/hostapi/wasapi windows Affects MS Windows
Projects
None yet
6 participants