Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with out-of-range accesses when using specifyOffsetsAtLaunch #139

Closed
azonenberg opened this issue Sep 24, 2023 · 19 comments
Closed

Comments

@azonenberg
Copy link

Hi,

I'm having trouble getting buffer offsets at launch to work.

If I have specifyOffsetsAtLaunch=0, my test case works fine.

Using the exact same code with specifyOffsetsAtLaunch=1 and all offsets in the VkFFTLaunchParams set to zero, I get validation errors:

19: VUID-vkCmdDispatch-None-02706(ERROR / SPEC): msgNum: -1660035578 - Validation Error: [ VUID-vkCmdDispatch-None-02706 ] Object 0: handle = 0x5561992ca640, name = Filter_FFT.queue, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x9d0dde06 | (set = 0, binding = 0) Descriptor index 0 access out of bounds. Descriptor size is 2097160 and highest byte accessed was 2590896487 Command buffer (0x556199610a70). Compute Dispatch Index 0x3. Pipeline (0xa43473000000002d). Shader Module (0x2e2cd000000002b). Shader Instruction Index = 155.  Stage = Compute.  Global invocation ID (x, y, z) = (64, 0, 0 ) Unable to find SPIR-V OpLine for source information.  Build shader with debug info to get source information. The Vulkan spec states: If the robustBufferAccess feature is not enabled, and if the VkPipeline object bound to the pipeline bind point used by this command accesses a storage buffer, it must not access values outside of the range of the buffer as specified in the descriptor set bound to the same pipeline bind point (https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-02706)
19:     Objects: 1
19:         [0] 0x5561992ca640, type: 4, name: Filter_FFT.queue
19: VUID-vkCmdDispatch-None-02706(ERROR / SPEC): msgNum: -1660035578 - Validation Error: [ VUID-vkCmdDispatch-None-02706 ] Object 0: handle = 0x5561992ca640, name = Filter_FFT.queue, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x9d0dde06 | (set = 0, binding = 0) Descriptor index 0 access out of bounds. Descriptor size is 2097160 and highest byte accessed was 2590902375 Command buffer (0x556199610a70). Compute Dispatch Index 0x3. Pipeline (0xa43473000000002d). Shader Module (0x2e2cd000000002b). Shader Instruction Index = 155.  Stage = Compute.  Global invocation ID (x, y, z) = (800, 0, 0 ) Unable to find SPIR-V OpLine for source information.  Build shader with debug info to get source information. The Vulkan spec states: If the robustBufferAccess feature is not enabled, and if the VkPipeline object bound to the pipeline bind point used by this command accesses a storage buffer, it must not access values outside of the range of the buffer as specified in the descriptor set bound to the same pipeline bind point (https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-02706)
19:     Objects: 1
19:         [0] 0x5561992ca640, type: 4, name: Filter_FFT.queue

Additionally, the output is incorrect.

Here's how I'm launching the FFT:

	VkFFTLaunchParams params;
	memset(&params, 0, sizeof(params));
	params.inputBuffer = &inbuf;
	params.buffer = &outbuf;
	params.commandBuffer = &cmd;

	auto err = VkFFTAppend(&m_app, -1, &params);
	if(VKFFT_SUCCESS != err)
		LogError("Failed to append vkFFT transform (code %d)\n", err);

Why would this behave any differently if specifyOffsetsAtLaunch is true vs false?

@azonenberg
Copy link
Author

Also, my goal here was to do a bunch of consecutive FFTs in the same buffer (also doable by batching).

When I use batched 1D FFTs, I need to make my input buffer a bit larger than the expected size (number of blocks * FFT length) or I get validation errors about reading past the end of the buffer. And the final FFT in the batch appears to be corrupted somehow.

@azonenberg
Copy link
Author

Full source for the code exhibiting the error is at https://github.com/glscopeclient/scopehal/blob/b63c57e4ebfc3b058e9a6c6004528b3f47211819/scopeprotocols/SpectrogramFilter.cpp

488x 16384 point R2C FFTs is the dataset size I'm using in my current test.

@DTolm
Copy link
Owner

DTolm commented Sep 25, 2023

Hello,

For the first question, I couldn't verify the specifyOffsetsAtLaunch=1 breaking the execution. Can you send the values stored in axis->specializationConstants.inputOffset.data.i and in axis->specializationConstants.outputOffset.data.i when VkFFT_DispatchPlan (lines 129-138) call is made and if they are copied in axis->pushConstants successfully?

As for the second question - in R2C FFTs the expected size is not (number of blocks * FFT length), but (number of blocks * (FFT length/2+1)). This happens due to the Hermitian symmetry and is the same in all other FFT libraries. You can read more on that in the VkFFT documentation.

Feel free to ask other questions about VkFFT!

Best regards,
Dmitrii

@azonenberg
Copy link
Author

I'll poke at the offsets issue in a bit, let's focus on the second one as that's the more immediate one.

I'm aware of the Hermitian symmetry, my problem is that I'm overrunning the input buffer.

[SpectrogramFilter::Refresh] SpectrogramFilter: 8000002 input points, 488 16384-point FFTs
[SpectrogramFilter::Refresh]     FFT range is DC to 40 GHz
[SpectrogramFilter::Refresh]     2.44141 MHz per bin
VUID-vkCmdDispatch-None-02706(ERROR / SPEC): msgNum: -1660035578 - Validation Error: [ VUID-vkCmdDispatch-None-02706 ] Object 0: handle = 0x555fec0f4370, name = FilterGraphExecutor[2].queue, type = VK_OBJECT_TYPE_QUEUE; | MessageID = 0x9d0dde06 | (set = 0, binding = 0) Descriptor index 0 access out of bounds. Descriptor size is 31981568 and highest byte accessed was 31983423 Command buffer (FilterGraphExecutor[2].cmdbuf)(0x7f6e88002330). Compute Dispatch Index 0x1e8. Pipeline (0xc67bba000002bcbe). Shader Module (0xb43548000002bcbc). Shader Instruction Index = 304.  Stage = Compute.  Global invocation ID (x, y, z) = (0, 3902, 0 ) Unable to find SPIR-V OpLine for source information.  Build shader with debug info to get source information. The Vulkan spec states: If the robustBufferAccess feature is not enabled, and if the VkPipeline object bound to the pipeline bind point used by this command accesses a storage buffer, it must not access values outside of the range of the buffer as specified in the descriptor set bound to the same pipeline bind point (https://vulkan.lunarg.com/doc/view/1.3.250.1/linux/1.3-extensions/vkspec.html#VUID-vkCmdDispatch-None-02706)
    Objects: 1
        [0] 0x555fec0f4370, type: 4, name: FilterGraphExecutor[2].queue
    SpectrogramFilter: 129.019 ms

The input waveform is 8M points and I'm calculating a spectrogram as 488x 16384 point FFTs. So I would expect to consume a total of 7995392 fp32 input samples (31981568 bytes) and generate 3998184 complex output points (31985472 bytes).

According to the Vulkan validation layer, however, the shader is reading up to byte 31983423 from the input (1855 bytes off the end of the buffer). I'm also seeing that the final FFT's output is corrupted, while the other 487 look fine.

In this screenshot, we can see the input signal (pink, top), and below that the vkFFT-generated spectrogram (using a Blackman-Harris window in a separate compute shader). Everything looks fine except for the very last column of pixels corresponding to the final FFT, which is garbage.

fft-garbage

@azonenberg
Copy link
Author

Some more interesting findings from doing additional testing (all with 16K point FFTs and same input data)

If I set numberOfBatches to 1 and truncate the input appropriately, there is no overrun and everything works perfectly.

With larger numberOfBatches values (but less than the full 488), there are read overruns, but no obvious corruption of the output (probably because there's valid data for the next FFT right after the data it's supposed to be reading). There is always a minimum of 8 bytes (2 floats) of overrun for numberOfBatches > 1, but there is also an increasing trend where larger numbers of batches are more likely to have larger overruns. This is not a monotonic trend though, for example 24 batches overrun by 64 bytes, 25 batches by 8 bytes, and 26 batches by 80 bytes.

I'm using the Vulkan backend on an RTX 2080 Ti for this test. I have access to a few other NVIDIA cards and will see if I get the same behavior in each case, or whether it's dependent on some property of the card (number of registers, shared memory size, etc).

overruns

@DTolm
Copy link
Owner

DTolm commented Sep 25, 2023

But the output of 488x16384 R2C should be 245x16384=4014080 complex numbers? I will try to reproduce the batching sample again tomorrow and report findings here.

@azonenberg
Copy link
Author

Again, the overrun is not at the output. It's at the input.

@azonenberg
Copy link
Author

Also why would the number of batches change from input to output? 488 FFTs, each with 16384 real points, should output 488 blocks of 8193 complex numbers.

And it's doing exactly that, the problem is that it's somehow reading an extra 1856 real values from the input past the end of the expected (16384*488) inputs.

@DTolm
Copy link
Owner

DTolm commented Sep 26, 2023

Oh, I completely misread the configuration. I thought 16384 was the batch number and 488 was the length. Sorry, will need to redo all the tests.

@DTolm
Copy link
Owner

DTolm commented Sep 26, 2023

I have verified the mentioned issue in v1.3.1. Can you try the v1.3.2 on the develop branch? I fixed some incorrect range calculations in Vulkan there before uploading it yesterday and they seem to have been related to what you were experiencing.

@azonenberg
Copy link
Author

Just tried with 4bea811 and I am now reading 1984 bytes instead of 1856 past the end of the input. Everything still looks OK except for the last FFT in the batch which is obviously corrupted. But I haven't attempted to verify the batching is correct (i.e. that the first batch is samples 0-16383, the second 16384-32767, etc) or whether the intermediate batches are sampling a bit earlier/later than they should.

@DTolm
Copy link
Owner

DTolm commented Sep 27, 2023

Can you enable configuration.keepShaderCode and send the generated shaders so I can verify that we test the same thing? This would help a lot. Thank you!

@azonenberg
Copy link
Author

Output of ngscopeclient --debug > /tmp/log.txt 2>&1 with keepShaderCode=1 for my test case (this is ten waveforms processed consecutively, each doing 488 FFTs of 16K points). I have the input buffer enlarged by one FFT so there's no Vulkan validation errors, but there are out-of-bounds reads if I don't do that.

log.txt

@DTolm
Copy link
Owner

DTolm commented Sep 27, 2023

Ok, I see the issue. If you set inputBufferStride[0] to 16384 explicitly in the configuration, then it will resolve the issue. By default, VkFFT uses padded strides - it will assume all buffers have the complex buffer alignment - 16384/2+1 complex values. Hence your unbound accesses. I should probably change the default behavior for this case as this is unintuitive.

@azonenberg
Copy link
Author

So it expects complex padding even for real inputs? Yes, that is definitely unintuitive.

@azonenberg
Copy link
Author

Just tested and with input buffer stride set to the FFT point count, I'm getting what looks to be correct behavior. I think you nailed it.

DTolm added a commit that referenced this issue Sep 27, 2023
@DTolm
Copy link
Owner

DTolm commented Sep 27, 2023

When I did that part of the code I thought that expecting all buffers with the same padding would be logical, but I was wrong. I have added a check that out-of-place R2C/C2R should use non-padded strides for input buffer.

@azonenberg
Copy link
Author

Sounds good, I think we can close this now.

@DTolm
Copy link
Owner

DTolm commented Sep 27, 2023

Thanks for pointing it out, I will close the issue once 1.3.2 is merged into the main branch.

@DTolm DTolm mentioned this issue Oct 23, 2023
DTolm added a commit that referenced this issue Oct 23, 2023
-Added double-double support in VkFFT. Requires cpu initialization in full quad precision, so only supports gcc with quadmath dependency for now. Potentially possible to add full FP128 support or some other FP128 library (like mpir) in the future.
-Data has to be stored in double-double before VkFFT kernels calls (no fp128<->double-double conversion on the GPU yet).
-Full 1e-32 precision, but same range as FP64. See Library for Double-Double and Quad-Double Arithmetic by Y Hida for more information on double-double.
-Double-double requires FMA contraction to be disabled (due to ab-cd contraction rounding mismatch). Doesn't work on Vulkan as I haven't found how to do that yet.
-Added DST I-IV support.
-Fixed warnings (#138)
-Added proper check for app to be zero before initializeVkFFT call and zeroing on deletion (#134)
-Added an option to provide a staging buffer in the application and VkGPU handle (#129)
-Added guards for build type (#128)
-Changed default innermost stride for real buffers in out-of-place R2C from size[0]+2 to size[0] (#139)
-Allow specifying glslang version (#135)
-Improved instruction count and accuracy for radix-7.
-Fixed missing deallocation calls for the inverse Bluestein axes. Fixed the buffer layout size in Vulkan in some cases.
-Refactored the code generator and container struct layout for better handling complex numbers (-5k loc).
-Added more precision tests and benchmarks.
@DTolm DTolm closed this as completed Oct 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants