Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batched 3D FFT memory access fault #311

Closed
upsj opened this issue Feb 23, 2021 · 6 comments
Closed

Batched 3D FFT memory access fault #311

upsj opened this issue Feb 23, 2021 · 6 comments
Assignees

Comments

@upsj
Copy link

upsj commented Feb 23, 2021

I am working on an hipFFT wrapper for an HPC library, where I basically hipified our cuFFT wrapper code 1:1 (only replacing 64 bit calls by 32 bit calls). The CUDA and HIP-CUDA tests work and give correct results, only when compiling everything with rocFFT on an AMD device do I get a memory access fault. My guess would be that this is related to the "interleaved batch" memory layout we are using, where the innermost dimension is the batch dimension, followed by the 3 FFT dimensions.

Let me know if you need any additional information

What is the expected behavior

  • The FFT is executed

What actually happens

  • the execution fails with Memory access fault by GPU node-1 (Agent handle: ...) on address .... Reason: Page not present or supervisor privilege.

How to reproduce

  • Minimal reproducer: (setting batch to 1 works correctly)
#include <hip/hip_runtime.h>
#include <hipfft.h>


int main() {
	hipfftDoubleComplex* in;
	hipfftDoubleComplex* out;
	hipfftHandle handle;
	int sizes[] = {16, 32, 64};
	int size = sizes[0] * sizes[1] * sizes[2];
	int batch = 2;
	size_t worksize;
	hipMalloc(&in, sizeof(hipfftDoubleComplex) * size * batch);
	hipMalloc(&out, sizeof(hipfftDoubleComplex) * size * batch);
	hipfftCreate(&handle);
	hipfftMakePlanMany(handle, 3, sizes, sizes, batch, 1, sizes, batch, 1, HIPFFT_Z2Z, batch, &worksize);
	hipfftExecZ2Z(handle, in, out, HIPFFT_FORWARD);
	hipDeviceSynchronize();
	hipFree(out);
	hipFree(in);

}

Environment

Hardware description
GPU Radeon VII
CPU AMD Ryzen Threadripper 1920X
Software version
HIP 4.0.20496-4f163c68
hipFFT 1.0.2.57-be3a15d
rocFFT 1.0.8.966-rocm-rel-4.0-23-2d35fd6
hip-clang dac2bfceaa8d4a90257dc8a6d58f268e172ce00e
@evetsso evetsso self-assigned this Feb 23, 2021
@evetsso
Copy link
Contributor

evetsso commented Feb 23, 2021

Thanks for the bug report. It looks like a fix for this might already be coming in the next release but I'll confirm.

@evetsso
Copy link
Contributor

evetsso commented Feb 23, 2021

@upsj After looking closer at your test program, it looks like you've got some errors in it:

  • istride and ostride must be 1 for contiguous data. Note that if your data is contiguous, you can pass null pointers for inembed and onembed and hipFFT will choose equivalent defaults.
  • idist and odist must be the number of elements between batches. For contiguous batches, that's the same as the int size variable in your code.
  • If you really do have 2 batches of size 16x32x64 each, you must allocate at least 16 *32 *64 * 2 elements for input and output. Non-contiguous data would require additional memory allocated.

This works:

#include "hipfft.h"

int main() {
	hipfftDoubleComplex* in;
	hipfftDoubleComplex* out;
	hipfftHandle handle;
	int sizes[] = {16, 32, 64};
	int size = sizes[0] * sizes[1] * sizes[2];
	int batch = 2;
	size_t worksize;
	hipMalloc(&in, sizeof(hipfftDoubleComplex) * size * batch);
	hipMalloc(&out, sizeof(hipfftDoubleComplex) * size * batch);
	hipfftCreate(&handle);
	hipfftMakePlanMany(handle, 3, sizes, sizes, 1, size, sizes, 1, size, HIPFFT_Z2Z, batch, &worksize);
	hipfftExecZ2Z(handle, in, out, HIPFFT_FORWARD);
	hipDeviceSynchronize();
	hipFree(out);
	hipFree(in);
}

I'm closing this issue - please feel free to comment if you have any questions. We can open this issue or another issue if you run into additional problems.

@evetsso evetsso closed this as completed Feb 23, 2021
@upsj
Copy link
Author

upsj commented Feb 23, 2021

I think you slightly misunderstood my use case - the interleaved, non-contiguous storage is intended, since due to interface consideration, we store the FFT for each batch as a column in a row-major matrix.
Formally, with dimensions (n,m,k) and batch count c, the index of the entry (x,y,z) in batch b is x*s2*s3*c + y*s3*c + z*c + b.
The example I posted is minimized, we encounter the same issue in practice, and the identical invocation with cuFFT works.

@evetsso evetsso reopened this Feb 23, 2021
@evetsso
Copy link
Contributor

evetsso commented Feb 23, 2021

Ok, I see. I don't have an immediate solution to your problem but will investigate.

@upsj
Copy link
Author

upsj commented Feb 23, 2021

That's great to hear, thanks! Just let me know when you have a solution, I will disable the offending tests until then.

@evetsso
Copy link
Contributor

evetsso commented Feb 26, 2021

f9006e4 fixes this in the develop branch. It should be included in the next release. Please comment/reopen if you still see problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants