-
Notifications
You must be signed in to change notification settings - Fork 43
Description
I am working on applying the OpenCL Wrapper (which is fantastic by the way and thanks) to audio processing.
I have written kernel scripts which are highly parallelized (using Memory<float>) but kernel operations must be run sequentially many times to process an audio buffer block.
Simulation of Processing Audio Buffer
Audio typically processes at 44100 Hz with an acceptable upper buffer size limit of 1024. This means we have ~23 ms to process a buffer (~42 buffers per second ie. ~42 "fps")
Hypothetically, for simplicity sake, let's say for this project here we have in our kernel.cpp these needed functions:
kernel void add_kernel(global float* A, global float* B, global float* C) {
const uint n = get_global_id(0);
C[n] = A[n]+B[n];
}
kernel void smooth_kernel(global float* C) {
const uint n = get_global_id(0);
const uint N = get_global_size(0);
if (n > 0 && n < N - 1) {
C[n] = (C[n] + C[n] + C[n - 1] + C[n + 1]) * 0.25f;
}
}
kernel void do_nothing_kernel(global float* C) {
}
So say for example, as we see here, add_kernel and smooth_kernel are both running parallelized operations, but cannot be combined into one kernel as the smoothing depends on the adding being done already.
Then in main.cpp, we have to replace your code:
#include "opencl.hpp"
//===============
//TIMER
//===============
//GET TIMER START TIME
inline std::chrono::high_resolution_clock::time_point getStartTime() {
std::chrono::high_resolution_clock::time_point start(std::chrono::high_resolution_clock::now());
return start;
}
//GET TIMER FINISH TIME
inline double getTimeEllapsedMs(std::chrono::high_resolution_clock::time_point start) {
std::chrono::duration<double> duration = std::chrono::high_resolution_clock::now() - start;
double stopwatchMs = duration.count() * 1000;
return stopwatchMs;
}
//=============
//MAIN
//=============
int main() {
Device device(select_device_with_most_flops());
const uint N = 1024u;
Memory<float> A(device, N);
Memory<float> B(device, N);
Memory<float> C(device, N);
for(uint n = 0u; n < N; n++) {
A[n] = 3.0f;
B[n] = 2.0f;
C[n] = 1.0f;
}
A.write_to_device();
B.write_to_device();
C.write_to_device();
print_info("Value before kernel execution: C[0] = " + cl_to_string(C[0]));
//define kernels:
Kernel add_kernel(device, N, "add_kernel", A, B, C);
Kernel smooth_kernel(device, N, "smooth_kernel", C);
Kernel do_nothing_kernel(device, N, "do_nothing_kernel", C);
//===============================================
//LOOP TEST TO SIMULATE PROCESSING AUDIO BUFFER
//===============================================
auto startTime = getStartTime();
int numSamples = 1024;
for (int i = 0; i < numSamples; i++) {
add_kernel.run();
smooth_kernel.run();
do_nothing_kernel.run();
}
int timeMs = getTimeEllapsedMs(startTime);
//===============================================
//now hypothetically read output from processing
C.read_from_device();
//print result
print_info("Value after kernel execution: C[0] = " + cl_to_string(C[0]) + " C[] " + cl_to_string(C[5]));
print_info("Time Taken: " + cl_to_string(timeMs) + " ms");
//takes 80-100 ms to complete, which is too long
wait();
return 0;
}
Result
I am getting with this operation 80-100 ms roughly to complete. This is far too long. This is matching my performance in my real world project where basic kernel runs are adding up to 40-120 ms over the course of an audio buffer.
As noted, we have only 23 ms to complete the buffer.
I note the majority of the time is just spent on running the kernels (ie. even do_nothing_kernel takes a very long time). This is presumably scheduling and thread/event coordination etc.
Idea
In this hypothetical, if we could move the sample loop audio buffer iteration to the kernel, so we only have to run a process_full_buffer kernel once I presume this would be faster.
But within that we would still need sequential processing on the GPU. Ie. The smooth_kernel function cannot run until the add_kernel is done on each sample. The smooth and add functions are parallelized but they must be run sequentially, and each audio sample must also be processed sequentially.
Is there any method other than run() or way you can imagine to enqueue and run say 1024 sequential sets of 3-5 kernel operations in one blast? ie. Send thousands of kernel operations for the GPU to coordinate and complete on its own in one batch, then let the thread continue when that is finished?
ie. So it won't take 80-100 ms just to coordinate or work through it all?
I presume it is all the "start run", "finish run", "trigger event that is done" etc that is slowing things down. If I could just hand the GPU as a batch 1024-4096 kernel operations and say "do those in order and tell the program when you're done at the end so it can continue" this would probably solve the problem. But I don't know how.
Like if we could make a std::vector<Kernel> kernels, add 1024-4096 kernels to it, then run an operation to pass all these to the GPU to do in order and then have the CPU thread continue after it concludes all of them as a batch.
Is such a thing possible? If so, how can you imagine it?
Thanks
Thanks for the great project and any ideas. I don't know any OpenCL. I picked your project up and within 2-3 days of porting it over to my actual projects I can run GPU code. So this is a testament to your great design. I just hope there is some way around this type of lag or I am not sure what to do.
(Minor Suggestion)
(Note that I renamed all your to_string function to cl_to_string in my code above as it was causing confusion otherwise when I brought over to projects that have std::to_string. Unrelated, but I think this would improvement for the project in any case. Perhaps you may wish to consider it.)