Audio buffer processing - Rapidly enqueue thousands of kernel runs sequentially on GPU then complete? Vs. just using `run()` in a loop (which takes too long)? 

I am working on applying the OpenCL Wrapper (which is fantastic by the way and thanks) to audio processing.

I have written kernel scripts which are highly parallelized (using `Memory<float>`) but kernel operations must be run sequentially many times to process an audio buffer block.

## Simulation of Processing Audio Buffer

Audio typically processes at 44100 Hz with an acceptable upper buffer size limit of 1024. This means we have ~23 ms to process a buffer (~42 buffers per second ie. ~42 "fps")

Hypothetically, for simplicity sake, let's say for this project here we have in our `kernel.cpp` these needed functions:

```
kernel void add_kernel(global float* A, global float* B, global float* C) { 
	const uint n = get_global_id(0);
	C[n] = A[n]+B[n];
}

kernel void smooth_kernel(global float* C) {
	const uint n = get_global_id(0);
	const uint N = get_global_size(0);

	if (n > 0 && n < N - 1) {
		C[n] = (C[n] + C[n] + C[n - 1] + C[n + 1]) * 0.25f;
	}
}
kernel void do_nothing_kernel(global float* C) {
}
```

So say for example, as we see here, `add_kernel` and `smooth_kernel` are both running parallelized operations, but cannot be combined into one kernel as the smoothing depends on the adding being done already.

Then in main.cpp, we have to replace your code:

```
#include "opencl.hpp"
//===============
//TIMER
//===============
//GET TIMER START TIME 
inline std::chrono::high_resolution_clock::time_point getStartTime() {
	std::chrono::high_resolution_clock::time_point start(std::chrono::high_resolution_clock::now());
	return start;
}
//GET TIMER FINISH TIME
inline double getTimeEllapsedMs(std::chrono::high_resolution_clock::time_point start) {
	std::chrono::duration<double> duration = std::chrono::high_resolution_clock::now() - start;
	double stopwatchMs = duration.count() * 1000;
	return stopwatchMs;
}

//=============
//MAIN
//=============
int main() {
	Device device(select_device_with_most_flops());

	const uint N = 1024u;
	Memory<float> A(device, N);
	Memory<float> B(device, N);
	Memory<float> C(device, N);

	for(uint n = 0u; n < N; n++) {
		A[n] = 3.0f;
		B[n] = 2.0f;
		C[n] = 1.0f;
	}
	A.write_to_device(); 
	B.write_to_device();
	C.write_to_device();

	print_info("Value before kernel execution: C[0] = " + cl_to_string(C[0]));

	//define kernels:
	Kernel add_kernel(device, N, "add_kernel", A, B, C);
	Kernel smooth_kernel(device, N, "smooth_kernel", C);
	Kernel do_nothing_kernel(device, N, "do_nothing_kernel", C);

	//===============================================
	//LOOP TEST TO SIMULATE PROCESSING AUDIO BUFFER
	//===============================================
	auto startTime = getStartTime();
	int numSamples = 1024;
	for (int i = 0; i < numSamples; i++) {
		add_kernel.run();
		smooth_kernel.run();
		do_nothing_kernel.run();
	}
	int timeMs = getTimeEllapsedMs(startTime);

	//===============================================
	//now hypothetically read output from processing
	C.read_from_device(); 

	//print result
	print_info("Value after kernel execution: C[0] = " + cl_to_string(C[0]) + " C[] " + cl_to_string(C[5]));
	print_info("Time Taken: " + cl_to_string(timeMs) + " ms");

	//takes 80-100 ms to complete, which is too long

	wait();
	return 0;
}
```

## Result
I am getting with this operation 80-100 ms roughly to complete. This is far too long. This is matching my performance in my real world project where basic kernel runs are adding up to 40-120 ms over the course of an audio buffer. 

As noted, we have only 23 ms to complete the buffer.

I note the majority of the time is just spent on running the kernels (ie. even `do_nothing_kernel` takes a very long time). This is presumably scheduling and thread/event coordination etc.

## Idea
In this hypothetical, if we could move the sample loop audio buffer iteration to the kernel, so we only have to run a `process_full_buffer` kernel once I presume this would be faster.

But within that we would still need sequential processing on the GPU. Ie. The `smooth_kernel` function cannot run until the `add_kernel` is done on each sample. The smooth and add functions are parallelized but they must be run sequentially, and each audio sample must also be processed sequentially.

Is there any method other than `run()` or way you can imagine to enqueue and run say 1024 sequential sets of 3-5 kernel operations in one blast? ie. Send thousands of kernel operations for the GPU to coordinate and complete on its own in one batch, then let the thread continue when that is finished? 

ie. So it won't take 80-100 ms just to coordinate or work through it all?

I presume it is all the "start run", "finish run", "trigger event that is done" etc that is slowing things down. If I could just hand the GPU as a batch 1024-4096 kernel operations and say "do those in order and tell the program when you're done at the end so it can continue" this would probably solve the problem. But I don't know how.

Like if we could make a `std::vector<Kernel> kernels`, add 1024-4096 kernels to it, then run an operation to pass all these to the GPU to do in order and then have the CPU thread continue after it concludes all of them as a batch.

Is such a thing possible? If so, how can you imagine it?

## Thanks
Thanks for the great project and any ideas. I don't know any OpenCL. I picked your project up and within 2-3 days of porting it over to my actual projects I can run GPU code. So this is a testament to your great design. I just hope there is some way around this type of lag or I am not sure what to do.

## (Minor Suggestion)
(Note that I renamed all your `to_string` function to `cl_to_string` in my code above as it was causing confusion otherwise when I brought over to projects that have `std::to_string`. Unrelated, but I think this would improvement for the project in any case. Perhaps you may wish to consider it.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Audio buffer processing - Rapidly enqueue thousands of kernel runs sequentially on GPU then complete? Vs. just using `run()` in a loop (which takes too long)? #13

Simulation of Processing Audio Buffer

Result

Idea

Thanks

(Minor Suggestion)

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Audio buffer processing - Rapidly enqueue thousands of kernel runs sequentially on GPU then complete? Vs. just using run() in a loop (which takes too long)? #13

Description

Simulation of Processing Audio Buffer

Result

Idea

Thanks

(Minor Suggestion)

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Audio buffer processing - Rapidly enqueue thousands of kernel runs sequentially on GPU then complete? Vs. just using `run()` in a loop (which takes too long)? #13