Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enqueue several calls #337

Open
cabezabuque opened this issue Nov 21, 2018 · 13 comments
Open

Enqueue several calls #337

cabezabuque opened this issue Nov 21, 2018 · 13 comments

Comments

@cabezabuque
Copy link

Hi all,
first of all, congrats on a great library!
I am implementing a power method, and I would need to perform iterative calls to matrix vector multiplication (cgemv).
Currently, each call defines an event to notify when computation is finished, and I need to wait for that event, before I can issue the next multiplication.
I am just a newbie, and this might be an extremely stupid question... but:
Would it be possible to call CLBlast methods, defining an event that they need to wait for. What I would like to do is the following:

  1. call cgemv(M, V1, V2, event1); [wait for no events]
  2. call my normalization kernell [waits for event1, generates event2 when finished]
  3. call cgemv(M, V1, V2, event3); [waits for event2, notifies event3 when finished]
  4. call my normalization kernell again [waits for event3, generates event4 when finished]
  5. Keep iterating

I might be wrong, but I believe this could result in much better performance (issue all the commands from the CPU at once, let the events trigger other kernells from within OpenCL).

Anyways, I understand this might not be interesting for anyone, but I just wanted to share this thought...

Cheers!

Diego

@CNugteren
Copy link
Owner

Actually, I don't think this is possible as it is now. Not sure how much gains you will get, because if you wait for the previous call just before the CLBlast API is called, there is not much time lost in between I guess.

The no longer maintained clBLAS does have this feature though, see e.g. here. That is what you are looking for, right? I'll have to think about how much effort it will be to add such a feature to CLBlast as well.

@cabezabuque
Copy link
Author

Wow, that was some quick response time...
Yeah, the feature in clBlas is exactly what I meant: to let CLBlast calls wait for and trigger other kernells, just like any other OpenCL call...
Alternatively, I will imlpement the power method myself in a single kernell and do some profiling to see if I can get better performance. It always surprises me that, as simple as a matrix-vector multiplication looks, I never get close to the performance obtained by implementations of BLAS (what sort of black magic do you guys use??)
Cheers!

@CNugteren
Copy link
Owner

I have thought a little bit about this. I think in general it can be useful to add such a feature as you requested. However, you are so-far still the first to request it. Big problem with implementing this would be that the CLBlast API has to change: all functions will get extra arguments. In the C API this really breaks all existing code, for the C++ API it could be implemented as an optional argument with a default value. The work needed will not be too difficult I guess, but it has to be implemented for each routine so nevertheless it will be quite a bit of work.

What I propose to do for now is that you finalize your implementation using CLBlast. Then, I can add the feature you requested (on a branch of CLBlast) for GEMV only, and you can evaluate if you gain some speed. If it is really significant, I will consider changing the whole C++ API to support this.

@cabezabuque
Copy link
Author

Hi!
just a couple of comments:

  • Adding optional arguments to the methods would not break the C++ source-code but, would it break already compiled applications? I mean, a binary (.exe) would stop working if it replaced the .dll with the newer version of the .dll (an optional argument still gets translated into a call with a value for that extra parameter, and the .exe would expect a call to a function without those parameters at all). I am not sure about this (never had to do this with any .dll I produced), but I think this could be the case...

  • Would it be better to retain the old interface, and simply add new methods (with the events that the operation needs to wait for)? These methods could be placed in a separate .h file (not to affect older applications) and, over time, maybe one of the two interfaces will become deprecated (depending on users' demands/preferences).

In any case, getting access to a modified version of CLBast would be a blast (I could not help the pun, sorry). If I were to ask, it would be great to also get the method cgemm (I populate the matrices I use for multiplication in another kernel, and this feature would allow me to set the whole "pipeline" of operations at once: fillKernel->cgemm->[cgemv->normaliseKernel]xNUM_ITERATIONS)

Thanks a lot for your support. Also, please do not allow this request to become a source of added stress for you...

Cheers!

Diego

@CNugteren
Copy link
Owner

Sorry for the late reply, but I finally found some time to implement something. The branch CLBlast-337-events-wait-list now has this optional argument implemented for the C++ interface. It is done for each routine, but actually only functionally implemented in code for GEMV and GEMM. If you want you can test either of those routines and see if it is indeed correct and whether it can speed your application up.

@gspr
Copy link
Contributor

gspr commented Sep 3, 2021

Sorry for reviving such an old thread, but I've also found myself in a position where running CLBlast functions upon completion of other events (like OpenCL kernels can) would be hugely useful. You're probably overloaded with lots of things to do, but if there's any way I can help out, I'm happy to. I haven't looked at the events-list branch yet, not knowing if it's very stale or not.

The feature is useful for me because I wanna build a pretty big computation graph asynchronously and not have to return from the OpenCL device until it's all done. I guess this could sorta-kinda be emulated by doing CLBlast calls asynchronously in a separate thread on the host side, but it adds lots of complexity that is kind of orthogonal to OpenCL.

@gspr
Copy link
Contributor

gspr commented Sep 6, 2021 via email

@CNugteren
Copy link
Owner

You're probably overloaded with lots of things to do, but if there's any way I can help out, I'm happy to.

CLBlast is not really my priority at the moment, I don't have much free time available for this. But I'm willing to help / answer questions / review and accept a PR.

I see now that the changes in the CLBlast-337-events-wait-list branch are rather mechanical

Indeed, the diff is mostly repetitive. Actually, there is a Python script that generates these files even, so these changes are made automatically. See the diff for details on the changes in the scripts folder: most other changes in C++ files are result of those changes.

and I could probably get to work on updating that branch to current master head.

The good news here is that I'm not sure there is any work, if there is I guess it will be simple to do.

I have a question though: Is the intention not to expose the event waiting in the BLAS-like functions? Or am I reading the code wrong?

I'm not sure what you mean exactly. In my prototype branch I added an argument with the following signature and description: ``const std::vector<cl_event*>& event_wait_list: A list of OpenCL events that need to be executed before this routine's OpenCL kernel(s). This is an optional argument.. So I'm not exposing any internal events, only taking an extra argument. This fits your use-case I guess? The main motivation is that this makes the changes to CLBlast minimal.

But in general I'm not sure we should continue along this path given the extra arguments and such. I don't think it is much work to implement though, but the API changes could impact many users. I'm not sure it will be that helpful in general, although there are now two users already indeed that require this feature...

@gspr : would you be okay with updating that branch so that it is in sync with the latest master and then just using CLBlast from that branch? Or are you using pre-built packages?

@gspr
Copy link
Contributor

gspr commented Sep 8, 2021 via email

@CNugteren
Copy link
Owner

Yes you are right, it might make sense to do this. Please go ahead, and let's for starters then only consider the C++ interface, make a PR for that, and then follow-up with the (more breaking) C interface. Let me know if you need any help with the way things are organized and the Python scripts that generated code. Hopefully my prototype branch can be of some help.

@FreddieWitherden
Copy link

Just as a follow-up we make use of CLBlast in PyFR for our OpenCL backend and would benefit greatly from this.

In terms of the C API (which we use) perhaps one option is to add a new set of functions suffixed by "WithEvents" that take the event wait list, leaving the existing functions unchanged?

Also, happy to test any candidate PR's.

@gspr
Copy link
Contributor

gspr commented Dec 13, 2021 via email

@FreddieWitherden
Copy link

I have not had any time either; but would be able to test any branches. Is the plan to provide additional wrappers for all of the major functions or just the heavyweights from level 3 BLAS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants