New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[MRG] New ot.gpu with cupy #67
Conversation
I don't know why, but I obtain greatly faster CPU times than yours with my i7-6700HQ and GTX 980M and 32GB of memory.
|
Hello @LeoGautheron , This is because my CPU has a low frequency (1.5GHz) so it can't really fight an i7 ;) and i'm not sure my numoy install uses the properly its multiple cpu. Also kudos on the GTX that is clearly better than my old Titan X. Still note that the time is clearly dominated by upload/download from to GPU memory. We have to be clear in the documentation that best speed is obtained with matrices already as cupy arrays. This performances are not very impressive to be frank but I think it is important that we provide a working ot.gpu for release 0.5. We can definitely do better but unless you have kickass optimization tricks to share on short notice I think we should merge this (after proper documentation update). What do you think? |
I think its good to go, I see no optimization to do right now. |
OK then, thank you for the feedback, we need to provide gpu for the users (30k+ download on pip). And sorry again for you Zombie PR I stole some stuff from it anyways so not all is lost. |
The PR is a cupy implementation of the functions currently implemented in ot.gpu. I also removed all the classes that were deprecated anyways. It still needs proper updated test but i like this solution since it stays mostly compatible with the old ot.gpu.
I have received a large number of queries about ot.gpu but cudamat is not maintained and the problem will only grow so we need to do something before release 0.5.
This solution is far less elegant than PR #32 of @toto6 with all the decorators but having a cupy specific implementation leaves more room for code optimization than a generic implementation IMHO. Which means that we can make it better in the future without compromizing the numpy implmentation.
I give an example of use for the ot.gpu functions below with different format for input/output, i.e. if there are numpy.array of cupy.array . The output is obtained on my Titan X GPU after two run of the script in ipython.
The output I have is the following: