-
Notifications
You must be signed in to change notification settings - Fork 97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hgemm #77
Hgemm #77
Conversation
Done. |
for (i = 0; i < batchCount; i++) { | ||
cuda_wait(A[i], CUDA_WAIT_READ); | ||
cuda_wait(B[i], CUDA_WAIT_READ); | ||
cuda_wait(C[i], CUDA_WAIT_READ|CUDA_WAIT_WRITE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main use case is that the list of A, B and C are different "view" of the same gpu tensor. Is it useful to have a param to tell this and have the wait once before the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be useful, but would be too complex to communicate in a non-annoying way through the interface. Since I don't see any calls anywhere to this function I also fail to see any use case for now.
If we have a use case where A, B and C are essentially a 3D tensors that we perform gemm elementwise on, a different interface would be much faster. Otherwise I can look at the problem and try to come up with a reasonable interface if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Theano, we support only that 3d case. So adding support for it make sence. I like your idea of another method for that case. This can be done later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add support for float16 data whenever the blas library supports it.