AlbertDeFusco/ScratchNet

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

High performance variants #3

Open
opened this Issue Aug 3, 2018 · 4 comments

Projects
None yet
3 participants
Owner

AlbertDeFusco commented Aug 3, 2018

 @seibert suggests cupy @sklam suggests Numba

stsievert commented Aug 3, 2018 • edited

 One performance enhancement that most deep learning frameworks is automatic differentiation. That is, instead of manually coding `gradient` they use the chain rule and the derivatives of simple functions (e.g., all operators, `exp`, `sin`, `dot`) to construct a graph to compute the derivative. This graph can be executed in parallel, which is where the performance comes from. I got curious a while back, and timed a NumPy hard-coded gradient vs an automatically calculated gradient: https://stsievert.com/blog/2017/09/07/pytorch/#speed There is an implementation of autograd for NumPy available at https://github.com/HIPS/autograd. It'd probably be the easiest to integrate, although I'm not certain how well it plays with CuPy or how performant it is.

stsievert commented Aug 5, 2018

 And HIPS/autograd is high performant too. It comes close to PyTorch performance (I updated the timings on my blog for this). Here's a quick example: ```import autograd import autograd.numpy as anp import numpy as np n, d = int(10e3), 1000 A = np.random.randn(n, d) x = np.random.rand(d) y = A @ x x_hat = np.random.rand(d) def loss(x): return ((anp.dot(A, x) - y)**2).mean() def grad_hardcoded(A, x, y): return (2*A.T@(A@x - y)) / len(y)``` ```In [1]: %time g1 = autograd.grad(loss)(x_hat) CPU times: user 23.8 ms, sys: 2.09 ms, total: 25.9 ms Wall time: 13.7 ms In [2]: %time g2 = grad_hardcoded(A, x_hat, y) CPU times: user 72 ms, sys: 45.1 ms, total: 117 ms Wall time: 103 ms``` This is about 8 times faster because my machine has 8 cores.

timsetsfire commented Sep 27, 2018

 @stsievert - Just curious if you still see the same improvement when you move the multiplication by 2 outside of the parenthesis in the hardcoded gradient. ``````def grad_hardcoded(A, x, y): return 2*(A.T@(A@x - y)) / len(y) `````` On my OLD macbook pro, the hardcoded method was a little bit faster.

stsievert commented Sep 27, 2018

 Huh. NumPy is a bit faster, even when I jack up the dimension. ```In [4]: import autograd ...: import autograd.numpy as anp ...: import numpy as np ...: ...: n, d = int(100e3), 1000 ...: A = np.random.randn(n, d) ...: x = np.random.rand(d) ...: y = A @ x ...: x_hat = np.random.rand(d) ...: ...: def loss(x): ...: return ((anp.dot(A, x) - y)**2).mean() ...: ...: def grad_hardcoded(A, x, y): ...: return 2*(A.T@(A@x - y)) / len(y) ...: ...: In [5]: %time g1 = autograd.grad(loss)(x_hat) CPU times: user 229 ms, sys: 8.29 ms, total: 237 ms Wall time: 135 ms In [6]: %time g2 = grad_hardcoded(A, x_hat, y) CPU times: user 214 ms, sys: 3.7 ms, total: 218 ms Wall time: 116 ms In [7]: np.allclose(g1, g2) Out[7]: True``` (I've verified that `(2*A.T@(A@x - y)) / len(y)` is still slower than autograd; 525ms hardcoded, 139ms with autograd)
to join this conversation on GitHub. Already have an account? Sign in to comment