New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High performance variants #3

Open
AlbertDeFusco opened this Issue Aug 3, 2018 · 4 comments

Comments

Projects
None yet
3 participants
@AlbertDeFusco
Owner

AlbertDeFusco commented Aug 3, 2018

@seibert suggests cupy

@sklam suggests Numba

@stsievert

This comment has been minimized.

stsievert commented Aug 3, 2018

One performance enhancement that most deep learning frameworks is automatic differentiation. That is, instead of manually coding gradient they use the chain rule and the derivatives of simple functions (e.g., all operators, exp, sin, dot) to construct a graph to compute the derivative. This graph can be executed in parallel, which is where the performance comes from. I got curious a while back, and timed a NumPy hard-coded gradient vs an automatically calculated gradient: https://stsievert.com/blog/2017/09/07/pytorch/#speed

There is an implementation of autograd for NumPy available at https://github.com/HIPS/autograd. It'd probably be the easiest to integrate, although I'm not certain how well it plays with CuPy or how performant it is.

@stsievert

This comment has been minimized.

stsievert commented Aug 5, 2018

And HIPS/autograd is high performant too. It comes close to PyTorch performance (I updated the timings on my blog for this). Here's a quick example:

import autograd
import autograd.numpy as anp
import numpy as np

n, d = int(10e3), 1000
A = np.random.randn(n, d)
x = np.random.rand(d)
y = A @ x
x_hat = np.random.rand(d)

def loss(x):
    return ((anp.dot(A, x) - y)**2).mean()

def grad_hardcoded(A, x, y):
    return (2*A.T@(A@x - y)) / len(y)
In  [1]: %time g1 = autograd.grad(loss)(x_hat)
CPU times: user 23.8 ms, sys: 2.09 ms, total: 25.9 ms
Wall time: 13.7 ms

In  [2]: %time g2 = grad_hardcoded(A, x_hat, y)
CPU times: user 72 ms, sys: 45.1 ms, total: 117 ms
Wall time: 103 ms

This is about 8 times faster because my machine has 8 cores.

@timsetsfire

This comment has been minimized.

timsetsfire commented Sep 27, 2018

@stsievert - Just curious if you still see the same improvement when you move the multiplication by 2 outside of the parenthesis in the hardcoded gradient.

def grad_hardcoded(A, x, y):
    return 2*(A.T@(A@x - y)) / len(y)

On my OLD macbook pro, the hardcoded method was a little bit faster.

@stsievert

This comment has been minimized.

stsievert commented Sep 27, 2018

Huh. NumPy is a bit faster, even when I jack up the dimension.

In [4]: import autograd
   ...: import autograd.numpy as anp
   ...: import numpy as np
   ...:
   ...: n, d = int(100e3), 1000
   ...: A = np.random.randn(n, d)
   ...: x = np.random.rand(d)
   ...: y = A @ x
   ...: x_hat = np.random.rand(d)
   ...:
   ...: def loss(x):
   ...:     return ((anp.dot(A, x) - y)**2).mean()
   ...:
   ...: def grad_hardcoded(A, x, y):
   ...:     return 2*(A.T@(A@x - y)) / len(y)
   ...:
   ...:

In [5]: %time g1 = autograd.grad(loss)(x_hat)
CPU times: user 229 ms, sys: 8.29 ms, total: 237 ms
Wall time: 135 ms

In [6]: %time g2 = grad_hardcoded(A, x_hat, y)
CPU times: user 214 ms, sys: 3.7 ms, total: 218 ms
Wall time: 116 ms

In [7]: np.allclose(g1, g2)
Out[7]: True

(I've verified that (2*A.T@(A@x - y)) / len(y) is still slower than autograd; 525ms hardcoded, 139ms with autograd)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment