# Using Numpy efficiently

**Michiel Stock** [email](michiel.stock@ugent.be)

In [1]:
import numpy as np

## Vectorization

- *Python*: easy to use, but very slow (at lower level)

- *C*: very hard to use and learn, but extremely fast!

- *Numpy* is a python library implemented in C

- (but check out [Julia](https://julialang.org) for an easy to learn but very fast language!)

> Try to avoid for-loops in favor for implementation in pure Numpy (**faster** + **cleaner**)!

### Example: implementing the gradient of logistic loss

$$
\nabla L(w) = \sum_{i=1}^n (y_i - \sigma_i)\mathbf{x}_i
$$

In [2]:
# make some matrices
n, p = 1000, 100

X = np.random.randn(n, p)
y = np.random.binomial(1, 0.4, (n,))
sigma = np.random.rand(n)

In [3]:
def gradient_for_loop():
    grad = np.zeros((p, ))
    for i in range(n):
        xi = X[i,:]
        grad = grad + (y[i] - sigma[i]) * xi
    return grad

In [4]:
def gradient_vectorized():
    grad = X.T @ (y - sigma)
    return grad

In [5]:
gradient_for_loop()[:10]

array([ -8.26660256, -21.09310725,  31.37777826,  15.78809404,
        37.88198881,  32.19607002, -13.12624079,  24.26747211,
       -11.8339038 ,   1.32385632])

In [6]:
gradient_vectorized()[:10]

array([ -8.26660256, -21.09310725,  31.37777826,  15.78809404,
        37.88198881,  32.19607002, -13.12624079,  24.26747211,
       -11.8339038 ,   1.32385632])

In [7]:
%timeit gradient_for_loop()

4.34 ms ± 80.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [8]:
%timeit gradient_vectorized()

12.2 µs ± 832 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


## Broadcasting

Adding, multiplying matrices in Numpy do not need to be of the same shape = broadcasting of a matrix.

![Example of Broadcasting](Figures/numpy_broadcasting.png)

$$
\nabla^2 L(w) = \sum_{i=1}^n x_i x_i^\top \sigma_i (1-\sigma_i)
$$

In [9]:
def hessian_for_loop():
    hess = np.zeros((p, p))
    for i in range(n):
        xi = X[i,:]
        sigma_i = sigma[i]
        hess = hess + xi.reshape((-1, 1)) @ xi.reshape((1, -1)) * sigma_i * (1 - sigma_i)
    return hess

In [10]:
def hessian_broadcasting():
    hess = (X.T * sigma * (1 - sigma)) @ X
    return hess

In [11]:
hessian_for_loop()[:5,:][:,:5]

array([[159.08673741,   4.31189154,   3.10666987,   3.79068212,
         -5.84216841],
       [  4.31189154, 152.8870229 ,  -0.29741427,   5.29005101,
         -5.2376922 ],
       [  3.10666987,  -0.29741427, 162.21631251,   5.38616554,
         -7.15929111],
       [  3.79068212,   5.29005101,   5.38616554, 156.20236008,
          8.40740344],
       [ -5.84216841,  -5.2376922 ,  -7.15929111,   8.40740344,
        164.47386499]])

In [12]:
hessian_broadcasting()[:5,:][:,:5]

array([[159.08673741,   4.31189154,   3.10666987,   3.79068212,
         -5.84216841],
       [  4.31189154, 152.8870229 ,  -0.29741427,   5.29005101,
         -5.2376922 ],
       [  3.10666987,  -0.29741427, 162.21631251,   5.38616554,
         -7.15929111],
       [  3.79068212,   5.29005101,   5.38616554, 156.20236008,
          8.40740344],
       [ -5.84216841,  -5.2376922 ,  -7.15929111,   8.40740344,
        164.47386499]])

In [13]:
%timeit hessian_for_loop()

17.8 ms ± 319 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [14]:
%timeit hessian_broadcasting()

292 µs ± 3.96 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


## Memory use

Initializing a matrix using `np.ones`, `np.zeros`, `np.random.rand` etc. or making a new matrix **consumes memory**.

> `x = x + v  #  make NEW matrix x`

> `x += v  #  update elements of x`

or, equivalently,

> `x[:] = x + v  #  update elements of x`

In [15]:
def replace_matrix(n_steps=50, size=(5000, 5000)):
    x = np.zeros(size)
    for i in range(n_steps):
        x = x + 1  # new matrix every step
    return x

In [16]:
def inplace_matrix(n_steps=50, size=(5000, 5000)):
    x = np.zeros(size)
    for i in range(n_steps):
        x += 1  # update elements IN matrix
    return x

In [17]:
%timeit replace_matrix()

2.61 s ± 70.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [18]:
%timeit inplace_matrix()

853 ms ± 71.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
