# Using Numpy efficiently

**Michiel Stock** [email](michiel.stock@ugent.be)

In [1]:
import numpy as np

## Vectorization

- *Python*: easy to use, but very slow (at lower level)

- *C*: very hard to use and learn, but extremely fast!

- *Numpy* is a python library implemented in C

> Try to avoid for-loops in favor for implementation in pure Numpy (**faster** + **cleaner**)!

### Example: implementing the gradient of logistic loss

$$
\nabla L(w) = \sum_{i=1}^n (y_i - \sigma_i)x_i
$$

In [2]:
# make some matrices
n, p = 1000, 100

X = np.random.randn(n, p)
y = np.random.binomial(1, 0.4, (n,))
sigma = np.random.rand(n)

In [3]:
def gradient_for_loop():
    grad = np.zeros((p, ))
    for i in range(n):
        xi = X[i,:]
        grad = grad + (y[i] - sigma[i]) * xi
    return grad

In [4]:
def gradient_vectorized():
    grad = X.T @ (y - sigma)
    return grad

In [5]:
gradient_for_loop()[:10]

array([ 21.99389426, -11.57360697,   4.47161575, -13.93187941,
       -28.81296467, -19.97097842,   0.17146261, -32.96477955,
       -16.52864115,   5.38117259])

In [6]:
gradient_vectorized()[:10]

array([ 21.99389426, -11.57360697,   4.47161575, -13.93187941,
       -28.81296467, -19.97097842,   0.17146261, -32.96477955,
       -16.52864115,   5.38117259])

In [7]:
%timeit gradient_for_loop()

100 loops, best of 3: 5.67 ms per loop


In [8]:
%timeit gradient_vectorized()

The slowest run took 334.60 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 33.2 µs per loop


## Broadcasting

Adding, multiplying matrices in Numpy do not need to be of the same shape = broadcasting of a matrix.

![Example of Broadcasting](Figures/numpy_broadcasting.png)

$$
\nabla L(w) = \sum_{i=1}^n x_i x_i^\top \sigma_i (1-\sigma_i)
$$

In [9]:
def hessian_for_loop():
    hess = np.zeros((p, p))
    for i in range(n):
        xi = X[i,:]
        sigma_i = sigma[i]
        hess = hess + xi.reshape((-1, 1)) @ xi.reshape((1, -1)) * sigma_i * (1 - sigma_i)
    return hess

In [10]:
def hessian_broadcasting():
    hess = (X.T * sigma * (1 - sigma)) @ X
    return hess

In [11]:
hessian_for_loop()[:5,:][:,:5]

array([[ 156.59217927,   -3.4892104 ,    8.66727622,   -1.20859036,
          -5.70330684],
       [  -3.4892104 ,  177.39696315,    4.12844963,   -3.03388487,
           8.54097822],
       [   8.66727622,    4.12844963,  165.31682672,    3.95981756,
          -2.49870471],
       [  -1.20859036,   -3.03388487,    3.95981756,  166.41677289,
           0.92905495],
       [  -5.70330684,    8.54097822,   -2.49870471,    0.92905495,
         169.06622693]])

In [12]:
hessian_broadcasting()[:5,:][:,:5]

array([[ 156.59217927,   -3.4892104 ,    8.66727622,   -1.20859036,
          -5.70330684],
       [  -3.4892104 ,  177.39696315,    4.12844963,   -3.03388487,
           8.54097822],
       [   8.66727622,    4.12844963,  165.31682672,    3.95981756,
          -2.49870471],
       [  -1.20859036,   -3.03388487,    3.95981756,  166.41677289,
           0.92905495],
       [  -5.70330684,    8.54097822,   -2.49870471,    0.92905495,
         169.06622693]])

In [13]:
%timeit hessian_for_loop()

10 loops, best of 3: 52.4 ms per loop


In [14]:
%timeit hessian_broadcasting()

100 loops, best of 3: 1.07 ms per loop


## Memory use

Initializing a matrix using `np.ones`, `np.zeros`, `np.random.rand` etc. or making a new matrix **consumes memory**.

> `x = x + v  #  make NEW matrix x`

> `x += v  #  update elements of x`

or, equivalently,

> `x[:] = x + v  #  update elements of x`

In [15]:
def replace_matrix(n_steps=50, size=(5000, 5000)):
    x = np.zeros(size)
    for i in range(n_steps):
        x = x + 1  # new matrix every step
    return x

In [16]:
def inplace_matrix(n_steps=50, size=(5000, 5000)):
    x = np.zeros(size)
    for i in range(n_steps):
        x += 1  # update elements IN matrix
    return x

In [17]:
%timeit replace_matrix()

1 loop, best of 3: 7.22 s per loop


In [18]:
%timeit inplace_matrix()

1 loop, best of 3: 1.33 s per loop
