## Numba

Para utilizar numba se utiliza principalmente los decorators 
- @jit
- @njit
    - Es equivalente a utilizar @jit(nopython=True)
    
Es ideal evitar el uso de @jit ya que en ocasiones puede hacer el código más lento.

In [1]:
import numpy as np

#### Versión Python

In [2]:
def unit_vector(arr):
    
    sz = arr.shape[0]
    uvec = np.zeros(sz, dtype=np.float64)
    norm = 0
    
    for i in range(sz):
        norm += arr[i] ** 2
    norm = norm ** 0.5
    
    for j in range(sz):
        uvec[j] = arr[j] / norm
        
    return uvec

In [3]:
vec = np.random.rand(20_000_000)

In [4]:
%%timeit
_ = unit_vector(vec)

10.7 s ± 28.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


#### Versión Numba

In [5]:
from numba import jit, njit 

In [6]:
@njit
def unit_vector_nb(arr):
    
    sz = arr.shape[0]
    uvec = np.zeros(sz, dtype=np.float64)
    norm = 0
    
    for i in range(sz):
        norm += arr[i] ** 2
    norm = norm ** 0.5
    
    for j in range(sz):
        uvec[j] = arr[j] / norm
        
    return uvec

In [7]:
# Compilation run
_ = unit_vector_nb(vec[0:2])

In [8]:
%%timeit
_ = unit_vector_nb(vec)

66.8 ms ± 1.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


#### Soporte Numpy

##### Ejemplo 1

In [9]:
mat = np.random.rand(10000, 10000)

In [10]:
@jit
def test_mean1(mat):
    return np.mean(mat, axis=0)

@njit
def test_mean2(mat):
    rows, cols = mat.shape
    mean = np.zeros(cols, dtype=np.float64)
    
    for i in range(rows):
        mean[i] = np.mean(mat[:, i])
        
    return mean

In [11]:
#_ = test_mean1(mat)

In [12]:
_ = test_mean2(mat)

In [13]:
%%timeit
_ = np.mean(mat, axis=0)

43.4 ms ± 260 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [14]:
%%timeit
_ = test_mean2(mat)

1.37 s ± 14.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


- Ejecución en `Numba` resulta más lenta que en `Numpy`

##### Ejemplo 2

In [15]:
@njit
def simple_mean(mat):
    return np.mean(mat)

In [16]:
_ = simple_mean(mat)

In [17]:
%%timeit
_ = np.mean(mat)

115 ms ± 940 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [18]:
%%timeit
_ = simple_mean(mat)

75.2 ms ± 68.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Ejecución en `Numba` resulta más veloz que en `Numpy`

### Multithreading

Numba permite realizar multithreading utilizando la libreria `openmp` de C

In [19]:
import numba
from numba import prange, set_num_threads

In [20]:
%%timeit 
_ = unit_vector_nb(vec)

66.4 ms ± 631 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [21]:
numba.config.NUMBA_DEFAULT_NUM_THREADS

16

In [22]:
@njit(parallel=True)
def unit_vector_nbp(arr):
    
    sz = arr.shape[0]
    uvec = np.zeros(sz, dtype=np.float64)
    norm = 0
    
    for i in prange(sz): # Se cambia por prange
        norm += arr[i] ** 2
    norm = norm ** 0.5
    
    for j in prange(sz): # Se cambia por prange
        uvec[j] = arr[j] / norm
        
    return uvec

In [23]:
_ = unit_vector_nbp(vec[0:2])

In [24]:
%%timeit
_ = unit_vector_nbp(vec)

40.5 ms ± 543 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


In [25]:
%%timeit
set_num_threads(2)
_ = unit_vector_nbp(vec)

43.4 ms ± 4.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


### Caching

- Solo se puede hacer caching de modulos `self-contained` - No pueden depender de otros paquetes compilados por LLVM (Numba)
- Código se encuentra en la carpeta `__pycache__`

- Ya que el código se compila de forma JIT, optimizandose para el equipo en el que fue ejecutado, si se traspasa a otro PC, volverá a ser compilado y re-guardado.
- Caching utiliza pickle por detrás.

In [26]:
@njit(parallel=True, cache=True)
def unit_vector_nbpi(arr):
    
    sz = arr.shape[0]
    uvec = np.zeros(sz, dtype=np.float64)
    norm = 0
    
    for i in prange(sz):
        norm += arr[i] ** 2
    norm = norm ** 0.5
    
    for j in prange(sz):
        uvec[j] = arr[j] / norm
        
    return uvec

In [27]:
_ = unit_vector_nbpi(vec[0:1])

#### NO SABIS COMO SE ARMA LA COSA

In [28]:
@njit
def sum_one(var):
    return var + 1

In [29]:
_ = sum_one(1)

In [30]:
int32 = 2 ** 32 // 2 - 1
int64 = 2 ** 64 // 2 - 1

print(f'Int32: {int32:,}\nInt64: {int64:,}')

Int32: 2,147,483,647
Int64: 9,223,372,036,854,775,807


In [31]:
sum_one(int32)

2147483648

In [32]:
sum_one(int64)

-9223372036854775808

### Vectorize

- `np.vectorize` es lento y solamente existe por conveniencia. Literal de la documentación:
    

*The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.*

In [33]:
from numba import vectorize, guvectorize, float64

In [34]:
@vectorize([float64(float64, float64)])
def f(x, y):
    return x + y

In [35]:
v1 = np.random.rand(int(1e6))
v2 = np.random.rand(int(1e6))

In [36]:
%%timeit
_ = f(v1, v2)

1.89 ms ± 174 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
