# NumPy

When you work with python (and with whatever other programming language actually), it is mandatory to know the instrument you are using.
It is a matter of efficiency, both in terms of **SOFTWARE** as well as **WETWARE**.

In [None]:
import numpy

It is not standard but it is almost as it was in scientific computing using the Python programming language.

> **DO NOT RE-INVENT THE WHEEL**

It is built 
* to provide an efficient, intuitive and pythonic interface to Numerical computation
* to bridge a gap between compiled-language standards on memory management
* to simplify the user's life

**Implements A LOT of analytical functions and useful values** (very well optimised)

Some examples:

* logarithms and exponentials (i.e. ``numpy.exp``,``numpy.log``,``numpy.log10``)
* trigonometric functions (e.g. ``numpy.sin``, ``numpy.cos``, and the inverse functions, and the hyperbolic versions)
* statistical functions (e.g. ``numpy.mean``, ``numpy.std``)
* mathematical constants (e.g. ``numpy.pi``, ``numpy.e``)

## Array programming

The building-block data-structure in NumPy is the N-dimensional array:

* **constructors**

In [None]:
ar_from_shape = numpy.ndarray((2,3))
type(ar_from_shape), ar_from_shape

In [None]:
ar_from_obj = numpy.array((2.,3))
type(ar_from_obj), ar_from_obj

and also:
* ``numpy.linspace``, ``numpy.logspace``, ``numpy.arange``
* ``numpy.ones``, ``numpy.zeros``, ``numpy.empty``
* ``numpy.ones_like``, ``numpy.zeros_like``, ``numpy.empty_like``

* **casting functions**

In [None]:
ls1 = [ 1, 2, 3 ]
ar_from_ls = numpy.asarray(ls1) # <------
type(ar_from_ls), ar_from_ls

In [None]:
ls2 = [ ls1, ls1 ]
ar_from_ls_cont = numpy.ascontiguousarray(ls2) # <------
type(ar_from_ls_cont), ar_from_ls_cont

contiguity in memory is not guaranteed on the 1st dimension if you use ``numpy.array`` to cast something as above

In [None]:
ar_from_ls_cont[0] is ar_from_ls_cont[1]

**MAIN CHARACTERISTICS**

* uniform type: you should not use NDarrays to cast something like this

In [None]:
ar = numpy.array([1, 'hello', True])

In [None]:
ar.dtype, ar

* N-dimensional

In [None]:
diag1 = numpy.array( 
    [[1, 0],
     [0, 1]]
)
diag1.shape, diag1.size

> **NOTE** that for the specific matrix we have built above there are specific functions
>```python
> >>> numpy.diag((1,1))
>```
>```
>array([[1, 0],
>       [0, 1]])
>```
>```python
> >>> numpy.identity(2)
>```
>```
>array([[1, 0],
>       [0, 1]])
>```

and specific functions to perform operations on it, e.g.

In [None]:
diag1.diagonal()

* **Broadcasting** (vectorisation): at [this link](https://numpy.org/doc/stable/user/basics.broadcasting.html) the NumPy guide

In [None]:
ar1 = numpy.array( [1,2,3,4] )

In [None]:
# scalar operations
ar1 * 10

In [None]:
# same size arrays
ar2 = numpy.array([4, 3, 2, 1])
ar1 + ar2

In [None]:
# more complex shapes
mat = numpy.array(
    [[1, 2, 3, 4], 
     [5, 6, 7, 8]]
)
mat + ar1

In [None]:
# but the order of the dimensions count
mat.T + ar1

So what happens when we have to broadcast an array-operation to some other array with different size?  

In [None]:
x = numpy.array([1,2,3,4,5])
y = numpy.logspace(0,6,7)
x, x.size, y, y.size

In [None]:
x*y

**CANNOT BE BROADCASTED!**

But let's say I want to build a matrix in which each line is the result of the product of an element from the first array (``x``) and the second array (``y``), having therefore dimensions $5\times7$ 

In [None]:
x.T * y

We have the possibility to increase the number of axis of the first array

In [None]:
x[:, numpy.newaxis]

In [None]:
xy = x[:, numpy.newaxis]*y
xy, xy.shape

Or, if we want to do it the other way round:

In [None]:
yx = x*y[:, numpy.newaxis]
yx, yx.shape

This is the **MOST IMPORTANT FEATURE OF NUMPY** 

And if you are not, you can go back to not using numpy at all (I have shown you the tools are already there)

Little demonstration:

In [None]:
X = numpy.linspace(1,int(1e+3),int(1e+3))
Y = numpy.logspace(0, 6, int(1.e+4))

In [None]:
%%time
out = []
for x in X :
    tmp = []
    for y in Y :
        tmp += [x * y]
    out += [tmp]
out = numpy.array(out)

In [None]:
%time out = numpy.array([[x*y for y in Y] for x in X])
out.shape

In [None]:
%time out = X[:,numpy.newaxis]*Y

#### Some CAVEATS

> **WARNING** Even though uniform

In [None]:
class custom () :
    def __init__ ( self, num ) :
        self.num = num
    def __repr__ ( self ) :
        return f'custom({self.num})'
    def __str__ (self) :
        return self.__repr__()

In [None]:
numpy.array( [ custom( i ) for i in range(3) ] )

It's saved as a generic *object* type

In [None]:
ar_custom = numpy.array( [ custom( i ) for i in range(3) ] )
ar_custom *= 2

**NumPy is not thought to do this kind of stuff**

Even though you could work around the error above:

```python
    class custom () :
        def __init__ ( self, num ) :
            self.num = num
        def __repr__ ( self ) :
            return f'custom({self.num})'
        def __str__ (self) :
            return self.__repr__()
        def __mul__ (self, other) :
            return self.num * other
        def __imul__ ( self, other ) :
            return self.__mul__(other)
```

It is discouraged, because you won't know what other NumPy behaviour cannot be given to the custom object.

> **NUMPY IS FOR PRIMORDIAL NUMERICAL VARIABLE TYPES**
>
> for custom objects better use other containers or organising the data differently

### BUT WITH A BIT OF AWARENESS YOU CAN MAKE YOUR FUNCTIONS NUMPY COMPLIANT

Which means that you want to favour broadcasting when you implement something:

In [None]:
def func_naive (x, y=10) :
    from math import exp
    return exp(x)/y

In [None]:
X = numpy.linspace(0.0,1.0,100)

In [None]:
func_naive(X)

In [None]:
%time fx = [func_naive(x) for x in X]

In [None]:
def func_numpy (x, y=10) :
    from numpy import exp
    return exp(x)/y

In [None]:
%time fx = func_numpy(X)

**what about the second argument?**

We have to make a choice, depending on the problem at end:

* maybe the desired behaviour is to only accept scalars, then we should check for it

In [None]:
def func_numpy_yscalar (x, y=10) :
    from numpy import exp
    if hasattr( y, '__len__' ) :
        raise TypeError( 'argument `y` should be a scalar' )
    return exp(x)/y

* maybe instead we want it to
    - accept scalars
    - accept arrays of the same dimension
    - broadcast automathically

In [None]:
def func_numpy_ybroadcast (x, y=10) :
    import numpy 
    # careful because these will also make copies
    x = numpy.array(x)
    y = numpy.array(y)
    # store if the inputs were scalar
    xscalar = False
    if x.ndim == 0 :
        x = x[None]
        xscalar = True
    yscalar = False
    if y.ndim == 0 :
        y = y[None]
        yscalar = True
    # add an axis if necessary
    if not xscalar|yscalar and y.size != x.size :
        x = x[:,numpy.newaxis]
    ret = numpy.exp(x) / y
    # if scalar input return a scalar
    if xscalar & yscalar :
        return ret.item()
    return ret

In [None]:
a = func_numpy_ybroadcast( X )
b = func_numpy_ybroadcast( 1, numpy.linspace(1.,2.,X.size) )
c = func_numpy_ybroadcast( X, numpy.linspace(1.,2.,X.size) )
d = func_numpy_ybroadcast( X, [1, 2, 3] )
a.shape, b.shape, c.shape, d.shape

We have though lost something for being able to do this:

In [None]:
func_numpy_ybroadcast(1,2)

Can you tell me what the problem is with this implementation?

### One last point on arrays: indexing and masking

In [None]:
ar = numpy.arange(0, 100, 11)

In [None]:
ar

#### An array can be indexed:

* with a single index

In [None]:
ar[3]

* with a sequence of indexes

In [None]:
indexes = [0, 3, 5]

In [None]:
ar[indexes]

#### An array can be sliced: ``a[start:stop:step]``

* from the head

In [None]:
ar[:3]

* from the tail

In [None]:
ar[-3:]

* in the middle

In [None]:
ar[3:6]

* with some step

In [None]:
ar[1:-1:2]

#### An array can be masked

In [None]:
weven = (ar%2 == 0)

A mask is a boolean array with the same dimension of the object you want to mask

In [None]:
weven

In [None]:
ar[weven]

#### and the mask can be inverted

In [None]:
ar[~weven]

#### and two masks can be combined

In [None]:
wmaj50 = ( ar > 50 )

In [None]:
ar[weven&wmaj50]

In [None]:
wdiv3 = (ar%3 == 0)

In [None]:
ar[weven|wdiv3]

#### And all of this can be done also on matrices with a bit more care

In [None]:
mat = (ar[:, numpy.newaxis]) * numpy.flip(ar+1)

In [None]:
mat = mat[1:]
mat, mat.shape

* slice in one of the two dimensions

In [None]:
mat[2,:]

In [None]:
mat[:,3]

* mask along all dimensions

In [None]:
mat[mat%2==0]

Oh no! it has been flattened! This is because not all the lines had the same size.

**be careful of the I/Os when you mask!**

In [None]:
mask4lines = numpy.ones_like(mat[:,0], dtype=bool)
mask4lines[-4:] = False
mask4lines

In [None]:
mat[mask4lines]

In [None]:
mask4cols = numpy.ones_like(mat[0], dtype=bool)
mask4cols[-4:] = False
mask4cols

In [None]:
mat[:,mask4cols]

## Random

NumPy implements **(pseudo-)RANDOM NUMBER GENERATORS** in the sub-package ``numpy.random`` ([here the docs](https://numpy.org/doc/stable/reference/random/index.html)).

Generating random numbers is extremely useful in science for a lot of cases.

In [None]:
rng = numpy.random.default_rng( seed = 555 )

> **NOTE THAT** I am using a ``seed`` for **reproducibility**

Random number generation is embedded in the array-programming framework of NumPy:

* **uniform distributions**

In [None]:
rng.uniform(size=10)

In [None]:
help(rng.uniform)

In [None]:
rng.integers(1, size=10, endpoint=True)

* **non-uniform distributions**

In [None]:
rng.normal(size=10)

In [None]:
rng.poisson(size=10)

### EXERCISE: sample from a custom non-uniform distribution

In [None]:
import matplotlib.pyplot as plt

In [None]:
def custom_cdf ( x ) :
    return 1-(0.5 * ( numpy.cos(x) + 1 ) )

In [None]:
x = numpy.linspace(0, numpy.pi, 1024)
fx = custom_cdf(x)

In [None]:
plt.plot(x, fx)

In [None]:
ydata = numpy.interp( 
    rng.uniform(size=4096), 
    fx,
    x
)

In [None]:
ydata

In [None]:
yhist, xbins = numpy.histogram(ydata, bins=(10))

In [None]:
ycums = yhist.cumsum()/ydata.size

In [None]:
plt.plot(x, fx)
plt.step(xbins[1:], ycums)

Are they the same distribution???

(you can try to increase the number of bins in the function above (e.g. ``numpy.histogram( ydata, bins=(100) )``)

In [None]:
from scipy.stats import kstest
kstest(ydata, custom_cdf).pvalue > 0.05

**what about a PDF??**

In [None]:
def custom_pdf ( x ) :
    return x * numpy.exp(-x)

**you can try this yourself**